A Confrontation DecisionMaking Method with Deep Reinforcement Learning and Knowledge Transfer for MultiAgent System
Abstract
:1. Introduction
1.1. Reinforcement Learning
1.2. MultiAgent Confrontation in the RealTime Strategy Game
1.3. Confrontation DecisionMaking with Reinforcement Learning for MultiAgent System
1.4. Deep Reinforcement Learning
1.5. Knowledge Transfer
1.6. Contributions in This Work
1.7. Paper Structure
2. Background
2.1. Reinforcement Learning
2.2. Neural Network
3. Description of Continuous Control Problem and a Classical Method
3.1. Optimal Control Problem Using Value Function in SMDPs Process
3.2. Classical Method for Optimal Control Using Actor–Critic
Algorithm 1: Basic Procedure for Actor–Critic Algorithm 

4. MultiAgent DDPG Algorithm with an Auxiliary Controller for Confrontation DecisionMaking
4.1. Back Propagation Algorithm with a Momentum Mechanism
Algorithm 2: Procedure for the BackPropagation Algorithm with Momentum 

4.2. MultiAgent DDPG Algorithm with Parameter Sharing
Algorithm 3: MultiAgent DDPG Algorithm with Parameter Sharing 
Initialization: 
Initialize the current network for the critic $Q\left(s,{a}_{1},{a}_{2},....,{a}_{M}{\theta}^{Q}\right)$, using ${\theta}^{Q}$; Initializes the current network for actor $\mu (s{\theta}^{\mu})$ using ${\theta}^{\mu}$. 
Initialize the target networks for the critic and actor: ${\theta}^{{Q}^{\prime}}\leftarrow {\theta}^{Q}$,${\theta}^{{\mu}^{\prime}}\leftarrow {\theta}^{\mu}$ 
Initialize the sampling pool $R$ For each agent =1, K do 
For episode = 1, $M$ do 
Initialize a random noise $\aleph $; 
Get the current state ${s}_{1}$ 
For $t=1$, $T$ do 
${a}_{t}=\mu ({s}_{t}{\theta}^{\mu})+{\aleph}_{t}$, Perform action ${a}_{t}=\left({a}_{1}^{t},{a}_{2}^{t},....,{a}_{M}^{t}\right)$ and observe the next state. ${s}_{t}\leftarrow {s}_{t+1}$, $R\leftarrow ({s}_{t},{a}_{t},{r}_{t},{s}_{t+1})$; 
Randomly select $N$ samples $({s}_{t},{a}_{t},{r}_{t},{s}_{t+1})$ from the sampling pool $R$; 
${y}_{i}={r}_{i}+\gamma {Q}^{\prime}({s}_{i+1},{\mu}^{\prime}({s}_{i+1}{\theta}^{{\mu}^{\prime}}){\theta}^{{Q}^{\prime}})$ 
Update the parameters of the current network with momentum mechanism for critic using the Loss function: $L\left({\theta}^{Q}\right)=\frac{1}{N}{\displaystyle {\sum}_{i}{\left({y}_{i}Q\left({s}_{t},{a}_{1}^{i},{a}_{2}^{i},....,{a}_{M}^{i}{\theta}^{Q}\right)\right)}^{2}}$; 
Update the parameters of the current network with momentum mechanism for actor using ${\nabla}_{{\theta}^{\mu}}J$: ${\nabla}_{{\theta}^{\mu}}J\approx \frac{1}{N}{\displaystyle {\sum}_{i}{\nabla}_{a}Q\left({s}_{t},{a}_{1},{a}_{2},....,{a}_{M}{\theta}^{Q}\right){}_{s={s}_{i},a={\theta}^{\mu}\left({s}_{t}\right)}}{\nabla}_{{\theta}^{u}}u\left(s{\theta}^{u}\right){}_{s={s}_{i}}$ 
Update target networks for the actor and the critic separately: ${\theta}^{{Q}^{\prime}}\leftarrow \tau {\theta}^{Q}+(1\tau ){\theta}^{{Q}^{\prime}}$, ${\theta}^{{\mu}^{\prime}}\leftarrow \tau {\theta}^{\mu}+(1\tau ){\theta}^{{\mu}^{\prime}}$; 
End for End for 
End for 
4.3. Neural Network Structure for the DDPG Algorithm
4.4. An Auxiliary Controller Using a PolicyBased RL Method
5. Knowledge Transfer Method
6. Effect Test for the Proposed PolicyBased RL Method
7. Experiment on StarCraft Task
7.1. Reinforcement Learning Model for StarCraft Task
7.2. Experimental Configuration
7.3. Confrontation TaskGoliaths vs. Zealots
7.4. Confrontation TaskGoliaths vs. Zerglings
7.5. Large Scale Confrontation Scenario
8. Experiment on Tank War Task
8.1. Experimental Configuration for Tank War Task
8.2. Experimental Results and Analysis
9. Conclusions
Author Contributions
Funding
Conflicts of Interest
References
 Emary, E.; Hossam; Zawbaa, M.; Grosan, C. Experienced Gray Wolf Optimization Through Reinforcement Learning and Neural Networks. IEEE Trans. Neural Netw. Learn. Syst. 2018, 29, 681–694. [Google Scholar] [CrossRef]
 Kiumarsi, B.; Kyriakos; Vamvoudakis, G.; Modares, H.; Lewis, F.L. Optimal and Autonomous Control Using Reinforcement Learning: A Survey. IEEE Trans. Neural Netw. Learn. Syst. 2018, 29, 2042–2062. [Google Scholar] [CrossRef]
 Pradeep, D.J.; Noel, M.M.; Arun, N. Nonlinear control of a boost converter using a robust regression based reinforcement learning algorithm. Eng. Appl. Artif. Intell. 2016, 52, 1–9. [Google Scholar] [CrossRef]
 Hu, C.; Xu, M. Adaptive Exploration Strategy With MultiAttribute DecisionMaking for Reinforcement Learning. IEEE Access 2020, 8, 32353–32364. [Google Scholar] [CrossRef]
 Tan, X.; Chng, C.B.; Su, Y.; Lim, K.B.; Chui, C.K. Robotassisted training in laparoscopy using deep reinforcement learning. IEEE Robot. Autom. Lett. 2019, 4, 485–492. [Google Scholar] [CrossRef]
 Li, Z.; Liu, J.; Huang, Z.; Peng, Y.; Pu, H.; Ding, L. Adaptive impedance control of human–robot cooperation using reinforcement learning. IEEE Trans. Ind. Electron. 2017, 64, 8013–8022. [Google Scholar] [CrossRef]
 Yuan, Y.; Li, Z.; Zhao, T.; Gan, D. DMPbased Motion Generation for a Walking Exoskeleton Robot Using Reinforcement Learning. IEEE Trans. Ind. Electron. 2019, 67, 3830–3839. [Google Scholar] [CrossRef]
 Balducci, F.; Grana, C.; Cucchiara, R. Affective level design for a roleplaying videogame evaluated by a braincomputer interface and machine learning methods. Vis. Comput. 2016, 33, 1–15. [Google Scholar] [CrossRef]
 Hu, C.; Xu, M. Fuzzy Reinforcement Learning and Curriculum Transfer Learning for Micromanagement in MultiRobot Confrontation. Information 2019, 10, 341. [Google Scholar] [CrossRef] [Green Version]
 Othman, N.; Decraene, J.; Cai, W.; Hu, N.; Gouaillard, A. Simulationbased optimization of StarCraft tactical AI through evolutionary computation. J. Yunnan Agric. Univ. 2012, 3, 639–643. [Google Scholar]
 Ontanon, S.; Synnaeve, G.; Uriarte, A.; Richoux, F. A Survey of RealTime Strategy Game AI Research and Competition in StarCraft. IEEE Trans. Comput. Intell. AI Games 2013, 5, 293–311. [Google Scholar] [CrossRef] [Green Version]
 Shantia, A.; Begue, E.; Wiering, M. Connectionist reinforcement learning for intelligent unit micro management in StarCraft. In Proceedings of the International Joint Conference on Neural Networks, San Jose, CA, USA, 31 July–5 August 2011; pp. 1794–1801. [Google Scholar]
 Wender, S.; Watson, I. Applying reinforcement learning to small scale combat in the realtime strategy game StarCraft: Broodwar. In Proceedings of the IEEE Conference on Computational Intelligence and Games, Granada, Spain, 11–14 September 2012; pp. 402–408. [Google Scholar]
 Mansoor, A.; Juan; Cerrolaza, J.; Idrees, R.; Biggs, E.; Alsharid, M.A.; Avery, R.A.; Linguraru, M.G. Deep Learning Guided Partitioned Shape Model for Anterior Visual Pathway Segmentation. IEEE Trans. Med Imaging 2016, 35, 1856–1865. [Google Scholar] [CrossRef] [PubMed]
 Liu, Y.; Tang, L.; Tong, S.; Chen, C.L.; Li, D.J. Reinforcement Learning DesignBased Adaptive Tracking Control With Less Learning Parameters for Nonlinear DiscreteTime MIMO Systems. IEEE Trans. Neural Netw. Learn. Syst. 2015, 26, 165–176. [Google Scholar] [CrossRef] [PubMed]
 Xu, J.; Hou, Z.; Wang, W.; Xu, B.; Zhang, K.; Chen, K. Feedback deep deterministic policy gradient with fuzzy reward for robotic multiple peginhole assembly tasks. IEEE Trans. Ind. Inform. 2018, 15, 1658–1667. [Google Scholar] [CrossRef]
 Machado, M.C.; Bellemare, M.G.; Bowling, M. A laplacian framework for option discovery in reinforcement learning. In Proceedings of the 34th International Conference on Machine LearningVolume 70. JMLR. org, Sydney, Australia, 6–11 August 2017; pp. 2295–2304. [Google Scholar]
 Jialin, P.S.; Qiang, Y. A Survey on Transfer Learning. IEEE Trans. Knowl. Data Eng. 2010, 22, 1345–1359. [Google Scholar]
 Wei, Q.; Lewis, F.L.; Sun, Q.; Yan, P.; Song, R. Discretetime deterministic Qlearning: A novel convergence analysis. IEEE Trans. Cybern. 2016, 47, 1224–1237. [Google Scholar] [CrossRef]
 Luo, B.; Liu, D.; Huang, T.; Wang, D. Modelfree optimal tracking control via criticonly Qlearning. IEEE Trans. Neural Netw. Learn. Syst. 2016, 27, 2134–2144. [Google Scholar] [CrossRef]
 Vamvoudakis, K.G.; Lewis, F.L. Online actor–critic algorithm to solve the continuoustime infinite horizon optimal control problem. Automatica 2010, 46, 878–888. [Google Scholar] [CrossRef]
 Zhang, Z.; Wang, R.; Yu, F.R.; Fu, F.; Yan, Q. QoS Aware Transcoding for Live Streaming in EdgeClouds Aided HetNets: An Enhanced ActorCritic Approach. IEEE Trans. Veh. Technol. 2019, 68, 11295–11308. [Google Scholar] [CrossRef]
 Wang, Z.; Dedo, M.I.; Guo, K.; Zhou, K.; Shen, F.; Sun, Y.; Guo, Z. Efficient recognition of the propagated orbital angular momentum modes in turbulences with the convolutional neural network. IEEE Photonics J. 2019, 11, 1–14. [Google Scholar] [CrossRef]
 Schurz, H. Preservation of probabilistic laws through Euler methods for OrnsteinUhlenbeck process. Stoch. Anal. Appl. 1999, 17, 463–486. [Google Scholar] [CrossRef]
 Shao, K.; Zhu, Y.; Zhao, D. Starcraft micromanagement with reinforcement learning and curriculum transfer learning. IEEE Trans. Emerg. Top. Comput. Intell. 2018, 3, 73–84. [Google Scholar] [CrossRef] [Green Version]
 Garten, F.; Vrijmoeth, J.; Schlatmann, A.R.; Gill, R.E.; Klapwijk, T.M.; Hadziioannou, G. Lightemitting diodes based on polythiophene: Influence of the metal work function on rectification properties. Synth. Met. 1996, 76, 85–89. [Google Scholar] [CrossRef]
 Miao, Y.; Li, J.; Wang, Y.; Zhang, S.; Gong, Y. Simplifying long shortterm memory acoustic models for fast training and decoding. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016; pp. 2284–2288. [Google Scholar]
 Wang, H.; Gao, Y.; Chen, X.G. Transfer of Reinforcement Learning: The State of the Art. Acta Electron. Sin. 2008, 36, 39–43. [Google Scholar]
 Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.M.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. In Proceedings of the International Conference on Learning Representations, San Diego, CA, USA, 7–9 May 2015. abs/1509.02971. [Google Scholar]
 Nair, A.; McGrew, B.; Andrychowicz, M.; Zaremba, W.; Abbeel, P. Overcoming exploration in reinforcement learning with demonstrations. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, Australia, 21–26 May 2018; pp. 6292–6299. [Google Scholar]
 De Hauwere, Y.; Devlin, S.; Kudenko, D.; Nowé, A. Contextsensitive reward shaping for sparse interaction multiagent systems. Knowl. Eng. Rev. 2016, 31, 59–76. [Google Scholar] [CrossRef]
 Kompella, V.R.; Stollenga, M.; Luciw, M.; Schmidhuber, J. Continual curiositydriven skill acquisition from highdimensional video inputs for humanoid robots. Artif. Intell. 2017, 247, 313–333. [Google Scholar] [CrossRef]
 Mikhail, F.; Jürgen, L.; Marijn, S.; Alexander, F.; Jürgen, S. Curiosity driven reinforcement learning for motion planning on humanoids. Front. Neurorobotics 2014, 7, 1–15. [Google Scholar]
Parameter  Value 

Learning rate of DRL $\alpha $  0.0015 
Discount rate $\gamma $  0.95 
Minibatch size  64 
Soft target update  0.001 
Maximum time steps $T$  1000 
Adam algorithm factors ${\gamma}_{1}$ and ${\gamma}_{2}$  0.9 
Intermediate Task 1  Intermediate Task 2  Intermediate Task 3  

Mar13 vs. Zer15  Mar 6 vs. Zer 8  Mar 9 vs. Zer 12  Mar 11 vs. Zer 14 
Mar 23 vs. Zer 35  Mar 14 vs. Zer 16  Mar 18 vs. Zer 25  Mar 22 vs. Zer 29 
Parameter  Value 

Learning rate of DRL $\alpha $  0.001 
Discount rate $\gamma $  0.8 
Minibatch size  32 
Soft target update  0.01 
Maximum HPs  100 
Adam algorithm factor ${\gamma}_{1}$ and ${\gamma}_{2}$  0.9 
© 2020 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Hu, C. A Confrontation DecisionMaking Method with Deep Reinforcement Learning and Knowledge Transfer for MultiAgent System. Symmetry 2020, 12, 631. https://doi.org/10.3390/sym12040631
Hu C. A Confrontation DecisionMaking Method with Deep Reinforcement Learning and Knowledge Transfer for MultiAgent System. Symmetry. 2020; 12(4):631. https://doi.org/10.3390/sym12040631
Chicago/Turabian StyleHu, Chunyang. 2020. "A Confrontation DecisionMaking Method with Deep Reinforcement Learning and Knowledge Transfer for MultiAgent System" Symmetry 12, no. 4: 631. https://doi.org/10.3390/sym12040631