Research on Wargame Decision-Making Method Based on Multi-Agent Deep Deterministic Policy Gradient
:1. Introduction
- The MADDPG was optimized to adapt the wargame environment. The POMDP, the joint action-value function, and the Gumbel-Softmax estimator were introduced to model the decision process, train centralized critics, and fit the discrete policies of wargames, respectively.
- The supervised learning was incorporated before the reinforcement learning to improve training efficiency and reduce the action space. The wargame decision-making method was structured by dividing it into a supervised learning phase and a reinforcement learning phase. In the supervised learning phase, the state-action information pair data were separated to obtain the training and testing sets, and the model was trained with the supervised learning algorithm to obtain the primary agent.
- In the reinforcement learning phase of the wargame decision-making method, the policy gradient estimator was adopted to achieve the reduction of action space and to obtain the global optimal solution, while the additional reward function was designed to solve the sparse reward problem.
2. Related Work
2.1. Labeled and Real Combat Data Shortage
2.2. Markov Decision Process
2.3. DDPG
3. Improved MADDPG for Wargame Decision-Making
3.1. Multi-Agent Reinforcement Learning
3.2. Improved MAPDDPG
3.2.1. Partially Observable Markov Decision Process
3.2.2. Joint Action-Value Function
3.2.3. Gumbel-Softmax Estimator for Discrete Policy
4. Wargame Decision-Making Method
4.1. Supervised Learning Phase
4.2. Reinforcement Learning Phase
4.2.1. Policy Gradient Estimator
4.2.2. Additional Reward Function
5. Experiments
5.1. Experiments Platform
5.2. Experimental Settings
5.3. Experimental Results and Analysis
6. Conclusions and Future Work
Author Contributions
Informed Consent Statement
Conflicts of Interest
- Yuksek, B.; Guner, G.; Karali, H.; Candan, B.; Inalhan, G. Intelligent Wargaming Approach to Increase Course of Action Effectiveness in Military Operations. In Proceedings of the AIAA SCITECH 2023 Forum, Online, 22–27 January 2023; p. 2531. [Google Scholar] [CrossRef]
- Weilan, G.; Hao, Y.; Jieqiang, Z.; Fengyun, L. Research on the training of decision-making quantitative ability of decision-making assistants based on AHP method: Take X’s car purchase decision as an example. In Proceedings of the 2nd International Conference on Applied Mathematics, Modelling, and Intelligent Computing, Kunming, China, 25–27 March 2022; p. 1225958. [Google Scholar] [CrossRef]
- Wu, K.; Liu, M.; Cui, P.; Zhang, Y. A Training Model of Wargaming Based on Imitation Learning and Deep Reinforcement Learning. In Proceedings of the 2022 Chinese Intelligent Systems Conference: Volume I, Beijing, China, 15–16 October 2022; pp. 786–795. [Google Scholar] [CrossRef]
- Kase, S.E.; Hung, C.P.; Krayzman, T.; Hare, J.Z.; Rinderspacher, B.C.; Su, S.M. The Future of Collaborative Human-Artificial Intelligence Decision-Making for Mission Planning. Front. Psychol. 2022, 13, 1246. [Google Scholar] [CrossRef]
- Bell, A.; Bollfrass, A. To Hell with the Cell: The Case for Immersive Statecraft Education. Int. Stud. Perspect. 2022, 23, 129–150. [Google Scholar] [CrossRef]
- Chen, Y. Rethinking Adversarial Examples in Wargames. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), New Orleans, LA, USA, 19–20 June 2022; pp. 100–106. [Google Scholar] [CrossRef]
- Davis, P.K.; Bracken, P. Artificial intelligence for wargaming and modeling. J. Def. Model. Simul. 2022, 15485129211073126. [Google Scholar] [CrossRef]
- Xiaoling, L.; Fang, W.; Yuanzhou, L. Prediction method of equipment maintenance time based on deep learning. In Proceedings of the AOPC 2020: Display Technology; Photonic MEMS, THz MEMS, and Metamaterials; and AI in Optics and Photonics, Beijing, China, 5 November 2020; p. 115650M. [Google Scholar] [CrossRef]
- Peng, J.; Zhang, P. Velocity Prediction Method of Quadrotor UAV Based on BP Neural Network. In Proceedings of the 2020 International Symposium on Autonomous Systems (ISAS), Guangzhou, China, 6–8 December 2020; pp. 23–28. [Google Scholar] [CrossRef]
- Wu, Z.; Zhou, Y.; Wang, H.; Jiang, Z. Depth prediction of urban flood under different rainfall return periods based on deep learning and data warehouse. Sci. Total Environ. 2020, 716, 137077. [Google Scholar] [CrossRef]
- Liu, M.; Zhang, H.; Hao, W.; Qi, X.; Cheng, K.; Jin, D.; Feng, X. Introduction of a new dataset and method for location predicting based on deep learning in wargame. J. Intell. Fuzzy Syst. 2021, 40, 9259–9275. [Google Scholar] [CrossRef]
- Chen, L.; Liang, X.; Feng, Y.; Zhang, L.; Yang, J.; Liu, Z. Online Intention Recognition with Incomplete Information Based on a Weighted Contrastive Predictive Coding Model in Wargame. IEEE Trans. Neural Netw. Learn. Syst. 2022, 33, 1–14. [Google Scholar] [CrossRef]
- Czaczkes, T.J. How to not get stuck—Negative feedback due to crowding maintains flexibility in ant foraging. J. Theor. Biol. 2014, 360, 172–180. [Google Scholar] [CrossRef]
- de Moura Oliveira, P.B.; Pires, E.J.S.; Novais, P. Revisiting the Simulated Annealing Algorithm from a Teaching Perspective. In Proceedings of the International Joint Conference SOCO’16-CISIS’16-ICEUTE’16, San Sebastián, Spain, 19–21 October 2016; pp. 718–727. [Google Scholar] [CrossRef]
- Li, W.-T.; Li, J.-Q.; Chen, B.-K.; Huang, X.; Wang, Z. Information feedback strategy for beltways in intelligent transportation systems. Europhys. Lett. 2016, 113, 64001. [Google Scholar] [CrossRef]
- Liu, Y.; Heidari, A.A.; Cai, Z.; Liang, G.; Chen, H.; Pan, Z.; Alsufyani, A.; Bourouis, S. Simulated annealing-based dynamic step shuffled frog leaping algorithm: Optimal performance design and feature selection. Neurocomputing 2022, 503, 325–362. [Google Scholar] [CrossRef]
- Zhang, C.; Wan, L.; Liu, Y. Ship Heading Control Based on Fuzzy PID Control. In Proceedings of the 2019 34rd Youth Academic Annual Conference of Chinese Association of Automation (YAC), Jinzhou, China, 6–8 June 2019; pp. 607–612. [Google Scholar] [CrossRef]
- Li, Y.; Bertino, E.; Abdel-Khalik, H.S. Effectiveness of Model-Based Defenses for Digitally Controlled Industrial Systems: Nuclear Reactor Case Study. Nucl. Technol. 2020, 206, 82–93. [Google Scholar] [CrossRef]
- Ma, H. Optimization of Hotel Financial Management Information System Based on Computational Intelligence. Wirel. Commun. Mob. Comput. 2021, 2021, 8680306. [Google Scholar] [CrossRef]
- Sun, Y.; Yuan, B.; Xiang, Q.; Zhou, J.; Yu, J.; Dai, D.; Zhou, X. Intelligent Decision-Making and Human Language Communication Based on Deep Reinforcement Learning in a Wargame Environment. IEEE Trans. Hum. Mach. Syst. 2023, 53, 201–214. [Google Scholar] [CrossRef]
- Wu, W.; Liao, M.; Lv, P.; Duan, X.; Zhao, X. Performance Comparison Between Genetic Fuzzy Tree and Reinforcement Learning in Gaming Environment. In Proceedings of the Cognitive Systems and Signal Processing, Beijing, China, 29 November–1 December 2018; pp. 256–267. [Google Scholar] [CrossRef]
- Choi, M.; Moon, H.; Han, S.; Choi, Y.; Lee, M.; Cho, N. Experimental and Computational Study on the Ground Forces CGF Automation of Wargame Models Using Reinforcement Learning. IEEE Access 2022, 10, 128970–128982. [Google Scholar] [CrossRef]
- Boron, J.; Darken, C. Developing Combat Behavior through Reinforcement Learning in Wargames and Simulations. In Proceedings of the 2020 IEEE Conference on Games (CoG), Osaka, Japan, 24–27 August 2020; pp. 728–731. [Google Scholar] [CrossRef]
- Hung, C.P.; Hare, J.Z.; Rinderspacher, B.C.; Peregrim, W.; Kase, S.; Su, S.; Raglin, A.; Richardson, J.T. ARL Battlespace: A platform for developing novel AI for complex adversarial reasoning in MDO. In Proceedings of the Artificial Intelligence and Machine Learning for Multi-Domain Operations Applications IV, Orlando, FL, USA, 2–4 April 2022; pp. 294–304. [Google Scholar] [CrossRef]
- Zhao, Y.; Hemberg, E.; Derbinsky, N.; Mata, G.; O’Reilly, U.-M. Simulating a logistics enterprise using an asymmetrical wargame simulation with soar reinforcement learning and coevolutionary algorithms. In Proceedings of the Genetic and Evolutionary Computation Conference Companion, Lille, France, 10–14 July 2021; pp. 1907–1915. [Google Scholar] [CrossRef]
- Chen, L.; Zhang, Y.; Feng, Y.; Zhang, L.; Liu, Z. A Human-Machine Agent Based on Active Reinforcement Learning for Target Classification in Wargame. IEEE Trans. Neural Netw. Learn. Syst. 2023; in press. [Google Scholar] [CrossRef]
- Xue, Y.; Sun, Y.; Zhou, J.; Peng, L.; Zhou, X. Multi-attribute decision-making in wargames leveraging the Entropy-Weight method in conjunction with deep reinforcement learning. IEEE Trans. Games, 2023; in press. [Google Scholar] [CrossRef]
- Güneri, B.; Deveci, M. Evaluation of Supplier Selection in the Defense Industry Using q-Rung Orthopair Fuzzy Set based EDAS Approach. Expert Syst. Appl. 2023, 222, 119846. [Google Scholar] [CrossRef]
- Xiong, S.-H.; Zhu, C.-Y.; Chen, Z.-S.; Deveci, M.; Chiclana, F.; Skibniewski, M.J. On extended power geometric operator for proportional hesitant fuzzy linguistic large-scale group decision-making. Inf. Sci. 2023, 632, 637–663. [Google Scholar] [CrossRef]
- Cogburn, R. Markov Chains in Random Environments: The Case of Markovian Environments. Ann. Probab. 1980, 8, 908–916. [Google Scholar] [CrossRef]
- Chung, K.L. The general theory of Markov processes according to Doeblin. Z. Für Wahrscheinlichkeitstheorie Und Verwandte Geb. 1964, 2, 230–254. [Google Scholar] [CrossRef]
- Orey, S. Limit Theorems for Markov Chain Transition Probabilities; Van Nostrand: London, UK, 1971. [Google Scholar]
- Cogburn, R. The ergodic theory of Markov chains in random environments. Z. Für Wahrscheinlichkeitstheorie Und Verwandte Geb. 1984, 66, 109–128. [Google Scholar] [CrossRef]
- Cogburn, R. On the Central Limit Theorem for Markov Chains in Random Environments. Ann. Probab. 1991, 19, 587–604. [Google Scholar] [CrossRef]
- Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. arXiv 2015, arXiv:1509.02971. [Google Scholar] [CrossRef]
- Li, J.; Shi, H.; Hwang, K.-S. Using Fuzzy Logic to Learn Abstract Policies in Large-Scale Multiagent Reinforcement Learning. IEEE Trans. Fuzzy Syst. 2022, 30, 5211–5224. [Google Scholar] [CrossRef]
- Busoniu, L.; Babuska, R.; De Schutter, B. A comprehensive survey of multiagent reinforcement learning. IEEE Trans. Syst. Man Cybern. Part C Appl. Rev. 2008, 38, 156–172. [Google Scholar] [CrossRef] [Green Version]
- Zhao, Q.; Tong, L.; Swami, A.; Chen, Y.X. Decentralized cognitive MAC for opportunistic spectrum access in ad hoc networks: A POMDP framework. IEEE J. Sel. Areas Commun. 2007, 25, 589–600. [Google Scholar] [CrossRef]
- Peng, B.; Rashid, T.; Schroeder de Witt, C.; Kamienny, P.-A.; Torr, P.; Böhmer, W.; Whiteson, S. Facmac: Factored multi-agent centralised policy gradients. Adv. Neural Inf. Process. Syst. 2021, 34, 12208–12221. [Google Scholar] [CrossRef]
- Wang, L.; Wang, K.; Pan, C.; Xu, W.; Aslam, N.; Hanzo, L. Multi-Agent Deep Reinforcement Learning-Based Trajectory Planning for Multi-UAV Assisted Mobile Edge Computing. IEEE Trans. Cogn. Commun. Netw. 2021, 7, 73–84. [Google Scholar] [CrossRef]
- Jang, E.; Gu, S.; Poole, B. Categorical reparameterization with gumbel-softmax. arXiv 2016, arXiv:1611.01144. [Google Scholar] [CrossRef]
- Schwartz, P.J.; O’Neill, D.V.; Bentz, M.E.; Brown, A.; Doyle, B.S.; Liepa, O.C.; Lawrence, R.; Hull, R.D. AI-enabled wargaming in the military decision making process. In Proceedings of the Artificial Intelligence and Machine Learning for Multi-Domain Operations Applications II, Online, 27 April–8 May 2020; pp. 118–134. [Google Scholar] [CrossRef]
- Huang, G.-B.; Zhu, Q.-Y.; Siew, C.-K. Extreme learning machine: Theory and applications. Neurocomputing 2006, 70, 489–501. [Google Scholar] [CrossRef]
- Schulman, J.; Heess, N.; Weber, T.; Abbeel, P. Gradient estimation using stochastic computation graphs. In Advances in Neural Information Processing System; Curran Associates, Inc.: Red Hook, NY, USA, 2015; Volume 28. [Google Scholar] [CrossRef]
- Song, W.; Shi, C.; Xiao, Z.; Duan, Z.; Xu, Y.; Zhang, M.; Tang, J. Autoint: Automatic feature interaction learning via self-attentive neural networks. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, Beijing, China, 3–7 November 2019; pp. 1161–1170. [Google Scholar] [CrossRef] [Green Version]
- Wang, Y.; Han, B.; Wang, T.; Dong, H.; Zhang, C. Off-policy multi-agent decomposed policy gradients. arXiv 2020, arXiv:2007.12322. [Google Scholar] [CrossRef]
- Foerster, J.; Farquhar, G.; Afouras, T.; Nardelli, N.; Whiteson, S. Counterfactual multi-agent policy gradients. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018. [Google Scholar] [CrossRef]
Contributors | Contributed Content |
A.A. MapkoB | Created Markov processes |
Cogburn, R [30] | Gave formulas for Markov chains in stochastic environments |
Chung, K.L [31] | Presented various limit theorem theories in the general context |
Orey, S [32] | Compiled the limit theorems concerning the transfer probability of Markov chains |
Cogburn, R [33] | Analyzed the dependence between environmental factors and controlled Markov chains |
Cogburn, R [34] | Established the central limit theorem of the function of Markov chains in a stochastic environment |
Our Side’s Agent | Icon | Action Speed (s/Grid) | Initial Position |
Tank | 20 | 5947 | |
Chariot | 20 | 6048 | |
Infantry Squad | 144 | 6048 |
Enemy’s Agents | Icon | Action Speed (s/Grid) | Initial Position |
Heavy Tank | 15 | 3427 | |
Heavy Chariot | 15 | 3526 | |
Infantry Squad | 144 | 3526 |
Number of Actions | Action of Agents |
0 | Move to the left |
1 | Move to the right |
2 | Move to the upper left |
3 | Move to the upper right |
4 | Move to the lower left |
5 | Move to the lower right |
6 | Attack enemy’s heavy tank |
7 | Attack enemy’s heavy chariot |
8 | Attack enemy’s infantry squad |
9 | Convert to covert status |
Parameters | Values |
action_selector | “gumbel” |
epsilon_start | 0.5 |
epsilon_finish | 0.05 |
epsilon_anneal_time | 50,000 |
obs_last_action | True |
batch_size_run | 1 |
batch_size | 32 |
buffer_size | 5000 |
act_noise | 0.1 |
gamma | 0.9 |
target_update_interval | 200 |
target_update_mode | ‘hard’ |
target_update_tau: | 0.001 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (
Share and Cite
Yu, S.; Zhu, W.; Wang, Y. Research on Wargame Decision-Making Method Based on Multi-Agent Deep Deterministic Policy Gradient. Appl. Sci. 2023, 13, 4569.
Yu S, Zhu W, Wang Y. Research on Wargame Decision-Making Method Based on Multi-Agent Deep Deterministic Policy Gradient. Applied Sciences. 2023; 13(7):4569.
Chicago/Turabian StyleYu, Sheng, Wei Zhu, and Yong Wang. 2023. "Research on Wargame Decision-Making Method Based on Multi-Agent Deep Deterministic Policy Gradient" Applied Sciences 13, no. 7: 4569.
APA StyleYu, S., Zhu, W., & Wang, Y. (2023). Research on Wargame Decision-Making Method Based on Multi-Agent Deep Deterministic Policy Gradient. Applied Sciences, 13(7), 4569.