KnowRU: Knowledge Reuse via Knowledge Distillation in Multi-Agent Reinforcement Learning
Abstract
:1. Introduction
- We propose a task-independent KD framework for MARL, which focuses on not only accelerating the training phase, but also improving asymptotic performance for new tasks.
- Different strategies are explored to further improve the knowledge transfer performance.
- Extensive empirical experiments demonstrate the effectiveness of KnowRU in different task scenarios with different MARL algorithms.
2. Related Work
3. Methodology
3.1. Preliminaries and Notations
3.2. KD
3.3. Knowledge Reuse via Mimicking
Algorithm 1: The training process based on AC framework. |
Initialization: Parameters and represent student’s actor and critic networks, respectively; parameter represents the previous policy model; and parameters and represent the weight of knowledge reuse and weight decay value, respectively. Output: Parameter of the actor networks and parameter of the critic networks for every agent: for episode = 1 to max-episodes: for step i = 1 to max-steps-in-episode: take action : get from environment, and observes new state store transition into replay buffer end for randomly sample N transitions from replay buffer for j = 1 to N by step k: get samples by order get by get by compute where := get optimize actor network by minimizing: optimise critic network by minimizing: end for if : end for |
4. Experiments and Analysis
4.1. Experimental Setup
4.2. Simple _Spread Scenario
4.3. Simple _Adversary Scenario
4.4. Cooperative _Treasure _Collection Scenario
4.5. Component Analysis and Discussion
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef] [PubMed]
- Tesauro, G. Temporal difference learning and TD-Gammon. Commun. ACM 1995, 38, 58–68. [Google Scholar] [CrossRef]
- Vinyals, O.; Babuschkin, I.; Czarnecki, W.M.; Mathieu, M.; Dudzik, A.; Chung, J.; Choi, D.H.; Powell, R.; Ewalds, T.; Georgiev, P.; et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature 2019, 575, 350–354. [Google Scholar] [CrossRef] [PubMed]
- Zhou, X.; Wang, H.; Ding, B. How many robots are enough: A multi-objective genetic algorithm for the single-objective time-limited complete coverage problem. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, Australia, 21–26 May 2018; pp. 2380–2387. [Google Scholar]
- Da Silva, F.L.; Costa, A.H.R. A survey on transfer learning for multiagent reinforcement learning systems. J. Artif. Intell. Res. 2019, 64, 645–703. [Google Scholar] [CrossRef] [Green Version]
- Crandall, J.W. Just add Pepper: Extending learning algorithms for repeated matrix games to repeated markov games. In Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems, Valencia, Spain, 4 June 2012; pp. 399–406. [Google Scholar]
- Hernandez-Leal, P.; Kaisers, M. Towards a fast detection of opponents in repeated stochastic games. In International Conference on Autonomous Agents and Multiagent Systems; Springer: Cham, Switzerland, 2017; pp. 239–257. [Google Scholar]
- Kelly, S.; Heywood, M.I. Knowledge transfer from keepaway soccer to half-field offense through program symbiosis: Building simple programs for a complex task. In Proceedings of the 2015 Annual Conference on Genetic and Evolutionary Computation, Madrid, Spain, 11–15 July 2015; pp. 1143–1150. [Google Scholar]
- Hinton, G.; Vinyals, O.; Dean, J. Distilling the knowledge in a neural network. arXiv 2015, arXiv:1503.02531. [Google Scholar]
- Bowling, M.; Veloso, M. An Analysis of Stochastic Game Theory for Multiagent Reinforcement Learning; Technical Report; Carnegie-Mellon Univ Pittsburgh Pa School of Computer Science: Pittsburgh, PA, USA, 2000. [Google Scholar]
- Watkins, C.J.; Dayan, P. Q-learning. Mach. Learn. 1992, 8, 279–292. [Google Scholar] [CrossRef]
- Silver, D.; Lever, G.; Heess, N.; Degris, T.; Wierstra, D.; Riedmiller, M. Deterministic Policy Gradient Algorithms; International Conference on Machine Learning; PMLR: Beijing, China, 2014; pp. 387–395. [Google Scholar]
- Konda, V.R.; Tsitsiklis, J.N. Actor-critic algorithms. In Advances in Neural Information Processing Systems; MIT Press: Denver, CO, USA, 2000; pp. 1008–1014. [Google Scholar]
- Lowe, R.; Wu, Y.I.; Tamar, A.; Harb, J.; Abbeel, O.P.; Mordatch, I. Multi-agent actor-critic for mixed cooperative-competitive environments. In Advances in Neural Information Processing Systems; MIT Press: Los Angeles, CA, USA, 2017; pp. 6379–6390. [Google Scholar]
- Iqbal, S.; Sha, F. Actor-attention-critic for multi-agent reinforcement learning. In Proceedings of the International Conference on Machine Learning (PMLR), Long Beach, CA, USA, 9–15 June 2019; pp. 2961–2970. [Google Scholar]
- Taylor, M.E.; Stone, P. Transfer learning for reinforcement learning domains: A survey. J. Mach. Learn. Res. 2009, 10, 1633–1685. [Google Scholar]
- Tan, M. Multi-agent reinforcement learning: Independent vs. cooperative agents. In Proceedings of the Tenth International Conference on Machine Learning, Amherst, MA, USA, 27–29 July 1993; pp. 330–337. [Google Scholar]
- Han, J.; Hu, R. Deep fictitious play for finding Markovian Nash equilibrium in multi-agent games. In Proceedings of the Mathematical and Scientific Machine Learning (PMLR), Princeton, NJ, USA, 20–24 July 2020; pp. 221–245. [Google Scholar]
- Sunehag, P.; Lever, G.; Gruslys, A.; Czarnecki, W.M.; Zambaldi, V.; Jaderberg, M.; Lanctot, M.; Sonnerat, N.; Leibo, J.Z.; Tuyls, K.; et al. Value-decomposition networks for cooperative multi-agent learning. arXiv 2017, arXiv:1706.05296. [Google Scholar]
- Zhou, M.; Chen, Y.; Wen, Y.; Yang, Y.; Su, Y.; Zhang, W.; Zhang, D.; Wang, J. Factorized q-learning for large-scale multi-agent systems. In Proceedings of the First International Conference on Distributed Artificial Intelligence, Beijing, China, 13–15 October 2019; pp. 1–7. [Google Scholar]
- Koga, M.L.; Freire, V.; Costa, A.H. Stochastic abstract policies: Generalizing knowledge to improve reinforcement learning. IEEE Trans. Cybern. 2014, 45, 77–88. [Google Scholar] [CrossRef] [PubMed]
- Didi, S.; Nitschke, G. Multi-agent behavior-based policy transfer. In Proceedings of the European Conference on the Applications of Evolutionary Computation, Porto, Portugal, 30 March–1 April 2016; pp. 181–197. [Google Scholar]
- Buciluǎ, C.; Caruana, R.; Niculescu-Mizil, A. Model compression. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery And Data Mining, Philadelphia, PA, USA, 20–23 August 2006; pp. 535–541. [Google Scholar]
- Lai, K.H.; Zha, D.; Li, Y.; Hu, X. Dual policy distillation. arXiv 2020, arXiv:2006.04061. [Google Scholar]
- Wadhwania, S.; Kim, D.K.; Omidshafiei, S.; How, J.P. Policy distillation and value matching in multiagent reinforcement learning. arXiv 2019, arXiv:1903.06592. [Google Scholar]
- Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. arXiv 2015, arXiv:1509.02971. [Google Scholar]
- Pan, S.J.; Yang, Q. A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 2009, 22, 1345–1359. [Google Scholar] [CrossRef]
- Weiss, K.; Khoshgoftaar, T.M.; Wang, D. A survey of transfer learning. J. Big Data 2016, 3, 9. [Google Scholar] [CrossRef] [Green Version]
- Kukačka, J.; Golkov, V.; Cremers, D. Regularization for deep learning: A taxonomy. arXiv 2017, arXiv:1710.10686. [Google Scholar]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Gao, Z.; Xu, K.; Ding, B.; Wang, H. KnowRU: Knowledge Reuse via Knowledge Distillation in Multi-Agent Reinforcement Learning. Entropy 2021, 23, 1043. https://doi.org/10.3390/e23081043
Gao Z, Xu K, Ding B, Wang H. KnowRU: Knowledge Reuse via Knowledge Distillation in Multi-Agent Reinforcement Learning. Entropy. 2021; 23(8):1043. https://doi.org/10.3390/e23081043
Chicago/Turabian StyleGao, Zijian, Kele Xu, Bo Ding, and Huaimin Wang. 2021. "KnowRU: Knowledge Reuse via Knowledge Distillation in Multi-Agent Reinforcement Learning" Entropy 23, no. 8: 1043. https://doi.org/10.3390/e23081043
APA StyleGao, Z., Xu, K., Ding, B., & Wang, H. (2021). KnowRU: Knowledge Reuse via Knowledge Distillation in Multi-Agent Reinforcement Learning. Entropy, 23(8), 1043. https://doi.org/10.3390/e23081043