Forward and Backward Bellman Equations Improve the Efficiency of the EM Algorithm for DEC-POMDP
Abstract
:1. Introduction
2. DEC-POMDP
3. EM Algorithm for DEC-POMDP
3.1. Control as Inference
3.2. EM Algorithm
3.3. M Step
3.4. E Step
3.5. Summary
Algorithm 1 EM algorithm for DEC-POMDP |
4. Bellman EM Algorithm
4.1. Forward and Backward Bellman Equations
4.2. Bellman EM Algorithm (BEM)
4.3. Comparison of EM and BEM
Algorithm 2 Bellman EM algorithm (BEM) |
5. Modified Bellman EM Algorithm
5.1. Forward and Backward Bellman Operators
5.2. Modified Bellman EM Algorithm (MBEM)
5.3. Comparison of EM, BEM, and MBEM
Algorithm 3 Modified Bellman EM algorithm (MBEM) |
6. Summary of EM, BEM, and MBEM
- EM obtains and by calculating the forward–backward algorithm up to . needs to be large enough to reduce the approximation errors of and , which impairs the computational efficiency.
- BEM obtains and by solving the forward and backward Bellman equations. BEM can be more efficient than EM because BEM calculates the forward and backward Bellman equations instead of the forward–backward algorithm up to . However, BEM cannot always be more efficient than EM when the size of the state or that of the memory is large because BEM calculates an inverse matrix to solve the forward and backward Bellman equations.
- MBEM obtains and by applying the forward and backward Bellman operators, and , to the initial functions, and , times. Since MBEM does not need to calculate the inverse matrix, MBEM may be more efficient than EM even when the size of problems is large, which resolves the drawback of BEM. Although needs to be large enough to reduce the approximation errors of and , which is the same problem as EM, MBEM can evaluate the approximation errors more tightly owing to the contractibility of and , and can utilize the results of the previous iteration, and , as the initial functions, and . These properties enable MBEM to be more efficient than EM.
7. Numerical Experiment
8. Conclusions and Future Works
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Appendix A. Proof in Section 3
Appendix A.1. Proof of Theorem 1
Appendix A.2. Proof of Proposition 1
Appendix A.3. Proof of Proposition 2
Appendix B. Proof in Section 4
Proof of Theorem 2
Appendix C. Proof in Section 5
Appendix C.1. Proof of Proposition 3
Appendix C.2. Proof of Proposition 4
Appendix C.3. Proof of Proposition 5
Appendix C.4. Proof of Proposition 6
Appendix D. A Note on the Algorithm Proposed by Song et al.
References
- Bertsekas, D.P. Dynamic Programming and Optimal Control: Vol. 1; Athena Scientific: Belmont, MA, USA, 2000. [Google Scholar]
- Puterman, M.L. Markov Decision Processes: Discrete Stochastic Dynamic Programming; John Wiley & Sons: Hoboken, NJ, USA, 2014. [Google Scholar]
- Sutton, R.S.; Barto, A.G. Introduction to Reinforcement Learning; MIT Press: Cambridge, MA, USA, 1998; Volume 135. [Google Scholar]
- Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
- Kochenderfer, M.J. Decision Making under Uncertainty: Theory and Application; MIT Press: Cambridge, MA, USA, 2015. [Google Scholar]
- Oliehoek, F. Value-Based Planning for Teams of Agents in Stochastic Partially Observable Environments; Amsterdam University Press: Amsterdam, The Netherlands, 2010. [Google Scholar]
- Oliehoek, F.A.; Amato, C. A Concise Introduction to Decentralized POMDPs; Springer: Berlin/Heidelberg, Germany, 2016; Volume 1. [Google Scholar]
- Becker, R.; Zilberstein, S.; Lesser, V.; Goldman, C.V. Solving transition independent decentralized Markov decision processes. J. Artif. Intell. Res. 2004, 22, 423–455. [Google Scholar] [CrossRef] [Green Version]
- Nair, R.; Varakantham, P.; Tambe, M.; Yokoo, M. Networked distributed POMDPs: A synthesis of distributed constraint optimization and POMDPs. In Proceedings of the AAAI’05: Proceedings of the 20th National Conference on Artificial Intelligence, Pittsburgh, PA, USA, 9–13 July 2005; pp. 133–139. [Google Scholar]
- Bernstein, D.S.; Givan, R.; Immerman, N.; Zilberstein, S. The complexity of decentralized control of Markov decision processes. Math. Oper. Res. 2002, 27, 819–840. [Google Scholar] [CrossRef]
- Bernstein, D.S.; Hansen, E.A.; Zilberstein, S. Bounded policy iteration for decentralized POMDPs. In Proceedings of the Nineteenth International Joint Conference on Artificial Intelligence (IJCAI), Edinburgh, UK, 30 July–5 August 2005; pp. 52–57. [Google Scholar]
- Bernstein, D.S.; Amato, C.; Hansen, E.A.; Zilberstein, S. Policy iteration for decentralized control of Markov decision processes. J. Artif. Intell. Res. 2009, 34, 89–132. [Google Scholar] [CrossRef] [Green Version]
- Amato, C.; Bernstein, D.S.; Zilberstein, S. Optimizing fixed-size stochastic controllers for POMDPs and decentralized POMDPs. Auton. Agents Multi-Agent Syst. 2010, 21, 293–320. [Google Scholar] [CrossRef]
- Amato, C.; Bonet, B.; Zilberstein, S. Finite-state controllers based on mealy machines for centralized and decentralized pomdps. In Proceedings of the AAAI Conference on Artificial Intelligence, Atlanta, GA, USA, 11–15 July 2010. [Google Scholar]
- Amato, C.; Bernstein, D.S.; Zilberstein, S. Optimizing memory-bounded controllers for decentralized POMDPs. arXiv 2012, arXiv:1206.5258. [Google Scholar]
- Kumar, A.; Zilberstein, S. Anytime planning for decentralized POMDPs using expectation maximization. In Proceedings of the 26th Conference on Uncertainty in Artificial Intelligence, Catalina Island, CA, USA, 8–11 July 2010; pp. 294–301. [Google Scholar]
- Kumar, A.; Zilberstein, S.; Toussaint, M. Probabilistic inference techniques for scalable multiagent decision making. J. Artif. Intell. Res. 2015, 53, 223–270. [Google Scholar] [CrossRef] [Green Version]
- Toussaint, M.; Storkey, A. Probabilistic inference for solving discrete and continuous state Markov Decision Processes. In Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA, USA, 25–29 June 2006; pp. 945–952. [Google Scholar]
- Todorov, E. General duality between optimal control and estimation. In Proceedings of the 47th IEEE Conference on Decision and Control, Cancun, Mexico, 9–11 December 2008; pp. 4286–4292. [Google Scholar]
- Kappen, H.J.; Gómez, V.; Opper, M. Optimal control as a graphical model inference problem. Mach. Learn. 2012, 87, 159–182. [Google Scholar] [CrossRef] [Green Version]
- Levine, S. Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv 2018, arXiv:1805.00909. [Google Scholar]
- Sun, X.; Bischl, B. Tutorial and survey on probabilistic graphical model and variational inference in deep reinforcement learning. In Proceedings of the IEEE Symposium Series on Computational Intelligence (SSCI), Xiamen, China, 6–9 December 2019; pp. 110–119. [Google Scholar]
- Bishop, C.M. Pattern Recognition and Machine Learning; Springer: Berlin/Heidelberg, Germany, 2006. [Google Scholar]
- Toussaint, M.; Harmeling, S.; Storkey, A. Probabilistic Inference for Solving (PO) MDPs; Technical Report; Technical Report EDI-INF-RR-0934; School of Informatics, University of Edinburgh: Edinburgh, UK, 2006. [Google Scholar]
- Toussaint, M.; Charlin, L.; Poupart, P. Hierarchical POMDP Controller Optimization by Likelihood Maximization. UAI 2008, 24, 562–570. [Google Scholar]
- Kumar, A.; Zilberstein, S.; Toussaint, M. Scalable multiagent planning using probabilistic inference. In Proceedings of the 22nd International Joint Conference on Artificial Intelligence, Barcelona, Spain, 16–22 July 2011. [Google Scholar]
- Pajarinen, J.; Peltonen, J. Efficient planning for factored infinite-horizon DEC-POMDPs. In Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence, Barcelona, Spain, 16–22 July 2011; Volume 22, p. 325. [Google Scholar]
- Pajarinen, J.; Peltonen, J. Periodic finite state controllers for efficient POMDP and DEC-POMDP planning. Adv. Neural Inf. Process. Syst. 2011, 24, 2636–2644. [Google Scholar]
- Pajarinen, J.; Peltonen, J. Expectation maximization for average reward decentralized POMDPs. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Proceedings of the European Conference, ECML PKDD 2013, Prague, Czech Republic, 23–27 September 2013; Springer: Berlin/Heidelberg, Germany, 2013; pp. 129–144. [Google Scholar]
- Wu, F.; Zilberstein, S.; Jennings, N.R. Monte-Carlo expectation maximization for decentralized POMDPs. In Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence, Beijing, China, 3–9 August 2013. [Google Scholar]
- Liu, M.; Amato, C.; Anesta, E.; Griffith, J.; How, J. Learning for decentralized control of multiagent systems in large, partially-observable stochastic environments. In Proceedings of the AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016; Volume 30. [Google Scholar]
- Song, Z.; Liao, X.; Carin, L. Solving DEC-POMDPs by Expectation Maximization of Value Function. In Proceedings of the AAAI Spring Symposia, Palo Alto, CA, USA, 21–23 March 2016. [Google Scholar]
- Kumar, A.; Mostafa, H.; Zilberstein, S. Dual formulations for optimizing Dec-POMDP controllers. In Proceedings of the AAAI, Phoenix, AZ, USA, 12–17 February 2016. [Google Scholar]
- Bertsekas, D.P. Approximate policy iteration: A survey and some new methods. J. Control. Theory Appl. 2011, 9, 310–335. [Google Scholar] [CrossRef] [Green Version]
- Liu, D.R.; Li, H.L.; Wang, D. Feature selection and feature learning for high-dimensional batch reinforcement learning: A survey. Int. J. Autom. Comput. 2015, 12, 229–242. [Google Scholar] [CrossRef]
- Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]
- Hallak, A.; Mannor, S. Consistent on-line off-policy evaluation. In Proceedings of the International Conference on Machine Learning, PMLR, Sydney, Australia, 6–11 August 2017; pp. 1372–1383. [Google Scholar]
- Gelada, C.; Bellemare, M.G. Off-policy deep reinforcement learning by bootstrapping the covariate shift. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 3647–3655. [Google Scholar]
- Levine, S.; Kumar, A.; Tucker, G.; Fu, J. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv 2020, arXiv:2005.01643. [Google Scholar]
- Hansen, E.A.; Bernstein, D.S.; Zilberstein, S. Dynamic programming for partially observable stochastic games. In Proceedings of the AAAI, Palo Alto, CA, USA, 22–24 March 2004; Volume 4, pp. 709–715. [Google Scholar]
- Seuken, S.; Zilberstein, S. Improved memory-bounded dynamic programming for decentralized POMDPs. arXiv 2012, arXiv:1206.5295. [Google Scholar]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Tottori, T.; Kobayashi, T.J. Forward and Backward Bellman Equations Improve the Efficiency of the EM Algorithm for DEC-POMDP. Entropy 2021, 23, 551. https://doi.org/10.3390/e23050551
Tottori T, Kobayashi TJ. Forward and Backward Bellman Equations Improve the Efficiency of the EM Algorithm for DEC-POMDP. Entropy. 2021; 23(5):551. https://doi.org/10.3390/e23050551
Chicago/Turabian StyleTottori, Takehiro, and Tetsuya J. Kobayashi. 2021. "Forward and Backward Bellman Equations Improve the Efficiency of the EM Algorithm for DEC-POMDP" Entropy 23, no. 5: 551. https://doi.org/10.3390/e23050551
APA StyleTottori, T., & Kobayashi, T. J. (2021). Forward and Backward Bellman Equations Improve the Efficiency of the EM Algorithm for DEC-POMDP. Entropy, 23(5), 551. https://doi.org/10.3390/e23050551