# Robust Reinforcement Learning: A Review of Foundations and Recent Advances

^{1}

^{2}

^{3}

^{*}

^{†}

## Abstract

**:**

## 1. Introduction

**Definition**

**1.**

## 2. Preliminaries

#### 2.1. Optimization

#### 2.2. Optimal Control

#### Hamilton-JACOBI-Bellman Equation

#### 2.2.1. Robust Control

#### Relations to Game Theory

#### 2.2.2. Differential Games

#### 2.2.3. Nash Equilibrium

#### 2.2.4. Two-Player Zero-Sum Game

#### 2.3. Reinforcement Learning

#### 2.3.1. Single Agent Reinforcement Learning

#### 2.3.2. Multi Agent Reinforcement Learning

## 3. Robustness in Reinforcement Learning

#### 3.1. Transition and Reward Robust Designs

**s**,

**a**)-rectangular uncertainty set can be reformulated as (r)-rectangular uncertainty set. The authors further propose a robust value iteration algorithm based on (r)-rectangular uncertainty sets for finite-state MDPs. The provided experiments show significantly less conservative behavior compared to the (s)-rectangular approach in [45] while still achieving improved robust performance w.r.t. nominal non-robust MDPs.

#### 3.2. Disturbance Robust Designs

#### 3.3. Action Robust Designs

#### 3.4. Observation Robust Designs

#### 3.5. Relations to Maximum Entropy RL and Risk Sensitivity

## 4. Conclusions

#### 4.1. Summary

**Transition robust**methods define an uncertainty set of possible transition functions [23,24,25,44,45]. For finite-state MDPs, convergence guarantees are given [24,25]. However, to remain tractable, the strict assumption of rectangularity is required. A consequence is overly pessimistic policies. Modern contributions center around the deficiencies of traditional transition robust RL. Tackling the pessimistic behavior is done in three different ways. First, the authors in [39,41,52] propose a trade-off between robust and non-robust performance. In a multi-objective optimization scheme, the importance of robust performance is lowered in favor of the non-robust performance measures. Another set of works identifies the rectangularity property as the source of pessimistic behavior [42,47,48]. They propose a non-rectangular set of coupled uncertainty that remains tractable. The non-rectangularity effectively restricts worst-case outcomes to more realistic cases. Thirdly, as a combination of stochastic and robust optimization, distributional robust methods weaken the worst-case formulation. Instead of a definite worst-case transition function, the adversary only chooses a worst-case distribution over transition functions. The additional layer of uncertainty prevents convergence to overly pessimistic policies [38,40,43,46,49,50,51,53,54]. Additionally, literature addresses the restriction of traditional transition robust designs to finite-state MDPs. As the dimensionality of state and action spaces grows, the classical methods suffer from the curse of dimensionality. Modern systems are rarely describable as low-dimensional discrete problems. Propositions include linear and non-linear function approximations, e.g., approximate dynamic programming [10,136].

**Disturbance robust**designs rely on external forces to express uncertainty in the system dynamics. Methods utilize this relation to define disturbing adversaries [7,8,9]. A core advantage is the removal of explicit uncertainty sets. However, more recent contributions are only demonstrated empirically without mathematical guarantees [8,9]. Compared to the other categories, the distribution robust design lacks scientific contributions.

**Action robust**designs, instead, imply disturbances as perturbations of the agent’s actions. Literature introduces two variations of action robust designs. Each depicts a different type of external disturbance. Probabilistic action robust MDPs consider sudden disrupting forces. The PRMDP simulates rare but catastrophic events, e.g., crashes in autonomous driving [11,55,56]. Noisy action robust MDPs, on the other hand, describe continuous perturbations of actions to simulate changes in physical parameters [11]. Both variants define a joint policy as a linear interpolation between the protagonist’s and adversary’s policy. Further, recent work adopts the concept of adversarial attacks to produce action robust agents [57]. Adversarial attacks are mainly known from input perturbations in deep learning [161,162].

**Observation robust**designs leverage the vulnerability of policies to input perturbations [61,62,63,64,145]. The adversary exploits this vulnerability to distort the protagonist’s perception. Consequently, the decision-making process is redirected to produce worst-case transitions. Most works define adversarial attacks as direct optimization of the observation or state space. As such, the presented methods effectively separate the optimization procedure. Instead of utilizing an adversarial RL formulation, the robust policy is obtained through classical RL algorithms [61,62,63,64]. Another work focuses on limiting an agent’s response to adversarial attacks to provide robustness guarantees and certification bounds [65].

#### 4.2. Outlook

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Acknowledgments

## Conflicts of Interest

## References

- Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction, 2nd ed.; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
- Puterman, M.L. Markov Decision Processes: Discrete Stochastic Dynamic Programming; John Wiley & Sons: Hoboken, NJ, USA, 2014. [Google Scholar]
- Franklin, G.F.; Powell, J.D.; Emami-Naeini, A.; Powell, J.D. Feedback Control of Dynamic Systems; Addison-Wesley: Reading, MA, USA, 1994. [Google Scholar]
- Bennett, S. A brief history of automatic control. IEEE Control Syst. Mag.
**1996**, 16, 17–25. [Google Scholar] - Bryson, A.E. Optimal control-1950 to 1985. IEEE Control Syst. Mag.
**1996**, 16, 26–33. [Google Scholar] [CrossRef] - Kirk, D.E. Optimal Control Theory: An Introduction; Courier Corporation: Chelmsford, MA, USA, 2012. [Google Scholar]
- Morimoto, J.; Doya, K. Robust Reinforcement Learning. In Advances in Neural Information Processing Systems 13; Leen, T.K., Dietterich, T.G., Tresp, V., Eds.; MIT Press: Cambridge, MA, USA, 2001. [Google Scholar]
- Pinto, L.; Davidson, J.; Gupta, A. Supervision via Competition: Robot Adversaries for Learning Tasks. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017. [Google Scholar]
- Pinto, L.; Davidson, J.; Sukthankar, R.; Gupta, A. Robust Adversarial Reinforcement Learning. In Proceedings of the 34th International Conference on Machine Learning (ICML), Sydney, Australia, 6–11 August 2017. [Google Scholar]
- Tamar, A.; Xu, H.; Mannor, S. Scaling up robust MDPs by reinforcement learning. arXiv
**2013**, arXiv:1306.6189. [Google Scholar] - Tessler, C.; Efroni, Y.; Mannor, S. Action Robust Reinforcement Learning and Applications in Continuous Control. In Proceedings of the 36th International Conference on Machine Learning (ICML), PMLR, Long Beach, CA, USA, 9–15 June 2019. [Google Scholar]
- Zhou, K.; Doyle, J.C. Essentials of Robust Control; Prentice Hall: Upper Saddle River, NJ, USA, 1998. [Google Scholar]
- Ben-Tal, A.; El Ghaoui, L.; Nemirovski, A. Robust Optimization; Princeton University Press: Princeton, NJ, USA, 2009. [Google Scholar]
- Hansen, L.P.; Sargent, T.J. Robustness; Princeton University Press: Princeton, NJ, USA, 2016. [Google Scholar]
- Ben-Tal, A.; Nemirovski, A. Robust convex optimization. Math. Oper. Res.
**1998**, 23, 769–805. [Google Scholar] [CrossRef] [Green Version] - Ben-Tal, A.; Nemirovski, A. Robust optimization–methodology and applications. Math. Program.
**2002**, 92, 453–480. [Google Scholar] [CrossRef] - Safonov, M.G. Origins of robust control: Early history and future speculations. Annu. Rev. Control
**2012**, 26, 173–181. [Google Scholar] [CrossRef] - Zames, G. Feedback and optimal sensitivity: Model reference transformations, multiplicative seminorms, and approximate inverses. IEEE Trans. Autom. Control
**1981**, 26, 301–320. [Google Scholar] [CrossRef] - Doyle, J. Analysis of feedback systems with structured uncertainties. In IEE Proceedings D-Control Theory and Applications; IET: London, UK, 1982. [Google Scholar]
- Zames, G.; Francis, B. Feedback, minimax sensitivity, and optimal robustness. IEEE Trans. Autom. Control
**1983**, 28, 585–601. [Google Scholar] [CrossRef] - Doyle, J.C.; Glover, K.; Khargonekar, P.P.; Francis, B.A. State-space solutions to standard H
_{2}and H_{∞}control problems. IEEE Trans. Autom. Control**1989**, 1691–1696. [Google Scholar] [CrossRef] [Green Version] - Van Der Schaft, A.J. L 2-gain analysis of nonlinear systems and nonlinear state feedback H
_{∞}control. IEEE Trans. Autom. Control**1992**, 37, 770–784. [Google Scholar] [CrossRef] [Green Version] - Bagnell, J.A.; Ng, A.Y.; Schneider, J.G. Solving Uncertain Markov Decision Processes; Carnegie Mellon University, the Robotics Institute: Pittsburgh, PA, USA, 2001. [Google Scholar]
- Nilim, A.; El Ghaoui, L. Robust control of Markov decision processes with uncertain transition matrices. Oper. Res.
**2005**, 53, 780–798. [Google Scholar] [CrossRef] [Green Version] - Iyengar, G.N. Robust dynamic programming. Math. Oper. Res.
**2005**, 30, 257–280. [Google Scholar] [CrossRef] [Green Version] - Glover, K.; Doyle, J.C. State-space formulae for all stabilizing controllers that satisfy an H(infinity)-norm bound and relations to risk sensitivity. Syst. Control Lett.
**1988**, 11, 167–172. [Google Scholar] [CrossRef] - Basar, T.; Bernhard, P. H
_{∞}-Optimal Control and Related Minimax Design Problems: A Dynamic Game Approach; Birkháuser: Boston, MA, USA, 2008. [Google Scholar] - Limebeer, D.J.N.; Anderson, B.D.O.; Khargonekar, P.P.; Green, M. A Game Theoretic Approach to H
_{∞}Control for Time-varying Systems. SIAM J. Control Optim.**1992**, 30, 262–283. [Google Scholar] [CrossRef] - McEneaney, W.M. Robust control and differential games on a finite time horizon. Math. Control Signals Syst.
**1995**, 8, 138–166. [Google Scholar] [CrossRef] - Isaacs, R. Differential Games I: Introduction; Technical Report; Rand Corp: Santa Monica, CA, USA, 1954. [Google Scholar]
- Owen, G. Game Theory; Academic Press: Cambridge, MA, USA, 1982. [Google Scholar]
- Ho, Y.; Bryson, A.; Baron, S. Differential games and optimal pursuit-evasion strategies. IEEE Trans. Autom. Control
**1965**, 10, 385–389. [Google Scholar] [CrossRef] - Starr, A.W.; Ho, Y.C. Nonzero-sum differential games. J. Optim. Theory Appl.
**1969**, 3, 184–206. [Google Scholar] [CrossRef] - Littman, M.L. Value-function reinforcement learning in Markov games. Cogn. Syst. Res.
**2001**, 2, 55–66. [Google Scholar] [CrossRef] [Green Version] - Uther, W.; Veloso, M. Adversarial Reinforcement Learning; Carnegie Mellon University: Pittsburgh, PA, USA, 1997. [Google Scholar]
- Shoham, Y.; Leyton-Brown, K. Multiagent systems: Algorithmic, Game-Theoretic, and Logical Foundations; Cambridge University Press: Cambridge, UK, 2008. [Google Scholar]
- LaValle, S.M. Robot Motion Planning: A Game-Theoretic Foundation. Algorithmica
**2000**, 26, 430–465. [Google Scholar] [CrossRef] - Charalambous, C.D.; Rezaei, F. Stochastic uncertain systems subject to relative entropy constraints: Induced norms and monotonicity properties of minimax games. IEEE Trans. Autom. Control
**2007**, 52, 647–663. [Google Scholar] [CrossRef] - Xu, H.; Mannor, S. The robustness-performance tradeoff in Markov decision processes. In Advances in Neural Information Processing Systems (NIPS); MIT Press: Cambridge, MA, USA, 2006. [Google Scholar]
- Xu, H.; Mannor, S. Distributionally robust Markov decision processes. In Advances in Neural Information Processing Systems (NIPS); Curran Associates, Inc.: Red Hook, NY, USA, 2010. [Google Scholar]
- Delage, E.; Mannor, S. Percentile optimization for Markov decision processes with parameter uncertainty. Oper. Res.
**2010**, 58, 203–213. [Google Scholar] [CrossRef] [Green Version] - Mannor, S.; Mebel, O.; Xu, H. Lightning does not strike twice: Robust MDPs with coupled uncertainty. arXiv
**2012**, arXiv:1206.4643. [Google Scholar] - Hu, Z.; Hong, L.J. Kullback-Leibler Divergence Constrained Distributionally Robust Optimization. Available at Optimization Online. 2013. Available online: https://asset-pdf.scinapse.io/prod/2562747313/2562747313.pdf (accessed on 12 March 2022).
- Lim, S.H.; Xu, H.; Mannor, S. Reinforcement learning in robust markov decision processes. In Advances in Neural Information Processing Systems (NIPS); Curran Associates, Inc.: Red Hook, NY, USA, 2013. [Google Scholar]
- Wiesemann, W.; Kuhn, D.; Rustem, B. Robust Markov Decision Processes. Math. Oper. Res.
**2013**, 38, 153–183. [Google Scholar] [CrossRef] [Green Version] - Yu, P.; Xu, H. Distributionally robust counterpart in Markov decision processes. IEEE Trans. Autom. Control
**2015**, 61, 2538–2543. [Google Scholar] [CrossRef] [Green Version] - Mannor, S.; Mebel, O.; Xu, H. Robust MDPs with k-rectangular uncertainty. Math. Oper. Res.
**2016**, 41, 1484–1509. [Google Scholar] [CrossRef] - Goyal, V.; Grand-Clement, J. Robust Markov Decision Process: Beyond Rectangularity. arXiv
**2018**, arXiv:1811.00215. [Google Scholar] - Smirnova, E.; Dohmatob, E.; Mary, J. Distributionally robust reinforcement learning. arXiv
**2019**, arXiv:1902.08708. [Google Scholar] - Coulson, J.; Lygeros, J.; Dörfler, F. Regularized and Distributionally Robust Data-Enabled Predictive Control. In Proceedings of the 2019 IEEE 58th Conference on Decision and Control (CDC), Nice, France, 11–13 December 2019. [Google Scholar]
- Derman, E.; Mannor, S. Distributional robustness and regularization in reinforcement learning. arXiv
**2020**, arXiv:2003.02894. [Google Scholar] - Turchetta, M.; Krause, A.; Trimpe, S. Robust model-free reinforcement learning with multi-objective Bayesian optimization. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020. [Google Scholar]
- Abdulsamad, H.; Dorau, T.; Belousov, B.; Zhu, J.J.; Peters, J. Distributionally Robust Trajectory Optimization Under Uncertain Dynamics via Relative-Entropy Trust Regions. arXiv
**2021**, arXiv:2103.15388. [Google Scholar] - Yang, I. Wasserstein Distributionally Robust Stochastic Control: A Data-Driven Approach. IEEE Trans. Autom. Control
**2021**, 66, 3863–3870. [Google Scholar] [CrossRef] - Klima, R.; Bloembergen, D.; Kaisers, M.; Tuyls, K. Robust temporal difference learning for critical domains. In Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems (AAMAS). International Foundation for Autonomous Agents and Multiagent Systems, Montreal, QC, Canada, 13–17 May 2019. [Google Scholar]
- Pan, X.; Seita, D.; Gao, Y.; Canny, J. Risk Averse Robust Adversarial Reinforcement Learning. arXiv
**2019**, arXiv:1901.08021. [Google Scholar] - Tan, K.L.; Esfandiari, Y.; Lee, X.Y.; Sarkar, S. Robustifying reinforcement learning agents via action space adversarial training. In Proceedings of the 2020 American control conference (ACC), Denver, CO, USA, 1–3 July 2020. [Google Scholar]
- Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Nets. In Advances in Neural Information Processing Systems 27 (NIPS); Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N.D., Weinberger, K.Q., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2014. [Google Scholar]
- Goodfellow, I.J.; Shlens, J.; Szegedy, C. Explaining and harnessing adversarial examples. arXiv
**2014**, arXiv:1412.6572. [Google Scholar] - Huang, S.; Papernot, N.; Goodfellow, I.; Duan, Y.; Abbeel, P. Adversarial Attacks on Neural Network Policies. arXiv
**2017**, arXiv:1702.02284. [Google Scholar] - Mandlekar, A.; Zhu, Y.; Garg, A.; Fei-Fei, L.; Savarese, S. Adversarially robust policy learning: Active construction of physically-plausible perturbations. In Proceedings of the 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vancouver, BC, Canada, 24–28 September 2017. [Google Scholar]
- Pattanaik, A.; Tang, Z.; Liu, S.; Bommannan, G.; Chowdhary, G. Robust Deep Reinforcement Learning with Adversarial Attacks. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems (AAMAS). International Foundation for Autonomous Agents and Multiagent Systems, Stockholm, Sweden, 10–15 July 2018. [Google Scholar]
- Gleave, A.; Dennis, M.; Kant, N.; Wild, C.; Levine, S.; Russell, S. Adversarial Policies: Attacking Deep Reinforcement Learning. arXiv
**2019**, arXiv:1905.10615. [Google Scholar] - Zhang, H.; Chen, H.; Xiao, C.; Li, B.; Liu, M.; Boning, D.; Hsieh, C.J. Robust Deep Reinforcement Learning against Adversarial Perturbations on State Observations. arXiv
**2020**, arXiv:2003.08938. [Google Scholar] - Lütjens, B.; Everett, M.; How, J.P. Certified adversarial robustness for deep reinforcement learning. In Proceedings of the Conference on Robot Learning (CoRL), Osaka, Japan, 30 October–1 November 2019. [Google Scholar]
- Grünwald, P.D.; Dawid, A.P. Game theory, maximum entropy, minimum discrepancy and robust Bayesian decision theory. Ann. Stat.
**2004**, 32, 1367–1433. [Google Scholar] [CrossRef] [Green Version] - Osogami, T. Robustness and risk-sensitivity in Markov decision processes. In Advances in Neural Information Processing Systems (NIPS); Curran Associates, Inc.: Red Hook, NY, USA, 2012. [Google Scholar]
- Eysenbach, B.; Levine, S. If MaxEnt RL is the Answer, What is the Question? arXiv
**2019**, arXiv:1910.01913. [Google Scholar] - Eysenbach, B.; Levine, S. Maximum entropy rl (provably) solves some robust rl problems. arXiv
**2021**, arXiv:2103.06257. [Google Scholar] - Boyd, S.; Vandenberghe, L. Convex Optimization; Cambridge University Press: Cambridge, UK, 2004. [Google Scholar]
- Papageorgiou, M.; Leibold, M.; Buss, M. Optimierung; Springer: Berlin/Heidelberg, Germany, 1991. [Google Scholar]
- Kall, P.; Wallace, S.W.; Kall, P. Stochastic Programming; Springer: Berlin/Heidelberg, Germany, 1994. [Google Scholar]
- Beyer, H.G.; Sendhoff, B. Robust optimization–A comprehensive survey. Comput. Methods Appl. Mech. Eng.
**2007**, 196, 3190–3218. [Google Scholar] [CrossRef] - Xu, H.; Caramanis, C.; Mannor, S. A distributional interpretation of robust optimization. Math. Oper. Res.
**2012**, 37, 95–110. [Google Scholar] [CrossRef] - Wiesemann, W.; Kuhn, D.; Sim, M. Distributionally robust convex optimization. Oper. Res.
**2014**, 62, 1358–1376. [Google Scholar] [CrossRef] [Green Version] - Heger, M. Consideration of risk in reinforcement learning. In Machine Learning Proceedings 1994; Elsevier: Amsterdam, The Netherlands, 1994. [Google Scholar]
- Scarf, H.E. A Min-Max Solution of an Inventory Problem; Technical Report; Rand Corp: Santa Monica, CA, USA, 1957. [Google Scholar]
- Bolza, O. Vorlesungen über Variationsrechnung; BG Teubner: Stuttgart, Germany, 1909; Available online: https://diglib.uibk.ac.at/ulbtirol/content/titleinfo/372088 (accessed on 12 March 2022).
- McShane, E.J. On multipliers for Lagrange problems. Am. J. Math.
**1939**, 61, 809–819. [Google Scholar] [CrossRef] - Bliss, G.A. Lectures on the Calculus of Variations; University of Chicago Press: Chicago, IL, USA, 1946. [Google Scholar]
- Cicala, P. An Engineering Approach to the Calculus of Variations; Libreria Editrice Universitaria Levrotto & Bella: Turin, Italy, 1957. [Google Scholar]
- Pontryagin, L.S. Mathematical Theory of Optimal Processes; Routledge: London, UK, 2018. [Google Scholar]
- Bellman, R. The theory of dynamic programming. Bull. Am. Math. Soc.
**1954**, 60, 503–515. [Google Scholar] [CrossRef] [Green Version] - Doya, K. Reinforcement Learning in Continuous Time and Space. Neural Comput.
**2000**, 12, 219–245. [Google Scholar] [CrossRef] [PubMed] - Bellman, R. A Markovian decision process. J. Math. Mech.
**1957**, 6, 679–684. [Google Scholar] [CrossRef] - Kalman, R.E. Contributions to the theory of optimal control. Bol. Soc. Mat. Mex.
**1960**, 5, 102–119. [Google Scholar] - Kalman, R.E.; Bertram, J.E. Control system analysis and design via the “second method” of Lyapunov: I—Continuous-time systems. J. Basic Eng.
**1960**, 82, 371–393. [Google Scholar] [CrossRef] - Kalman, R.E. A new approach to linear filtering and prediction problems. J. Basic Eng.
**1960**, 82, 35–45. [Google Scholar] [CrossRef] [Green Version] - Von Neumann, J.; Morgenstern, O. Theory of Games and Economic Behavior; Princeton University Press: Princeton, NJ, USA, 1944. [Google Scholar]
- Isaacs, R. Differential Games: A Mathematical Theory with Applications to Warfare and Pursuit, Control and Optimization; Dover Publications: Mineola, NY, USA, 1999. [Google Scholar]
- Nash, J. Non-cooperative games. Ann. Math.
**1951**, 54, 286–295. [Google Scholar] [CrossRef] - Awheda, M. On Multi-Agent Reinforcement Learning in Matrix, Stochastic and Differential Games. Ph.D. Thesis, Carleton University, Ottawa, ON, Canada, 2017. [Google Scholar]
- Bowling, M.H.; Veloso, M.M. An Analysis of Stochastic Game Theory for Multiagent Reinforcement Learning. 2000. Available online: https://apps.dtic.mil/sti/citations/ADA385122 (accessed on 12 March 2022).
- Howard, R.A. Dynamic Programming and Markov Processes. 1960. Available online: https://psycnet.apa.org/record/1961-01474-000 (accessed on 12 March 2022).
- Littman, M.L. Markov games as a framework for multi-agent reinforcement learning. In Machine Learning Proceedings 1994; Elsevier: Amsterdam, The Netherlands, 1994. [Google Scholar]
- Sharma, R.; Gopal, M. A robust Markov game controller for nonlinear systems. Appl. Soft Comput.
**2007**, 7, 818–827. [Google Scholar] [CrossRef] - Monahan, G.E. State of the art—A survey of partially observable Markov decision processes: Theory, models, and algorithms. Manag. Sci.
**1982**, 28, 1–16. [Google Scholar] [CrossRef] [Green Version] - Kaelbling, L.P.; Littman, M.L.; Cassandra, A.R. Planning and acting in partially observable stochastic domains. Artif. Intell.
**1998**, 101, 99–134. [Google Scholar] [CrossRef] [Green Version] - Sutton, R.S.; McAllester, D.; Singh, S.; Mansour, Y. Policy gradient methods for reinforcement learning with function approximation. Adv. Neural Inf. Process. Syst.
**1999**, 12. [Google Scholar] - Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; Riedmiller, M. Playing atari with deep reinforcement learning. arXiv
**2013**, arXiv:1312.5602. [Google Scholar] - Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. arXiv
**2015**, arXiv:1509.02971. [Google Scholar] - Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the 35th International Conference on Machine Learning PMLR, Stockholm, Sweden, 10–15 July 2018. [Google Scholar]
- Williams, R.J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn.
**1992**, 8, 229–256. [Google Scholar] [CrossRef] [Green Version] - Kakade, S.M. A natural policy gradient. In Advances in Neural Information Processing Systems; NIPS: Cambridge, MA, USA, 2001; Volume 14. [Google Scholar]
- Peters, J.; Schaal, S. Natural actor-critic. Neurocomputing
**2008**, 71, 1180–1190. [Google Scholar] [CrossRef] - Peters, J.; Mulling, K.; Altun, Y. Relative entropy policy search. In Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence, Atlanta, GA, USA, 11–15 July 2010. [Google Scholar]
- Schulman, J.; Levine, S.; Abbeel, P.; Jordan, M.; Moritz, P. Trust region policy optimization. In Proceedings of the International Conference on Machine Learning, PMLR, Lille, France, 7–9 July 2015. [Google Scholar]
- Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv
**2017**, arXiv:1707.06347. [Google Scholar] - Lanctot, M.; Zambaldi, V.; Gruslys, A.; Lazaridou, A.; Tuyls, K.; Perolat, J.; Silver, D.; Graepel, T. A Unified Game-Theoretic Approach to Multiagent Reinforcement Learning. In Advances in Neural Information Processing Systems; NIPS: Cambridge, MA, USA, 2017; Volume 30. [Google Scholar]
- Laurent, G.J.; Matignon, L.; Fort-Piat, L. The world of independent learners is not Markovian. Int. J.-Knowl.-Based Intell. Eng. Syst.
**2011**, 15, 55–64. [Google Scholar] [CrossRef] [Green Version] - Claus, C.; Boutilier, C. The Dynamics of Reinforcement Learning in Cooperative Multiagent Systems. In Proceedings of the Fifteenth National/Tenth Conference on Artificial Intelligence/Innovative Applications of Artificial Intelligence. American Association for Artificial Intelligence, Madison, WI, USA, 26–30 July 1998. [Google Scholar]
- Shapley, L.S. Stochastic games. Proc. Natl. Acad. Sci. USA
**1953**, 39, 1095–1100. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Buşoniu, L.; Babuška, R.; Schutter, B.D. Multi-agent reinforcement learning: An overview. In Innovations in Multi-Agent Systems and Applications-1; Springer: Berlin/Heidelberg, Germany, 2010. [Google Scholar]
- Gronauer, S.; Diepold, K. Multi-agent deep reinforcement learning: A survey. Artif. Intell. Rev.
**2021**, 55, 895–943. [Google Scholar] [CrossRef] - Littman, M.L.; Szepesvári, C. A generalized reinforcement-learning model: Convergence and applications. ICML
**1996**, 96, 310–318. [Google Scholar] - Szepesvári, C.; Littman, M.L. A unified analysis of value-function-based reinforcement-learning algorithms. Neural Comput.
**1999**, 11, 2017–2060. [Google Scholar] [CrossRef] - Silver, D.; Huang, A.; Maddison, C.J.; Guez, A.; Sifre, L.; Van Den Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M.; et al. Mastering the game of Go with deep neural networks and tree search. Nature
**2016**, 529, 484–489. [Google Scholar] [CrossRef] - Foerster, J.; Assael, I.A.; De Freitas, N.; Whiteson, S. Learning to communicate with deep multi-agent reinforcement learning. Adv. Neural Inf. Process. Syst.
**2016**, 29. [Google Scholar] - Lowe, R.; Wu, Y.I.; Tamar, A.; Harb, J.; Pieter Abbeel, O.; Mordatch, I. Multi-agent actor-critic for mixed cooperative-competitive environments. In Advances in Neural Information Processing Systems; NIPS: Cambridge, MA, USA, 2017; Volume 30. [Google Scholar]
- Foerster, J.; Farquhar, G.; Afouras, T.; Nardelli, N.; Whiteson, S. Counterfactual multi-agent policy gradients. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018. [Google Scholar]
- Rashid, T.; Samvelyan, M.; Schroeder, C.; Farquhar, G.; Foerster, J.; Whiteson, S. Qmix: Monotonic value function factorisation for deep multi-agent reinforcement learning. In Proceedings of the International Conference on Machine Learning (ICML), PMLR, Stockholm, Sweden, 10–15 July 2018. [Google Scholar]
- Mahajan, A.; Rashid, T.; Samvelyan, M.; Whiteson, S. Maven: Multi-agent variational exploration. In Advances in Neural Information Processing Systems (NIPS); Curran Associates, Inc.: Red Hook, NY, USA, 2019. [Google Scholar]
- Son, K.; Kim, D.; Kang, W.J.; Hostallero, D.E.; Yi, Y. Qtran: Learning to factorize with transformation for cooperative multi-agent reinforcement learning. In Proceedings of the 36th International Conference on Machine Learning (ICML), PMLR, Long Beach, CA, USA, 9–15 June 2019. [Google Scholar]
- Rashid, T.; Farquhar, G.; Peng, B.; Whiteson, S. Weighted qmix: Expanding monotonic value function factorisation for deep multi-agent reinforcement learning. Adv. Neural Inf. Process. Syst.
**2020**, 33, 10199–10210. [Google Scholar] - Zhu, Y.; Zhao, D. Online minimax Q network learning for two-player zero-sum Markov games. IEEE Trans. Neural Netw. Learn. Syst.
**2020**, 33, 1228–1241. [Google Scholar] [CrossRef] - Yu, C.; Velu, A.; Vinitsky, E.; Wang, Y.; Bayen, A.; Wu, Y. The Surprising Effectiveness of PPO in Cooperative, Multi-Agent Games. arXiv
**2021**, arXiv:2103.01955. [Google Scholar] - Xu, H.; Mannor, S. Robustness and generalization. Mach. Learn.
**2012**, 86, 391–423. [Google Scholar] [CrossRef] [Green Version] - Satia, J.K.; Lave Jr, R.E. Markovian decision processes with uncertain transition probabilities. Oper. Res.
**1973**, 21, 728–740. [Google Scholar] [CrossRef] - Xiao, C.; Li, B.; Zhu, J.Y.; He, W.; Liu, M.; Song, D. Generating adversarial examples with adversarial networks. arXiv
**2018**, arXiv:1801.02610. [Google Scholar] - White, C.C., III; Eldeib, H.K. Markov decision processes with imprecise transition probabilities. Oper. Res.
**1994**, 42, 739–749. [Google Scholar] [CrossRef] - Givan, R.; Leach, S.; Dean, T. Bounded parameter Markov decision processes. In European Conference on Planning; Springer: Berlin/Heidelberg, Germany, 1997. [Google Scholar]
- Littman, M.L. Memoryless policies: Theoretical limitations and practical results. In From Animals to Animats 3: Proceedings of the Third International Conference on Simulation of Adaptive Behavior; MIT Press: Cambridge, MA, USA, 1994. [Google Scholar]
- Zhu, J.J.; Jitkrittum, W.; Diehl, M.; Schölkopf, B. Worst-Case Risk Quantification under Distributional Ambiguity using Kernel Mean Embedding in Moment Problem. In Proceedings of the 2020 59th IEEE Conference on Decision and Control (CDC), Jeju, Korea, 14–18 December 2020. [Google Scholar]
- Gupta, V. Near-optimal Bayesian ambiguity sets for distributionally robust optimization. Manag. Sci.
**2019**, 65, 4242–4260. [Google Scholar] [CrossRef] [Green Version] - Rahimian, H.; Mehrotra, S. Distributionally robust optimization: A review. arXiv
**2019**, arXiv:1908.05659. [Google Scholar] - Badrinath, K.P.; Kalathil, D. Robust Reinforcement Learning using Least Squares Policy Iteration with Provable Performance Guarantees. In Proceedings of the International Conference on Machine Learning PMLR, Virtual, 18–24 July 2021. [Google Scholar]
- Abdullah, M.A.; Ren, H.; Ammar, H.B.; Milenkovic, V.; Luo, R.; Zhang, M.; Wang, J. Wasserstein robust reinforcement learning. arXiv
**2019**, arXiv:1907.13196. [Google Scholar] - Todorov, E.; Erez, T.; Tassa, Y. Mujoco: A physics engine for model-based control. In Proceedings of the 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, Vilamoura-Algarve, Portugal, 7–12 October 2012. [Google Scholar]
- O’Donoghue, B.; Osband, I.; Munos, R.; Mnih, V. The uncertainty bellman equation and exploration. In Proceedings of the International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017. [Google Scholar]
- Derman, E.; Mankowitz, D.; Mann, T.; Mannor, S. A Bayesian Approach to Robust Reinforcement Learning. In Proceedings of the Conference on Uncertainty in Artificial Intelligence (UAI), PMLR, Tel Aviv, Israel, 22–25 July 2019. [Google Scholar]
- Rajeswaran, A.; Ghotra, S.; Ravindran, B.; Levine, S. Epopt: Learning robust neural network policies using model ensembles. arXiv
**2016**, arXiv:1610.01283. [Google Scholar] - Tamar, A.; Glassner, Y.; Mannor, S. Optimizing the CVaR via sampling. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, Austin, TX, USA, 25–30 January 2015. [Google Scholar]
- Brockman, G.; Cheung, V.; Pettersson, L.; Schneider, J.; Schulman, J.; Tang, J.; Zaremba, W. OpenAI Gym. arXiv
**2016**, arXiv:1606.01540. [Google Scholar] - Mankowitz, D.J.; Levine, N.; Jeong, R.; Shi, Y.; Kay, J.; Abdolmaleki, A.; Springenberg, J.T.; Mann, T.; Hester, T.; Riedmiller, M. Robust reinforcement learning for continuous control with model misspecification. arXiv
**2019**, arXiv:1906.07516. [Google Scholar] - Lutter, M.; Mannor, S.; Peters, J.; Fox, D.; Garg, A. Robust Value Iteration for Continuous Control Tasks. arXiv
**2021**, arXiv:2105.12189. [Google Scholar] - Papernot, N.; McDaniel, P.; Jha, S.; Fredrikson, M.; Celik, Z.B.; Swami, A. The limitations of deep learning in adversarial settings. In Proceedings of the 2016 IEEE European Symposium on Security and Privacy (EuroS&P), Saarbruecken, Germany, 21–24 March 2016. [Google Scholar]
- Szegedy, C.; Zaremba, W.; Sutskever, I.; Bruna, J.; Erhan, D.; Goodfellow, I.; Fergus, R. Intriguing properties of neural networks. arXiv
**2013**, arXiv:1312.6199. [Google Scholar] - Van Hasselt, H.; Guez, A.; Silver, D. Deep reinforcement learning with double q-learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016. [Google Scholar]
- Welling, M.; Teh, Y.W. Bayesian learning via stochastic gradient Langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML-11), Bellevue, WA, USA, 28 June–2 July 2011. [Google Scholar]
- Salman, H.; Yang, G.; Zhang, H.; Hsieh, C.J.; Zhang, P. A convex relaxation barrier to tight robustness verification of neural networks. arXiv
**2019**, arXiv:1902.08722. [Google Scholar] - Ziebart, B.D. Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy; Carnegie Mellon University: Pittsburgh, PA, USA, 2010. [Google Scholar]
- Howard, R.A.; Matheson, J.E. Risk-sensitive Markov decision processes. Manag. Sci.
**1972**, 18, 356–369. [Google Scholar] [CrossRef] - Jaquette, S.C. A utility criterion for Markov decision processes. Manag. Sci.
**1976**, 23, 43–49. [Google Scholar] [CrossRef] - Denardo, E.V.; Rothblum, U.G. Optimal stopping, exponential utility, and linear programming. Math. Program.
**1979**, 16, 228–244. [Google Scholar] [CrossRef] - Patek, S.D. On terminating Markov decision processes with a risk-averse objective function. Automatica
**2001**, 37, 1379–1386. [Google Scholar] [CrossRef] - Osogami, T. Iterated risk measures for risk-sensitive Markov decision processes with discounted cost. arXiv
**2012**, arXiv:1202.3755. [Google Scholar] - Whittle, P. Risk-sensitive linear quadratic Gaussian control. Adv. Appl. Probab.
**1981**, 13, 764–777. [Google Scholar] [CrossRef] - Whittle, P. Risk sensitivity, a strangely pervasive concept. Macroecon. Dyn.
**2002**, 6, 5–18. [Google Scholar] [CrossRef] [Green Version] - Nass, D.; Belousov, B.; Peters, J. Entropic Risk Measure in Policy Search. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 3–8 November 2019. [Google Scholar]
- Petersen, I.; James, M.; Dupuis, P. Minimax optimal control of stochastic uncertain systems with relative entropy constraints. IEEE Trans. Autom. Control
**2000**, 45, 398–412. [Google Scholar] [CrossRef] [Green Version] - Madry, A.; Makelov, A.; Schmidt, L.; Tsipras, D.; Vladu, A. Towards deep learning models resistant to adversarial attacks. arXiv
**2017**, arXiv:1706.06083. [Google Scholar] - Papernot, N.; McDaniel, P.; Goodfellow, I.; Jha, S.; Celik, Z.B.; Swami, A. Practical black-box attacks against machine learning. In Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security, Abu Dhabi, United Arab Emirates, 2–6 April 2017. [Google Scholar]

**Figure 1.**(

**a**) A plant without model uncertainty, where $\mathit{w}$ is a vector signal containing external noise, disturbances, and the reference signal. The system output is given in $\mathit{z}$. Measurements are represented by $\mathit{y}$, while $\mathit{u}$ is the control signal. (

**b**) A plant with all possible model uncertainty expressed as $\Delta $. Here ${\mathit{w}}_{0}$ depicts external noise, disturbances, and the reference signal. Now $\mathit{w}$ is a signal representing parameter perturbations and model uncertainty. The system output is again described with ${\mathit{z}}_{0}$ and $\mathit{z}$. Measurement and control signals are given by $\mathit{y}$ and $\mathit{u}$, respectively. Both plants are stabilized by a controller K [12].

**Figure 2.**The figure illustrates a Nash equilibrium in a two-player zero-sum game. The protagonist minimizes the objective ${min}_{{\mathit{u}}_{1}}{J}_{1}({\mathit{u}}_{1},{\mathit{u}}_{2})$ while the adversary counteracts through maximization ${max}_{{\mathit{u}}_{2}}{J}_{2}({\mathit{u}}_{1},{\mathit{u}}_{2})$. In this equilibrium, each player achieves an optimal payoff when following the optimal action $({\mathit{u}}_{1}^{*},{\mathit{u}}_{2}^{*})$. (i) In case 1, the protagonist deviates from the optimal action with $\Delta {\mathit{u}}_{1}$. The adversary pursues the optimal action. Consequently, the protagonist gets a higher payoff. The adversary, on the other hand, reaches a better objective. (ii) In case 2, the protagonist follows the optimal action. However, the adversary changes his action by $\Delta {\mathit{u}}_{2}$. As a result, the protagonist achieves better outcomes—the opponent’s outcomes decrease. Therefore, the Nash equilibrium corresponds to a solution in which both players achieve the best possible outcome w.r.t. each other. A change in the policy causes a loss and should be avoided.

**Figure 3.**Schematic representation of the Markov decision process. Transitions are depicted in a discrete-time formulation to define variables at every time step. Each state is composed of a vector of observations or sensory inputs of the agent. The agent can choose an action from a given action-space at every step, which is evaluated through a reward function.

**Figure 4.**Schematic representation of the two-player zero-sum Markov game. Transitions are depicted in a discrete-time formulation to define variables at every time step. Each state is composed of a vector of observations or sensory inputs of the agents. The agents can choose an action from their respective action spaces evaluated through a reward function at every step. For the specific two-player zero-sum case as depicted here, ${\overline{r}}_{t}=-{r}_{t}$.

**Figure 5.**Schematic representation of methods following a transition robust design. The framework considers adversaries taking control of the transition function. The adversary selects a transition function—here described as distribution—from a predefined uncertainty set. This decision is based on the current state of the environment and the action chosen by the protagonist.

**Figure 6.**Illustration of the underlying concept of disturbance robust designs. Uncertainties in the system dynamics are modeled as disruptive forces. These forces represent an additional condition on the transition probabilities. As such, the transition function shifts according to the adversarial action to produce worst possible outcomes for the protagonist.

**Figure 7.**The underlying concept behind action robust designs. The framework considers two different scenarios: (i) The protagonist action is distorted by the adversary through immediate gradient optimization; (ii) A joint action is defined as an linear combination of protagonist and adversarial actions.

**Figure 8.**Illustration of observation robust designs. The adversary distorts the states or the protagonist’s perception of the states. As consequence, the protagonist makes detrimental decisions. The design relies on the vulnerability of neural networks to input perturbations.

Item | Definition |
---|---|

State | $\mathit{s}\in \mathcal{S}$ |

Action | $\mathit{a}\in \mathcal{A}$ |

Reward | r |

Stochastic Transition Matrix | $\mathcal{P}$ |

Transition probability | $p({\mathit{s}}^{\prime}\mid \mathit{s},\mathit{a})$ |

Horizon | T |

Uncertainty | ${\mathcal{U}}_{}$ |

Objective | $J$ |

Bellman Operator | $\tau $ |

History | $\mathit{h}\in \mathcal{H}$ |

Parameter | $\theta $ |

Value Function | V |

Q Function | Q |

Policy/Strategy | $\pi $ |

Discount Factor | $\gamma $ |

Expectation | $\mathbb{E}$ |

Variance | $\mathbb{V}\mathrm{ar}$ |

Identity Matrix | $\mathcal{I}$ |

Probability Distribution | $\mathbb{P}$ |

Learning Rate | $\alpha $ |

Adversary | A |

Protagonist | P |

**Table 2.**Similarities in the notations of reinforcement learning and optimal control. Here the transfer function of reinforcement learning is formulated in discrete time while the dynamics in optimal control are in continuous time.

Reinforcement Learning | Optimal Control |
---|---|

State $\mathit{s}$ | State $\mathit{x}$ |

Action $\mathit{a}$ | Control $\mathit{u}$ |

Reward r | Cost $\mathit{c}$ |

Observations o | Measurements y |

Transition ${\mathit{s}}^{\prime}\sim p\left({\mathit{s}}^{\prime}\right|\mathit{s},\mathit{a})$ | Dynamics $\dot{\mathit{x}}=f(\mathit{x},\mathit{u},w)$ |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Moos, J.; Hansel, K.; Abdulsamad, H.; Stark, S.; Clever, D.; Peters, J.
Robust Reinforcement Learning: A Review of Foundations and Recent Advances. *Mach. Learn. Knowl. Extr.* **2022**, *4*, 276-315.
https://doi.org/10.3390/make4010013

**AMA Style**

Moos J, Hansel K, Abdulsamad H, Stark S, Clever D, Peters J.
Robust Reinforcement Learning: A Review of Foundations and Recent Advances. *Machine Learning and Knowledge Extraction*. 2022; 4(1):276-315.
https://doi.org/10.3390/make4010013

**Chicago/Turabian Style**

Moos, Janosch, Kay Hansel, Hany Abdulsamad, Svenja Stark, Debora Clever, and Jan Peters.
2022. "Robust Reinforcement Learning: A Review of Foundations and Recent Advances" *Machine Learning and Knowledge Extraction* 4, no. 1: 276-315.
https://doi.org/10.3390/make4010013