Robust Reinforcement Learning: A Review of Foundations and Recent Advances

: Reinforcement learning (RL) has become a highly successful framework for learning in Markov decision processes (MDP). Due to the adoption of RL in realistic and complex environments, solution robustness becomes an increasingly important aspect of RL deployment. Nevertheless, current RL algorithms struggle with robustness to uncertainty, disturbances, or structural changes in the environment. We survey the literature on robust approaches to reinforcement learning and categorize these methods in four different ways: (i) Transition robust designs account for uncertainties in the system dynamics by manipulating the transition probabilities between states; (ii) Disturbance robust designs leverage external forces to model uncertainty in the system behavior; (iii) Action robust designs redirect transitions of the system by corrupting an agent’s output; (iv) Observation robust designs exploit or distort the perceived system state of the policy. Each of these robust designs alters a different aspect of the MDP. Additionally, we address the connection of robustness to the risk-based and entropy-regularized RL formulations. The resulting survey covers all fundamental concepts underlying the approaches to robust reinforcement learning and their recent advances.


Introduction
In recent years RL research has started shifting towards deployment on realistic problems. In an effort to mimic human learning behavior, RL utilizes trial and error-based designs contrary to traditional control designs [1,2]. In control, the optimal behavior is derived from analytical reasoning of physical constraints [1,[3][4][5][6]. Such designs are known as white-box models. RL, on the other hand, assumes a black-box approach where the system is unknown. The agent continuously observe the system's response through interactions. The observed data drives the optimization of the agent's behavior to achieve a given objective [1,2]. However, the solutions to standard RL methods are not inherently robust to uncertainties, perturbations, or structural changes in the environment, phenomena that are frequently observed in real-world settings. Definition 1. Robustness-in the scope considered in this survey-refers to the ability to cope with variations or uncertainty of one's environment. In the context of reinforcement learning and control, robustness is pursued w.r.t. specific uncertainties in system dynamics, e.g., varying physical parameters.
For example, a common manifestation of this phenomena is encountered when evaluating policies trained in simulation on the real environment. Even for advanced simulators, design exploits the susceptibility of neural networks to input perturbations. The recent developments inspire these methods in image-based deep learning [58,59]. The adversary leverages the vulnerability to distort the protagonist's perception. As a consequence, the decision-making process is redirected to a worst-case transition [60][61][62][63][64][65]. The protagonist perceives the shift in transitions as changing dynamics. Hence, transition and disturbance robust designs target elements around the environment. Contrary, action and observation robust designs aim for elements surrounding the decision-making process. Aside from direct robust formulations, we discuss literature on the connection of robustness to risk-based and entropy-regularized RL formulations. Mathematical proofs show the equivalence of certain risk-based and entropy-regularized RL formulations to transition robust designs [66][67][68][69].
First, we discuss the basics of optimization, optimal control, and reinforcement learning. We clarify how these research fields connect and relate with each other. Further, the extension to multi-agent environments and MARL is explained. We present our understanding of robust reinforcement learning separated into the four aforementioned categories. We discuss several extensions and concepts proposed in each category over the past two decades. As most research has been presented in the context of transition robust design, this part is significantly larger than the other categories. Finally, we shortly introduce the connections of robustness to risk-based and entropy-regularized formulations. This survey centers around the core ideas and mathematical framework behind each approach. We cover a large research body of both reinforcement learning and control theory. As such, notations of both topics can be found in this paper. In Table 1, the similarities in the notations we used are clarified. Not all work in robustness is covered in this survey. However, our goal is to provide new aspiring researchers with a solid foundation on robust RL, detailing the fundamental idea behind the four categories and their possible extensions.

Preliminaries
The concepts behind robust reinforcement learning are not unique to RL-rather, they are multidisciplinary. Closely related research areas are optimization, optimal control, and game theory. Ideas and concepts from these areas have been repurposed and built upon. This chapter summarizes the fundamentals, i.e., optimization, optimal control, and rein-forcement learning. The relations to each other and to game theory are highlighted. We aim to provide a better understanding of the core ideas and their interdisciplinary application.

Optimization
Optimization provides mathematical tools for solving problems in various areas, e.g., control theory, decision theory, and finance [70][71][72]. In general, an objective function J(x) is optimized by finding the correct design parameters x ∈ X. Typically, the domain X represents a subset of a Euclidean space X ⊆ R n [71]. The problem is formulated as where the vectors h(x) and g(x) represent equality and inequality constraints, respectively. In reality, however, the optimization problem must often deal with uncertainties and perturbations [16,73]. Possible causes can be changes in environmental parameters, e.g., angle, temperature, or material. Furthermore, the optimization must be able to handle measurement and approximation errors due to the approximation of real physical systems [73]. Robust optimization approaches tackle these issues by considering a deterministic uncertainty model [13][14][15][16]74,75]. The general assumption considers an extended objective J(x, ζ) with an additional vector of uncertain problem parameters ζ ∈ R m . This objective is commonly known as robust counterpart. The vector ζ belongs to a given uncertainty set U ζ and is not exactly specified. As such, the optimization problem is extended to a bi-level optimization problem. A bi-level problem corresponds to a game, where a protagonist tries to minimize the objective J(x, ζ). Meanwhile an adversary controls ζ to achieve the worst possible outcome for the protagonist. The approach, however, does not consider the benefits of any available distributional information on the problem. Thus, the outcome of the worst-case scenario in Equation (2) is too conservative [73,75]. Stochastic optimization (SO) takes advantage of such distributional information. In comparison to Equation (2), SO treats ζ not as a vector from an uncertainty set. Rather, ζ is a random variable drawn from a known distribution P [70,72,73]. The objective is reformulated as where the protagonist seeks to minimize the expectation of uncertain parameters. SO captures a less conservative risk attitude of the outcome than robust optimization given a stochastic domain. While risk-exposure to uncertain outcomes of a known probability distribution-is captured in both approaches, neither address ambiguity, i.e., the probability distribution itself is subject to uncertainty [75,76].
The distributional robust optimization [77] aims to handle both situations. The approach keeps the potentially uncertain probability distribution P in a known set of distributions U P [74,75]. Due to the a priori known U P , the distributionally robust optimization problem is formalized as which again results in a bi-level optimization problem similar to robust optimization. In this scenario, however, an adversary chooses the worst possible distribution P ∈ U P . This way, the distributionally robust counterpart captures not only the decision maker's risk attitude but also an aversion towards ambiguity [75]. The basic concept for optimization see Equation (1) is a core part of optimal control and reinforcement learning. Both aim to optimize a payoff function, which is defined by the underlying environment surrounding the controller or agent. Robust approaches in RL further utilize the concept of robust optimization see Equation (2) or distributional robust optimization see Equation (4)-either directly or from a game-theoretic point of view.

Optimal Control
Optimal control is a practical application of optimization for dynamic (variational) systems. Such optimization problems are generalized in the calculus of variations (CV) formulated as J(x) is an objective or cost function optimized w.r.t. any variable x. The problem is constrained by the conditions x(a) = x a and x(b) = x b on the start-and endpoint, respectively. The Lagrangian L(x(t),ẋ(t), t) describes the cost at each point in time. Works published by Bolza [78], McShane [79], Bliss [80], Cicala [81] turned out to be important foundations for the adaption of CV to optimal control. In more detail, optimal control aims to calculate optimal trajectories by minimizing an objective function under differential constraint equations describing the system dynamics [1]. Thus, the optimal control problem can be written in the form s.t.ẋ(t) = f (x(t), u(t)), x i (0) = x i,0 = const., d(x(t f ), t f ) = 0, g(x(t), u(t)) ≥ 0, known as Bolza form. The goal here is to optimize the cost function J(u) w.r.t. the control variable u(t). The first-order differential constraintsẋ(t) describe the system dynamics given a current state x(t) and control u(t). Constraints on the terminal state are given by d(x(t f ), t f ) with final time t f which is often written as x(t f ) = x t f . Optionally, control and state-space constraints are defined as g(x(t), u(t)). As in CV, the Lagrange term L(x(t),ẋ(t), t) describes the cost at each point in time while the Mayer term Φ(x(t f ), t f ) represents the constant cost in the terminal state [71]. This approach of using first-order differential equations became known as the state-space formulation of control [4]. With the maximum principle Pontryagin [82] formulated the Hamiltonian (6) for optimal control. The Maximum Principle is an approach for solving such non-classical variational problems (see Equation (5)) by transforming the problem into nonlinear subproblems [5]. Pontryagin [82] states that for a minimizing trajectory satisfying the Euler-Lagrange equations there exists a control u maximizing the Hamiltonian.

Hamilton-JACOBI-Bellman Equation
In optimal control Bellman [83] introduced an optimal return or value function to define some performance measure from state x(t) and time t to a terminal state x(t f ) under optimal control u(t) [5]. The value function-an extension of the Hamilton-Jacobi equation-takes the form of a first-order nonlinear partial differential equation later known as Hamilton-Jacobi-Bellman equation (HJB) [71,83,84]. The HJB equation was originally presented in the context of dynamic programming (DP). DP solves optimal control discretized w.r.t. states and actions [83]. Due to the discretization, DP becomes infeasible in higher dimensional spaces known as the curse of dimensionality [1,5]. Discretized w.r.t. time the HJB equation is commonly known as Bellman equation (see Equation (14)) and is a fundamental concept of RL (see Section 2.3). The time discretization causes a sequence of states for which Bellman [85] proposed a Markovian framework known as Markov decision process. It represents a discrete stochastic version of the optimal control problem [1,85].

Robust Control
Parallel to dynamic programming [83], Kalman et al. [86] proposed a more control theory-driven approach known as linear quadratic regulator control (LQR). LQR formed the foundation for the linear quadratic Gaussian control (LQG) [86][87][88]. The goal of control theory remains to find a controller that stabilizes a dynamic system. In these systems, robustness against modeling errors, parameter uncertainty, and disturbances in an environment has long been a big challenge. As such, a robust controller stabilizes a system under these uncertainties and disturbances. While optimal control already accounted for disturbances, it is still not robust under modeling errors and parameter uncertainty [12]. Since the 1980s, research tackling robustness became more prominent under the name robust control [5]. During that time, a new form of optimal control called H ∞ -control emerged, based on sensitivity minimization [18][19][20][21][22]. For this survey, we want to focus on H ∞ -control as it provides interesting relations to game theory and robust reinforcement learning.
H ∞ -control generally optimizes finite linear time invariant dynamical systems described by the following linear constant coefficient differential equationṡ where x describes the system state vector and u denotes a control vector with a disturbance given in w. A, B, E, C, D, and F are real constant matrices. In the Laplace space the relation of system output y and control u is given as a linear transfer function Such a linear system can be reshaped to the form depicted in Figure 1a. The dynamics of this system are defined as where a stable transfer matrix P represents the plant. The vector signal w contains noise, disturbances and the reference signal, while z includes all controlled signals and tracking errors. The measurement and control signals are represented by y and u, respectively. The linear transformation z = T zw w is governed by the transformation matrix Optimal H ∞ -control is defined as to find all admissible controllers K such that is minimized, where σ max corresponds to the largest singular value. By definition, a controller is admissible if it internally stabilizes the system [12]. This formulation represents a worst-case design as it minimizes the system's sensitivity to the worst possible case of disturbances. Here w 0 depicts external noise, disturbances, and the reference signal. Now w is a signal representing parameter perturbations and model uncertainty. The system output is again described with z 0 and z. Measurement and control signals are given by y and u, respectively. Both plants are stabilized by a controller K [12].
Robust stability, however, is only given if a controller K stabilizes the system not only under noise and disturbances but also under parameter perturbation and model uncertainties as depicted by ∆ in Figure 1b. According to the small gain theorem, this system is deemed stable for any perturbation in ∆ with ∆ ∞ < 1/γ if the controller ensures that T zw ∞ ≤ γ with some γ > 0 [12,27]. Coming from robust optimization, ∆ represents a known uncertainty set containing all possible disturbances and parameter perturbations of the nominal plant.

Relations to Game Theory
An important observation is that the H ∞ -control problem minimizes the maximum norm. Reformulating Equation (9) under the small gain theorem represents a cost function of the form J(u, w) := z 2 2 − γ 2 w 2 2 ≤ 0, where z depends on u and w. Using this cost function, an optimal value function V * is formulated as As such, H ∞ -control represents a mini-max optimization problem or a mini-max game. Generally, mini-max games are zero-sum games, where the controller and the disturbances or uncertainties are each represented as one player. While H ∞ -control is in the frequency domain, robust control in the time domain corresponds to a differential game. Basar and Bernhard [27] show the connection between H ∞ -control and differential games in more detail in their book.

Differential Games
Game theory has its roots with Von Neumann and Morgenstern [89] defining various types of games, settings, and strategies [31]. The differential game, however, was only later introduced by Isaacs [30]. These differential games provide a unique game-theoretic perspective on optimal control. In general, a differential game is written as a multi-objective optimization problem representing a situation where N players act in the same environment with different goals or objectives. We will refer to these players as agents. The objectives are defined as a payoff [90] or cost function [33] for each agent i, with Note that the payoffs are written in Bolza form, similar to optimal control (see Equation (5)). The Lagrangian of each agent i depends on the whole set of all agents. Isaacs [90] proposed two kinds of games. The game of kind describes discrete and the game of degree continuous cost functions. The former, in most cases, describes an objective of a yes-no type of possible payoffs, e.g., win or loss of a game.
The system dynamics in a multi-objective optimization problem are written as a first-order differential equatioṅ depending on the world's state x and the agents' actions u 1 , . . . , u N [90]. An agent follows either a pure or mixed strategy determining which action u i to choose from a set of actions the agent is given for each state [90]. Pure strategies are of the exploiting deterministic type. In each state, the agent chooses the best possible action without exploration. Mixed strategies, in contrast, are stochastic and thus also incorporate exploration. The agent maintains a distribution over the given action set for each state and draws samples to determine the next action to execute. An agent playing mixed strategies always wins against an agent playing w.r.t. pure strategies because of its deterministic behavior. In multi-objective optimization, the set of strategies is expressed as Approaches to solving such a game depend on the available information for each agent. Commonly, each agent is aware of the current value of the state, the system parameters, and the cost function. The strategies of the adversaries, however, are unknown [33]. Another important piece of information is how the agents interact with each other. In game theory, there are different types of games determining the interaction of agents. A cooperative game refers to a situation where a subset of agents acts in unison to reach a mutually beneficial outcome. This setting can be extended to all agents acting in unison to reach a common goal that corresponds to a fully cooperative or team game. Respectively, a non-cooperative game represents a situation without cooperation. Each agent chooses its actions regardless of the cost inflicted to other agents [31,37,89,91,92]. The solutions to these types of games provide a balance between the independent interest of the agents called equilibrium [31,90,91].

Nash Equilibrium
In non-cooperative games, however, convergence to a globally optimal equilibrium is not guaranteed, which instead requires Nash equilibria [91]. In a Nash equilibrium, it is guaranteed that each agent deviating from the equilibrium increases its costs. In turn, agents acting optimally w.r.t. the equilibrium get their costs lowered. A strategy set φ * is a Nash equilibrium if holds for each objective J i (φ 1 , . . . , φ N ), where φ i is any admissible strategy for agent i. There are methods utilizing this property, e.g., the value function approach (dynamic programming) or the variational approach (analytic, see calculus of variations) [33]. However, the existence of a Nash equilibrium is not guaranteed for all non-cooperative games. Another challenge in general non-cooperative games is the lack of information, as agents do not exchange information about their strategies. Hence, the agents cannot be sure that their adversaries act w.r.t the Nash equilibrium. The resulting optimal strategy for an agent might therefore not necessarily follow the Nash equilibrium. An intuitive but excessively pessimistic approach is the assumption that every other agent behaves adversarially. This approach corresponds to a situation where an agent's cost is maximized by all of its adversaries. An agent thus aims to minimize its maximum cost, the basic idea behind mini-max games. Such problems are defined in the following form which does not take the payoffs of the other agents into account for agent i [31,33,35,90].

Two-Player Zero-Sum Game
An extreme case is the two-player zero-sum game where two agents play against each other as adversaries such that the objectives J i are directly opposed J(u 1 , u 2 ) := J 1 (u 1 , u 2 ) = −J 2 (u 1 , u 2 ).
The sum of the objectives of both agents amounts to zero [31,90,91]. As shown in Figure 2, the Nash equilibrium corresponds to a saddle point solution in a two-player zero-sum game. So we can rewrite the Nash equilibrium to become If this equation holds, there is a Nash equilibrium [27,90,91]. Further, if a Nash equilibrium exists, this equilibrium is equivalent to the solution of a mini-max optimization for twoplayer games [33]. Additionally, the existence of a Nash equilibrium is guaranteed if mixed strategies are allowed. Therefore, the two-player zero-sum game guarantees the existence of a mini-max solution for mixed strategies [93]. However, this property does not hold for N > 2 player games. A special variant of the zero-sum game is the game against nature. It describes a situation in which one agent's actions correspond to environmental disturbances (nature) that other agents must overcome. Thus, nature and the agents work on completely different action spaces [37]. In terms of optimal control, this situation comes closest to H ∞ -control as a worst-case design under uncertainties and disturbances. It is also possible to give nature control over the parameters of the environment to incorporate modeling errors and thus more closely represent robust control. Figure 2. The figure illustrates a Nash equilibrium in a two-player zero-sum game. The protagonist minimizes the objective min u 1 J 1 (u 1 , u 2 ) while the adversary counteracts through maximization max u 2 J 2 (u 1 , u 2 ). In this equilibrium, each player achieves an optimal payoff when following the optimal action (u * 1 , u * 2 ). (i) In case 1, the protagonist deviates from the optimal action with ∆u 1 . The adversary pursues the optimal action. Consequently, the protagonist gets a higher payoff. The adversary, on the other hand, reaches a better objective. (ii) In case 2, the protagonist follows the optimal action. However, the adversary changes his action by ∆u 2 . As a result, the protagonist achieves better outcomes-the opponent's outcomes decrease. Therefore, the Nash equilibrium corresponds to a solution in which both players achieve the best possible outcome w.r.t. each other. A change in the policy causes a loss and should be avoided.

Reinforcement Learning
Despite the long history of research and success in optimal control, it differs considerably from human control ( Table 2). Instead of defining the environment's behavior mathematically and designing a system from scratch to control said behavior, humans learn to control by repetitively interacting with their environment. This human approach to control is adopted in reinforcement learning. An agent learns to map certain situations or states in an environment to actions while maximizing its short and or long-term reward [1]. Still, reinforcement learning is based on ideas from optimal control. Every action the agent takes has an impact on the environment and, in most cases, causes a change of the environments' state. Such change in turn is observed and acted upon. Table 2. Similarities in the notations of reinforcement learning and optimal control. Here the transfer function of reinforcement learning is formulated in discrete time while the dynamics in optimal control are in continuous time.

Single Agent Reinforcement Learning
A fundamental concept for the environment was introduced through Bellman's research with the development of the Markov decision process. A schematic of this mathematical framework describing the interplay of an agent and its environment is shown in Figure 3. A key assumption behind the MDP is the Markovian nature of the system, commonly known as the Markov property. This property states that a fully observed state is statistically sufficient. In a statistically sufficient state, the transition probability can be determined only based on the observation and taken action without knowing the history of previous states, i.e., the transition probability is conditionally independent of the transition history. In the standard formulation, we assume a fully observable MDP where the agent can fully observe the state of the environment at each time step [1,2,9,34,85,[94][95][96]. As described in Section 2.2.1, the MDP was designed to tackle a discretized version of the state space formulation of optimal control w.r.t. time [85]. The MDP is defined by the tuple M := (S, A, r, P, γ), where all possible states of the environment are represented by S ∈ R n . On the other hand, A ∈ R m contains all possible actions of the agent, respectively. The system dynamics are described by a deterministic or stochastic transition function P : S × A → S mapping each set of state s ∈ S and action a ∈ A to a new state s ∈ S. By this definition, the transition function of a finite state MDP can also be seen as a 3-dimensional matrix containing a probability for each (s , s, a)-tuple. For a specific state s this matrix reduces to a 2-dimensional matrix. Each row then describes a probability distribution p: S × A × S → [0, 1] over next states s corresponding to a specific action as defined by p(s |s, a)= Pr{s t = s |s t−1 = s, a t−1 = a} = r∈r p(s , r|s, a).
The actions are rewarded by a continuous function r: S × A → R. As an example let us assume a system with two possible states s 1 , s 2 and two possible actions a 1 and a 2 . Further, we assume to be in state s 1 . Then the transition matrix for state s 1 can be written as P s 1 = p(s 1 |s 1 , a 1 ) p(s 2 |s 1 , a 1 ) p(s 1 |s 1 , a 2 ) p(s 2 |s 1 , a 2 ) .
As many papers in this survey assume a finite state space, we will use the terms transition function and transition matrix interchangeably. The full stochastic dynamics are then defined with a probability function p: S × r × S × A → [0, 1] formulated as p(s , r|s, a)= Pr{s t = s , r t = r|s t−1 = s, a t−1 = a}.

Environment
Agent s t+1 r t+1 s t r t a t Figure 3. Schematic representation of the Markov decision process. Transitions are depicted in a discrete-time formulation to define variables at every time step. Each state is composed of a vector of observations or sensory inputs of the agent. The agent can choose an action from a given action-space at every step, which is evaluated through a reward function.
In cases where the observed state no longer fully describes the true state of the environment, the observed state loses its property of being statistically sufficient. For example, the observed state may be identical in two different states of the environment. Such scenarios can often occur in real-world applications. To track the true state, the agent must maintain a belief that contains knowledge of past states to estimate which of the true states is represented by the observed state [97,98]. A standard MDP formulation is solved based on statistically sufficient belief states representing this belief. This concept is known as partial observable Markov decision processes (POMDP) [97,98].
In a fully observable MDP setting, reinforcement learning aims to find an optimal deterministic policy π * : S → A. From its origin, the optimal policy formulates an optimal control at each time step, maximizing the expected return of the objective function where γ ∈ [0, 1) describes the discount factor reducing the influence of future rewards. The horizon T is defined as the maximum time of operation, possibly infinite, or the time required to reach a final or terminal state. In the infinite horizon case, the optimal policy becomes stationary [2]. Converging to an optimal policy π * , however, is, in theory, only guaranteed if every action is executed infinitely often in each state [1]. A greedy policy exploiting the best action logically is a reasonable choice as it maximizes the known rewards. Nevertheless, there is no guarantee that the known actions are best as some actions might not have been tested yet. Actions that have not been explored have no estimate for their long-term reward and, as a result, will never be chosen. Exploring new actions carries the risk of lowering the short-term reward, but, in turn, the long-term reward may rise eventually. Finding admissible solutions requires a fine trade-off between exploring the unknown and exploiting the known, commonly referred to as exploration vs. exploitation dilemma [1]. A common approach is using a stochastic policy π: S × A → R + sampling an action from a distribution in a given state a ∼ π(a|s). As each action in a state is initially given a probability π(a|s) ≥ 0, exploration is guaranteed. Still, any stochastic policy can converge to a stationary deterministic policy π * , fully exploiting the system to gain the maximum reward [1,2].
The question remains on how to evaluate each action. Using the immediate reward is possible but contains no information about the future. An action might be attractive for now but leads to a low series of rewards in the future. A common concept most reinforcement methods are based on is the value function or simplified HJB Equation [1,83,85]. Even though the value function cannot be calculated directly, [83] shows that it can be estimated backward from the final to the initial state for the discrete-time setting. Hence, the statevalue function V and action-value or Q-function Q are defined as Q π (s, a) = r(s, a) + E s ∼P (s |s,a) V π (s ) .
As the name suggests, the optimal state-value function V * describes the maximum value one can get from state s onwards. In turn, the optimal action-value function Q * represents the maximum value one can get if action a is chosen in state s. Bellman [83] introduced the key for a large variety of RL algorithms in the form of dynamic programming for finite space MDPs. Even though dynamic programming was developed in the context of optimal control, one might argue that it is a shared part of the history of optimal control and reinforcement learning [1]. This algorithm breaks down the backward estimation into subproblems. Each subproblem only considers the reward between two consecutive states in a trajectory. The true value function can then be found by recursively solving the subproblems. Thus, a policy π * will be optimal if it behaves greedy w.r.t. this function [1,2] This solution requires finite state spaces as it keeps a lookup table for the values in every state. As the lookup table grows exponentially with the number of state variables, each variable introducing a new dimension, dynamic programming carries the risk of the curse of dimensionality. Methods derived from dynamic programming directly using the same table-based formulation are commonly known as tabular methods and suffer from the same problem. The RL community tackled the curse of dimensionality using function approximators [1,[99][100][101][102]. Still, as we will see in Section 3.1, dynamic programming is often used as a foundation to show convergence guarantees under different MDP formulations. One concept regularly used in reinforcement learning for optimizing value functions is the temporal difference (TD) Error. The TD Error, described by is the difference between the current estimate of the value function V(s) and a new sample r t + γV(s ) from interacting with the environment. One can optimize the value estimate by minimizing the TD Error. Other common approaches utilize parametric policies π(θ): S × A → R + instead, where the parameters are updated based on gradient descent w.r.t. the objective J θ (see Equation (11)) [99,103-108].

Multi Agent Reinforcement Learning
Reinforcement learning, as we have discussed to this point, assumed a lone agent in a fixed, stationary environment. This secluded view of a learning environment is often unrealistic. Agents often have to interact with other agents in the same environment, and the environment might not be stationary [95]. The crucial assumption of the Markov property is violated through the existence of the other agents as soon as their policies are non-stationary [34,109]. These additional agents cause the environment to become non-stationary and non-Markovian from the perspective of a single traditionally trained agent [110]. A straightforward approach to training these agents in multi-agent environments is called independent learners. Each agent utilizes the simple assumption that no other agents exist or treats them as stationary, causing the environment to be stationary again [93]. For independent learners, the optimal policy π * is stationary and deterministic, which for standard MDPs is undominated. Undominated means that this policy achieves the highest possible reward from any state amongst all possible policies. This property no longer exists in the case of multiplayer environments as the performance of any policy highly depends on the other player's choices. A deterministic policy will never be optimal in a multiplayer environment as it is unable to react to other players or opponents and is prone to be exploited [34]. A simple example is rock, paper, scissors. An agent who is deterministic by constantly choosing the same action will always be defeated by his opponent, while an optimal stochastic policy at least breaks even [34]. A more reasonable approach is joint action learners taking information of their opponents into account [111]. As the multi-agent system exhibits the nature of a game-theoretic framework, the Nash equilibrium (see Section 2.2.3) is certainly a possible solution. In this equilibrium, an optimal policy is defined as the best response to all other policies if they act after the equilibrium [34]. Due to the guarantees of the Nash equilibrium in two-player zero-sum games, the adversarial setting is the easiest case of multi-agent reinforcement learning.
The stochastic game, as introduced by Shapley [112], provides an extension of game theory to MDP environments. Hence, the stochastic game or Markov game is a natural extension of the reinforcement learning framework to multi-agent systems [95,113,114]. The Markov game is defined by a tuple M := (S, A 1 , . . . , A k , r 1 , . . . , r k , P, γ) with a set of states S ∈ R n , and a collection of action sets A 1 , . . . , A k each ∈ R n . Each set of actions is assigned to a single agent in the environment. The transition function is extended to the form P : S × A 1 × . . . × A k → S depending on the current state s ∈ S and a chosen action a i ∈ A i of each agent. This function can again be formulated as a probability p: Instead of a single reward function, each agent i is given its own reward function r i : S × A 1 × . . . × A k → R. As before, γ ∈ [0, 1) denotes the discount factor [9,34,36,95]. The state-value function V and Q-function Q are then rewritten from the perspective of agent i as Q π 1 ,...,π k i (s, a 1 , . . . , a k ) = r i (s, a 1 , . . . , a k ) + E s ∼P (s,a 1 ,...,a k ) V where r it = r i (s t , a 1t , . . . , a kt ) [11,34]. The Markov game describes a series of matrix games, where each matrix game or subgame represents a state of the Markov game. In the case of two agents, these subgames are defined by a matrix r where each component r ij contains the instantaneous reward for an action i of agent 1 and action j of agent 2 [95]. This matrix game has to be solved at each stage of the Markov game. As a generalized formulation, the matrix game is defined by a tuple M := (n, A 1 , . . . , A n , r 1 , . . . , r n ) with n players where each player i is given an action set A i and a payoff function r i . The payoffs are visualized as an n-dimensional matrix r [93].
These types of games require stochastic policies as any adaptive agent may learn to exploit deterministic behavior as described in Section 2.3.2. In fact, optimal behavior is achieved when the agent is in a Nash equilibrium, where the agent's policy is the best response to every other policy in the equilibrium. That means any deviation from that policy results in a decrease in reward [34]. Since only two-person zero-sum games guarantee the existence of a Nash equilibrium at every stage of the game, they are the only case with a tractable solution [34,115,116].
For two-player zero-sum cases, the definition of the Markov game can be reduced to a tuple M = S, A,Ā, r, P, γ . The collection of action spaces is reduced to A ∈ R n for the agent andĀ ∈ R m for its adversary, respectively. Due to the nature of zero-sum games, both agents optimize a single reward function r ∈ R in opposite directions as described in Section 2.2.1 [9,95]. On this basis, the state-value and Q-function are simplified to where r t = r(s t , a t ,ā t ) [11]. Since the protagonist maximizes the reward, this definition is equivalent to the generalized form in Equations (16) and (17)   Littman [95] was one of the first to tackle two-player zero-sum games using reinforcement learning in a multi-agent learning approach. He proposed a new variant of Q-learning, which, in the standard form, tries to approximate the Q-function. This approximation minimizes the error between the Q estimate in the current state and the observed reward plus the Q estimate of the following state. Instead of a maximum Q estimate depending only on a single agent, he proposed a mini-max update depending on the maximizing action of one agent and the minimizing action of its opponent [95]. His training environment was designed as a two-player soccer game with a rectangular position grid. Littman [95] tested four different scenarios. Two agents were trained using standard Q-learning (independent learner) and two using mini-max Q-learning (joint action learner). In both cases, one agent was trained with an opponent choosing random actions and one with an identical copy of himself. All trained agents performed well in a test match against an opponent choosing random actions. However, the results showed that the joint action learner trained against a random opponent performed worse than the independent learner counterpart. Since the independent learner was not taken advantage of by the purely random opponent, its optimization was more effective compared to the joint action learners' optimization having an optimal opponent in mind. In another experiment, Littman kept the trained independent, and joint action learners fixed to train a new challenger for each of the four agents. Unsurprisingly the result changed dramatically, with both independent learners not winning a single game while the joint action learners still won ∼35% of the games. The joint action learner, who was trained against itself, performed better than the one against a random opponent [95].
Uther and Veloso [35] picked up on Littman's work to further investigate multi-agent reinforcement learning techniques in a two-player zero-sum setting they described as an adversarial environment. Their work introduced the term of adversarial reinforcement learning as a special variant of MARL. One insight of their work is the big disadvantage of mini-max Q-learning assuming an optimal opponent, i.e., pessimistic behavior. They proposed incorporating a modeled probability distribution over the opponent's actions in each state to take advantage of an opponent's suboptimal behavior [35]. However, a huge problem of tabular methods like Q-learning that maintain a table of Q values for each stateaction pair is the massive explosion of possible states with increasing numbers of agents. Uther and Veloso [35] discussed two solutions for this problem. The first generalizes over similar states using function approximators, effectively reducing the number of samples required to have an estimate for each possible state. The second solution is watching the opponent to gather more samples from the same amount of moves to converge more quickly to good estimates of the Q values in the table [35].
These approaches presented in [34,35,95] boost the performance of reinforcement learning in two-player games compared to independent learning approaches, but they do not tackle a central issue of reinforcement learning-robustness. Still, they are major milestones for significant contributions to the research of solving two-player games and beyond [117][118][119][120][121][122][123][124][125][126]. It is known from robust optimization and robust control that the twoplayer zero-sum game equivalent to a mini-max formulation can be used to gain robustness against parameter perturbations and modeling errors. The question here is how the adversary must be designed to gain robustness against specific environmental variations. This topic will thoroughly be discussed in the following chapter.

Robustness in Reinforcement Learning
Despite the success and attention classic reinforcement learning has received over the last years, it often struggles with robustness and generalization. This problem is mainly caused by agents overfitting to the specific training environment, which is a major challenge at deployment time. Training of reinforcement learning agents is often done in simulation due to the high cost of interaction with physical systems. In turn, the simulation is an imperfect representation of reality containing modelling errors and imprecise parameters. The difference between simulation and reality is often too large to handle for the trained policy during transition. Even policies trained on the real system directly do not perform well under previously unencountered uncertainties or disturbances. Even slight deviations in the environment's parameters e.g., mass or friction can have significant impact on a policies' performance. Such changes are a common occurrence in test scenarios and can be the difference between success and failure [7][8][9]. As shown in Section 2.2 it is possible to leverage the idea of two-player zero-sum games or mini-max solutions to gain robustness to disturbances and environmental parameter variations. Further, Xu and Mannor [127] show that supervised learning algorithms robust to noise and disturbances also provide desirable generalization properties. Even though, it is not clear how this proof holds in reinforcement learning, it is believed that there is a similar connection [11].
Our main interest lies in the robustness of reinforcement learning to parameter variations of the environment. Considering the fundamental definition of the MDP as a tuple M := (S, A, r, P, γ) , there are multiple options available where additional uncertainty can be placed. Any change in the environment's states is a reaction to the agent's actions described by the transition function P. Thus, arguably the most intuitive choice is robustness to uncertainty in the transition function such that not only the transitions are uncertain but also the function itself is subject to uncertainty. Based on the concept of robust optimization, introducing an uncertainty set over transition functions P ∈ U P , one can formulate a robust MDP [23][24][25]. As robust optimization is defined as a mini-max optimization, the uncertainty set can be seen as an action space for an adversary in a two-player zero-sum game. The adversary is governing the system's dynamics in the form of the transition function [128]. A similar formulation has also been presented for uncertainty in the reward function [23][24][25].
While intuitive from the perspective of the MDP formulation, uncertainty in the transition function is not the only way to achieve robustness against environmental parameter variations and disturbances. Another approach leverages the idea of robust control by formulating parameter uncertainty as disturbances in the environment represented by an adversary. This adversary would then apply forces to the protagonists body to simulate changes e.g. in mass, gravity or friction [7][8][9].
Similar to how disturbances can describe a change in certain environmental parameters such as friction, a change in these parameters can also describe a disturbance. A common ground between disturbances and parameter variations is that both cause a change in the environment's behavior. The agent can no longer rely on the transition probabilities being fixed for any given state-action pair. During our research, we identified a similarity in all approaches to robustness against changing system dynamics caused by parameter uncertainty, disturbances, and modeling errors. The key component the different approaches aim for is variability in the transition probabilities. As such, an adversary can also be designed to manipulate the protagonists actions directly. Manipulating the actions alters the transition probabilities from the perspective of the originally chosen action. The agent perceives this alteration as an unexpected, possibly undesirable transition between states [11,55,56].
The same effect is achieved by manipulating the protagonist's observation of states. By tricking the protagonist into wrong beliefs about the environment's current state, the protagonist maneuvers itself into unexpected worst case situations [62,63,129].
Over the last years, promising results to parameter robustness have been published using these four basic approaches. We will discuss the different ideas in detail in this section, including some of their variations presented over the years. We try to show how the approaches connect to the basic concepts introduced in the previous section and other related research to the best of our understanding.

Transition and Reward Robust Designs
Given the underlying stochastic nature of an MDP, every interaction between agent and environment causes a shift in the environment's state. This change is governed by the transition probability model P. A consequence of this stochastic process is the risk of potentially encountering critical states. To minimize this risk in traditional RL, Heger [76] adopted a two-player framework similar to Littman [95]. In Heger [76], however, the adversary takes control over P and hence is modeled in the environment. This framework provides a guaranteed minimum performance by considering transitions with the worst outcome at each time step. As a result, the optimal agent behaves risk-averse but also very conservatively [23]. However, this approach abandons the underlying stochastic properties of the environment, replacing the stochastic nature with a deterministic worst-case design [23]. Further, Heger [76] does not seek robustness w.r.t. to errors in the approximation, disturbances, or other outside influences. Instead, the goal should be to derive optimization approaches that are robust against these external uncertainties while also retaining the underlying stochastic nature of the environment. As such, robust reinforcement learning rather considers a layered approach of robust stochastic optimization. This consideration first gained attention in the 1970s where it was applied in Markov decision processes with imprecisely known transition probabilities (MDPIP) [23,128,130,131]. The unknown transition matrix P is assumed to be fixed and lies within a known uncertainty set of possible transition matrices U P . In the case of an infinite horizon T = ∞ with discrete finite state S and action A spaces, the MDPIP framework can be formulated as a zero-sum stochastic game between protagonist and adversary [112]. Thus, compared to [76], the goal is then changed to optimizing over the worst possible transition matrix P instead of a single worst-case transition, a mini-max problem of the form In this formulation, optimization over U P is still a deterministic process, but the optimization over π remains stochastic as each P ∈ U P is a stochastic transition matrix. In general, however, it is NP-hard to find an optimal memoryless policy for Equation (18) with brute force [132]. In contrast to a two-player simultaneous game in which both agents interact simultaneously, the protagonist first tries to maximize the expected return. Then, based on the protagonist's action, the adversary selects the worst possible transition matrix P within an uncertainty set U P . The interaction of this two-player game is shown in Figure 5.
Intuitively, due to this additional information, the behavior of an optimal policy for the adversary becomes static. Each time the adversary encounters the (s, a)-pair, it will select the same transition matrix P a s out of U P a s . Thus, it is also possible to find an optimal stationary policy for the protagonist, which behaves deterministically [128,130]. Nevertheless, finding a solution for statically behaving adversaries is computationally expensive. Figure 5. Schematic representation of methods following a transition robust design. The framework considers adversaries taking control of the transition function. The adversary selects a transition function-here described as distribution-from a predefined uncertainty set. This decision is based on the current state of the environment and the action chosen by the protagonist. Therefore, Bagnell et al. [23] considered a two-player zero-sum dynamic game of complete information defined by as a lower bound for the static game. The gap between the solution of the dynamic game Equation (19) and the static game Equation (18) goes to zero if the horizon T goes to ∞ [23,24]. In the case of the dynamic game, the adversary's policy is not restricted to remaining static during the game. The adversary is thus able to explore the uncertainty set to find the worst possible transition matrix for each (s, a)-pair. Finding a solution for general uncertainty sets, however, is still NP-hard [45].
To improve tractability, Nilim and El Ghaoui [24] and Iyengar [25] have modeled the uncertainty set as a Cartesian product of individual (s, a)-dependent uncertainty sets Here, R |S| + represents the probability simplex for each (s, a) that describes the uncertainty of state s given action a [24]. This property (Equation (20)) is known as the (s, a)-rectangularity property. Intuitively, it permits the adversary to select a transition matrix from U P a s without considering transition matrices from other state-action pairs. This property provides the foundation of the robust Markov decision processes (RMDP), defined by a tuple M := (S, A, r, U P , γ).
Initially, uncertainty sets were constructed based on polytopes or interval matrices [23,128,130,131]. Inspired by game theory, Bagnell et al. [23] have proven that there are optimal stationary deterministic policies both for the protagonist and for the adversary if the uncertainty set is a compact and convex polytope. Therefore, the authors have stated that there has to be a robust value iteration to find optimal policies. However, solving this bi-level optimization problem is computationally expensive [23]. Although these sets can satisfy the (s, a)-rectangularity property, they are not statistically accurate enough to represent uncertainties [24,25]. For this reason, Nilim and El Ghaoui [24] and Iyengar [25] have constructed uncertainty sets, which are described by likelihood or entropy bounds, which also have a significantly lower computational effort. Using these bounds and an estimated or known reference distribution P 0 is a natural way to construct statistically accurate uncertainty sets. Although these sets are not convex, Nilim and El Ghaoui [24] and Iyengar [25] have finally proven that the robust dynamic programming algorithm will find an optimal stationary deterministic policy for all uncertainty sets as long as the (s, a)rectangularity property is satisfied. Moreover, given this uncertainty set, the complexity of solving the problem is only slightly higher than for a classical MDP with a fixed transition matrix [24,25].
Subsequently, Wiesemann et al. [45] has presented a generalization of the (s, a)rectangular uncertainty set which is known as (s)-rectangularity. Robust dynamic programming algorithms constraint to this less restrictive property can still find optimal stationary policies. While optimal policies are deterministic in Equation (20), now they may as well be stochastic [45].
A slightly different setting from standard robust MDPs is presented by Lim et al. [44], who aim to solve the robust MDP in the presence of an unknown adversary, meaning that the full extent of nature's ability to change is unknown. An MDP described by the tuple M := (S, A, r, U P , γ) with finite state S and action space A is considered. As in standard robust MDPs, a possibly history dependent compact uncertainty set U P (s, a) over transition matrices P (s, a) is defined for every state-action pair. However, only a subset F of state-action pairs is truly adversarial while all others behave purely stochastically, i.e., with a fixed P (s, a), as in non-robust MDPs. By optimizing a regret, one can determine a policy as good as the mini-max policy without knowing either F or P (s, a). Such solutions slightly deviate from the common solution to robust MDPs as they are more optimistic and hence indirectly address a major issue of robust approaches based on worst-case analysis.
The worst-case analysis is prone to produce overly conservative policies that achieve only mediocre performance across all possible model parameters in exchange for increasing the worst-case performance [39]. As such, in cases where the nominal model parameters already provide a reasonable representation of the real physical system, a non-robust approach leads to higher-performing policies on that system. In other words, applying worst-case analysis to problems with only a small sim-to-real gap may lead to worse performance. Xu and Mannor [39], therefore, propose a trade-off as a weighted sum between a nominal and a robust performance criterion. In their paper, the authors consider an MDP with an uncertain reward function r ∈ U r . The performance criteria are then defined as the expected return at step t for their respective parameters wherer is the nominal reward, P is the nominal criterion, and R is the robust criterion. Therefore, the weighted sum is with λ ∈ [0, 1] being the weighting parameter. A policy π is said to be Pareto efficient if it obtains the maximum of P t (π, s) among all policies with a certain value of R t (π, s). Xu and Mannor [39] show that this problem can then be solved using parametric linear programming for the whole set of Pareto efficient policies. Unfortunately, this approach only considers uncertainties in the reward function. In the case of uncertain transitions, as presented in [23,24], the authors, Xu and Mannor [39], prove that a solution is not Markovian and, as such, may be intractable. However, the idea of trading-off nominal and robust performance has been adopted for a robust policy optimization algorithm for unknown, noisy system dynamics with possibly noisy observations [52]. The algorithm is based on multiobjective Bayesian optimization, where the objectives represent a nominal performance measure f 1 (θ) and a robust performance measure f 2 (θ), with θ being the policy parameters. The solution to this optimization problem is defined as a Pareto set Θ * that contains all parameters θ * with θ * ∈ Θ * iff ∃i = 1, 2, such that f i (θ * ) > f i (θ)∀ θ ∈ Θ [52]. Their work has shown that the underlying concept of a trade-off between nominal and robust performance is viable for uncertain transition functions. A different realization of such trade-offs is presented in [41]. The authors propose a chance constraint formulation based on risk assessments for the expected performance of a policy. Assuming that the transition function p and reward function r are drawn from probability distributions f (p) and f (r), Delage and Mannor [41] define the chance constraint optimization problem as max y∈R,π∈Π y s.t. P r,p E π ∞ ∑ t=0 γ t r(s t ) ≥ y ≥ 1 − , which describes the risk-adjusted discounted performance of an uncertain MDP. The constraint in the optimization guarantees with a probability of 1 − that the expected performance of π will be greater or equal to y given r ∼ f (r) and p ∼ f (p). For = 0, this problem becomes equivalent to the worst-case analysis of robust MDPs [41]. In this sense, the presented approach relaxes the constraint of worst-case analysis, which is to guarantee minimum performance at all times.
While the trade-off in [39,52] is an intuitive counter to the conservatism of the worstcase analysis, other research targets the rectangularity assumption as the main source of the problem [42,47,48]. The rectangularity assumption is the necessary assumption that the uncertainty in each state is uncoupled from all other states as general coupled uncertainty sets are intractable [42,47,48]. The consensus is that defining tractable uncertainty sets with coupled uncertainty across states mitigates the problem of overly conservative solutions.
Mannor et al. [42] initially proposed the idea of lightning does not strike twice (LDST). The algorithm targets systems where the parameters of transition and reward function deviate from their nominal representationsp s ,r s only in a small number of states s. This consideration is made under the assumption that a deviation has a low probability. The total number of states allowed to deviate from their nominal parameters is hence bounded by D [42]. Following the example of [24], the authors consider a stationary and a timevariant model for applying LDST. In the stationary model, all deviations from the nominal parameters are chosen at the beginning of an episode and kept fixed thereafter. The resulting optimization problem is defined as where p s and r s refer to state-specific representations of the transition and reward function. Accordingly, the time-variant model describes a sequential game, where the deviation is chosen upon entering the corresponding state. It follows that the optimization is then defined as max Rather than the number of states, here, the number of decision stages in which deviations occur is bounded by D. It is, in this case, also possible to visit the same state multiple times, where each time the parameters are different. Their experiments show improved performance compared to not only worst-case robust policies but also to nominal nonrobust policies.
In [47], the authors propose the concept of k-rectangular uncertainty sets, of which LDST is a special case. The k-rectangularity is a generalization of the standard rectangularity concept while capturing the computational difficulty of uncertainty sets. If the projection P of an uncertainty set onto S ⊂ S is one of at most k different possible sets, the uncertainty set is considered k-rectangular. Here k describes the upper bound of an integer that compactly encodes the coupling of uncertainty among different sets [47]. If S ⊂ S is a nonempty subset of states, then the projection P of an uncertainty set U onto S is defined as Mannor et al. [47] define an uncertainty set as k-rectangular if | U S * |≤ k ∀ S * ⊆ S, where k is an integer. The term | U S * | denotes the class of conditional projection sets, defined for all S ⊂ S \ S * as U S * = U S * (p S : ∀S ⊂ S \ S * , ∀p S ∈ P S U) . This definition limits the number of sets the class of projection sets can contain to k sets. Mannor et al. [47] noted that k-rectangularity generalizes the standard rectangularity concept such that the standard rectangularity equals 1-rectangularity. With LDST being a special case of k-rectangularity, the authors focused simulations on a comparison to LDST showing further improvements in performance. Their work is a first attempt at providing coupled uncertainty sets flexible enough to overcome conservatism while remaining computationally tractable for the wider applicability of robust MDPs.
Another concept for tractable coupled uncertainty is factor matrix uncertainty sets [48]. A factor matrix W = (ω 1 , . . . , ω r ) comprised of r factors is defined, where each factor is chosen from a corresponding uncertainty set W i [48]. The factors are chosen independently to retain tractability despite coupled uncertainty such that W ∈ W with W = W 1 × . . . × W r is a Cartesian product. This property is referred to as (r)-rectangularity [48]. Each factor ω i represents a probability distribution over the next state s t+1 ∈ S. It follows the factor matrix uncertainty set U P ⊆ R S×A×S with u i s t a t being coefficients for the convex combination describing all possible transitions probabilities for a specific state-action pair. Goyal and Grand-Clement [48] show that if W is a Cartesian product, any (s,a)-rectangular uncertainty set can be reformulated as (r)-rectangular uncertainty set. The authors further propose a robust value iteration algorithm based on (r)-rectangular uncertainty sets for finite-state MDPs. The provided experiments show significantly less conservative behavior compared to the (s)-rectangular approach in [45] while still achieving improved robust performance w.r.t. nominal non-robust MDPs.
A third viable solution to the conservatism problem of worst-case analysis is a distributional robust design (see Section 2.1). Robust MDPs take uncertainties related to the assumed transition probability model into account. Nevertheless, potential ambiguities within the matrix P itself remain untouched. The distributional robust optimization framework adds these uncertainties in P itself. The focus shifts to distributions D P over transition probability matrices that, on average, lead to the worst-case models P ∼ D P but tolerate deviations. Consequently, less conservative agents are implied by this formulation. From a different perspective, a common interpretation of this distributional information is that of a prior in Bayesian formulations [40,46,53]. A prior incorporates an additional layer of uncertainty and prevents overfitting to, in this case, a deterministic worst-case optimization over possible transition matrices. A typical choice of prior information is that all distributions D P ∈ U D are within a certain range of a nominal distribution D P,0 . This range is defined by some difference measure D, i.e., Kl-Divergence or Wasserstein metric, giving rise to a ball of possible distributions surrounding the nominal distribution [38,43,[49][50][51]53,54]. Other approaches define the ambiguity set in terms of constraints on the first and second moments of distributions. The first constraint ensures that the first moment lies within an ellipse, while the second criterion enforces that the second-moment matrix lies within a positive semi-definite cone [41,43]. Such constraints may, for example, be confidence regions placing more plausible transition matrices into higher confidence intervals [40,41,43,46]. Further works include ambiguity sets based on a reproducing kernel Hilbert space metric [133] or near-optimal Bayesian ambiguity sets [134]. While this survey will not further detail distributional approaches, we advise the reader to look into [135] for a comprehensive review focusing solely on distributional robust optimization. So far, it has been shown that robust MDPs can be solved using a robust dynamic programming approach under convergence guarantees [24,25]. Dynamic programming, however, becomes intractable in large state-space, a common occurrence for practical problems due to the curse of dimensionality. Solutions to this problem have already been researched for non-robust MDPs as linear and non-linear function approximations. While non-linear function approximations, such as deep neural networks, are more versatile, there are no longer guarantees of convergence to global optimal value functions [136]. Linear approximations, on the other hand, retain convergence guarantees while mitigating the curse of dimensionality [136]. One of the first approaches applying linear function approximation to robust MDPs is presented by Tamar et al. [10]. The authors propose a robust variant of approximate dynamic programming (ADP). Given a standard robust MDP with an uncertain transition function and a known uncertainty set U P , the Q-function is derived as This Q-function is then approximated byQ π (s, a) = φ(s, a) T ω, where ω ∈ R k are weights related to a feature representation φ(s, a) ∈ R k of the states and actions. Optimizing over the linear approximation of the Q-function yields a greedy policy φ * ω (s) = arg max a φ(s, a) T ω for a given ω. Tamar et al. [10] further provide convergence guarantees within certain conditions. Only recently, Badrinath and Kalathil [136] showed further development in linear approximations for robust MDPs. The authors derive a robust variant of least squares policy evaluation and least squares policy iteration by defining an approximate robust TD(λ) operator as a more general model-free learning framework, an aspect lacking in [10].
A similar effort has been pursuit by Abdullah et al. [137]. It is argued that most robust learning algorithms fail to extend to a generalized robust learning framework as they are often bound to exploitation of task-specific or other properties such as low-dimensional discrete state and action spaces. To mitigate this problem, Abdullah et al. [137] propose Wasserstein robust reinforcement learning (WR 2 L). The algorithm is designed to work in both discrete and continuous spaces as well as in low and high dimensional problems. Their framework relies on the Wasserstein metric, which, compared to other metrics measuring distance between distributions, such as the KL-Divergence, is a genuine distance, exhibiting symmetry. Assuming a possibly unknown reference dynamics model P 0 , a set of candidate dynamics P is defined as the -Wasserstein ball around P 0 , where ∈ R + denotes the degree of robustness. This set represents the action space of an adversary in a two-player zero-sum game similar to uncertainty sets. Following this notion, the optimization problem is described as | s, a), p 0 (· | s, a)) ≤ , for all possible candidate dynamics P bounded by the expected Wasserstein distance. Abdullah et al. [137] show promising results for improved robustness compared to nominal non-robust RL and other robust algorithms in both low-and high-dimensional MuJoCo Robotics environments [138]. A well-fitting but rarely used methodology when encountering uncertainties in RL, whether internal or external, is a Bayesian treatment. One of the traditional Bayesian formulations of RL is posterior sampling methods. While such sampling methods are typically designed in low-dimensional tabular settings, solutions like the uncertainty Bellman equation (UBE) [139] scale posterior sampling methods up to large domains [140]. The extension to robust MDPs is presented as the uncertainty robust Bellman equation (URBE) [140]. Following the insights of [24], the recursive robust Q-function of a robust MDP with finite horizon T is Accordingly, Derman et al. [140] derive a posterior of this Q-function for a posterior uncertainty setÛ P (ψ) with ψ being state-action-dependent confidence levels. This uncertainty set is constructed for each episode based on the observed data from all previous ones. It follows a solution ω to the URBE This approach offers a trade-off between robustness and conservatism for robust policies. Derman et al. [140] propose a DQN-URBE algorithm for which they show that it can adapt significantly faster to changing dynamics online compared to existing robust techniques with fixed uncertainty sets. A slightly out-of-scope approach is ensemble policy optimization (EPOpt) [141], an algorithm that uses an ensemble of simulated source domains representing different parameter settings and slight variations of the true target environment. These source domains M(φ) contain parameterized stochastic transition and reward functions P φ , r φ whose parameters are drawn from a distribution D φ . The goal is to learn an optimal policy π * θ (s) with good performance for all source domains while simultaneously adapting D φ to approximate the target domain W better. The algorithm is split into two alternating steps: (i) given a source distribution, find a robust policy; (ii) gather data from the target domain using said robust policy and adapt the source distribution [141]. There are two nested evaluation metrics for the parameterized policy π θ J π (θ, φ) = Eτ optimized for the conditional value at risk (cVaR) to find soft robust policies following the work of [142]. The adaption step of the source domain distribution is defined as a Bayesian update using data acquired by applying the current policy to the target domain. Experiments on the OpenAI hopper environments [143] show no performance losses over a wide range of torso masses. While the source domains were specifically chosen to include the target domain in these experiments, further experiments have shown that even badly initialized sets of source domains only require a few iterations to adapt to the target domain [141].
All of the discussed approaches follow the traditional discrete-time reinforcement learning paradigm. However, a few approaches aim for continuous-time designs [7,144,145]. As one of the first to derive a robust reinforcement learning approach, Morimoto and Doya [7] rely heavily on the concept of H ∞ -control and differential games (see Section 2.2). Their approach will be discussed in more detail in Section 3.2 in the context of the disturbance robust design. Mankowitz et al. [144], however, focus on uncertainties in the transition function, presenting a robust variant of maximum a posteriori policy optimization (MPO).
Instead of optimizing the squared TD error, the authors propose an optimization of the worst-case squared TD error min π r(s t , a t ) + γ inf with U P being a state-action-dependent uncertainty set. To further deal with the problem of overly conservative policies, Mankowitz et al. [144] suggest an entropy regularization of the robust Bellman operator from [25]. Most approaches consider uncertainties only in transitions, actions, observation, or disturbances. Lutter et al. [145], instead, propose a robust procedure akin to dynamic programming for continuous state-action spaces and continuous-time formulations, which accounts for perturbations in states, actions, observations, and model parameters all at once. The authors present their algorithm as robust fitted value iteration (rFIR), a robust variant of their previously presented algorithm continuous fitted value iteration (cFIR). Assuming a priori known or learned transition dynamics that are non-linear to the system state but affine to the action, and a separable reward function, the optimal policy and perturbations are analytically calculable in closed-form for each type of perturbation. A separable reward function is a function that decomposes into the sum of an action-dependent and a state-dependent reward function, where both summands are non-linear, positively defined, and strictly convex [145]. Following that assumption, Lutter et al. [145] extend the policy evaluation step of the cFIR algorithm to a closed-form mini-max optimization.

Disturbance Robust Designs
Even though uncertainty in transition matrices is arguably the most intuitive choice for achieving parameter robustness, other intriguing and promising approaches have been proposed over the years. It is known from robust control that parameter changes or modeling errors can be also be described as disturbance forces during state-transitions of the environment [7,12]. For example, a shift in surface friction is represented as a disturbance force applied to the contact points of the agent with that surface. A decrease in friction eases movement across contact surfaces equivalent to a pushing force, while increases in friction act similar to opposing forces. This concept was applied by the control community in the context of H ∞ -control. As in H ∞ -control, the disturbance robust design can be represented as a two-player zero-sum game. The adversary's action space is that of external forces (see Figure 6). In H ∞ -control, a controller is stable under all disturbances w < 1/γ if the maximum H ∞ -norm of the closed-loop transfer function T zw ∞ ≤ γ (see Section 2.2.1). Solving the problem in Equation (9) corresponds to finding a control u in a dynamic systemẋ = f (x, u, w) that satisfies the constraint under all possible disturbances w with the initial state x(0) = 0. Minimizing this value function V under the maximum disturbance is equivalent to solving a differential game with an optimal value function From there, the Hamilton-Jacobi-Isaacs equation (HJI) is derived as a condition for the optimal value function. On this basis, Morimoto and Doya [7] formulate robust reinforcement learning for a continuous-time dynamic systeṁ x = f (x, u) with an augmented value function Here q(t) = r(x(t), u(t)) + b(w(t)) is the reward function augmented by b(w(t)) for withstanding disturbances. Their formulation relies on the contributions made in [84] for the continuous-time variant of reinforcement learning. The parameter τ denotes a constant.
The optimal value function is derived as a solution of an HJI variant , u, w)].
Morimoto and Doya [7] propose an actor-disturber-critic architecture for a model-free implementation where the policies are defined as u(t) = A u (x(t); v u ) + n u (t) and w(t) = A w (x(t); v w ) + n w (t), respectively. Here A u and A w are function approximators with parameter vectors v u and v w and additive exploration noise n u and n w . Morimoto and Doya [7] derive closed-form updates for both policies. It is further proven that this new paradigm coincides with the analytic solution of the H ∞ -control in the linear case. Experiments on a non-linear dynamical system show robust behavior against weight and friction changes while nominal RL approaches fail. Pinto et al. [8,9] utilize disturbances in reinforcement learning not only for robustness but also sample efficiency in real-world learning. In their earlier work [8], the authors propose an adversarial framework of two real-world robots for learning grasping tasks. It is known that mining good and hard samples leads to faster convergence and better performance. Pinto et al. [8] show that the existence of a destabilizing adversary helps to reject weak notions of success, meaning that actions resulting in only a loose grip on the object are rejected, leading to faster and better learning. Besides learning quality, their framework also increases the robustness of grasping positions. This work is further extended in [9] into a formal robust reinforcement learning framework. Pinto et al. [9] propose a two-player zero-sum Markov game between a protagonist π(·|s) and a destabilizing adversaryπ(·|s) defined by the tuple M := S, A,Ā, P, r, γ . Designing the Markov game with continuous action and state spaces and utilizing neural networks for non-linear function approximation allows for a broad variety of applications. The reward function is defined from the protagonist's perspective as R = E s 0 ,a∼π(s),ā∼π(s) Moreover, Pinto et al. [9] deploy an iterative update procedure where protagonist and adversary alternate between being updated and being kept fixed over every n steps. Experiments show robustness against adversarial disturbances and variations in mass and friction across various OpenAi Gym environments, including complex robot walking. Even in the absence of any parameter changes or disturbances the algorithm has shown improved performance compared to a trust region policy optimization (TRPO) [107] baseline. However, no theoretical guarantees have been provided [11].
Controlsā t External Forces Figure 6. Illustration of the underlying concept of disturbance robust designs. Uncertainties in the system dynamics are modeled as disruptive forces. These forces represent an additional condition on the transition probabilities. As such, the transition function shifts according to the adversarial action to produce worst possible outcomes for the protagonist.

Action Robust Designs
So far, we have discussed a substantial amount of research on robust MDPs in the context of transition and reward uncertainty. While the amount of research on that topic is impressive, a majority is presented in the tabular case for mainly low-dimensional finite spaces [23][24][25]39,42,45]. Few contributions have touched upon linear [10] and nonlinear function approximation, while especially non-linear approximations have only been addressed in recent years. It further is often unclear how to obtain mentioned uncertainty sets [11]. There have been further advances to non-linear robust reinforcement learning in the context of disturbance-based robustness but partly without any theoretical guarantees [9,11]. Tessler et al. [11] instead argue that a more natural approach to introduce robustness is action perturbations. Naturally, a deviation in the action space also simulates environmental changes to a certain extend. Consider a magnitude reduction for a chosen continuous action that encodes some force or speed. Such a reduction has the same effect as increasing friction or introducing an opposing force. As such, action perturbations change the expected behavior of an environment. A schematic representation of the methods discussed in action robust designs is shown in Figure 7.
Following this general idea, Tessler et al. [11] propose two types of action robust MDPs, the noisy action robust MDP (NR-MDP) and the probabilistic action robust MDP (PR-MDP). Both MDPs are defined by the tuple M := (S, A, r, P, γ) with some joint policy π mix (π,π) between the protagonist π and adversaryπ. In the case of the NR-MDP, the action space A denotes a compact and convex metric space for the joint actions to ensure that the mixture actions are valid. The reason behind proposing two different MDP formulations lies in the inherent nature of robustness they are encoding. The NR-MDP is designed to represent constant interrupting forces applied to the agent, e.g., through unexpected weight of a robot arm constantly applying a downward force. This constant force takes on the form of adversarially chosen noise added by the adversary through the joint policy. As such, Tessler et al. [11] define a noisy joint policy π mix N ,η (π,π) as π mix N ,η (a|s) = E b∼π(·|s)

b∼π(·|s)
[I a=(1−η)b+ηb ] ∀s ∈ S, a ∈ A, where η scales the impact of the constant disturbance applied by the adversary. It follows the optimal η-noisy robust policy as with a t ∼ π mix N ,η (π(s t ),π(s t )), where Θ(Π) is a set of stationary stochastic policies and Π is a set of stationary deterministic policies. The optimal policy will then be robust w.r.t. any bounded perturbations added by the adversary. Naturally, by choosing η = 0, the NR-MDP collapses back to the standard non-robust MDP.
In contrast, the PR-MDP describes interruptions of the protagonist's movements through e.g., sudden pushes. In this type of MDP, there is a certain probability η that the adversary takes control over the decision-making process to perform a worst-case action. This probability is encoded into the probabilistic joint policy π mix P,η (π,π) defined as π mix P,η (a|s) = (1 − η)π(a|s) + ηπ(a|s) ∀s ∈ S.
The optimal probabilistic robust policy is given by with a t ∼ π mix P,η (π(s t ),π(s t )). For their experiments, Tessler et al. [11] introduce a robust variant of deep deterministic policy gradient (DDPG) [101] based on the soft policy iteration to train two deterministic policy networks, for protagonist and adversary, respectively. Similar to standard DDPG, a critic is trained for estimating the Q-function of the joint policy. The experiments showed that with few exceptions, their approaches performed better than a baseline in several MuJoCo robotics environments, while with increasing probability, random noise is applied instead of the chosen action. Surprisingly their approaches also outperformed the baseline during the absence of any perturbations. Compared to the classical robustness approach based on known uncertainty sets, the PR-MDP and NR-MDP approach does not require any knowledge on the uncertainty sets as they are implicitly given through η. However, this advantage also restricts the MDPs as they cannot handle any worst-case perturbations. Tessler et al. [11] further show that the PR-MDP is a specific case of the robust MDP formulation.
A similar idea as the PR-MDP was presented by Klima et al. [55], who extended TD learning algorithms by a new robust operator κ to improve robustness against potential attacks and perturbations in critical control domains. The approach has similarities to mini-max-Q learning in two-player zero-sum games as proposed by Littman [95] but does not assume a minimization over the opponent's action space. Instead, attacks are defined to minimize over the protagonist's action space, such that both policies learn but not enact simultaneously. Klima et al. [55] formalize this idea by replacing the mini-max simultaneous actions with stochastic transitions between multiple controllers with arbitrary objectives to take control in the next state s . This formulation is similar to an adversary taking control with probability η in the PR-MDP setting. The value of the next state s then depends on who is in control. Thus, the TD Error (see Equation (15)) changes to with η either known a priori or estimated by the agent. This framework allows learning of robust policies in the presence of an adversary without ever executing critical adversarial actions on the system as only the future estimate is affected by the adversary. Klima et al. [55] apply this approach to off-policy Q(κ)-learning and on-policy expected SARSA(κ) as follows , a), where both algorithms are in between the worst-case design and risk-sensitivity. Klima et al. [55] show that the algorithms converge to the optimal Q function Q * and the robust Q function Q * κ . Both algorithms, Expected SARSA(κ) and Qκ-learning, are compared to nominal Q-learning, SARSA, and Expected SARSA. All experiments show improved performance and robustness compared to the baseline methods.
In general, worst-case scenarios, in which an adversary attempts to minimize the expected return, do not automatically cause the agent to experience catastrophic but highly unlikely events as the adversary does not actively pursue such outcomes. Pan et al. [56] argue that a robust policy should not only aim to maximize the expected return but also be risk-averse. The authors improve the framework of Pinto et al. [9] to provide increasingly harder challenges such that the protagonist learns a policy with minimal variance in its rewards. Pan et al. [56] present the risk-averse robust adversarial reinforcement learning algorithm (RARARL) assuming a two-player zero-sum sequential game, where the protagonist and adversary take turns in controlling the environment. The protagonist takes a sequence of m actions followed by a sequence of n actions chosen by the adversary. The game is formulated as a tuple of the form M := (S, A, r, P, γ) with S defining a possibly infinite state space while the agents share the same action space A. Pan  The variance for an action a is calculated according to where Q i is realized as the i th head of a Q-value network. The same formula is applied for the adversary usingQ. The models of the ensemble are trained across different sets of data. Pan et al. [56] utilize an asymmetric reward function to increase the probability of catastrophic events. Good behavior receives small positive rewards while catastrophic outcomes result in highly negative rewards. An important observation is that the protagonist needs to be trained separately first to achieve rudimentary control as the protagonist is otherwise unable to keep up with the adversary. Pan et al. [56] show that by introducing risk-averse behavior in the presence of a risk-seeking adversary, far fewer catastrophic events occur during the test phase.
While most robust designs formulate some bi-level optimization problem for which a Nash equilibrium needs to be found, Tan et al. [57] propose a new perspective in the context of action robust design. In their paper, Tan et al. [57] discuss the importance of robustness in deep reinforcement learning for the validity of DRL, especially in safetycritical applications. The authors approach robustness from the perspective of adversarial attacks [59,146] to deep neural networks as they are known from image classification. It has been discovered that neural networks are highly susceptible to perturbations of their input vectors [147]. Tan et al. [57] project adversarial attacks onto action robust control for bounded white-box attacks on the protagonist's actions. The robust optimization is defined as where δ denotes the adversarial attack on the action space. The inner optimization problem is solved by projected gradient descent, while the outer problem is optimized through standard policy gradient techniques [57]. An important factor in this formulation is that the adversarial perturbations δ on the action space are optimized until convergence for each episode of the training process. The approach has been shown to improve the performance of DRL based controllers in OpenAi Gym environments. Using adversarial attacks to formulate a robust reinforcement learning framework, however, is not new. Prior works have utilized this very idea in the context of an AI's perception of its environment [60,62].

Observation Robust Designs
Adversarial attacks on observations in reinforcement learning are closely related to generative adversarial networks (GAN) [58]. GANs are a framework for estimating generative models via an adversarial process. The approach describes a mini-max game between a discriminative model (DM) and a generative model (GM). The goal is to train generative models capable of creating data indistinguishable from a given data set. The GM generates samples that get mixed into the true data set while the adversary tries to identify the artificially created data of the GM. As such, the GM minimizes the probability of samples being distinguished from the true data, while the DM maximizes the probability of correctly separating true from generated data [58]. Loosely speaking, a GM is trying to trick a classifier into falsely classifying generated samples as true samples. However, considering the nature of classification tasks with more than two classes, there is no way to determine a truly worst-case classification as there is only right and wrong. This problem does not exist in reinforcement learning, i.e., control tasks, where a worst-case action or state can be identified.
In a similar fashion to GANs, adversarial attacks on observations aim to trick a protagonist into perceiving the true state of the environment as another (see Figure 8). The goal is to warp the protagonist's perception to the point that the perceived state causes the protagonist to act adversarial to its own interest [62]. Interestingly, such an approach is not aimed at robustness against sensory errors or sensor noise but against perturbations in the environmental parameters. The protagonist is forced into unwanted and unexpected transitions of the environment states, which can be seen as a change of the system dynamics. A significant advantage of adversarial attacks, compared to the max-min formulation most research has followed thus far, is the need for equilibrium solutions in the max-min formulation [62]. Adversarial attacks are optimized for each step of the learning process such that from the perspective of the protagonist, a traditional single-player MDP still exists. As such, traditional RL algorithms can be applied [60,62]. Huang et al. [60] has presented early results in the context of image-based reinforcement learning for Atari games using the fast signed gradient method (FSGM) [59] to find the worst possible actions. However, these results focus on the vulnerability of deep neural networks to perturbations in high-dimensional image data, where this vulnerability has been found initially [62]. Pattanaik et al. [62] show that this vulnerability is not restricted to high dimensions but also applies to standard lower-dimensional state observations known from control environments such as the Cartpole or Hopper. Further, while the objective function used in [60] for the FSGM to identify adversarial attacks leads to a decrease in the probability of taking the best possible action, it does not necessarily increase the probability of taking the worst possible action [62]. As such, Pattanaik et al. [62] propose a more efficient objective function as the cross-entropy loss between the optimal policy of the protagonist π * i (a i | s) and an adversarial probability distribution p i = P(a i ). The adversarial distribution has a probability of 1 if a i is the worst possible action and 0 otherwise. The FSGM is incorporated as an additional step into traditional RL methods that identifies the worst possible perceived state according to the objective function in Equation (21) before queering the protagonist for the next action. Experiments on several OpenAi Gym and MuJoCo environments have shown that the presented approach adds robustness to environmental parameter perturbations in double deep q networks (DDQN) [148] and DDPG. However, despite the promising results, a theoretical connection of these adversarial attacks to robustness against parameter perturbations has not been provided [62].
Parallel work by Mandlekar et al. [61] not only includes white-box adversarial attacks on the input space but also on the dynamics. The authors propose an extension of the training process in TRPO with a curriculum learning-based procedure that considers physically plausible perturbations. The approach targets dynamic systems of the form where µ, ν, and ω correspond to the dynamic noise, process noise, and observation noise, respectively. While ν and ω directly perturb the state or observation of the dynamical system, µ represents the uncertainty of the physical parameters of the model, e.g., mass and friction. During learning, the frequency with which perturbations occur in the system is increased over time through a parameter δ following the curriculum learning paradigm. Through this process, the agent becomes increasingly more robust to both modeling errors and adversarial perturbations of the input space. In contrast to [62], Mandlekar et al. [61] use the full gradient to optimize the adversarial attacks. They argue that the FSGM is designed for high-dimensional image spaces and may cause scaling issues when applied in low-dimensional dynamic systems. Experiments on the MuJoCo robotics simulator show significant robustness improvements compared to nominal TRPO across all environments. While better results are achieved for adversarial perturbations during training, even random perturbations already show significant improvement. A more formal approach is presented in the form of state-adversarial MDPs (SA-MDP) [64]. The SA-MDP is defined as a tuple M := (S, A, B, r, P, γ), where the protagonist's observations are perturbed by a bounded function ν(s) ∈ B(s). With these perturbations, the authors aim for robustness against sensory errors and sensor noise. Naturally, the protagonist policy is then described as π(a | ν(s)). The SA-MDP produces a bi-level optimization problem for which finding a solution is challenging. However, leveraging stochastic gradient langevin dynamics [149] allow for a separate optimization of the inner problem up to a local optimum first. As a consequence, a possibly large gap to the global optima remains. This gap is bounded by the total variation distance D TV (π(· | s), π(· |ŝ)) or KL-Divergence D KL (π(· | s) || π(· |ŝ)), whereŝ ∈ B(s) is the perturbed state observation, using convex relaxations for neural networks [150]. These bounds then allow for an outer optimization of an inner lower or upper bound, which provides robustness certificates for guaranteed minimum performance. The total variation distance and KL-Divergence are integrated as a robust policy regularization for various traditional DRL algorithms, such as DDPG, proximal policy optimization (PPO) [108], and deep q networks (DQN) [100].
Further extending existing work in robustness guarantees and certification bound algorithms from computer vision, Lütjens et al. [65] propose the certified adversarially-robust reinforcement learning (CARRL) to solidify the value of deep reinforcement learning in safety critical domains. Certification bounds theoretically provide guaranteed deviation bounds on the output of a neural network given an input perturbation. For Deep Learning, these bounds are relevant mainly because they guarantee bounded output even when non-linear activation functions are used. CARRL extends classical deep reinforcement learning algorithms such as DQN to guarantee protection against adversarially perturbed observations or sensor noise. In CARRL, the agent does not rely on the given observed state but rather assumes that the observed state is corrupted. Lütjens et al. [65] instead propose to exploit the possibility that the true state lies somewhere within an -ball B p (s adv , ) = {s : s − s adv p ≤ } around the assumed corrupted observed state s adv . Based on the parameter , an upper and lower bound, the certification bounds, are calculated for the predicted Q value. Assuming that the training process induces the network to converge to the optimal value function given the lower bound, CARRL handles perturbed observations. While for = 0 CARRL reduces to nominal DQN, an increase in corresponds to an increasingly conservative behavior of the policy. Experiments have shown that the proposed method outperforms nominal DQN on benchmarks with perturbed observations.
Most of the work evolving around observation robust designs considers that the adversary has direct access to the protagonist's observations, which Gleave et al. [63] argue is a critical assumption not present in realistic situations. An example given by Gleave et al. [63] is autonomous driving, where pedestrians and other drivers can take actions that affect an AI's input but cannot directly manipulate any sensor data. Gleave et al. [63] instead propose adversarial policies acting in a multi-agent environment. This environment is described by a classical two-player zero-sum Markov game of the form M := S, A,Ā, r, P, γ where S denotes a finite state space while A ∈ r andĀ ∈r denote the finite action space of protagonist and adversary. This formulation is not designed to target the protagonist's observations specifically. The experiments are designed as MuJoCo robotics simulations, where the protagonist and adversary are the same type of robot, just with different goals. The authors consider a win-loss reward function typically known from games such as chess or go. The protagonist wins if he successfully completes a given task while the adversary wins by preventing the completion of the given task such that the reward function is given as r : S × A ×Ā × S → R withr = −r. An interesting result and the reason why this work is placed into observation robust design is the unique approach of winning of the trained adversaries. Instead of solving the game for two evolving players, Gleave et al. [63] investigate the players learning behavior for a fixed opponent in MuJoCo robotics simulations. First, the authors show that even without directly manipulating the protagonist's perception, an adversary can be trained with PPO that decimates a fixed protagonist's performance simply by being present in the observations. These results further solidify what is known about the vulnerability of neural network policies to adversarial settings. Second, Gleave et al. [63] present a fine-tuning of the protagonist for a fixed adversary to improve the robustness properties of the protagonist policy against such adversarial settings. However, as either one of the players is kept fixed, the learning process will eventually overfit to the fixed opponent and thus be again susceptible to new attacks [63].
Especially for neural network input and hence observation robustness, research has proven how vulnerable neural network policies are to adversarial attacks and settings. Reaching a better understanding of adversarial training and behavior is crucial in achieving robustness, security, and a better understanding of deep reinforcement learning [63].

Relations to Maximum Entropy RL and Risk Sensitivity
While research in robust reinforcement learning has progressed far and issues regarding conservatism, the construction of ambiguity sets, and, in part, convergence have been addressed. The latter still remains a concern, i.e., the existence of multiple Nash equilibria destabilizes convergence to meaningful robust policies. In recent years, however, an interesting alternative has been proven to provide robust properties. With their research, Eysenbach and Levine [68] and Eysenbach and Levine [69] have investigated the relationship between maximum entropy reinforcement learning (MaxEnt RL) and robustness. This research dates back to work by Grünwald et al. [66], where it was shown that maximizing the entropy of a distribution is equivalent to maximizing the worst-case log-loss in prediction problems, which also translates to conditional distributions [68]. In [68], these insights are extended to reinforcement learning for certain classes of uncertain reward functions. The objective of MaxEnt RL is given as an extension of the classical reinforcement learning objective by an entropy term The term H π refers to the Shannon entropy. This objective is therefore equivalent to the reward robust objective given an uncertainty set U r . This uncertainty set is specified by the definition of the entropy term in the objective, which, in this case, relates to logarithmic functions due to the Shannon entropy However, given other entropy formulations, the definition for the uncertainty set changes [68]. These results have further been extended to consider uncertainties in the dynamics [69]. First, the authors propose a different definition of the uncertainty set for the reward robust objective as all functions r (s t , a t ) satisfying the constraint where > 0 is some positive constant. For adversarial dynamics, the authors then consider an MDP with a transformed reward function r (s t , a t ) = 1/T log r(s t , a t ) + H(s t+1 | s t , a t ), where the entropy for the state transition probability is taken into account. It follows that the MaxEnt RL objective now defines a lower bound on the robust objective for adversarial dynamics from an uncertainty set U P comprising functions p (s t+1 | s t , a t ) satisfying the constraint E π(a t |s t ), According to Eysenbach and Levine [69], Ziebart [151], this definition can be interpreted as all dynamics p (s t+1 | s t , a t ) sufficiently close to the original dynamics. For both cases, the authors provide formal proofs that these relations hold. Considering the difficulty of solving bi-level optimization problems like those defined in robust reinforcement learning, MaxEnt RL becomes an attractive and easier-to-solve alternative while still providing robustness properties to a certain extend. Further alternatives for achieving robustness have been found in the context of risk sensitivity. Contrary to the robust MDP, the risk-sensitive MDP [152][153][154] does not consider any uncertainty in its parameters. Instead, the objective aims to optimize some risk measure of the cumulative cost [67]. As in standard MDPs, early approaches build upon the dynamic programming approach with finite state and action spaces [152][153][154][155][156]. Since then, research has progressed to more refined formulations with continuous spaces, such as Linear Quadratic Gaussian control [157,158] and relative entropy policy search [159]. A connection of risk-sensitive MDPs to robust MDPs has been found for two specific formulations, iterated risk measures and expected exponential utility functions [67].
Osogami [67] has found that robust MDPs, whose uncertainty is governed by a parameter 0 ≤ α ≤ 1, such that the uncertainty set U P over the transition functions p is described by p(s t+1 | s t , a t ) = 1 , equivalent to risk-sensitive MDPs with an iterated conditional tail expectation (ICTE) as objective. Here, p 0 (s t+1 | s t , a t ) refers to the nominal transition function. As such, Osogami [67] states that max where the ICTE involves N recursive applications of cVaR, with r(π) = ∑ T t=1 r(s t , a t ) being the cumulative reward. Osogami [67] further extends his proofs to simultaneous transition and reward uncertainty and a more general class of robust MDPs using coherent risk measures. Xu and Mannor [40] also state that the Coherent Risk Measure is equivalent to a distributionally robust formulation. For specific details, we refer the reader to [67].
The second equivalence to robust MDPs is discussed for expected exponential utility functions [67]. The exponential utility function is defined as U(r) = exp(γr(π)), where the risk-sensitivity factor γ governs the resulting shape of the function, as convex, linear, or concave. As such, this factor also determines the behavior of the optimization as risk-seeking, neutral, or risk-averse [158,159]. Minimizing the expectation of this utility function is equivalent to minimizing the entropic risk measure (ERM) max π 1 γ log E[exp(γr(π))] = max π E[exp(γr(π))], for a risk-sensitivity factor γ > 0. The relation to robust MDPs relies on the property of the ERM to be expressed as 1 γ log E[exp(γr(π))] = max where q 0 is a probability mass function for r(π). Further, U q 0 denotes a set of probability mass functions, whose support is contained in the support of q 0 [67,160]. Osogami [67] then derives that risk-sensitive MDPs, which minimize the expected exponential utility, are equivalent to robust MDPs with the objective max π min p∈U p 0 E π,p r(π) − γ T−1 ∑ t=1 KL(p(s t+1 | s t , a t ) p 0 (s t+1 | s t , a t )) .
This formulation has also been extended to simultaneous transition and reward uncertainty [67].
Even though there exist restrictions on the involved uncertainty sets, as in the MaxEnt RL approach, this connection of risk-sensitivity and robustness provides efficient and easier to solve alternative algorithms to achieve parameter robustness.

Conclusions
The great success of RL in recent years is evidence of how far theoretical research has progressed. Using a trial and error-based design, RL mimics human learning behavior [1,2]. Current research shifts the attention to the deployment in realistic environments. Classic RL methods, however, experience deficiencies in robustness to uncertainties, perturbations, or structural changes in the environment-a consequence of realistic problems.

Summary
Our survey provides a comprehensive summary of robust RL and its underlying foundations. Therefore, we cover multidisciplinary concepts of optimization, optimal control, and game theory. We conduct a literature review on robust RL and separate methods in four different categories: (i) transition robust design, (ii) disturbance robust design, (iii) action robust design, and (iv) observation robust design. Each category targets a different aspect of the MDP formulation.
Transition robust methods define an uncertainty set of possible transition functions [23][24][25]44,45]. For finite-state MDPs, convergence guarantees are given [24,25]. However, to remain tractable, the strict assumption of rectangularity is required. A consequence is overly pessimistic policies. Modern contributions center around the deficiencies of traditional transition robust RL. Tackling the pessimistic behavior is done in three different ways. First, the authors in [39,41,52] propose a trade-off between robust and non-robust performance. In a multi-objective optimization scheme, the importance of robust performance is lowered in favor of the non-robust performance measures. Another set of works identifies the rectangularity property as the source of pessimistic behavior [42,47,48]. They propose a non-rectangular set of coupled uncertainty that remains tractable. The nonrectangularity effectively restricts worst-case outcomes to more realistic cases. Thirdly, as a combination of stochastic and robust optimization, distributional robust methods weaken the worst-case formulation. Instead of a definite worst-case transition function, the adversary only chooses a worst-case distribution over transition functions. The additional layer of uncertainty prevents convergence to overly pessimistic policies [38,40,43,46,[49][50][51]53,54]. Additionally, literature addresses the restriction of traditional transition robust designs to finite-state MDPs. As the dimensionality of state and action spaces grows, the classical methods suffer from the curse of dimensionality. Modern systems are rarely describable as low-dimensional discrete problems. Propositions include linear and non-linear function approximations, e.g., approximate dynamic programming [10,136].
Disturbance robust designs rely on external forces to express uncertainty in the system dynamics. Methods utilize this relation to define disturbing adversaries [7][8][9]. A core advantage is the removal of explicit uncertainty sets. However, more recent contributions are only demonstrated empirically without mathematical guarantees [8,9]. Compared to the other categories, the distribution robust design lacks scientific contributions.
Action robust designs, instead, imply disturbances as perturbations of the agent's actions. Literature introduces two variations of action robust designs. Each depicts a different type of external disturbance. Probabilistic action robust MDPs consider sudden disrupting forces. The PRMDP simulates rare but catastrophic events, e.g., crashes in autonomous driving [11,55,56]. Noisy action robust MDPs, on the other hand, describe continuous perturbations of actions to simulate changes in physical parameters [11]. Both variants define a joint policy as a linear interpolation between the protagonist's and adversary's policy. Further, recent work adopts the concept of adversarial attacks to produce action robust agents [57]. Adversarial attacks are mainly known from input perturbations in deep learning [161,162].
Observation robust designs leverage the vulnerability of policies to input perturbations [61][62][63][64]145]. The adversary exploits this vulnerability to distort the protagonist's perception. Consequently, the decision-making process is redirected to produce worst-case transitions. Most works define adversarial attacks as direct optimization of the observation or state space. As such, the presented methods effectively separate the optimization procedure. Instead of utilizing an adversarial RL formulation, the robust policy is obtained through classical RL algorithms [61][62][63][64]. Another work focuses on limiting an agent's response to adversarial attacks to provide robustness guarantees and certification bounds [65].
Aside from the core contributions, we covered literature on tightly connected areas. The authors in [7,144,145] discuss robust RL methods for continuous-time problems. Riskbased and entropy-regularized RL exhibit a strong connection to the transition robust design. In [66][67][68][69], the authors prove an equivalence of risk-sensitive MDPs to robust MDPs. Further, Osogami [67] shows a similar equivalence when optimizing expected exponential utility functions [158,159].

Outlook
Robust RL has proposed various designs and approaches over the past two decades. Starting with promising mathematical guarantees, astonishing results on a few realistic and complex problems have been made in recent years. However, the shift towards applicability research is just beginning and is not yet complete. While surveying the literature on robust RL and the interdisciplinary connections to other areas, we found three major issues.
First, no coherent baseline exists for comparing different robust RL methods. Most research is benchmarked against non-robust methods. Therefore, an important objective for future research is the development of a common baseline-including a set of benchmark tasks and evaluation of pre-existing methods.
Second, the presented literature misses a consistent measure for robustness. Recent experiments mainly center around variations of a few physical parameters of the environments. However, realistic problems are rarely restricted to variations of single parameters. Developing a metric to assess adaptability to uncertainty across large sets of system parameters and disturbances is important.
The third is the tractability and the entailing portability to realistic and complex systems. Convergence guarantees are only given for strictly constrained frameworks-mostly in transition robust methods. With the extension to high dimensional problems with nonlinear function approximation, these guarantees are mostly invalid. Therefore, especially w.r.t. equilibrium solutions, it remains an open research to push robust methods to realistic and complex systems. In turn, some of the presented literature pursues practically driven research. While the results are promising, these works lack mathematical evidence.
Continued research of adversarial policies and robustness may provide a better understanding of deep RL [63]. Xu and Mannor [127] further state that robustness could yield desirable generalization properties. As such, there are still various directions in which research in robustness needs to progress-as well as opportunities for contribution.