Inverse Reinforcement Learning as the Algorithmic Basis for Theory of Mind: Current Methods and Open Problems

: Theory of mind (ToM) is the psychological construct by which we model another’s internal mental states. Through ToM, we adjust our own behaviour to best suit a social context, and therefore it is essential to our everyday interactions with others. In adopting an algorithmic (rather than a psychological or neurological) approach to ToM, we gain insights into cognition that will aid us in building more accurate models for the cognitive and behavioural sciences, as well as enable artiﬁcial agents to be more proﬁcient in social interactions as they become more embedded in our everyday lives. Inverse reinforcement learning (IRL) is a class of machine learning methods by which to infer the preferences (rewards as a function of state) of a decision maker from its behaviour (trajectories in a Markov decision process). IRL can provide a computational approach for ToM, as recently outlined by Jara-Ettinger, but this will require a better understanding of the relationship between ToM concepts and existing IRL methods at the algorthmic level. Here, we provide a review of prominent IRL algorithms and their formal descriptions, and discuss the applicability of IRL concepts as the algorithmic basis of a ToM in AI.


Introduction
Our everyday interactions with others rely on us being aware of their mental states so that we can adapt our behaviour to best suit a social context. The increasing complexity and interconnectedness of our world requires that we interact with different types of agents and systems, including other humans, autonomous artificial agents, companies, and institutions. Theory of mind (ToM) is the psychological construct by which we infer the mental states of others we see as intentional, based on their behaviour [1]. Taking an intentional stance toward a system means treating it as a rational agent in order to predict or explain its behaviour by the desires and beliefs it is assumed to have given its purpose, irrespective of what the system is comprised of [2]. An agent is said to be rational when it acts so as to fulfill a desire on the basis of their perception and beliefs, or in decision-theoretic terms, when it seeks to optimise a measure of reward for the decisions it makes (noting that in the context of machine intelligence, it is advisable to consider rich psychological concepts such as ToM, desires, and beliefs with care [3]).
ToM can be cast as an information-processing problem-transforming raw information from observations into representations that are useful for reasoning about others' mental states. As such, its function can be replicated algorithmically, but doing so requires knowing what representations are useful, and from what information and what transformations of it these representations can be obtained [4]. The design of these algorithms can be aided by insights into the neural correlates of social cognition from functional imaging studies [5], or from behavioural data [6]. A recent result for artificial agents on this front is the achievement of human-level performance in the diplomacy game-a challenging task that requires inferring beliefs and intentions of other players from natural language to negotiate and coordinate with them-by the Cicero AI agent [7]. A key step in their approach is to model an agent's action choices by assuming it simultaneously attempts to maximise the expected value of an action given other players' actions and minimise the difference between its action choices and that of a model from human behaviour data, in a reward function that is structurally similar to Equation (41) below. As artificial agents become more embedded in the world, not only will humans need to take an intentional stance toward them [8], but artificial agents will need to have that same ability toward others [9][10][11]. With these points in mind, the ability to socially integrate AI is important for its future development, and ToM will play a central role in achieving it.
An important part of such an AI ToM is the specification of desires (goals), beliefs, and intentions that are fundamental to how agents make choices [12]. Early work in the intersection between psychology and AI (published the same year as the first paper on ToM [13]) provided algorithmic methods by which to infer goals and plan structures from actions, by using linguistic descriptions of action sequences [14] and later extended to account for differing beliefs between the actor (i.e., the agent whose internal states are being modelled) and observer (i.e., the agent doing the modelling) [15]. Others showed semantic representations of the relationship between intentions and beliefs [16]. Notable work by Yoshida et al. [17] proposed a Game ToM model wherein the value function in a Markov decision process (MDP) is defined over the joint state spaces of all agents in the environment. This leads to a recursive optimisation of the joint value function in each agent up to a certain level of sophistication. Under the assumption that the rewards are the same for and known by all agents, instead of inferring rewards from behaviour, it is sufficient to infer other's level of sophistication in order to act strategically. More recent and oft-cited computational implementations of ToM, Bayesian ToM [18,19] and Machine ToM [20], seek to recover agent goals as well as their beliefs in an MDP setting. A class of machine learning methods that is particularly designed to operate in the MDP framework is inverse reinforcement learning (IRL), the objective of which is to infer the reward function of an agent from its state-action trajectories. The potential suitability of preference learning, and IRL in particular, as a computational approach for ToM was recently outlined by Langley et al. [21] and Jara-Ettinger [22], respectively. IRL has seen a recent resurgence of interest, with multiple reviews of methods appearing in the last few years [23][24][25][26][27]. Simultaneously, a growing body of research focuses on computational approaches to modelling other agents [28][29][30]. In spite of these contributions, a better understanding of the relationship between ToM concepts and existing IRL methods at the algorthmic level is required to adopt IRL as the algorithmic basis of ToM.
Here we provide a review of prominent IRL algorithms and their formal descriptions and discuss the applicability of IRL concepts as foundations for an algorithmic ToM. Section 2 provides background on IRL, including the conceptual formulation of the problem, its foundations on reinforcement learning (RL), important concepts and notation, and its relation to ToM. Section 3 explains the connection between desires and rewards and reviews algorithmic approaches to two issues that arise: how to discriminate between different reward functions that equally explain observed behaviour (Section 3.1), and how to characterise the reward function in the context of the problem (Section 3.2). Section 4 discusses the importance of beliefs in the IRL problem and their interpretation in this context as relating to transition dynamics (Section 4.1) and state observability (Section 4.2). Section 5 covers methods that relate to the intentions of an agent, including how suboptimal behaviour (Section 5.1) and multiple intentions (Section 5.2) are accounted for. Section 6 highlights important and promising considerations for expanding IRL and making it more suitable as an algorithmic approach to ToM.

Background
RL algorithms learn to optimise agent actions given observations of the state of the agent's environment, with respect to a reward function. This reward function is the most succinct representation of a task. We use the terms "reward" and "utility" interchangeably. Utility has more general connotations and widespread use in economics and game theory, whereas reward is more common in AI, and specifically in RL. Conceptual efforts in economics led to development of theories of rational multiobjective decision making based on the attributes of each available choice, in what is known as multiattribute utility theory. A central pillar in this line of work is the quantification of the decision maker's preferences [31]. Russell [32] called to attention the lack of work on the computational aspects of this problem, which he related to machine learning as the dual of RL and named it IRL. The task was characterised as follows.
Given (1) measurements of an agent's behaviour over time, in a variety of circumstances, (2) if needed, measurements of the sensory inputs to that agent, (3) if available, a model of the environment. Determine the reward function being optimised.
Under the principle of rationality, a rational agent's behaviour is driven by a tendency to optimise for its desires given its beliefs. The intentional stance invokes this principle to attribute causality for behaviour to mental states [33]. An agent's reward function is the driver of its behaviour and can therefore operate as a representation of its desires. On the other hand, the agent's beliefs about the world inform what behaviour is appropriate or feasible, and play a crucial role in planning toward fulfilling its desires. IRL may serve as an algorithmic paradigm for inferring the mental states (beliefs, desires) of others based on their observed behaviour (i.e., ToM) [22].

Problem Formulation and Notation
The problem setting for IRL, as for RL, is a (finite) MDP, characterised by the tuple s ∈ S}-a probability distribution over S from which the initial state is drawn (s 0 ∼D) • γ ∈ [0, 1)-a time discount factor • R : S × A → R-a reward function whose absolute value is bounded by R max .
Behaviour within an MDP is dictated by a policy. A deterministic policy is a function π : S → A, yielding an action choice a for a given state s. A stochastic policy is a probability distribution over actions given a state π(a|s) = P(a|s). A mixed policy ψ is a distribution over a set of deterministic stationary policies Π, with λ k = P(π = π k ), or equivalently, a convex combination with coefficients ∑ k λ k = 1. Mixed policies are executed by selecting a policy π k with probability λ k at the start of the MDP and following this policy for the entirety of the problem. A policy is optimal when it maximises its associated value function, or "expected sum of discounted rewards", where d π,T,t is the state-action distribution at time t, a result of the agent's policy and the environment's transition probabilities, with s 0 drawn from D. The Bellman equation provides a way to recursively compute the values of states under a policy. For a given MDP, An additional auxiliary function commonly used in MDP settings is the Q-function which defines the cumulative reward to be expected from performing action a while in state s, and can be used to obtain an optimal policy π * (s) ∈ arg max a∈A Q π (s, a; R) for a given R.
A useful representation of a policy is its discounted state-action visitation distribution, or occupancy. A policy π and its occupancy measure µ π can be used interchangeably for a given environment-occupancy provides a representation of the policy as influenced by the transition dynamics of the environment. By employing Kronecker delta notation (δ ij = 1 if i = j, 0 otherwise), the occupancy is and is sufficiently defined through the linear Bellman flow constraints [34] These constraints define a set G of all the constraint-satisfying occupancies, each of which can be represented as a vector µ π ∈ R |S×A| [35].

IRL Concepts
The notation MDP\R is used to denote an MDP where the reward function R is not given. The IRL task consists in finding an R for which the agent's observed behaviour is optimal given an MDP\R, usually working within a parametric class {R θ : θ ∈ Θ}. The canonical approach to this is to use a linear approximation of R from features φ(s, a) ∈ R d of each state and action, with weights θ ∈ R d , such that R(s, a) = θ T φ(s, a). This is explained in further detail, with alternative approaches, in Section 3.2. The idea of approximating utility (reward) as a linear function of subutilities (features) dates back to the work of Carmel and Markovitch [36], used for opponent modelling in extensive form games (chess), and is closely related to earlier work in economics, such as in [37]. Often R, φ, are functions of the state only, instead of state-action pairs. The extension to this setting is trivial if we adopt the state-action formulation. Under a policy, we have feature expectations (expected discounted value of features) v π ∈ R d : A feature matrix F ∈ R d×|S ×A| can be employed to encapsulate the features for each state-action pair F (·,s,a) = φ(s, a), resulting in v π = Fµ π . Under a linear approximation of the rewards, V π (Equation (1)) can alternatively be obtained from features (through linearity of expectations) with Feature counts from a given trajectory τ provide a compact representation of the trajectory The measurements of the behaviour of the agent of interest (here actor or expert) over time are given as demonstrations D, usually taking the form of an unordered set D = {τ π j } m j=1 of sequential paths (i.e., trajectories) τ π j = ((s, a) t ) H j t=0 of length H j + 1. Additional information may be provided in the demonstrations, including feature matrices, occupancies, etc.
Empirical estimates of occupancy and feature expectations can be computed as average counts from the observed trajectories The above results all apply to mixed policies through linearity of expectations.

Boltzmann Policies
To make stochastic policies more robust to environmental changes, it is desirable that they assign nonzero probabilities to actions other than the optimal to allow for exploration. According to Jaynes' Maximum Entropy Principle (MaxEnt) [38], to find a probability distribution that minimises bias for a given partial set of information requires maximising the amount of entropy (or uncertainty) in the distribution, subject to the known information. Recall the definition of a deterministic optimal policy through the Q-function (Equation (3)). To define a stochastic policy, a distribution over action choices can be determined from their associated Q-values instead. Subject to moment matching constraints for the zeroth and first moments ∑ a∈A π(a|s) = 1 ∀s ∈ S, E[Q π (s, a)] = ∑ a∈A π(a|s)Q π (s, a) ∀s ∈ S, the entropy-maximising distribution is the Boltzmann distribution, or "Boltzmann policy" as it is known in the RL context, π(a|s) = Pr(a|s, R, π) = 1 Z exp(αQ π (s, a; R)), with normalising constant, or partition function Z(s, R, π, α) = ∑ a ∈A exp(αQ π (s, a ; R)) and (negative) potential energy Q π (s, a; R). The hyperparameter α serves as an inverse temperature parameter defining the steepness of the policy distribution, or how "greedy" for optimal Q-values it is. This greediness may be understood as the level of rationality attributed to the agent by the ToM observer. This distribution is known as the Softmax function in the machine learning literature. A maximum entropy optimal policy can be obtained through the soft Q-function where the value function is defined through the LogSumExp function [39]. Actions with higher Q-values reduce regret, which rational agents are expected to act in accordance with. An alternative and equivalent interpretation of this policy is as proportional to the exponential advantage of an action π(a|s) ∝ exp(Q π (s, a) − V π (s)). (13) The Boltzmann distribution provides a smooth parametric model for the action choice distribution in a given state (i.e., a policy) that is shaped by the Q-function at the state. These qualities, along with the alignment with MaxEnt and the degree of freedom in the temperature hyperparameter, are desirable attributes in modelling rational agents. For these reasons, a sizable proportion of IRL algorithms resort to a Boltzmann assumption when characterising the policy (e.g., Sections 3.1.1.6, 3.1.2.2, 3.1.2.3, 3.1.3.2 and 3.1.6.3). It is common practice to assume the Q-function is given in a converged state or obtained through dynamic programming (value iteration) or RL methods (e.g., Q-learning). In the ToM interpretation, the accuracy of the Q-function with respect to the true values of the MDP encode, in part, the accuracy of the agent's beliefs-a core mental attitude of ToM, as discussed in Section 4. Another core mental attitude in models of ToM are desires, which are encoded as rewards in rational agent models. In the following section, we review IRL algorithms whose emphasis is on recovering these rewards, and discuss how they can provide an effective computational approach to inferring an agent's desires from their behaviour.

Inferring an Agent's Desires
The problem of inferring an agent's desires with computer science methods was first approached by Russell [32], which coined IRL broadly, suggesting a possible algorithmic direction based on the use of a parametric form of the reward function, as is common in econometrics. In the IRL problem setting, this function can be fitted using Pr(τ|R θ ), the likelihood of observing behaviour τ if the true reward function were R θ , as a loss function. The parameterisation (or lack thereof) of R offers a design choice (see Section 2.2). Any optimisation method can be employed given this formulation, usually involving interleaving policy optimisation and reward function selection. However, there may exist multiple reward functions for which the observed trajectories are optimal, including degenerate solutions. This issue was the first to be addressed in the IRL literature [40], and provides our first classification axis for algorithms, namely by how they discriminate between plausible reward functions, as covered in Section 3.1. Another important concern is how the reward function can be characterised beyond a linear parameterisation, which we explore in Section 3.2.

Reward Function Discrimination
The first algorithmic treatment of the IRL task was by Ng and Russell [40], proving IRL soluble for moderately sized discrete and continuous state spaces. They characterised, analytically, the set of all reward functions for which a given policy is optimal for finite state spaces, and suggested heuristics to constrain said set of reward functions in the form of penalties for the cost of single-step deviations from the given (optimal) policy and regularisation of the rewards (modulated by hyperparameter λ). For finite state spaces, R (and any other function of the states) can be represented as a vector R whose ith element is R(s i ). Similarly, the state transition probabilities can be encapsulated in a tensor T ∈ [0, 1] |S|×|S|×|A| , which can be indexed by action to obtain a matrix T a where each (i, j) element is the probability of transitioning from state s i to state s j upon performing action a. In the resulting formulation, including the penalties, the goal is to find the R that maximises with a 1 ≡ π(s) and subject to the constraints where represents elementwise inequality (for all elements). To extend the method to infinite state spaces, the authors resorted to a linear approximation of R with given fixed features. This approach to the problem is the canonical form of the so-called maximum-margin class of IRL techniques, the goal of which is to estimate a reward function that maximises the difference between an optimal policy and the rest of the available policies [27], or equivalently, a set of weights θ that parameterise a reward function R θ , such that under R θ , V π * ≥ V π for all π. This particular method requires a way to approximate the value of V π under any MDP. Invoking Equation (7), the task is then to find the θ that maximises, for a sample S 0 ⊂ S of the state space where p is a penalty function for states in which π is not optimal underR. The penalty weight value of 2 in p is arbitrarily chosen. The original paper asserts that results were not sensitive to this value.
Finally, further generalising the method, they introduced an algorithm to findR such that a policy π to be determined maximises V π when a set D of trajectories τ through S is given in lieu of π E . This algorithm requires (i) a way to approximate V π (as above), (ii) a way to find an optimal policy π k under any R (techniques from the RL literature can be employed to this end), and (iii) the ability to simulate trajectories starting from s 0 under policy π in the MDP.
Approximations of the feature expectations and the value function can be obtained by performing m Monte Carlo trajectories of length H under π, and averaging over their values (for a large H, the difference as compared with an infinite time horizon is negligible): Similarly, we can obtain approximations for the expert's feature expectations, but note that their accuracy is contingent on the size of the set of demonstrations provided, as well as the fact that some states may not be visited in some cases.
In their final algorithm, shown in Algorithm 1, the optimisation step is similar to Equation (15), with the same constraints, but replacing the expectation by the empirical averageṼ π from Equation (17), and the domain S × A \ a 1 by the set of policies Π. The initial state s 0 is fixed for all the trajectories, but this results in no loss of generality if we let s 0 be a dummy state and set T(s 0 , a, s 1 ) = D(s 1 ) for all a ∈ A.
In the context of ToM, the given trajectories represent the observer's knowledge of the actor's behaviour. The longer the trajectories and the larger the set of trajectories (hyperparameters H and m, respectively), the better the observer can be said to know the actor. The resultantR is the observer's model of what drives the actor's behaviour (i.e., its utilities), which may be used in conjunction with policy estimates π ∈ Π (i.e., its probabilities) to predict its future behaviour. In Figure 1, we group these two variables together conceptually as the observer's model of the agent (green, dashed outline). The rationality of the actor is based on these two sources of information [41]. The observer requires a model of the environment (MDP\R) to be able to estimate the model of the agent. This is a sensible requirement for any agent. In Algorithm 1, it is assumed to be completely faithful to the real environment ( Figure 1, yellow with dashed and solid outlines, respectively).
One outstanding question is the meaning of the basis functions, or environment features, φ i in the context of ToM. We place them conceptually within the observer, as depicted in Figure 1 (orange). The cardinality d of the space Θ in which we perform the linear approximation of the reward function, and thus its expressivity, depends on how many features the observer makes use of. Intuitively, they stand for the perceptual acuity of the observer-the number of different "stimuli" the observer can differentiate amongst and attribute value to. They are likely to differ to that of the actor; that is, if the actor does have them in the first place-it may not know its subutilities and simply be guided by its reward function. Simple examples of features in the scenario of an agent crossing the road include whether there is a car present, the speed of the car, the state of the pedestrian crossing traffic lights, etc. Moreover, not only the features, but the state observations themselves may differ between the actor and the observer (e.g., first-person vs. third-person point of view). In Ng and Russell [40] they are "given" and fixed.
As new trajectories are observed, the same algorithm can be used to update the weights if the current θ and π k are used instead of randomly initialising them. We call attention to the fact that, although this was not stated in the algorithm as presented, it can yield the set of policies π ∈ Π, as well as their respectiveṽ π ,Ṽ π , and different reward function estimatesR under which each of the policies were optimised. . Given trajectories τ E , the observer constructs a model of the actor comprising a policy π and reward function R (dashed green), employing a model of the environment (i.e., a model of the MDP\R, dashed yellow, which is usually assumed to be a priori known by the observer and equal to the actual environment, yellow) to generate candidate trajectories τ π . Both trajectories are compared (blue) with the aid of features φ (orange) that are intrinsic to the observer to update the weights θ. The weights characterise the reward function in conjunction with the features. Iteratively repeating this process yields a suitable reward function.

Feature Expectation Matching
Abbeel and Ng [42] contributed modifications to the max-margin approach under somewhat stricter constraints: R is bounded in absolute value to 1, which requires ||θ * || 1 ≤ 1, and therefore ||θ * || 2 ≤ 1. Casting the problem as apprenticeship learning (AL), deviating slightly from the IRL premise, their goal is to find a policy π that performs close to the expert policy under the unknown reward function R * . Their focus is on the feature expectations: the estimated policy π must obtain (empirical) feature expectations close to the expert's, i.e., satisfy ||v π − v π E || 2 ≤ . In other words, the true goal is not uncovering the reward function: although the algorithm guarantees finding a policy, the feature expectations of which are within of the expert's, the reward function recovered as part of this process may not be correct. Because the 2-norm of the linear approximation weights θ is restricted to be less than 1, this is equivalent to minimising the difference between the value functions of the expert's (π E ) and estimated (π) policies. Such policy (and pertaining feature expectations) can be obtained almost identically to Algorithm 1, with the key difference being in the weights' update step, as per Algorithm 2, and the change in the loop exit condition to t < . They provide an additional, simpler algorithm based on computing the orthogonal projection of the feature expectations onto the segment between previous iterations' expectations and demonstrate its faster convergence compared to their max-margin method.

Algorithm 2:
Excerpt from the algorithm in [42], with adapted notation.
Terminal condition in loop 6 return θ

Multiplicative Weights Apprenticeship Learning
Syed and Schapire [43] expand on the apprenticeship learning algorithm in [42] with algorithmic tools from game theory. Further constraining the ranges of φ(s) ∈ [−1, 1] d and θ ∈ Θ C = {θ ∈ R d : ||θ|| 1 = 1 and θ 0} allows for defining the margin The goal is defined through the game value (note the difference in symbols ν, v) i.e., ψ * is the mixed policy that maximises V ψ − V π E for the worst-case possibility for θ * , a sensible constraint because θ * is unknown. This allows for a zero-sum game formulation, though only abstractly, so the "players" are not the observer and actor, but the rewards and the policy (this is the foundational concept of adversarial IRL methods, reviewed in Section 3.1.7). "Min player" sets the reward by choosing θ, and "max player" chooses a mixed policy ψ, adversarially. As such, the game can be defined via a d × |Π| game matrix with where i indices over the feature dimensions d and k over the policies in Π, the space of policies π. From this, we have in Von Neumann's minimax form [44]. The 0 lower bound is explained as follows. The stricter constraint setting θ ∈ Θ C is equivalent to assuming all the features "got the sign right" in relation to how they contribute to the reward (because the weights are all positive). This assumption results in ψ * having higher value than π E when v ψ * v π E regardless of the value of the actual weights θ * .
To solve this optimisation problem, they adapt the multiplicative weights algorithm from [45]. This algorithm has two main steps. (1) Given min player "strategy" θ, find ψ * = arg max ψ∈Ψ θ T Gψ (i.e., find an optimal policy in the MDP with known R, through any MDP solver); (2) Given max player "strategy" ψ, compute (θ (i) ) T Gψ for each of the d pure (i.e., one-hot) strategiesθ (i) (i.e., compute the feature expectations v ∈ R d of the given policy ψ, which can be done by solving d systems of linear equations, or approximated iteratively). These steps in Algorithm 3 are equivalent to the projection algorithm from [42]. The complexity of these steps scales with the size of MDP\R, and not with G. There is similarity in the higher bound approximation step in [46] (Algorithm 4, line 12).
The mixed policy returned by the Multiplicative Weights AL algorithm consists of a uniform distribution over estimated policiesπ that are π -optimal, meaning |V(π) − V(π * )| ≤ π . The game matrix G is slightly modified and makes use of v -good feature expectations estimates, meaning ||v − v π || ∞ ≤ v .

Linear Programming Apprenticeship Learning
The nondeterministic nature of mixed policies may not be desirable. Later work by Syed et al. [34] demonstrated that stationary policies can be obtained from Algorithm 3 by finding the optimal policies through linear programming, showing up to two orders of magnitude improvement in running time. The resultant linear program to find the maximum margin is max with resulting stationary policy Empirical estimates for occupancy values can be obtained from given trajectories by using Equation (9).
The last three methods we reviewed [34,42,43] are instances of AL. Although the objective in AL is to learn a policy that resembles the expert's, as opposed to learning the reward function, AL and IRL are largely overlapping and share core techniques, specifically in the two main tasks of policy estimation from observed behaviour (goal of AL), and the inference of rewards from a given policy (goal of IRL). Knowing an agent's policy may also be considered a form of ToM, as it is internal to the agent and reflects their intentions/modus operandi. The use of these two core tasks in ToM may be better understood through a simile with the "theory theory" and the simulation theory accounts of mentalising. The theory theory perspective assumes that we make inferences about hidden mental states through logic and abstraction, as we do in the natural sciences for the unobservable causal phenomena of the world. This is similar to the reward learning approach. In the "simulation theory" account, mental states are represented through perspective-taking, by using our own cognitive resources to simulate another's [47,48]. This is similar to the AL approach (e.g., [42,49]), as well as the less-sophisticated behavioural cloning (BC), whereby agents learn state-action mappings through supervised learning (with the limitation in applicability to observed state-action pairs only). The simulation account can be extended to IRL. For example, in [50] the observer models a human's reward function by proposing counterfactual scenarios.

Maximum Margin Planning
The use of (estimated) state-action occupancy measure from demonstrations is further extended by Ratliff et al. [49], whose goal is to find a reward function under which the optimal policy is similar to the expert's. To do so, they cast the problem as structured prediction, relying on a loss-augmented reward function R l = θ T F j + l T j , where l j ∈ R + |S×A| is a loss vector defining the cost of deviating from the expert policy for every state-action pair.
Each demonstration may be generated in a different MDP, and is given by . From this, they introduce an objective function measuring the difference in performance between a policy with occupancy µ θ = arg max µ∈G j (θ T F j + l T j )µ (optimal under the loss-augmented reward function), and the given demonstration µ j (without loss-augmentation), based on a quadratic programming formulation (in accordance with the hinge loss form) The form of the loss function imposes a margin by which the solution obtained is better than any other possible solutions. Using the occupancy measures from the expert policy has the effect of making rewards for high occupancy state-action pairs larger, which in turn encourages similarity between the policies, as well as discouraging degenerate solutions [51]. The optimisation of the weights θ is performed through gradient descent by using the subgradient of the objective function where β j is a data-dependent normalisation coefficient, q ∈ {1, 2} is a choice of slack penalty type (for 1-and 2-loss, respectively). Boularias and Chaib-draa suggest the use of a loss vector l(s, a) = 1 − µ j (s, a) [35]. The (near-)optimal policy π θ is obtained with RL methods in the particular MDP\R under the loss-augmented reward function, and provides the occupancy µ θ . Optionally, the weights θ can be projected on to additional problem-specific constraints after every update. The weights can be learned offline from a training set D, or online for each observation D j . The occupancies µ j can be obtained equivalently from trajectories τ j when the demonstrations are in said format instead. Prior knowledge can be included in the form of further constraints on θ, such as by explicitly penalising certain features, or regularising the learning procedure around a prior belief about θ instead of approximately 0. Additionally, it can be included through the loss vector l if certain state-action pairs are known to be poor choices [49].
When there is no single reward function that maximises the margin, such as when the agent's behaviour is suboptimal or no data is available for parts of the state space, this method is limited [52].

Policy Matching
Neu and Szepesvári [53] set out to find a θ for which π θ matches the expert policy π E , or rather the empirical occupancy estimateμ π E thereof, through gradient descent on a loss function (similar to [49]). Thus, their performance measure is the difference between the proposed and expert policies, as opposed to the proposed policy's performance with respect to the original reward function as is the case in [42]. They select the squared loss function and assert it can be approximated by Obtaining the occupancies µ θ requires a policy π θ . They employ a Boltzmann policy (Section 2.3), acting as a smooth map from the parameter space to the policy space. As in [49], the parameterised policy π θ is trained through an MDP solver to be (near-)optimal under R in the MDP\R. The reward function parameters are obtained through gradient descent on the loss function. Others have taken a similar approach of matching the expert's state occupancy through gradient techniques [54].

Probabilistic Methods
Probabilistic methods cast the IRL problem as a (Bayesian) inference problem, aiming to find an estimate for R that best explains the given demonstrations (interpreted as noisy observations of the expert's policy) [55]. The agent's action choice probabilities Pr(a|s, R E ) are modelled as a Boltzmann distribution (Section 2.3) (or alternatively made deterministic as arg max a∈A Q * (s, a; R E )). This allows us to define the likelihood of a pair (s, a) ∈ S × A under any given R as Under an assumption of independence between state-action pairs in a given demonstration (based on a stationarity assumption for the agent's policy), the likelihood of the demonstration is Combining the likelihood from the demonstrations with a given prior over rewards P(R), we can obtain a posteriori Two solution methods are used to find this posterior in the literature: gradient-based methods are used to directly find an (approximate) maximum a posteriori estimate for R, and Markov chain Monte Carlo (MCMC) methods to approximate the entire posterior distribution of R [55]; more recently, variational inference-based methods have been proposed to this end, e.g., [56][57][58].

Tree Traversal
Chajewska et al. [46] provide the first algorithm to treat the reward as a random variable. Of further interest to this review, they show the usefulness of their method for strategic interactions in the two-player game setting. They work with a game decision tree instead of an MDP (though the method can equivalently be applied in the MDP setting), allowing the observer to consider the actor's actions as well as their own and "nature"/chance decision nodes.
Similarly to [40], they use τ E to fit linear constraints on Θ, a space of coefficient values for a linear approximation of R. They set out to obtain a posterior distribution q(θ|τ) over a constrained region of the parameter space Θ C , by conditioning a prior p(θ) on the evidence from the demonstrations Pr(τ|θ). The constrained region Θ C is contained in Θ * ⊆ [0, 1] d , the polytope defined by V π * ≥ V π ∀π ∈ Π. The prior p(θ) over Θ * is obtained through density estimation on population reward function data (from many actors, as in e.g., [41]). Because q(θ|τ) can be prohibitively complex to compute, the method approximates it through an MCMC procedure (Algorithm 4), specifically by using the Metropolis-Hastings (MH) algorithm over a quantisation of the convex set Θ C , with p as the acceptance probability distribution. Θ C is obtained by traversing the tree and assigning upper (V hi (s)) and lower (V lo (s)) bounds on the value of each node, V lo (s) ≤ V(s) ≤ V hi (s), with the set of constraints C built with constraints V hi (s) ≥ V lo (s ) from each of the expert's decision nodes. We use S(s) as shorthand for the subset of S that is reachable from s, i.e., S(s) = {s : T(s, a, s ) > 0 ∀a ∈ A}. The use of π E (s) denotes choices that are perceived as chance by the expert, including the observer's actions and nature's actions (passive dynamics). Although the original paper did not, we make use of the Bellman equations in our elucidation of the algorithm where applicable, for closer correspondence with IRL. The algorithms are equivalent if γ = 1 and φ(s) = 0 for all s that are not leaf nodes.

.2. Policy Walk
Ramachandran and Amir [59] formally state the problem as Bayesian inference (hereafter Bayesian IRL). Their PolicyWalk MH algorithm (Algorithm 5) allows for domain knowledge to be incorporated in the prior, with the potential for further improvement in estimation accuracy.
Their approach models the likelihood of state-action pairs as the occupancy under the expert's policy Pr((s t , a t )|R) = µ E (s t , a t ) with a Boltzmann distribution assumption (Section 2.3). Unlike its use in [53] (Section 3.1.1.6), here the normalisation in the likelihood has to be done over all (s t , a t ) ∈ τ, which may be intractable depending on the state space. Fortunately, because the algorithm only uses ratios of the densities, the normalising constant Z can be discarded, and the resultant likelihood is hence, the posterior is where p is used as the acceptance probability distribution in the MH procedure.

Algorithm 5: Policy Walk Algorithm [59]
Algorithm PolicyWalk(τ, P) Update π wrtR 11 else 12π ← π end 13 (R, π) ← (R,π) with probability min{1, p(R,π) p(R,π) } end 14 return R For the prior Pr(R), they invoke the principle of maximum entropy to assume the rewards are independent and identically distributed. Three different prior distribution candidates are proposed: • for prior-agnostic context, a uniform distribution over [−R max , R max ] |S| or an improper prior Pr(R) = 1 over R |S| ; • for real-world MDP with parsimonious reward structures, a Gaussian or Laplacian prior (over R |S| ); and • for planning-type problems, where most states can be expected to have low or negative rewards, with some having high rewards, a Beta distribution (over R |S| ).
An important distinction is that the PolicyWalk algorithm estimates the value of R at each of the states directly, and therefore it does not make use of features or a (linear) function approximator for R as in the previously discussed methods. Although they show substantial improvements over the method in [40] for |S| ≤ 1000, larger (or infinite) state spaces may not be feasible to learn over with this approach. The loss functions used in their evaluation experiments are the 2-norm between the estimated and true reward functions R, and the 1-norm for the estimated policy π evaluated under the true R. Bayesian IRL methods are limited in scalability, as they have to define probabilities and/or calculate value or quality function values for every point in the state-action space. Furthermore, as in other previous methods, large computational overheads are associated with the need to optimise the policy in the MDP.
The authors posit that the max-margin method from [40] is a special case of Bayesian IRL where the obtained R is the MAP estimate with a Laplacian prior. This argument is later extended by Choi and Kim [60] (Section 3.1.4). The solution space for this method is the same as in [53], as per the analysis in [61].

Structured Generalisation
Rothkopf and Dimitrakakis [62] contribute a principled generalisation of the Bayesian IRL approach (Section 3.1.2.2) with structured priors on rewards and policies. Given a controlled Markov process ν = {S, A, T} and a discount factor γ (or priors thereof), a prior for a stochastic reward function Pr(ρ|ν) over the space of reward functions R, and a prior for the policy Pr(π|ρ E , ν) over the policy space Π; with joint prior ∀π ∈ . (32) With this statistical model and the usual Boltzmann action choice probability assumption (Section 2.3), they derive two MH algorithms: direct sampling from the joint posterior distribution Pr(π, ρ|τ), and a hybrid Gibbs sampler procedure with a reward sequence augmentation of the model with Pr(r t |s t , a t , ρ E ). These algorithms do not require the demonstrations to be optimal, and are capable of finding policies that outperform the agent's actual policy with respect to its reward function, as well as revealing policies that perform better than those recovered with previous IRL methods.

Maximum Entropy Methods
Bayesian IRL methods compute the likelihood as the total probability over each action choice in a trajectory. This fails to account for global interdependencies of choices along a trajectory. Maximum entropy methods focus on modelling the likelihood of entire trajectories Pr(τ|R) as a whole, as opposed to individual action choices. They resolve the ambiguity between trajectories, constrained to matching feature counts by maximising the entropy of the distribution.

Maximum Entropy IRL
Ziebart et al. [52] provided a definitive method by which to discriminate between reward function candidates by focusing on the characterisation of the likelihood Pr(τ|θ). Although previous probabilistic approaches worked with distributions over policies, inevitably focusing on local action choices [53,59], this method is based on a distribution over entire trajectories that is normalised globally. Multiple trajectories with the same feature counts may obtain the same rewards under a given R θ . Through the principle of maximum entropy, they obtain a distribution that removes any preferences for trajectories beyond the requirement of matching feature counts, thereby resolving the ambiguity. It attributes equal probabilities to trajectories with equal rewards, and exponentially higher probabilities to trajectories with higher rewards, and does so globally over the trajectories (as opposed to locally at the action level as is the case in [49,59]).
For observed trajectory realisations τ E j=1:m , this is equivalent to maximising the likelihood of the observed trajectories under a maximum entropy exponential family of distributions (exp θ T Φ(τ) ) θ∈Θ . Thus, learning from observation entails finding θ * = arg max θ L(θ), where With a partition function Z assumed constant for all (s, a, s ), and assuming the effects of transition dynamics on behaviour are negligible, the distribution of interest for nondeterministic MDPs (which extends trivially to deterministic MDPs) is The gradient of L(θ) is the difference between the average feature counts from observed trajectories and the expected feature counts over all trajectories in the MDP. The latter can be expressed equivalently taking the expectation over states in the MDP instead, requiring the state visitation frequencies µ θ (s) Thus, for the optimal θ, the feature expectations over the MDP match the empirical feature expectations from the observed trajectories. The state visitation frequencies µ θ (s) for an infinite time horizon can be approximated for a large time horizon H by using a sample-based algorithm (Algorithm 6). The above is equivalent to calculating the feature expectations v π with γ = 1; thus, here too we try to minimise the difference between trajectory values between observed trajectories and trajectories from parameterised policy, but avoid actually computing the policy in favor of using state occupancies obtained with Algorithm 6. This approach is resilient to expert behaviour being suboptimal (cf. Section 5.1), as well as the stochasticity of the environment. Although the algorithm is efficient by using all paths below a fixed length, in their experiments with taxi driver path data Ziebart et al. [52] work within a smaller, fixed class of reasonable trajectories resulting in significant improvements in speed.

Maximum Causal Entropy IRL
Transition dynamics play an important role in MDPs. Ziebart et al. [39] frame the agent's decision making in the MDP as two interacting stochastic processes: the environment's transition dynamics T (assumed to be known-given or estimated from data), and the agent's policy π (unknown in the IRL problem). For each time step t in the sequence (1, . . . , H), the state and action values are random variables S t , A t . These can be collected into the random sequences S 1:H , A 1:H , respectively, and they determine the interaction between the processes. When considered together, they form a trajectory τ.
In the MaxEnt approach [52], information is lost by failing to consider causality (time direction) in the trajectory distribution Pr(τ|θ, T) ≡ Pr(S 1:H , A 1:H ). This is addressed in [39,63] by proposing to use an alternative way to decompose this joint probability by using causally conditioned probabilities [ wherein future state variable outcomes have no effect on preceding variables. The state transition dynamics follow the Markov property, and thus have a causally conditioned probability T(S 1:H ||A 1: ). An agent's policy can also be modelled as a causally conditioned probability distribution, though the factors in the product of probabilities may not be Markovian, so π(A 1:H ||S 1:H ) = ∏ H t=1 π(A t , |A 1:t−1 , S 1:t ). The goal in this framing of the IRL problem is to find the maximum causal entropy (MaxCausalEnt) policy estimatorπ * = arg max such that Pr(A t |S 1:t , A 1:t−1 ) = 1 ∀S 1:t , A 1:t−1 .
Assuming, as is usually the case, that the features decompose linearly in time makes the optimisation much simpler. The distribution (first-order Markovian policy) that optimises this constrained problem is a Boltzmann policy (Section 2.3) and takes the recursive form An estimate of θ can be obtained through optimisation on the gradient computed through the calculation procedure in Algorithm 7. MaxCausalEnt was later expanded to the infinite time horizon setting [65,66].

Extensions
Though the MaxEnt approach was groundbreaking and has been adopted as the de facto canonical model for IRL, it shares shortcomings with other previous methods, such as reliance on the feature map φ being given, and on a defined model T of the environment's transition dynamics. Work that addresses these issues is exposed in Sections 3.2 and 4.1, respectively. Boularias et al. [68] provide a model-free method based on minimising the relative entropy (KL-divergence) between the empirical distribution of trajectories produced by a baseline policy and the distribution of demonstrated trajectories produced by a learned policy. With p(τ) = Pr(τ) defined over the space of possible trajectories, and p π,T (τ) = Pr(τ|π, T) the probability of a trajectory under a policy and transition dynamics, the objective is min p ∑ τ p(τ) ln p(τ) p π,T (τ) (39) with constraints The objective is minimised through stochastic gradient descent, and this method is capable of learning from small demonstration samples. More recently, Snoswell et al. [69] provide a model-free MaxEnt IRL method based on a unified view of the MaxEnt and relative entropy methods that is capable of handling trajectories of variable lengths (with time complexity linear in longest trajectory length), state-dependent action spaces, and nonlinear reward characterisations (Section 3.2). An approach that is similiar to MaxEnt IRL and extends to continuous time and continuous state and action spaces is presented in [70]. Others have explored the use of semisupervised techniques by including unsupervised trajectories in addition to expert trajectories in training [71]. Connections of MaxEnt IRL with GAN and energy-based models have been drawn [72].
The MaxCausalEnt IRL method has been improved by including both (labelled) successful and failed demonstrations [73], and by considering its performance degradation as a result of diverging transition dynamics models in the agent and observer [74]. Its connections to other methods from econometrics have been studied under a unified perspective [75].

MAP Inference Generalisation
We have seen ways to obtain the posterior reward distribution for given trajectories through Bayesian and maximum entropy methods. Choi and Kim [60] analyse how best to obtain point estimates for the reward function from the posterior. Although the posterior mean is commonly used because it minimises the mean squared error, this measure entails integrating over the entire space of reward functions, including those that are not consistent with observed behaviour. Motivated by this issue, the authors suggest the MAP estimate as a more robust alternative and introduce a gradient method by which to obtain MAP estimates of the reward function, based on the (sub)differentiability of the distribution. In an effort to unify previous methods under the Bayesian perspective, they demonstrate that most of the IRL methods can be alternatively viewed as finding the MAP estimate, because they work by maximising an objective (equivalent to the posterior in Bayesian terms) that is comprised of an assessment term measuring compatibility between the reward and the demonstrations (equivalent to the likelihood in Bayesian terms), and a regularisation term measuring preference over realisations of the reward function (equivalent to the prior in Bayesian terms).
The ability to use prior knowledge to model an agent's reward function is an important point from the ToM perspective. Actions alone do not provide enough information about the desires driving them, and there is great advantage in utilising information from other sources, such as task context or the type of actor [22], which can be incorporated in the form of priors in probabilistic approaches. For example, in the algorithm in [59] (Section 3.1.2.2) the observer has two "preconceptions" of the actor: the temperature α, representing how capable of choosing high-valued actions the observer expects the actor to be, and the prior Pr(R), a distribution over the reward functions that may be chosen based on the type of actor or environment. Moreover, working with uncertainties may be useful to the observer. In an interactive setting, they may choose to act more conservatively when there are high uncertainties, or act to elicit more information from the actor to improve the confidence in the reward function estimate.

Linearly-Solvable MDPs
A severe limitation of the approaches we have reviewed thus far is their requirement to solve the forward MDP on each iteration, which comes at a high computational cost. Dvijotham and Todorov [76] present a method based on linearly solvable MDPs (LMDP) [77], which was the first to not require solving the forward problem. LMDP provide an approximation to MDPs that enables finding solutions faster at a small cost in accuracy. This is achieved through a decomposition of the dynamics into the environment's passive dynamics Pr(s |s) (not to be confused with transition dynamics Pr(s |s, a)), assumed to be given, and the control dynamics π(s |s) (not to be confused with the policy π(a|s), though they are closely related).
The reward function (here, cost to be minimised) R comprises a state term r(s) ≥ 0 (to be inferred) and a control term that is the KL-divergence between the control dynamics and the passive dynamics (in order for the KL divergence to be defined, it is required that π(s |s) = 0 when Pr(s |s) = 0, a condition that is imposed), R(s, π(·|s)) = r(s) + D KL (π(·|s)|| Pr(·|s)).
When the demonstration sample size is larger than the number of states, the method can recover the value function analytically, as the MLE of the unconstrained, convex function which is uniquely defined. The policy and reward function are subsequently recovered from the value function estimate through z(s). When the size of the demonstration sample is smaller than the number of states, the (negative) likelihood can be optimised with respect to z(s) instead, although the resulting function is nonconvex and its optimisation is slower and susceptible to local minima. The value function can be represented as a look-up table or approximated as a linear function of features. Additionally, they suggest a method to automatically initialise and adapt the features in continuous space, employing Gaussian radial basis function kernels. A further potential advantage of this method is that it does not require trajectories (s, a), operating over state transitions (s, s ) instead.
Under passive dynamics, Pr(τ|s 0 ) = ∏ H t=1 Pr(s t |s t−1 ) is the probability of a trajectory. For the same trajectory to occur when the control dynamics are applied, the probability is Note the similarity with MaxEnt IRL. Under uniform passive dynamics, MaxEnt IRL is an equivalent approach for LMDP.

Direct Methods
Direct methods take a more analytical approach to solving the IRL problem, exploiting the algebraic structure of the problem definition. Two classification methods proposed by Klein et al. [78,79] address the important limitation in previous work of needing to solve the MDP at every iteration. Orthogonally, two policy search methods operate through direct loss minimisation [80] and policy gradient minimisation [81].

Structured Classification-Based IRL
In Klein et al. [78], a multinomial classification with output labels for each action a ∈ A is trained to yield a classification score given the states. The critical insight is that the classification score q(s, a) can be interpreted as a proxy for the Q(s, a) function, assigning a value to each state-action pair. This additionally affords a policy approximator π C (s) = arg max a q(s, a) (Equation (3)). The training dataset comes from expert trajectories D C = {(s t , a t = π E (s t )) t }.
3.1.6.2. Cascaded Supervised IRL Subsequent work by Klein et al. [79] retrieve a reward function estimate by chaining two generic supervised learning steps. First, a multinomial classification step yields a Q-function surrogate q, as in [78]. If the transition dynamics for the environment are known, a reward function can be obtained directly based on the Bellman equation from this classifier: A key contribution of this work is removing the requirement of knowing the transition dynamics by approximating R through regression, as the second step in the process. Although the regressors (s, a) are provided, this requires a response variabler, obtained from the Bellman equation with The resultant dataset is D R = {(s t , a t ),r t } t . However, samples for state-actions that differ from the expert's (s k , a = a k ) are needed to reduce the regression error. The authors address this with a synthetic augmentation of the regression dataset with artificial samples ((s t = s k , a ), r lo ) t,∀a =π E (s t ) . The reward for these samples is set to ensure it is always lower than that of the expert's samples: r lo = min krk − 1.

Empirical Q-Space Estimation
Melo et al. [61] provide analytical solutions to the IRL problem through constraints imposed by the policy observations, which can be optimal, perturbed, or incomplete. The so-called inverse Bellman equation shows a one-to-one relationship between Q-functions and rewards. If the transition dynamics are known, all we need to obtain a valid R is the Q-function, because the optimal policy is assumed to be either deterministic (Equation (3)) or Boltzmann (Equation (11)). Given the demonstrations and a prior on policies we can obtain an empirical Bayesian estimate of the policyπ(s, a). If the optimal policy is known for a given state, we have Q(s, ·) = 0 for actions that are suboptimal and uniform across the optimal actions. If the optimal policy is noisy, Q(s, a) = log(π(s,a)) α + V(s) and we can set V arbitrarily. If no information is available for the policy at a given state, invoking the advantage function A(s, a) = Q(s, a) − V(s) we have multiple degrees of freedom: arbitrary V(s), and A(s, ·) constrained to be ≤ 0 and have at least one zero-valued element because every state has at least one optimal action.

Direct Loss Minimisation
Doerr et al. [80] propose performing direct (deterministic) policy search on a reward function R j (τ) = −L(τ j , τ) that reflects the loss between observed (τ j ) and proposed (τ) trajectories. That is, optimise the reward parameterisation weights through (48) where τ is a trajectory under parameters θ (e.g., generated by the optimal policy for R θ ) in the same MDP as the demonstrations. Any off-the-shelf policy search method can be used to optimise θ, with the authors employing the covariance matrix adaptation evolutionary strategy (CMA-ES) optimiser.

Policy Gradient Minimisation
An alternative direct method is to find the reward function for which the parameterised policy gradient is minimised, as is done by Pirotta and Restelli [81] and Metelli et al. [82]. This removes the need for solving the MDP.

Adversarial Methods
Ho and Ermon [83] introduced a model-free adversarial framework to learn a policy, which is trained through RL under a reward function obtained through MaxCausalEnt IRL. Although this work did not contribute new IRL algorithms, it proposed a new framing of the problem analogous to generative adversarial networks (GAN) [84].
In tasks with large state-action spaces or unknown transition dynamics, the computation of the partition function in the MaxEnt objective is intractable [85]. Adversarial IRL methods [72,86] approximate the MaxEnt objective through sampling. A synthetic policy π ω generates trajectories maximising an entropy-regularised policy objective E[∑ tRθ (s t , a t ) − log π ω (a t |s t )], whereas a binary discriminator θ (s, a) + π ω (a|s) (49) discerns between synthetic and expert trajectories. The two networks are trained adversarially, resulting in a reward function approximatorR θ and a policy π ω . This shares similarities with the earlier approach in [43] (Section 3. 1.1.3). The adversarial IRL approach has been extended for metalearning [85,87] (Section 6); improved with an information bottleneck [88], semantic rewards [89], or end-to-end differentiability through self-attention [90]; and adapted to language-conditioned tasks [91].
In this subsection, we have outlined the many approaches that have been proposed to discriminate between the several reward functions that could explain a given set of behavioural demonstrations. Maximum margin methods do so by attempting to maximise the margin between the chosen reward function and any other alternatives (Section 3.1.1). Probabilistic methods interpret the rewards as a random variable and the state-action pair demonstrations as evidence, framing the problem as Bayesian inference to obtain a posterior distribution over rewards (Section 3.1.2). This is extended in maximum entropy methods, which seek to account for interdependencies between action choices at the trajectory level to provide a more accurate way to select from plausible reward functions (Section 3.1.3). A gradient method is proposed to obtain a MAP estimate of the rewards without needing to integrate over the entire solution space, showing how most previous methods can be unified under this perspective (Section 3.1.4). Others, by approximating the environment by using the LMDP construct, are able to recover the reward function without needing the actions to be given in the demonstrations (Section 3.1.5). A class of more direct methods exploit the algebraic definition of the IRL problem to find solutions by means of optimisaion techniques (Section 3.1.6). Finally, adversarial methods train a synthetic policy to generate trajectories and a discriminator to discern between expert and synthetic trajectories, converging into a useful reward approximator (Section 3.1.7). All of these approaches assume the solution space for reward functions is defined a priori. In what ways may this solution space be defined? In other words, how may these reward functions be characterised?

Reward Function Characterisation
Early IRL methods assumed linear approximations of the reward function over basis functions, or features φ(s, a). Features have a natural interpretation as elements of the environment that can be perceived and on the basis of which decisions are made. Each feature may be more or less valuable to the decision-making process given context and goals, and they may have complex, logical, hierarchical relationships amongst them and with respect to the rewards (beyond linear). Some approaches include "raw" or "primitive" features that can simply be enumerated without taking their relevance into account, and which form the basis on which to compose more complex or abstract features. In most IRL methods, R defines how to combine and how much value to attribute to features, placing them at the core of the problem. Important considerations arise from this, such as whether they are perceived equally by the actor and the observer, including issues of partial observability, beliefs, perspective taking, and differing ways to interpret and combine raw features. A notable exception in early methods is Bayesian IRL [59] (Section 3.1.2.2), which constructs a Markov chain in R space to sample from the space directly, avoiding the use of features.
Although functions of features afford the possibility of using kernelisation to incorporate nonlinearity, the kernel versions of these functions can have intractable computation and memory requirements. Methods based on matching feature expectations or feature counts do not hold when the reward function is nonlinear in the features. The policy matching method [53] (Section 3.1.1.6) used linear functions in the experiments but applies to any R that is differentiable with respect to θ. They show their method to produce results even when the knowledge of the features is incomplete, through experiments with features that are transformed (linearly) and perturbed (by uniform noise).
Ratliff et al. [92] introduce an algorithmic boosting procedure based on maximum margin planning [49] (Section 3.1.1.5) to learn a nonlinear mapping from a set of feature primitives that is capable of inducing new features, thereby reducing the feature engineering problem to a simple classification problem. The loss vector l sa = (1 − I[(s, a) ∈ τ E ]) ∈ {0, 1} d , where I is the indicator function, provides the loss for failing to match the demonstrations τ E . The procedure iterates through the following. 1.
From current features F k , optimise c(θ; F k ) (Equations (22) and (23)) and compute the loss-augmented reward function R θ = θ T F k + l T .

2.
Train π θ under current R θ and obtain the best loss-augmented µ θ . Early on in the process, this may differ greatly from the given µ E , as the features are not yet expressive enough.

3.
Gather a training dataset D φ of features for the classifier, comprising:

4.
Train a classifier on this data D φ to generalise to other (s, a) / ∈ D φ . 5.
Update the feature matrix F k (expanding in d) by classifying every (s, a) ∈ S × A with the classifier.
Subsequent work by the authors [93] generalises the boosting technique with a functional approach, with a simpler and nonlinear variant with faster convergence and better performance in experiments. They do so by replacing the cost term θ T F j µ in the maximum margin planning objective function by a more general ∑ (s,a)∈S×A R(φ j (s, a))µ(s, a); thus, the objective becomes a functional in R and can be optimised through functional gradient descent. Additionally, they derive an exponentiated functional gradient algorithm to ensure R is positive everywhere, with the aim to make it compatible with path-planning algorithms such as A * .
A method to automatically initialise and adapt feature parameterisations in continuous space is proposed in Dvijotham and Todorov [76] (Section 3.1.5). The features are normalised Gaussian radial basis function kernels with G(s) = [1, s k , s k s l ] ∀k ≤ l. The value function is linear in the kernels In Levine et al. [94], the observer learns a regression tree over S to represent the reward function, with the branching determined by (binary) feature primitives φ (0) (s) ∈ {0, 1} d 0 , yielding features φ that are logical combinations of these primitives. This way, instead of minimising a measure of deviation from expert demonstrations as in previous methods, their algorithm discovers regions of the state-action space where the expressiveness of the features is insufficient with respect to R, and updates the features accordingly, by iteratively alternating between an R optimisation step and a φ fitting step. The tree has d leaf nodes each containing a set of states S i ⊆ S, for i = 1, . . . , d. The features can be interpreted as indicator functions φ i (s) = I(s ∈ S i ). Features deeper in the tree are more complex combinations of feature primitives.
For the optimisation step, R is constrained by D, because the optimal policy under R must be consistent with the demonstrations; the current features φ, so that R must minimise the sum of squares error with its projection onto the feature space. The projection is performed by means of G Rφ ∈ R d×|S | and G φR ∈ R |S|×d , defined to be so that the vector G φR G Rφ R ∈ R |S| encodes the reward for each state, computed as the average reward over the states in the S i that s belongs to. They set the optimisation step as a sparse quadratic program such that where the regularisation term discourages similar features from taking new values by employing a sparse matrix N ∈ R K×d of feature distances where each row k out of K = d(d − 1)/2 corresponds to a pair of features, and N k,i = −N k,i = ∆(φ i , φ i ). The use of 1-penalty for this term is justified by the preference for potentially mergeable features to be very similar to each other, rather than having minimal distance to all others. In the feature optimisation step, a reward function candidate is computed at each node witĥ R(s, a) = |S i | −1 ∑ s∈S i R(s, a) if s ∈ S i R(s, a) otherwise (53) and the pertaining optimal policy trained with value iteration. If the optimal policy for R is consistent with D, set node as leaf node, R ←R, and terminate the iteration. The feature distance measure ∆(φ i , φ i ) is defined to be proportional to the depth of the deepest common parent node for φ i and φ i and acts as a measure against overfitting. Additionally, the maximum allowed depth of the tree is increased with each iteration. Their algorithm reaches convergence in very few iterations consistently. It does not scale to continuous space because it needs to enumerate all s ∈ S for the optimisation step, though approximation techniques may be used to construct a tractable set of constraints to allow for this. Incorporating priors in the fitting step may make learning more efficient. Other regression techniques (including neural networks) can be used instead of regression trees.
A limitation of the above nonlinear reward function methods is that they assume optimal demonstrations. Two concurrent but differing works [95,96] leverage Gaussian processes (GP) to learn nonlinear reward functions of the features that do not require the expert behaviour to be optimal. Furthermore, unlike the above methods, which use the max-margin heuristic to discriminate between reward functions, they are probabilistic. Jin et al. [95] extend [42]'s projection method to continuous spaces by using kernels (GP). The use of kernel machines has issues with scalability, with complexity increasing in the amount of data, and requiring large numbers of training samples for tasks with high variability in the reward structure [97]. Grounded on the MaxEnt perspective, the algorithm in Levine et al. [96] learns a reward function and a kernel function by means of a probabilistic model of the demonstrations and a GP prior on rewards. The learned kernel function comprises feature weights that capture the relevance of each feature to the agent's reward function, an important capability from the ToM perspective. Though they use the mean posterior of the learned reward distribution, they suggest that the entire distribution could be used for different exploration/exploitation tradeoffs in policies, or to elicit more information for regions of high uncertainty. Because it is linear in state, it may not converge in large spaces. This was addressed in subsequent work by local approximation of the reward function likelihood [98].
Kim and Park [99] extend the original AL method [42] with kernels (reproducing kernels), simplifying the training and making it robust to local optima and both robust to and efficient with small demonstration samples.
Choi and Kim [100] propose a nonparametric method to construct the features based on Bayesian IRL. These features are again constructed from logical combinations of primitives. The number of features does not need to be defined beforehand. The prior is an Indian buffet process (IBP).
An alternative proposed by Michini and How [101] is to partition the demonstrations into smaller subtrajectories to simplify the complexity requirements of the reward function approximator. Interpreting them as subgoals, simpler reward functions are then obtained for each. They contribute a Bayesian, nonparametric algorithm that automates the partitioning based on a generative model. With a Chinese restaurant process (CRP) prior, the number of partitions does not need to be predetermined and has no limitation in number. This has a number of advantages. A subgoal may be as simple as a single state or feature, so sparse reward functions can be obtained through this method. It also removes sequential dependencies, making it robust to changes in the initial conditions and better able to handle cyclic trajectories.
Metelli et al. [82] induce the features which, taken as basis functions, span the subspace of reward functions for which the policy gradient is zero (i.e., under which the policy is optimal). The reward function for which deviations from the demonstrations has the highest penalty is selected from this subspace.
The advent of deep architectures provided a way to learn reward functions directly from "raw" state representations (such as images). In Ref. [97] leverage neural networks trained through backpropagation, under the MaxEnt paradigm, to approximate complex, nonlinear reward functions. The features may be learned by the network (e.g., convolutional NN for visual states), without having to rely on handcrafted (given) feature functions. Neural networks aptly scale to complex reward structures in large state spaces. As the computational complexity of this method does not increase with the number of demonstrations, it is suitable for lifelong learning-a desideratum for ToM-IRL. However, it requires access to the MDP to train a policy at each iteration. Wulfmeier et al. [102] extend their deep MaxEnt IRL approach [97] with new architectures for more complex environments. Their approach is shown to be scalable to large demonstration datasets. Similarly, Bogdanovic et al. [103] demonstrate learning to play simple video games in pixel state spaces from expert demonstrations with deep AL. They also show that their method can be extended with an approach similar to [79] to retrieve the reward function [104]. NN are also used to approximate R in [105], avoiding the need to solve the MDP at each iteration. Others propose a binomial logistic regression classifier-based method to learn the value and (nonlinear) reward functions without needing to solve the forward MDP [106].
Training models through Bayesian variational inference has been successful in uncovering nonlinear reward functions. Jin et al. [56] employ deep GP to concurrently learn abstract representations of state features and the reward function. The reward function is modelled as a zero-mean GP prior as in [96], and representations are learned through stacked latent GP layers. Bayesian neural networks (BNN) are finite-dimensional equivalents to GP. Roa-Vicens et al. [57] apply BNN to solve the IRL problem by exploiting their ability to robustly and efficiently characterise a reward function from point estimates obtained by MaxCausalEnt. The process consists of an inference step optimising the likelihood of the demonstrations to obtain point estimates of the rewards, and a learning step that uses the point estimates to train a BNN mapping features to rewards.
Two approximations to MaxCausalEnt IRL for tasks with unknown dynamics have been proposed: Finn et al. [107] address these issues with an adversarial, sample-based approximation algorithm for MaxEnt IRL that is capable of learning nonlinear reward functions as well as efficiently scaling to continuous, high-dimensional state spaces, without relying on a transition dynamics model. Fu et al., introduce adversarial IRL (AIRL) [86]. Focusing on scalability to large, high-dimensional tasks, with unknown dynamics. their algorithm obtains reward functions with robustness to changes in environment dynamics, thereby being able to generalise better beyond training. Following [97], they use a NN as a reward function approximator (i.e., there is no need for feature map). Furthermore, by estimating the gradient through sampling, it does not require a transition dynamics model to be given (but it requires the MDP to simulate in).
In this subsection, we have seen the important role that features play in characterising the reward function. The expressivity of the reward function has a direct dependency on the complexity of the features and their relationships. It is important for our discussion to note that the features in the algorithms belong, phenomenologically, in the observer. Though the agent's decision making does indeed depend on features-things in the world that can be perceived by it-there may or may not be an overlap in the features that the agent and the observer perceive, depending on how the problem is framed conceptually. As a simple illustrative example, consider a blind person walking on the street. As they navigate by using tactile and auditory features, one may infer their "reward" function (e.g., where they want to go) based on visual features that they are certainly not making use of. Future IRL approaches in the context of ToM could benefit from preemptively selecting features based on the type the agent is perceived to be, as a form of perspective-taking. This could be achieved by means of the priors that some of the algorithms above have available as "stored sources" of information, to be used in combination with "immediate sources" observed from the external world [108]. Sections 4.1 and 4.2 provide support for this point of view.
Learning an agent's exact R * is usually not possible, nor is it necessary, because the use of knowing R is to act strategically in the context of a particular interaction [46]. This is supported by Samuelson's Theory of Revealed Preferences [37], which states that consumer behaviour is the most reliable indicator of their preferences (read utilities or rewards). The identifiability of the reward function was flagged and addressed as a fundamental problem in IRL since its conception, but only recently analysis of the problem has been undertaken. Kim et al. [109] formalise the problem and show its relation to properties of the MDP, providing algorithms to establish whether an MDP's rewards are identifiable. This analysis is extended by Cao et al. [110], finding that a single reward function is not identifiable even if the optimal policy is fully known, and that because the value function parameterises the reward space, it is all that is required in conjunction with the optimal policy to recover a suitable reward function (cf. Section 3.1.6.1). Interestingly, they also show that, in the absence of a value function, rewards can be uniquely identified up to a constant if a policy under different discount factors or transition dynamics is given. This highlights the importance of parameters of the MDP (γ, T) beyond the reward function. These recent findings ought to be incorporated into any new IRL algorithms.

Inferring an Agent's Beliefs
The relationship between desires and beliefs is tightly coupled and influences an agent's perceived rationality-an agent's behaviour that appears irrational under a set of beliefs may turn out to be completely rational under another. The majority of work in IRL has focused on recovering the reward function of the agent (i.e., its desires), and has mostly neglected its beliefs. On the other hand, the beliefs of agents can be inferred by observing their behaviour [111].
In the structural estimation of MDPs [112], the econometrics counterpart of IRL that inspired its conception, an agent is represented by the tuple of "primitives" (R, T, γ), and in conjunction with its policy π results in a "controlled stochastic process" {(s t , a t )} t=1,... . The discount factor γ has an effect on the value function, capturing the agent's time preference. As such, it may be a parameter of interest in modelling an agent's decision-making in ToM, but IRL attempts at inferring this parameter are scarce. Using a prior over γ has been suggested [62], or jointly optimising R and γ [113]. Other work addresses the challenge of entanglement of rewards over time [110] and bias for short-term rewards [114]. Note that in contrast to the assumption we have seen in IRL algorithms so far, the transition dynamics T are modelled as part of the agent, i.e., they represent the agent's subjective beliefs about the outcomes of its actions. Seeing a given set of trajectories as realisations of the controlled stochastic process, the goal is to uncover the driving policy and the primitives that generated it (including both its desires and beliefs about the effects of its actions on the environment). Other agent beliefs to be considered are those about the actual state of the environment-observations alone may not be sufficient to know with certainty the exact state of the environment. In this section, we review how IRL methods have addressed beliefs as relating to the transition dynamics in Section 4.1 and state observability in Section 4.2.

Transition Dynamics
A strong assumption of previous IRL methods is that the transition dynamics model of the agent is known by the observer (e.g., [40,59]), or assumed to have a negligible effect on the agent's decision making (e.g., [52]) [115]. Numerous model-free IRL methods have been proposed (e.g., [106,[116][117][118]). Here we are interested in how differences between the actual dynamics and the expert's beliefs thereof may be modelled. IRL methods to estimate the environment's actual transition dynamics T(s, a, s ) = Pr(s |s, a) and the agent's belief about them T E (s, a, s ) = Pr(s |s, a) as well as the rewards have been proposed: by mapping transition probabilities to distributions over features [115]; maximising the likelihood of demonstrations with respect to parameters for the reward function, real transition dynamics, and agent's belief about the transition dynamics θ = (θ R , θ T , θ E ) [119]; or by first observing the agent in tasks with known rewards and subsequently learning the parameterisation θ = (θ j=1:m , θ E ) (requiring the real transition dynamics to be known) [120]. Others perform reward learning with biased beliefs about dynamics [121], study degradation in performance as the transition functions differ between actor and observer [74], or study the impacts of changes in the environment dynamics [122,123]. In earlier work, internal dynamics models are learnt from demonstrations without learning the reward, in a subset of tasks with linear-Gaussian dynamics and quadratic rewards [124], or selecting from a discrete set of candidate models [125,126].

State Observability
False belief tasks are prominent ToM assessment tasks [6]. MDPs, while providing a compact paradigm for the study of rational decision making, have a critical limitation in the context of algorithmic approaches to ToM: agents' beliefs about the state of the world are always accurate [19]. Partially observable MDPs (POMDP) are an extension of the MDP construct in which states may not be fully identified by the agent from perceptual information. Instead, agents have a belief distribution b(s) over S, given the evidence up to the current time step, denoting the agent's belief that a given state is the actual current state. Several IRL methods have been proposed under the POMDP formulation, including a general framework to extend existing IRL methods where agents act according to their beliefs, as opposed to the actual state, so policies are a mapping π : ∆ → A, where ∆ is the belief simplex in |S| − 1 dimensions [127]. Others specifically provide a computational implementation of ToM, using Bayesian inference to reconstruct agents' joint belief state and desires [19]. In this approach, the observer encodes this joint distribution in a dynamic Bayesian network (DBN) with world state (Y), agent state (X), percept (O), belief (B), reward (R), and action (A) variables. The world and agent states result from a decomposition of the MDP state S, because the agent state x ∈ X is fully observed, but the world state y ∈ Y is not. The agent's belief at a given time t is a probability b(y) over Y denoting the agent's belief that a given y is the actual state of the world (fixed for the entirety of a given episode). The observer's joint distribution (conditioned on a given sequence up to time T ≥ t) of belief at time t and reward function can be recovered from the DBN. Beliefs are smoothed retrospectively: the observer's model of the beliefs of the agent at time t is informed by behaviour up to time T ≥ t, bearing similarity to how humans model other's beliefs. Earlier work by the authors under the name of Bayesian inverse planning [33,128], is a subset of the IRL problem, wherein the reward function is known to be constrained to R(s) = −1 ∀s ∈ S \ s g , where s g is the absorbing goal state.
To summarise, one cannot assume that agents have perfect knowledge of the effects of performing an action (i.e., their transition dynamics model may not match that of the environment), or what the actual state of the world is (i.e., their beliefs about the current state may be a more or less accurate distribution over states based on observations of the environment). Furthermore, from a ToM perspective, the observer's transition dynamics and current state models may differ from both the actual environment's and the agent's. For IRL methods that rely on the actual state and transition dynamics being given to be suitable in the ToM context, they need to be expanded to incorporate these considerations about beliefs.
Orthogonally, there is yet another factor to be considered. Even if an agent were to have a clearly defined reward function and perfectly accurate beliefs, its behaviour may not be in accordance with them, be it due to a lack of skill or due to other extrinsic effects. Bridging the gap between desires and beliefs on the one hand and behaviour (actions) on the other are an agent's intentions.

Agent's Intentions
Intentions reflect an agent's commitment to acting, guided by their beliefs, toward states of the world that align with their desires. In the MDP formulation, intentions manifest as the selection of actions so as to maximise expected utility, i.e., in the agent's policy. Here we take a brief look at IRL methods that have taken the agent's intentions into account, be it through considering potentially suboptimal behaviour with respect to the true rewards (Section 5.1), or the possibility that agents have multiple intentions (Section 5.2).

Suboptimal Demonstrations
The policy is ultimately what generates behaviour (trajectories). Although, under the assumption of rationality, the policy is expected to choose actions so as to maximise value, it may not achieve this optimally, reflecting the skill of the agent. This creates a challenge for the recovery of a reward function; we may recover a reward function under which the observed behaviour is optimal, but the behaviour may not have been optimal with respect to the true reward function in the first place. Some methods avoid dealing with this by making their goal to mimic the expert directly, in a manner agnostic about the underlying MDP and R (although they employ policy occupancies obtained in the MDP under the proposed rewards) [49] (Section 3.1.1.5). Another example [98], designed for large state spaces, requires local optimality only. The use of Boltzmann policies in many of the methods reviewed (e.g., [53,59]) entails a certain level of suboptimality given by the stochasticity of the policy-the agent's rationality is captured, to some degree, in the temperature parameter that defines the bias of the action choice distribution for higher expected value actions. This may also be related to curiosity or openness to risk.
Relaxing the assumption that behaviour needs to be consistent with rewards allows for modelling suboptimal agents in an influence diagram framework (including agents with changing desires) [129], through a model-free relative-entropy approach [68]. Alternatively, suboptimal behaviour may in fact be optimal with respect to an agent's incorrect belief about the transition dynamics (see Section 4.1) [120], or it may otherwise be attributed to random perturbations in the environment [130,131], or modelled as different experts with a common underlying reward function [132]. Other work seeks a balance between how compatible a reward function is with demonstrations, and how effective it is for learning a policy [113]. If an approximate ranking of demonstrations is provided, a reward function that explains the ranking can be extrapolated to train policies that outperform the demonstrator [133]. Others have shown the usefulness of (labelled) successful and failed demonstrations for learning [73]. The importance of modelling biases in human behaviour (beyond noisy rationality or simplicity assumptions) has been highlighted in [134].

Multiple Intentions
When inferring an agent's reward function from its behaviour, it is important to take context into account. There may be different types of agents, or agents with different preferences for achieving the same goal, with different skills [135] or options [136] or intentions [116], or different types of goal for a single agent, so being able to produce multiple reward functions is desirable. Thus, in what is known as multiple intention IRL (MI-IRL), two problems need to be solved: clustering trajectories based on their intentions, and recovering the reward function for each of the clusters. Methods to achieve this can be classified by whether the number of distinct reward functions is known ex-ante or not.
Parametric methods require the number of reward functions to be known ex-ante. Unsupervised clustering via expectation-maximisation (EM) can be employed to discern common intentions to find a R θ k for each cluster k, with the assumption that there are K < m clusters [137]. This approach has been extended with gradient methods by constraining the optimisation problem to rewards that are stationary points of the value function, with the selected reward being the MLE of the estimated policy gradient [116]. Subsequent work built on this method to account for nonstationarity in the environment and the expert's policy [138]. To scale Bayesian IRL to complex environments with large, high-dimensional state spaces (e.g., robotics), others propose metalearning the reward function parameters, finding a parameterisation for each of the provided demonstrations by assuming the reward weights are close to the mean of the weights over the tasks [139].
Nonparameteric methods do not require the number of different reward functions to be known beforehand. Bayesian nonparametric methods have been proposed to achieve this, extending previous parametric clustering methods with the structured generalisation of Bayesian IRL from [62] in [140], or using a Dirichlet process mixture model to draw cluster assignments and reward functions for each cluster through a MCMC algorithm, with the ability to transfer modelled information to new observations [141]. MaxEnt methods combining Dirichlet process-based clustering of demonstrations have also been proposed, including a gradient-based solution based on a Lagrangian relaxation of the resulting nonlinear optimisation problem [142], and employing a deep reward network [143]. Others extend this thinking to continuous action spaces via the path integral MaxEnt method from [70] with hierarchical clustering [144]. A more recent method based on contextual MDPs is able to learn from different experts with nonstationary policies without an assumption of optimality employing subgradient-based optimisation [145].
As we have seen, mentalising complex agents in the real world will require algorithms that can handle discrepancies between intentions and behaviour-manifesting as suboptimal behaviour with respect to the true reward function, as well as the possibility that agents have multiple intentions they behave in accordance with, and whose number may be unknown. Despite the number of operational issues that IRL needs to overcome to be a practicable algorithmic basis for ToM, our review has shown that there is a wealth of methods that aptly address each or some combination of them. What remains to be accomplished is the development of methods capable of modelling not only desires, but beliefs and intentions too, and to do so in large and complex spaces with degrees of uncertainty. For these methods to be truly effective, they ought to heed the considerations in the following section, toward which valuable contributions have been made independently, as we outline.

Further Considerations
Having reviewed the main approaches to IRL and how they relate to desires, beliefs, and intentions, here we outline some remaining important considerations and open challenges. IRL approaches to ToM need to be able to handle vast, complex state spaces to be successful in the real world. A number of methods made advances toward extrapolating to a large state space from demonstrations in a small subset of the space, through minimising the relative entropy between the observed and a learned policy's trajectories [68], by using local approximations of the reward function [98], or employing deep neural networks (DNN) as reward function approximators [102,146] or feature encoders [147]. Others scale Bayesian IRL to large state spaces, through approximate variational inference (with the additional advantage of not requiring it to solve the forward MDP at each iteration) [58], or leveraging multiple RL algorithms with different configurations as approximators to create a multifidelity Bayesian optimization framework [148]. Recent work has shown promising results in large state spaces such as pixel inputs in the Atari suite [133,147,149], or real-world driving data [69,150].
The number of demonstrations required for accurate modelling is an important consideration for inference, and thus for ToM algorithms. Feature expectation estimates can be obtained from observed trajectories, and (under a linear reward parameterisation) the number of expert demonstrations required scales with the dimensions of the features d, but not with |S| or the complexity of π E . Estimating the actor's policy from empirical averages has the advantage of not requiring transition dynamics model to be known. On the other hand, it requires large amounts of trajectory data to be accurate, as well as being limited to states that are visited by the actor. This may be addressed through synthetic data, such as generating trajectories from the learned reward function mimicking the expert to generalise the expert's actions to unseen state space regions [35]. Sample efficiency also affects adaptation to new tasks (e.g., new agents, or new contexts). Metalearning methods seek to uncover the structural similarities of different tasks to be able to more readily adapt to new tasks.It has been used to learn effective initialisation of the reward network parameters in AIRL [87,139], and to learn the similarities between tasks to build a prior, showing good performance in navigation tasks from pixels [151], or to small and state-only demonstration samples by conditioning the function approximators on context [152]. These, however, require known transition dynamics, a shortcoming addressed by disentangling the reward function from the environment dynamics through probabilistic embeddings, adapting to different tasks from single demonstrations through conditioning the rewards and policy on a latent context variable [85].
Another important consideration is how noisy or incomplete the observations are. The effect of noise in features can be mitigated by propagating information between states [153]. Incomplete trajectories, for example due to occlusion, have been addressed with a generalisation of the MaxEnt IRL approach [154]. Work has gone to establish whether an observation is sufficient to recover a (linear) reward function, allowing for new information to be included incrementally and identifying irrelevant features [155]. Both occlusions in trajectories and noisy perception by the observer are addressed with an approach grounded in Bayesian IRL (Section 3.1.2.2) and the MAP inference generalisation (Section 3.1.4) [156]. In more realistic settings, the actions at each timestep may not be available and the observer may need to work with state-only trajectories. This is known as the imitation from observation (IfO) problem (see [157] for a recent review). Some existing IRL algorithms naturally extend to this setting, e.g., [76,149,152]. IfO considers challenges relevant to ToM, such as perceptual encoding (vision, proprioception) [158], embodiment mismatch [159], and differences in viewpoint [160].
A clear direction for expansion of IRL methods, and especially so for their applicability as algorithmic ToM, is to settings where the observer not only observes but also acts in the environment or is able to communicate with the actor. In the cooperative setting, although the reward function may be common to both, the policies must complement each other in maximising rewards [161]. Incorporating human feedback in the learning process can be done by querying the expert for action at specific states [55], correcting suboptimal behaviour as it occurs [162], providing pairwise preferences between segments of trajectories [163], or evaluating counterfactual ("what if?") scenarios generated by the observer and thus reducing the number of interactions with the environment [50]. Human expertise can also be used to teach features to the observer [164].
The natural and most promising extension of the participative setting is to interactions where both agents have ToM abilities. This gives raise to game-theoretic considerations as both agents model each other's strategies. In psychological game theory, payoffs associated with emotions such as guilt or anger are operationalised into the utility function, going beyond the material payoff-based utility of classical game theory [29]. This framework has been used to predict behaviour in cooperative games [165] and to model the perception of other's intentions [166,167]. Emotions and mental states are closely interrelated, and computational ToM approaches may benefit from incorporating empathy and affective mentalising, as well as providing a foundation to develop standalone models thereof [168,169].
The overlap between IRL and game theory is studied in the game identification literature in econometrics [170,171], wherein the payoffs are estimated from behaviour analytically. Others do so algorithmically, by employing the game-theoretic concept of regret in conjunction with MaxEnt [172], or efficient linear programming in succinct games [173], or learning both the system dynamics and the reward function of multiagent nonzero-sum multistage games [174]. The extension from two-player games to the multiagent setting (e.g., [118,[175][176][177]) is nontrivial and may result in emergent phenomena, particularly when sophisticated ToM is present.
An interesting application of algorithmic ToM is as applied to oneself, i.e., as a means to introspection [178]. Behaviour-rating methods for reward learning (e.g., [121]) could provide a way for an agent to rate their own behaviour retrospectively and adjust their own mental models accordingly. Most of the considerations, issues, and extensions covered in this review may be applicable to the introspection application of algorithmic ToM, providing an avenue for more sophisticated artificial agents with better social abilities.
Priors and inductive biases are built into IRL algorithms and play an important role in narrowing down the space of reward function candidates. These afford a degree of flexibility for encoding relevant ToM heuristics in an algorithm's design, such as normative assumptions [134,179], neurophysiological correlates [21,180,181], or models of human decision-making [182] and habits [183]. More generally, an observer may include their experiences into the prior over time, and develop different structured priors in a hierarchical model, selecting them based on the agent's perceived "type" [184,185]. Furthermore, beyond generalisation, the observer may build agent-specific models that encode their idiosyncrasies, refined through repeated interactions [22].
Models for planning, including ToM, must be abstract, causal, and structured [11]. The IRL approach to ToM is highly abstract-compressing an agent's mental dispositions as a single reward function and environment model. The MDP framework in which it is modelled provides a causal structure, and prominent foundational algorithms, e.g., [39,63], place emphasis on this causality. Some work has gone to incorporate further structure into IRL-ToM, including the use of structured priors [62], dynamic Bayesian networks [19,128], compositional desires [186], hierarchical IRL [136,187,188], or the extension of IRL to relational domains [122,189] and contextual MDPs [145]. However, the expressiveness of MDPs as a way to encapsulate the decision-making task may be limited. Others have explored IRL and similar problems in alternative tasks' representations such as decision trees [46], influence diagrams, [129], Markov random field-based graphs [153], adaptive state graphs [190], and and-or graphs [191]. These and other more structured representations afford an avenue for further research for IRL-ToM.
Although the last two decades have produced a substantial amount of IRL methods with practical results, the algorithms are highly specific to the tasks and environments for which they are trained [29]. For IRL algorithms to be useful for ToM, they must be able to make use of different cues available in different contexts [4]. Environments from the field of control have provided a useful basis on which to develop foundational IRL methods, but there is promise in expanding to more dynamic environments involving other agents, such as lane switching for autonomous driving [146], games with strategic modelling of others [192], and specific benchmarks for evaluation [143,[193][194][195][196]. We need environments in which we can test both the individual differences in ToM and the degree to which some tasks require ToM more than others [197].

Conclusions
We have provided background on the IRL problem, reviewed the main algorithmic approaches with their formal descriptions, and discussed the applicability of IRL concepts as the algorithmic basis of a ToM in AI. The main goal in the IRL problem is to retrieve the reward function that best explains an agent's behaviour-the agent's desires in the ToM context. The foremost challenge in IRL comes from it being an ill-posed problem: a policy may be optimal under any number of reward functions, including degenerate ones; therefore, algorithms must incorporate heuristics to discriminate between solutions. Another important consideration is how the reward function is characterised, usually as a function of features of the environment. Different approaches have been taken to define the features and structure that characterise this function.
Some IRL methods also address other core ToM attitudes: beliefs about the environmental dynamics modelled as the transition probabilities and about the states in the form of observations, and intentions, with considerations of potentially suboptimal observed behaviour with respect to the true agent goals and the modelling of multiple intentions. Further considerations have been addressed in IRL algorithms, including the size and complexity of the state space, sample efficiency and robustness to noisy or incomplete observations, the participation of the observer in the environment and its game-theoretical and recursive consequences, ToM as introspection, the incorporation of prior knowledge and structured representations, and the environments and benchmarks available for further research and applications.
As demonstrated in this review, the IRL framework encapsulates the core elements of ToM succinctly while providing enough flexibility for many and various solution methods and extensions to be developed. As such, it holds great promise as a cradle for the algorithmic basis of a ToM in AI.