Using Inverse Reinforcement Learning with Real Trajectories to Get More Trustworthy Pedestrian Simulations

: Reinforcement learning is one of the most promising machine learning techniques to get intelligent behaviors for embodied agents in simulations. The output of the classic Temporal Difference family of Reinforcement Learning algorithms adopts the form of a value function expressed as a numeric table or a function approximator. The learned behavior is then derived using a greedy policy with respect to this value function. Nevertheless, sometimes the learned policy does not meet expectations, and the task of authoring is difﬁcult and unsafe because the modiﬁcation of one value or parameter in the learned value function has unpredictable consequences in the space of the policies it represents. This invalidates direct manipulation of the learned value function as a method to modify the derived behaviors. In this paper, we propose the use of Inverse Reinforcement Learning to incorporate real behavior traces in the learning process to shape the learned behaviors, thus increasing their trustworthiness (in terms of conformance to reality). To do so, we adapt the Inverse Reinforcement Learning framework to the navigation problem domain. Speciﬁcally, we use Soft Q-learning, an algorithm based on the maximum causal entropy principle, with MARL-Ped (a Reinforcement Learning-based pedestrian simulator) to include information from trajectories of real pedestrians in the process of learning how to navigate inside a virtual 3D space that represents the real environment. A comparison with the behaviors learned using a Reinforcement Learning classic algorithm (Sarsa( λ )) shows that the Inverse Reinforcement Learning behaviors adjust signiﬁcantly better to the real trajectories.


Introduction
Reinforcement learning (RL) [1] has been extensively used over the past years as a challenging and promising machine learning field in problem domains such as robot control, simulation, quality control, and logistics [2][3][4][5]. Its nature-inspired insight, based on the interaction with the environment, makes it especially suitable for solving decision-making problems in which we do not clearly know all the aspects involved in the process. RL algorithms are conceptually clear and their validity is usually demonstrated on problems that involve a discrete world with few states, a set of actions on that environment and a representation of the knowledge learned (value function) in tabular form. Nevertheless, when applying RL to real-life problems, we have to use more complex configurations Set θ 0 to arbitrary values and f 0 π = 0; i ← 0; while f i π − f ς > do Use Soft Q-learning with the current reward function R i (s, a) = ∑ i θ i φ i (s, a) and find a policy π θ i ; Evaluate policy π θ i and calculate the new f i+1 π ; Gradient update θ i+1 = θ i + α(i)( f ς − f i+1 π ); i ← i + 1; end In this paper, we propose the use of Inverse Reinforcement Learning (IRL) to include information from the real world inside the learning process. The introduction of real examples is a way of calibrating the learning process during the process itself, guaranteeing that the learned control or the behavior is similar to that provided in the examples without giving up the power of generalization inherent to the learning process. In this work, we focus on the problem of simulating the navigation of a pedestrian inside a 3D environment representing a simple maze. We have captured real traces of pedestrian trajectories in a maze recorded in our laboratory. Then, we define an adequate IRL approach for this problem domain and compare the resulting behavior with those obtained after a classic RL process in the same domain. The results show that the IRL simulation captures fundamental aspects from the real pedestrian examples that makes the simulation more realistic than the classic RL one.
The main contribution of this paper is to adapt the IRL formalism based on maximum causal entropy to the simulation of human navigation, as a means of including data from the real world to mitigate the problem of calibrating and adjusting RL-based simulators and, thus, getting more trustworthy simulations. A challenging aspect of our problem domain is the continuous state space that implies the use of a function approximator. The importance of getting realistic pedestrian simulations is justified by the increment in the use of simulated pedestrian flows in many kinds of social and environmental simulation scenarios such as facility design, buildings and infrastructure design, crowd disasters analysis, etc. [8].
The paper is organized as follows: first, we introduce the RL and IRL fundamentals, second, we present our approach to IRL for the navigation domain, then we explain our experimental set-up and discuss the results. Finally, we present our conclusions and future work.

Background
In this section, we briefly introduce the fundamentals of RL and IRL.

Reinforcement Learning
RL [9] is an area of Machine Learning concerned with the problem of sequential decision-making. This problem has been modeled in decision theory as a Markov Decision Process (MDP). An MDP is a 4-tuple {S, A, P, R} where S is the state space, A is the action space, P defines the (stochastic) transition from one state to another, and a reward function R : S × A × S → R. The function P : S × A × S → [0, 1] gives the probability of changing to a certain state from another one after performing an action, and models the interactions with the environment. The state signal s t describes the environment at discrete time t. Let A be a discrete set. In the state s t , the decision process can select an action from the action space a t ∈ A. The execution of the action in the environment changes the state to s t+1 ∈ S following the probabilistic transition function P (s t , a t , s t+1 ) = Pr{s t+1 = s | s t = s, a t = a}. Each decision triggers an immediate scalar reward given by the reward function r t+1 = R(s t , a t , s t+1 ) that represents the value of the decision taken in the state s t . The goal of the process is to maximize at each time-step t the expected discounted return defined as: where the γ ∈ [0, 1] parameter is the discount factor and the expectation E is taken with the probabilistic state transition P [10]. The discounted return takes into account not only the immediate reward got at time t but also the future rewards. The discount factor controls the importance of the future rewards. A policy is a function π : S → A and represents a particular behavior inside the MDP. Another way of representing a policy is as a probability distribution over the state-action pairs π : S × A → [0, 1]. The Action-value function (Q-function) Q π : S × A → R is the expected return of a state-action pair given the policy π: The goal of the RL algorithm is to find an optimal Q * such as Q * (s, a) ≥ Q π (s, a) ∀s ∈ S, a ∈ A, ∀π. The optimal policy π * (s) is automatically derived from Q * as it is defined in Equation (3): This problem is highly related with the Multi Armed Bandit problem in which it has a first motivation and benchmark [11][12][13].
Two main streams exist in the RL framework: value function algorithms and policy gradient algorithms. The former tries to find a function that maps state-action pairs with their expected return for the desired task. The latter works in the policies space trying to adjust a parametric policy via gradient ascent to find one that accomplishes the task. Soft Q-learning, used in this paper (see Section 2.2.1), should be included in the first group.

Inverse Reinforcement Learning
In RL, the source of information for the learned control or behavior is the reward function. This signal must be rich enough to evaluate all the key decisions taken in the learning process to converge to the optimal policy. Nevertheless, in many cases, the design of the reward function is difficult especially when the policy we want to find is very specific with respect to the selection of actions (i.e., a bad selection of an action can give a result far from the expected control or trajectory). In these cases, the definition of a rough reward function can be derived in unsatisfactory policies. IRL addresses the problem from another point of view. Given a set of examples taken from real life or from an expert, the goal of IRL is to find a reward function that can provide, after an RL process, a policy that generates results similar to those presented as examples [14][15][16]. Formally, given {S, A, P} and a policy π (from examples or an expert), we want to find the set of reward functions {R i } where π is an optimal policy in the MDP {S, A, P, R i }. In the practical cases, we are not provided with a policy π but with a set of examples derived from it. In this case, the goal is to define rewards that generate policies consistent with the provided examples.
Several IRL approaches consider the existence of a vector φ of feature functions φ i : S × A → R and a set of examples ς i , that is, sequences of pairs (s i j , a i j ) made by an expert or the real phenomena in the nature. It is assumed that the feature functions φ i measure properties that are relevant for the decision process (e.g., in the experiment proposed in [17], they are the current occupancy of a specific cell in the gridworld, in a driving simulator they can be properties of the driving style such as change of line, crash with other car. . . ). Given an MDP, a discounted feature expectation vector is defined for a policy π and a discount factor γ as In the case of the description of the real process, we do not know the implicit policy and it is estimated using the set of examples ς i These definitions are also valid for episodes with infinite time horizon (T → ∞). Considering a model of the reward function as a linear combination of the feature functions R(s, a) = ∑ i θ i φ i (s, a), the feature expectation vector of a policy π completely determines the expected sum of discounted rewards for acting using this policy [16,17]. The goal of IRL is to optimize the reward function to generate a policy π with a discounted feature expectation vector f π that accomplishes: Matching feature expectation vectors is an ill-posed problem: there are many solutions, including the degenerate solution (that is assigning probability zero to the trajectories of the set ς) for this optimization problem. Several formulations of the IRL problem have been proposed. The work by [16] formulates the problem as a linear programming optimization while the work in [17] proposes a quadratic programming optimization problem. The difficulty with these approaches is that they provide as output a mixture of policies that guarantees in expectation Equation (6). Given a finite set of policies π i i = 1, 2, . . . n, a new mixed policy can be built whose discounted feature expectation vector is a convex combination of those of the set of policies where the probability of choosing policy π i is λ i . We can calculate the probabilities λ i finding the closest point to f ς in the convex closure of f π i i = 1, 2, . . . n solving a quadratic programming problem [17]. In the simulation domain (and specifically in pedestrian simulation), a mixture of policies that, in expectation, satisfies feature matching is not acceptable because it can produce a feeling of random and non-realistic behaviors. Other works [18,19] have focused on avoiding the degenerate solution of the optimization problem.
Recently, the use of deep learning has been included in the IRL framework for problems with large or continuous state and action spaces. The work by [20] proposes a version of the Maximum Entropy IRL problem with a nonlinear representation of the reward function using a neural network for robotic manipulation. The work by [21] derives from the generative adversarial network paradigm a model-free imitation learning algorithm that extracts directly a policy from data.
Some of these approaches have been used in the problem domain of the pedestrian simulation. The work by [22] uses the approach discussed in [17]. The authors propose a geometric interpretation that builds a convex hull extracting critical information from the normals of the facets of that hull. The problem is that the process suffers from the curse of dimensionality with the number of chosen features. The work in [23] extends the problem to the multi-agent setting. The authors study a traffic-routing domain and find a formulation similar to that proposed in [16]. The multi-agent problem is decomposed in multiple agent-based optimizations, each one focusing on a part of the state space assuming a complete observation of it. Recently, interest has also focused on deep learning-based solutions. The work by [24] uses Deep Inverse Reinforcement Learning for multiple path planning for predicting crowd behavior in the distant future. The work by [25] implemented Inverse Reinforcement Learning using two algorithms: the Maximum Entropy algorithm and the Feature Matching algorithm to estimate the reward function of cyclists in following and overtaking interactions with pedestrians.
In this paper, we propose formulating IRL as a maximum causal entropy estimation task [26,27] to overcome the problems of policy mixture creation and degenerate solutions of the previous cited proposals. The maximum entropy principle [28] is an extensively used statement in statistics and probability theory. Applied in the IRL context, it prescribes the policy that is consistent with the example set of trajectories that has no any other assumptions (bias) about the problem or data [26]. This criterion is applied by maximizing Shannon's expected information conditional entropy H( where Y are the predicted variables of the model and X are the side information about the problem that we do not want to model and the expectation is taken with respect to the joint probability of Y and X. In the MDPs context, the predicted variables Y are sequences of actions in A and the side information variables X are the provided sequence of states S generated by interaction with the environment [26]. The causal entropy [29,30] goes a step further considering that the causally conditioned probability of a random variable A t depends only on the previous sequence of states and actions occurred in the time sequence P(A||S) = Π T t=1 P(A t |S 1:t , A 1:t−1 ) (note the double vertical line for distinguishing causal probability from conditioned probability and note that the sequences of variables go from 1 to time t and t − 1 for states and actions, respectively. The causal entropy is then defined for an infinite horizon context as [27,31]: where the expectation in each time period t is taken from the joint probability distribution of A t and S t , which depends on the transition probability of the MDP (P) and the policy π. The formulation of IRL as a maximum causal entropy estimation is as follows: subject to: f π = f ς π(a|s) ≥ 0 ∀a ∈ A, s ∈ S ∑ a∈A π(a|s) = 1 ∀s ∈ S.
To solve this non-convex optimization problem, it is converted to an equivalent convex one considering a Lagrangian relaxation of Equation (8) and a dual problem formulation. From now on in this subsection, we follow the discussion and given demonstrations of [31].
For a fixed θ, the objective function can be maximized since the feasible set is closed and bounded. Then, we define g(θ) as the value of this optimization problem (Equation (9)) for a fixed θ. Since any solution of the optimization problem shown in Equation (8) is feasible for this problem, this implies that g(θ) is an upper bound of the optimal value of Equation (8). The dual problem formulation finds the lowest upper bound: This is a convex optimization problem. The work by [31] proves that strong duality holds for the original problem (Equation (8)) and the dual problem (Equation (10)). Therefore, if we denote p * to the primal optimal value, then the following holds: p * = min θ g(θ).
We can relax the optimization problem of Equation (9) for the other constraints leading to another Lagrangian relaxation optimization problem: subject to: π(a|s) ≥ 0 ∀a ∈ A, s ∈ S (11) Following the same reasoning than before, the dual problem minimize θ,λ s g(θ, λ) is a convex optimization problem.

Soft Q-Learning
The practical solution of the problem formulated in Equation (12) has the form of a gradient-based algorithm [31]. It has three steps: first, find a parameterized policy π θ i with the current reward R i (s, a) = ∑ i θ i φ i (s, a). Second, evaluate the policy and calculate f π θ i . Third, use a gradient-ascent with respect to parameters θ to adjust the value of the linear reward. The gradient of the dual problem is f ς − f π . When the transition probabilities P are known, the policy can be found using Dynamic Programming. If they are not known, as in this case (model-free case), the authors propose the use of algorithm Soft Q-learning that is a variation of the Q-learning algorithm, a well-known temporal difference learning method in RL. The schema are summarized in Algorithm 1.
A Soft Q-learning algorithm uses a recursive updating rule to back-up the value function defined as: In [31], the authors prove that this rule is a contraction that converges to a parameterized softmax policy π soft θ that is unique. This policy is defined as for a given (learned) value function Q θ with fixed θ. The problem is then solved with an iterative schema where the reward is adjusted using gradient ascent to get a new policy π θ by means of a Soft Q-learning process and then calculates a feature vector f π more approximate to f ς .
Soft Q-learning [31] is described in Algorithm 2.

Algorithm 2: Soft Q-Learning
Result: Value function Q Input: Constant Vector θ, collection of functions φ i (s, a), parameter γ; Set t=0 and s 0 ; Initialize Q 0 (., .) arbitrarily; Use a uniform random policy π; while Not reached max number of episodes do generate a sample (s t , a t , s t+1 , R(s t , a t ) using π; update Q t+1 (s t , a t ) following rule of Equation (13); end

IRL for the Navigation Domain
In this section, we briefly describe our problem domain and the decisions made to implement the IRL framework described in Section 2.

MDP Setup
In our set-up, an embodied 3D agent represented by a circular shape of dimensions similar to the mean personal space of a pedestrian moves inside a 3D virtual environment. The agent has a set of actions that permits to modify their speed and direction. The goal is to arrive to a target place in the environment. The state of the agent is defined in each step by a vector of real-valued features. The agent has a limited number of steps to reach the goal in one episode. At each step, the agent selects a pair of actions one for changing the speed and other to change the direction of the velocity.
The environment uses a physics engine called ODE (www.ode.org) that is calibrated to represent human interaction with real environments while walking. The environment consists of two walls that configure a maze (see figures of Section 4). The embodied agent is placed in a point opposite to the goal and is tasked with reaching the target crossing the maze using the internal corridor formed by the two walls. Next, we describe briefly the main characteristics of the state space and the actions space. The interested reader can find a complete description of the environment in [6,32].
The action space is discrete and has two types of actions. One type controls the speed and the other one controls the direction of the velocity. We have divided each type of action into nine different intensities. For the type of control of the speed, these are: four to increase the speed in 1 16 , 1 8 , 1 4 , 1 2 of a reference value (maximum velocity) and four to decrease the speed in the same proportion plus a 'Do nothing' action. For the control of the direction of velocity, we set an imaginary line between the agent and the goal. Then, we define four actions to turn to the right of the line in π 32 , π 16 , π 8 , π 4 and the same for the left of the line plus a 'Do nothing' action.
The state space is a vector space constituted by nine real-valued features described in Table 1 and shown in Figure 1.

S ag
Speed of the embodied agent A v Angle of the velocity respect to a line joining the agent with the goal D goal Distance to the goal D ob i Distance to wall i A ob i Angle of the position of wall is relative to a line joining the agent with the goal Since features are continuous, this imposes the necessity of a generalization method to address the mapping of infinite values of the state space to a discrete value function Q(s, a). We use a generalization method called tile coding [33] derived from CMAC [34]. It is a linear discretization function approximation method with binary sparse features. It is based on the definition of several grids that cover the state space. Each grid defines a set of cells. The representation of a point of the space is given by the active cells (those that contain the point) corresponding to each grid. The value function is then a linear combination of the active cell for a state in each time Q(s, a) = ∑ T i=1 ω i ϕ i (where T in the total number of cells of all grids, ϕ i is a binary feature with value '1' if the correspondent cell i is active and '0' otherwise, and ω i is the value of cell i that is updated by the learning algorithm.
In order to face the curse of dimensionality, a hash table is used so that the cells are mapped to entries in the hash table. This method is effective when the problem is disperse, and therefore, the collisions in the same entry by different cells are rare, which is the case because the state space used by the policies in the learning process is much smaller than the whole state space. A discussion of this generalization technique in the context of RL can be found in [9].

IRL Setup
An important decision in the configuration of the IRL framework is the form of the functions φ i (s) that are used to calculate reward R and the feature expectation vectors f π and f ς . It is a problem domain-dependent decision. The work by [17] indicates that the functions must represent key features of the problem (e.g., the distance to an obstacle in an autonomous vehicle problem), while [31] suggests that "[the functions φ i (s, a)] measure quantities that direct expert decisions". We propose the use of radial basis functions (RBFs) based on the distance of the agent state to several prototypical states as a measure of the similarity of the learned behavior and the real trajectories. The idea is as follows: we have reproduced the real trajectories of the set of examples ς with an agent in our virtual 3D environment and we have collected the states through which the agent goes. Then, we have used K-means to get a set of prototypes that generalize these states. Figure 2 displays a picture of the prototypes and the data points used for the two first features of the state description. Each of these prototypes constitute the mean of a Gaussian RBF function with standard deviation σ empirically fixed φ i (s) = exp(− d σ 2 ), where d = ||s − p i || 2 is the square of distance of a state s to the prototype p i . Therefore, each function φ i (s) gives an estimation of the similarity of the present state s to a specific dynamic situation similar to those extracted from the real trajectories of the set of examples ς represented by the prototype p i . Note that the state features are constituted by distances and velocities which is roughly a description of the dynamic state of the individual or agent. With this setting, the feature expectation vector f ς describes a behavior in terms of the distance of the states generated by the real trajectories to the prototypes. The same occurs with the expectation vector f π with respect to the learned behavior. The gradient ascent process will adjust the reward function to get a similarity between both feature expectation vectors which also means a similarity between the dynamic situations (states) learned and those represented by the examples. The assumption under this set up is that the generation of similar dynamic situations in a task implies similar behaviors.

Experimental Results
The lay-out of our virtual 3D environment is shown in Figure 3 (right side). There are two walls that create a maze. The embodied agent is placed on the right of the image and the goal is to reach the place labeled with a red triangle on the left. We have carried out the same experiment in a laboratory (with the same dimensions of those in the 3D environment) with real pedestrians (See Figure 3, left side). Several trajectories have been sampled from pedestrians in the laboratory that will constitute the set of examples denominated ς in Section 2.2. These real sampled trajectories are shown in the 3D environment in Figure 3 (right side). As a baseline, we first try to solve the navigation problem using the standard RL setting. This problem is similar to that considered in [6] with the difference that, here, we consider only one agent inside the maze. The algorithm used is Sarsa(λ) with Tile coding as the function approximator. The learning parameters are configured with the values shown in Table 2. The simplicity of the reward signal in this experiment generates undesirable trajectories that try to reach the target using the shortest path, avoiding the corridors of the maze as shown in the sequence of images in Figure 4. Grids used in Tile coding reward-goal = 100.0 Reward when reach the goal Number of episodes = 20,000 Number of episodes (trials) of the learning process Instead of designing more complex rewards to get the desired behavior, we use the examples in ς to setup a IRL process that learn similar paths. The parameter configuration for the IRL process is shown in Table 3. The parameters for the RL processes included inside the IRL schema are the same as Table 2. Table 4 shows the computational time for the execution of a complete schema with 100 iterations. The execution was carried out using a SGI ALTIX Ultraviolet 1000 with cores type Nehalem-EX (2.67 Ghz). Figure 5 shows the learning curve of the RL process for the final iteration of the schema described in Algorithm 1. Note that the agent carries out many episodes until the number of successful episodes increases. The agent does not reach to the goal consistently (that is, through a sequence of state-actions adequately rewarded) until about iteration 14,000. Once the Q θ (s, a) values begin to converge to those that define the current policy π soft θ , the probability of using a successful state-action sequence increments. It is likely that the learning process could be finished at this point. The learned behavior after the IRL process is shown in the image sequence of Figure 6. The sequence shows that the learned behavior with the IRL framework is in accordance with the behaviors of the real pedestrian consisting of following the corridor between the walls.  The same experiment has been reproduced using different number of prototypes, specifically 64 and 256 prototypes. The three configurations give rational trajectories over 80% of the simulations in some of the iterations of the schema. Just because one iteration provides good results does not mean that the immediate next ones will maintain or improve them. The variations in the reward function due to the gradient ascent process creates oscillations in the learning results.  A Procrustes analysis has been carried out to determine the similarities of the shapes of the curves of the discounted feature expectation vectors f ς and f π . It consists of applying shape-preserving transformations to the coordinates of the curves in order to get the best fit (matching) between them. Note that the goal of the gradient ascent process described in Algorithm 1 is to minimize the difference between the feature vectors. Therefore, this similarity in shape is a measure of the quality of the optimization process. The results displayed in Figure 7 show that both curves are similar demonstrating the effectiveness of the method. The Procrustes' distance P is the sum of the squared differences of the transformed coordinates. For the curves in our experiment, the distance is P f ς − f π = 10 −5 . As a value reference, the Procrustes' distance of the real feature vector f ς respect to a random vector with values in the same range is P random = 0.0015. . Curves for vectors f ς and the last iteration f π after being pre-processed by Procrustes analysis. Note the similarity of both curves.

Conclusions and Future Work
The results of the experiment demonstrate the effectiveness of the proposed approach using the IRL framework for pedestrian navigation. In particular, the experiment empirically confirms that the vector of feature functions of Gaussian RBFs is an adequate description of the dynamics of the agent in the sense that similar vectors imply similar dynamics. Nevertheless, the proposed IRL framework is computationally expensive. We have used one hundred iterations (each one including a RL learning process) to adjust the reward to this problem. It is arguable that such computational effort is necessary for this simple case. As demonstrated in [6], a more sophisticated reward can solve this problem in the RL context. Nevertheless, the use of expert examples is an alternative in complex cases where the design of the reward is not evident. This opens the door to new ways of 'editing' and improving behaviors obtained with RL-based simulation frameworks relying on examples provided by experts or real behaviors, as in this case. In addition, more importantly, the examples make the calibration of the learning process stronger, in the sense that the learned behaviors imitate the real ones, which improves the trustworthiness of the simulations.
Future work points towards the use of this technique in multiple-agents simulations. Moreover, a comparison with other optimization algorithms in IRL is necessary to configure an effective framework for IRL-based navigational simulations.

Acknowledgments:
The authors want to thank Vicente Boluda for collecting the data of the trajectories in the laboratory.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript:

RL
Reinforcement Learning IRL Inverse Reinforcement Learning MDP Markov Decision Process