Adjustable and Adaptive Control for an Unstable Mobile Robot Using Imitation Learning with Trajectory Optimization

: In this contribution, we develop a feedback controller in the form of a parametric function for a mobile inverted pendulum. The control both stabilizes the system and drives it to target positions with target orientations. A design of the controller based only on a cost function is difﬁcult for this task, which is why we choose to train the controller using imitation learning on optimized trajectories. In contrast to popular approaches like policy gradient methods, this approach allows us to shape the behavior of the system by including equality constraints. When transferring the parametric controller from simulation to the real mobile inverted pendulum, the control performance is degraded due to the reality gap. A robust control design can reduce the degradation. However, for the framework of imitation learning on optimized trajectories, methods that explicitly consider robustness do not yet exist to the knowledge of the authors. We tackle this research gap by presenting a method to design a robust controller in the form of a recurrent neural network, to improve the transferability of the trained controller to the real system. As a last step, we make the behavior of the parametric controller adjustable to allow for the ﬁne tuning of the behavior of the real system. We design the controller for our system and show in the application that the recurrent neural network has increased performance compared to a static neural network without robustness considerations.


Introduction
The control of mobile and unstable systems is, in most cases, divided into a stabilizing and a maneuvering part, e.g., [1]. This breakdown of the problem into two separate tasks makes analytic control designs manageable, however, the final performance of the system will be limited compared to holistic approaches. General holistic approaches are difficult to derive analytically for nonlinear systems. Thus, methods based on optimization and learning come into play that train a control law in the form of a parametric function based on a cost function. Most notably, methods emerged from the field of reinforcement learning, e.g., policy gradient methods like PPO [2], are used, as well as methods from the field of gradient-free optimization. For systems with relatively slow dynamics, nonlinear model predictive control can be used [3], which is, however, not suited for fast systems due to the continuous online optimization with non-deterministic computing time. In this work, we design a parametric controller for the position and orientation control of a mobile inverted pendulum (MIP), without the need to compute trajectories online.
The control laws derived based on the aforementioned approaches are generally trained in a simulation environment and often do not transfer well to the real system due to model inaccuracies, also called the reality gap. In order to overcome this challenge, different approaches have been developed. Some methods update or learn a highly precise model online, in order to reduce the reality gap [4,5]. Other approaches use adversarial attacks during the training of the controller in order to enforce robustness [6,7]. Other works focused on training a controller that is trained on a multitude of varying model dynamics, also called model ensembles [8][9][10]. The current research on overcoming the reality for this type of control law is mostly focused on methods of the domain of reinforcement learning. In this work, we tackle the reality gap problem for a less common class of design methods.
We use a method that is related to imitation learning [11] and has been referred to as a form of explicit model predictive control [12]. The method was also mentioned in [13], where a more complex and expensive optimization problem was then derived. In a second step, we extend the method to adapt online to model uncertainties by using a dynamic control law in the form of a recurrent neural network and by randomizing model parameters during training. Different from Peng et al. [9] who also use a dynamic control law, our method is based on the imitation learning framework instead of reinforcement learning, which allows us to shape the behavior of the system using equality constraints. With the objective of the application on the real system, we also make the behavior of the final control law adjustable by an external user via a trivial extension of the method.

Related Work
This section gives an overview of methods and research in two different areas related to our approach. The first area deals with the learning of parametric controllers that have a similar control performance in simulation and in the application on a real system. The second area is concerned with learning parametric controllers, based on a combination of imitation learning and trajectory optimization.

Training Robust Parametric Controllers
Most parametric controllers are trained in simulation due to the high cost of real-world data. The gap between the simulation model and reality can lead to poor performance or even failure of the controller when applied to the real system. Many articles have been published that either tackle the problem of reducing the reality gap or train a controller that can perform well despite the reality gap.
In order to improve the accuracy of the simulation model and as such reduce the reality gap, data concerning interactions with the real system have to be considered. Abbeel et al. [14] use an iteratively updated model to compute an improvement direction for the control parameters using a policy-gradient method. The step-size of the update is, however, determined on the real system. This is beneficial, as determining a good step size requires less interaction with the real system than estimating a gradient direction. The model is retrained after each update of the control parameters with new data. In Deisenroth and Rasmussen [4] a model in the form of a Gaussian Process is iteratively updated and a control improvement is done offline after each update, using gradient descent on the accumulated costs. In Golemo et al. [5], the model error is approximated using a recurrent neural network, trained on data from the real system, and a controller is then derived in simulation using the policy-gradient algorithm PPO [2].
Recently, more and more works focus on designing controllers that are robust with regards to model errors, rather than further increase the model accuracy. The methods can be roughly classified in methods that use a multitude of model parameters for training and methods that use a disturbing entity, called the adversary. A set of different model parameters is used in Mordatch et al. [15] to compute robust trajectories for a humanoid robot. As such, this work does not fall completely in the scope of robust parametric control, but rather robust trajectory generation. A non-parametric controller was used to deal with small system deviations online. Rajeswaran et al. [8] propose to train robust controllers using randomizing model parameters and a policy-gradient algorithm, while using only a subset of the worst trajectories to approximate the policy-gradient. This can be seen as a relaxation of a min-max formulation, common in nonlinear robust control [16]. An adversarial approach is introduced in Pinto et al. [7] to train a robust controller. Both the controller and the adversary are parametric functions that are trained using a policy-gradient algorithm. The approach also approximates the min-max problem, with the robust controller minimizing the total costs, and the adversary maximizing them again through bounded adversary actions. A different approach is presented in Yu et al. [17]. Instead of approximating a min-max solution, the parametric controller is trained with uncertain model parameters as additional inputs using the policy-gradient algorithm TRPO [18]. As the parameters of the real system are unknown during the application, an estimator/observer is trained in a second step, and the union of both estimator and controller are able to adapt to the real dynamics by guessing the current system parameters during application. Pattanaik et al. [6] fool the policy during training by perturbing the perceived state. In this case, the perturbations, or adversary attacks, do not result from a second controller but are either sampled randomly or computed based on the action-value function. Muratore et al. [10,19] use randomized model parameters to estimate the simulation optimization bias and prevent the learned policy from overfitting to the simulator. While the focus of Bousmalis et al. [20] lies on their vision component, the model parameters of the simulated robot arm are also randomized in order to improve the robustness. Peng et al. [9] train a recurrent neural network using reinforcement learning, while randomizing the model parameters. While the recurrent neural network only sees the current states and needs to learn a hidden state representation, the critic, i.e., the action value function, is omniscient in the sense that it has the model parameters as additional input. Chebotar et al. [21] also randomize model parameters, but focus their contribution on adapting the distribution of the model parameters iteratively by also including interactions of the real system. The policy is again learned using the policy-gradient algorithm PPO [2].
When looking at existing literature in this field, it is noticeable that the parametric controllers are solely trained using policy-gradient algorithms. In the present time, the most widely used methods to train a parametric controller stem from the field of reinforcement learning. However, alternatives exist and every method has its own strengths and weaknesses. Some methods from the field of gradient-free optimization, most notably evolution strategies, have proven to be competitive or even superior in certain cases [22,23]. Reinforcement learning algorithms learn by exploring in action-space and by increasing the probability of better than average actions, while evolution strategies explore in parameter space and increase the probability of better than average parameters [24]. Yet another class of algorithms to learn parametric controllers, which is in the focus of our contribution, is a combination of trajectory optimization and supervised learning. This can be seen as a special case of imitation learning. This last class of algorithms proved to be the best suited for the task at hand for reasons mentioned later, but methods focusing on robustness and transferability have to date not been investigated upon to the best of our knowledge.

Imitation Learning with Trajectory Optimization
As the name imitation learning suggests, the field is concerned with learning a parametric control law that imitates the behavior of a teacher, based on demonstrations on how to solve the task. In most of the current literature concerned with imitation learning, the expert is considered to be a human. A survey on applied methods with a human expert can be found in [11]. Since we will not have a human expert, we only consider, with a few exceptions, related works where the teacher is not a human but an optimizer. The approach of training a controller by combining trajectory optimization with supervised learning is far from new, yet no umbrella name for these methods has been established in literature. In this paper, we refer to these methods as imitation learning with trajectory optimization. The following literature contains works from both the control and machine learning communities.
An early application of this method is found in [25], where a neural network is trained as a controller for a wheeled robot. The task is to follow a desired path while avoiding obstacles with results also shown in the real application. An implementation with results in simulation for a chemical process can be found in Åkesson et al. [26]. An application for the control of a semi-active suspension system of the same method is shown in [12] where the authors report improved control performance with regards to their previous analytic control design.
Ross et al. [27] tackle a fundamental problem of imitation learning: the states visited by the expert during the demonstration and later used for supervised learning and the states that the controller will visit during execution are not distributed equally. Therefore, they propose an iterative approach where the controller samples states during execution and the expert provides actions for those states, rather than showing his own solution to the task. The training data obtained in this way is aggregated in a growing data-set. The method, called DAGGER for data-set aggregation, was proposed for use with human experts. He et al. [28] use DAGGER with an artificial expert. In each iteration their expert generates actions that are suboptimal, yet close to the current controller, with the intention of creating actions that can be learned faster than the optimal action. They show small improvements with regards to an expert that always generates optimal actions. Laskey et al. [29] propose adding noise to an artificial expert's demonstrations with the aim of learning a policy that is able to recover from errors. The power of the noise is optimized such that the distribution of expert trajectories approaches the distribution of trajectories produced by the parametric controller in the loop. The approach is called DART for "disturbances for augmenting robot trajectories".
Mordatch and Todorov [13] combine the trajectory optimization step with the supervised learning step by combining their objectives into a single optimization problem: minimizing costs of states and actions (trajectory optimization) and learning a parametric controller that can reproduce similar actions (supervised learning step). They then solve the optimization problem by iterating between regularized trajectory optimization and supervised learning. On simulation tasks, they show a decrease in the expected costs for their method, compared to a separation of trajectory optimization and supervised learning. In a later work [30], Mordatch et al. show the performance for more complex simulations and add a small number of recurrent states as well as sampled noise to their optimization setting.
Levine and Koltun [31,32] formulate a similar idea as a stochastic optimization problem by minimizing the Kullback-Leibler divergence between the distribution of actions produced by the controller and the actions of a distribution with high probability of low costs. They name their approach guided policy search. Guided policy search is also used by Zhang et al. [33] for the obstacle avoidance of a simulated quadrotor. However, they use a simulation of a model predictive controller (MPC) instead of optimizing the trajectory as a whole, and the supervised learning step uses only a subset of the states that the MPC was allowed to use. They extend their algorithm for the simulated quadrotor problem in Kahn et al. [34] to use an adaptive MPC during data generation and call their approach PLATO. As such, the final controller works on states that are available during real flights instead of the augmented state representation available to the MPC. Guided policy search also allows Levine et al. [35] to learn a convolutional neural network controller for a physical robot arm, using only pixel data as control inputs.
We refrain from using the previously mentioned methods, coupling supervised learning and trajectory optimization into a single objective, due to the increase in computation time and complexity and our focus on transferability to the physical system.

Design of an Adaptive and Adjustable Controller Using Imitation Learning
In this section, we present a new approach for training a parametric controller offline that is robust to model uncertainties by adapting to the system's behavior during execution. The approach uses randomized dynamics and a controller with internal states, e.g., a recurrent neural network. The idea has similarities to [9], however, we use imitation learning with trajectory optimization instead of reinforcement learning. Different from [9], the approach that is introduced in this section can use time-dependent cost functions and equality constraints to define the behavior of the system. The inclusion of equality constraints is essential for the control design of our system, as is explained later in Section 5.1. While the final controller has recurrent states to account for model uncertainties similar to [9], the training method and therefore the admissible problem setting differ fundamentally. Moreover, the presented approach consists of a sequence of well established and understood sub-problems, whereas the success of reinforcement learning algorithms depends on hyper-parameters that can be difficult to tune [36].
Training a controller using supervised learning on optimized trajectories does not work for controllers with recurrent states. We leverage the problem by using ideas similar to DAGGER [27] and DART [29]. The approach uses an intermediate policy with model parameters as input, similar to the universal policy in [17], which is necessary to reduce computation time to an acceptable level. The intermediate policy is called the oracle network and acts as a teacher for the final recurrent controller. The oracle network itself can only be used in simulation, whereas the recurrent controller is suitable for both simulation and application.
The approach is divided into three consecutive optimizations and does not require the real system in the loop. The three main parts are 1. Trajectory optimization with randomized model parameters. We also show a trivial extension of the approach, in order to make the final controller adjustable by the operator. In the case of, e.g., a neural network controller, the parameters of the controller are not interpretable as is the case, e.g., for PID controllers. By making the controller adjustable, we do not have to repeat the whole optimization if only a small change in the behavior is required.
In the following we assume a dynamics model that is discretized in time and continuous in the state and action space. We denote the system state at time t by x t ∈ R n and the control signal, or action u t ∈ R m . We also assume for simplicity that the dynamics model is deterministic, as is most often the case for models that are derived from laws of physics. The dynamics also depend on constant but possibly uncertain parameters p, e.g., friction coefficients, moments of inertia. The nonlinear state space model is of the form

Trajectory Optimization
The first part of our method consists of creating and storing optimal trajectories starting from relevant initial states. We also limit ourselves to trajectories over a fixed time horizon T. A distribution D x0 (x) over relevant starting states x 0 is defined by the control engineer. If a policy is directly derived using supervised learning on the trajectories, it is necessary to include starting states in D x0 (x) with velocities that are diverging from the desired state or trajectory. Otherwise, the learned policy will not be able to recover once diverging, e.g., because of a perturbation or simply because of modeling errors. In order to have a robust policy at the end, we also want to define a distribution over uncertain model parameters D p (p). Moreover, the control engineer has to define the time horizon T over which the trajectories are optimized and provide a cost function c(x t , u t ) and end costs c T (x T ) defining the behavior and goal of the control.
Let X = [x 0 , x 1 , . . . , x T ] and U = [u 0 , u 1 , . . . , u T−1 ] be the states and control inputs of a trajectory. An optimal trajectory for a given starting state x 0 ∼ D x0 (x) and parameter set p ∼ D p (p) is calculated by solving: Equations (2d) and (2e) represent equality and inequality constraints, which can be used to consider e.g., bounded control inputs or to enforce convergence to a specific terminal state x T,ref . If the control signals are not smooth, the inequalities can also be used to enforce smooth signals. This is advised as the control will be approximated by a smooth function approximator later.
The problem in Equation (2a) can be solved by either indirect or direct methods [37]. We choose a direct method called multiple shooting [38], as the method works naturally with discretized dynamics and direct methods are known to converge better than indirect methods for initializations that are far from the optimal solution. On the other hand, direct methods are known to produce less accurate solutions. For direct multiple shooting, all equality constraints, i.e., Equations (2b) and (2d) are handled using Lagrange multipliers. Either sequential quadratic programming or interior point methods, also called barrier methods, can be used to solve the problem including inequality constraints [39]. We use an open-source solver IPOPT [40], which implements an interior point solver.
It is important to note that the optimal action u * for a state x can be ambiguous due to the finite optimization horizon T, as well as the constraints (2d)-(2e). The finite horizon and the optimization constraints can cause the optimal u * to not only depend on a state x but also on the time t, i.e., whether the state x was visited early or late in the trajectory. As an example, consider the problem where the constraint x 3 = 1 forces the trajectory to return to a region of high costs. The solution to the problem is X * = [1, 0, 0, 1] and U * = [−1, 0, 1]. In this artificial example, the state x = 0 is visited at t ∈ 1, 2, and the optimal actions are u 1 = 0, u 2 = 1. Thus, in order to provide unambiguous training targets, only the first optimal state-action pair (x * 0 , u * 0 ) would have to be used. That would however be computationally very expensive, which is why we will use all state-action pairs (X * , U * ) during training. We believe that for any problems with non-neglectable optimization time, the trade-of of introducing ambiguous state-action targets by using all state-action pairs of the trajectories can be justified by the reduction in computation time. (2) is performed multiple times for different x 0 ∼ D x0 (x), p ∼ D p (p) until a sufficient coverage of the distributions is achieved. The N optimal trajectories, together with the corresponding model parameters p are stored for later use as tuples (X * k , U * k , p k ); k ∈ {1, . . . , N}.

Oracle Training
After a multitude of optimal trajectories has been obtained, we train a controller that uses both the state x t and model parameters p as inputs, similar to Yu et al. [17]. This controller can only be used in simulation when the model parameters are known. However, it would be problematic to guess good parameters when applying the control to the real system. Therefore, this controller is only an intermediate result and will be used as a teacher for the recurrent controller.
The control law is represented by a parametric function u t = g(x t , p; Θ). Our intermediate controller is learned using supervised learning on state-action pairs of the optimal trajectories. This is a simple regression problem and optimal parameters can be found by solving a regression problem: In order to avoid overfitting onto the data, it is generally advised to use cross-validation, i.e., to split the data into a dataset for training and a dataset to test the model performance.
The best way to solve (4) depends on the parametrization of g(. . . ). We use fully connected neural networks in this paper and solve Equation (4) using stochastic gradient descent with the optimizer Adam [41]. We call this intermediate parametric controller the oracle, as it is provided with the full model information to generate its output.

Training a Robust Recurrent Network
As previously mentioned, the control law obtained in Section 3.2 is not suited for use on a physical system, as it requires knowledge of the underlying parameters, which we assume uncertain. We circumvent this problem by learning a control law that adapts to the dynamics of the system. As in Peng et al. [9], the new control law will use the history of past states and actions to generate the new control signal: as the state transitions and thus the history (x t , u t−1 , x t−1 , . . . , u 0 , x 0 ) are different for each parameter set p, a control law in the form of (5) can produce different control outputs for different p.
The parametrization of a function in the form of (5) would be inefficient, therefore a common approach is to learn a representation of the past history as an internal state h t , also called hidden state. A recurrent neural network is an example for such a representation. The general structure of recurrent neural networks is: We mostly use a shorter notation with Θ = [Θ T 1 , Θ T 2 ] T , combining (6a) and (6b) into a single function. The shorter notation for a recurrent network is then Unlike the system state x t , the learned hidden state is generally not interpretable as a physical quantity. The training of the parameters Θ for recurrent neural networks using supervised learning is usually done using backpropagation through time. A loss between training inputs and training outputs is defined and a gradient is computed over a sequence of data. For regression problems, the mean squared error is used.
Training recurrent neural networks is computationally more expensive, as the gradient is computed by backpropagating through a sequence, which is similar to backpropagating through a deep neural network with the depth growing according to the sequence length. For long sequences, it is common to accept a bias in the gradient direction by truncating long sequences into smaller chunks [42]. Computing the loss gradient over truncated sequences is called truncated backpropagation through time (TBPTT).
A naive training of a recurrent neural network on sequences generated during the trajectory optimization will not lead to a good control law in the closed loop. This is due to the fact that sequences in the training data and the sequences seen during execution of the controller stem from different distributions. All training sequences are sequences included in optimal trajectories. Once the controller deviates from those trajectories, no training data is available. To leverage the problem, we use a mixture of ideas from DAGGER [27] and DART [29] together with our artificial expert trained according to Section 3.2. We train our recurrent neural network on sequences seen during execution of the recurrent neural network, but with actions recommended by the oracle network. Our small changes to DAGGER are the fact that we do not accumulate datasets, but learn for a limited number of iterations on the last data-set only. This is motivated by the fact that data generation using our artificial expert is cheap and early trajectories will not be relevant for the sequences that the final controller will produce. Early trajectories can thus be discarded for reasons of efficiency. Similar to DART we also add a small noise to the data gathering to have more variance in the sequences, especially close to convergence. However, we add our noise to the learning controller instead of to the expert. That way the distribution of sampled trajectories is always close to trajectories of the learning controller and we do not have to minimize the Kullback-Leibler divergence between two distributions, as is done in DART. Pseudo-code for generating training data for the recurrent neural network is give in Algorithm 1. Pseudo-code of our DAGGER and DART combination, which we abbreviate in the following as disturbed oracle imitation (DOI), is shown in Algorithm 2.
A visual comparison of the three algorithms DAGGER, DART and DOI is given in Figure 1. DAGGER generates training data on the trajectory created by the controller in the loop and accumulates all data. DART generates data for a distribution of trajectories generated by an expert, with the distribution variance adapted to the trajectories of the controller. DOI generates data on trajectories close to trajectories of the controller in the loop, but does not accumulate past data.

Algorithm 1 Generating training data for the recurrent neural network
Inputs: g, D x0 , D p0 , Evaluate controller Update Θ using TBPTT. end for end for At this point, the reason for the oracle network has to be clarified, since it is a reasonable assumption that one could just solve the optimization problem in Equation (2) to generate the training targetsû t for each x t in Algorithm 1. The reason is again the computational burden, as only the first action of the optimized trajectory would be used. Solving this many trajectory optimizations would only be feasible for small problems. Also, problems occur when a state is visited by the recurrent controller in early iterations where the trajectory optimization is unfeasible due to constraints. By creating the intermediate network, all T − 1 actions in the optimized trajectories can be used.
For the problem in this paper presented in Section 4, the trajectory optimization is the most expensive part and T = 500. A direct use of the optimizer to create training targetsû t had to be interrupted due to the extremely slow progress.

Adding Adjustable Behavior
The training of parametric controllers can require a lot of computation time. If the controller does not perform in the desired manner on the real system, the lengthy procedure needs to be repeated and time is lost. By a trivial extension of the presented method, we show how we can make the behavior of the controller adjustable. In the case of simple controllers, e.g., PID controllers, it is common practice to tune the final behavior in the closed loop with the real system [43]. For parametric controllers, however, simply tuning the parameters is not appropriate, as the influence of the change cannot be predicted and the number of parameters is usually high.
The behavior of the controller in our approach is defined by the cost function in Equation (2a). In a similar fashion as we design a controller to work on multiple models, we can design a controller to work on multiple cost functions.
In an initial step, the cost function is augmented to depend on an additional parameter λ in a bounded range, e.g., λ ∈ [−1, 1]. The influence of the parameter can be, e.g., to shift between penalizing large control signals and penalizing the state costs. The optimal trajectories are then initialized with random initial states and model parameters as well as a cost function parameter sampled from a uniform distribution covering the range. This can also be interpreted as augmenting the state x t with an additional state λ, with dynamics λ t+1 = λ.
The oracle network g λ (x t , p, λ; Θ) is then trained using supervised learning on the optimal trajectories. The recurrent controller has the state x and λ as inputs: The parameter λ is constant during execution and set by the control engineer. It can be used to adjust the behavior without training a new controller.

Task and Model Description
In this section, the mobile inverted pendulum is shortly presented and the control task is clarified. The mathematical model that is used to create the controller is also provided.

System and Task
The mobile inverted pendulum (MIP) is an unstable system with a non-holonomic contraint, described by a nonlinear model. A schematic and a picture of the hardware are shown in Figure 2. The wheels are accelerated by two DC motors. The sensors include a gyroscope, an accelerometer and encoders attached to the motors. The actuation signal for the motors and the sensor signals are handled by a microcontroller, which also communicates with a Raspberry Pi. The Raspberry Pi is used to evaluate the control law, as memory and computing power of the microcontroller are insufficient for the evaluation of our neural network. Because of the challenging system properties of the MIP, it has been extensively studied in the control community. Control laws for the mobile inverted pendulum are often designed around a linearized model and the system is only operated within small tilt angles and target positions in close proximity of the MIP. Controllers based on linearized models can be found, e.g., in [44,45]. An example of a control design based on a nonlinear model is given in Ha et al. [46], however results are only given in simulation.
The task considered in this work is to design a controller for the MIP allowing it to autonomously drive to a desired position, not necessarily in close proximity, and take a desired orientation. The controller furthermore has to stabilize the tilt angle while driving and it has to be computationally cheap enough to be evaluated in real time (δt = 0.01 s) on the Raspberry Pi.
Let x and y be the position coordinates in an inertial coordinate system I and γ the yaw angle of the MIP. While we want to drive to any position x r , y r and orientation γ r in the inertial coordinate system later, it is sufficient to learn a controller that can drive to a single point and orientation e.g., [x r , y r , γ r ] = 0. A coordinate transform in Equation (10) allows us to place any reference point different from 0 in the inertial coordinate system I into the origin of a second coordinate system B, and we can then use x B , y B and γ B for control.

Mathematical Model
The mathematical model of the MIP consists of the body dynamics and the electric drive model. We adopt the model presented in Pathak et al. [47] for our rigid body dynamics. The model has seven states: x = [x, y, γ, α,α, v,γ] T . The state variables consist of the position in the inertial coordinates x, y, the yaw angle γ and its time derivativeγ, the tilt angle α with time derivativeα and the forward velocity v. The equations of the body dynamics model are derived in [47] are repeated in the Appendix A.
The electric drive model consists of a standard model for DC motors found, e.g., in [48], with parameters determined through identification using measurement data. We neglect the current dynamics in our motor model, as they are much faster than the rest of the system's dynamics. We also add a heuristic friction model with four parameters c fric,1-4 . We combine the motor torque constant k 2 with the resistance, to reduce the number of parameters to identify, since we are only interested in the input-output relationship and not real physical values. Our friction and drive model are given in Equation (11).
The identified and measured parameters for drive, friction and body dynamics are given in Table 1. The parameters c z , I yy and c fric,1 are assumed uncertain for the control design and are depicted as a range, since they are hard to determine with high precision and the model dynamics are sensitive to changes in their value.

Application and Results
In this section, we first provide the settings and details of the approach presented in Section 3 for our application on the MIP. Then, a short analysis of training properties and results on the robustness are given both in simulation and on the real system. Our final recurrent controller of the form in Equation (9) is compared with different static controllers and the oracle controllers in terms of different robustness metrics. The oracle controllers are given only as a reference in the simulation results as a baseline for the performance of the recurrent controllers, however, as mentioned earlier, the oracle controllers cannot be used in the real application.

Control Design Details
The first part of our approach consists of the trajectory optimization. We optimize N = 10,000 trajectories for each controller. We show, however, that the task can be solved using less trajectories. The large number of trajectories reduces the influence of random sampling for our results. We sample three data sets. The first data-set, which is used to train a reference controller, has varying initial states with constant model parameters and a constant cost function. The second data-set has varying initial states and model parameters, with a constant cost function. The third data-set includes varying initial states, model parameters and the cost function parameter.
The initial states are sampled from uniform distributions. All entries of the initial state vector x are perturbed, which is important to include starting states of a falling MIP and thus to receive trajectories of the MIP recovering from falling. The initial positions are sampled in a radius of 1.1 m around the origin of the inertial coordinate system. The maximum range we consider for the control later is 1m; all target points outside of this range are projected back onto the radius.
The optimization horizon is 501 steps from t = 0 to t = T = 500, with a discrete step size of δt = 0.01 s. Each trajectory is thus 5s long. For the first two data-sets, we minimize the accumulated costs in Equation (12) with the cost functions specified in Equation (13).
Our cost function penalizes states that are far from the origin with the term x 2 t + y 2 t as well as high motor voltages using c u (u t ). We also penalize large angular velocitiesα t andγ t and the driving velocity v t . The cross term α t ·α t penalizes a falling motion, but rewards rising motions. The constant coefficients weight the importance of the different control goals against each other and were hand tuned by trial and error to produce a subjectively appealing behavior of the MIP.
For the described task, however, we were unable to design a cost function describing the desired behavior on its own, which is why we had to add end constraints. The end constraints force our trajectories to end in the origin x T = 0 at step T = 500. Moreover, adding end constraints increased the convergence speed of the optimization in our case.
The necessity of including end constraints for this problem is in fact the main reason we chose the approach of imitation learning with trajectory optimization learning over reinforcement learning approaches. Equality constraints are not possible with reinforcement learning approaches, which are based on probabilistic reasoning. Recent progress in that area has been made to include inequality constraints only [49]. For our problem, the equality and inequality constraints, previously Equations (2d) and (2e), are summarized in Equation (14).
As was mentioned earlier, we use equality constraints in Equation (14a) to force convergence of the trajectories to the desired state in finite time. The inequalities (14b) and (14c) are required since the maximum voltage that can be provided to the motors is restricted. The inequalities (14d) and (14e) are used to prohibit non-continuous control signals that would lead to larger training errors during the approximation with a smooth function approximator. The constraints restrict the second derivative of the control signals, expressed as a finite difference scheme.
After two data-sets of 10,000 trajectories are created, two controllers are trained on this data using supervised learning. The first controller g(x) is trained on the dataset with constant model parameters and only takes the state as input. The second controller is our oracle controller g(x, p) and is trained on the dataset with randomized model parameters as explained in Section 3.2. We use fully connected neural networks with two hidden layers of 128 neurons each. The hidden layers include tanh nonlinearities and a linear output layer. We use a random portion of 80% of the data as our training-set and the remaining data as our test set. We train the neural networks using a GPU over 10,000 epochs. We did not observe overfitting on the data during supervised learning, even when training on as few as 20 trajectories.
The recurrent neural network r(x, h) uses three hidden layers: a recurrent tanh layer with 32 neurons, a static tanh layer with 64 neurons and a static tanh layer with 32 neurons. The output layer is again linear. For the data generation in Algorithm 1, we again use the trajectory length T = 500 and add a noise with standard deviation = 0.001. We chose the amplitude of the noise in simulation by aiming for a disturbance that leads to trajectories that are subjectively not too far from the undisturbed case, yet the amplitude should not be so small that the noise is not visible. During each epoch, N traj = 500 trajectories are sampled according to Algorithm 1. The training of the recurrent neural network was performed on truncated sequences of 50 time steps, and we performed N gd = 50 parameter updates per epoch using Adam [41]. We ran Algorithm 2 for N epoch = 500 epochs.
In order to adjust the behavior on the real system later, we use our third data-set to train a second recurrent controller with an additional input r λ (x, λ, h), as presented in Section 3.4. For this controller, the optimal trajectories are generated using a modified cost function with the adjusting parameter This cost function is equal to (13) for λ = 0. For λ ∈]0, 1], the velocities are penalized less and the position error is stronger penalized, which leads to faster transition behavior. For λ ∈ [−1, 0[, velocities are penalized stronger, leading to a slower transition behavior.

Results in Simulation
For the following analyses and comparisons, we create a new test-set by optimizing 2000 trajectories that were not included in any training data previously. This set of trajectories contains initial states around 1 m of the origin and random model parameters sampled from the same distribution D p (p) as for the supervised training.
To assess the robustness of a controller, we evaluate two metrics. The first metric is the mean of the accumulated costs over the initial states and model parameters in the test-set The cost function c(x t , u t ) is the same that was used for the trajectory optimization. A lower value for J E,c means that the controller is closer to the optimal trajectories. The second metric is the highest accumulated costs subtracted from the optimal accumulated costs with initial states and model parameters from the test-set.
The metric J max,c is used to compare the worst case performance of the controllers. Again, smaller values are better.
In order to also quantify the violation of the end constraint, we define the cost function The cost function c T (x) is only evaluated using the final state at t = T. We use c T (x) for two metrics that quantify the mean and highest c T (x) of all simulations with the respective controllers.
As a first analysis, the influence of the number of trajectories and the number of training epochs on the performance on the oracle network is analyzed. Figure 3 shows the expected value of accumulated costs J E,c of oracle controllers, trained by supervised learning on a different number of trajectories N for different numbers of epochs N gd . For a small number of trajectories N, the best control performance is reached after a few epochs of supervised learning and the performance starts degrading with more training. Thus, a decrease in the error during supervised learning does not necessarily lead to an increase in the closed-loop performance. Including more trajectories leads to an increased performance in the closed loop. Including more than 5000 trajectories, however, did not lead to a noticeable increase in the closed loop performance for our case. The only case that produced a controller unable to stabilize the system was with as low as 20 trajectories. We use the oracle controller that uses all 10000 trajectories to train our recurrent controller r(x, h) using DOI. The mean simulation costs of the recurrent controller is depicted in Figure 4, computed every 5 training episodes. The costs are decreasing exponentially with the number of training episodes in our case. For each type of controller, i.e., static, oracle, recurrent and adaptive recurrent controllers, we extract the controllers with the best average closed loop performance that we obtained in the training on 10,000 trajectories for the following comparison. We evaluate the metrics in Equations (16) Table 2. The oracle network slightly outperforms the simple controller without parameter information in terms of mean costs and maximum above optimal costs. The recurrent controller performs better even than the oracle network in terms of both mean costs and maximal above optimal costs. For the mean accumulated costs this is unexpected, as the recurrent controller did not have a direct access to the cost function during its training. For the maximum above optimal accumulated costs, we believe that the recurrent layer is able to average out poor actions in individual states by acting on the past history instead of only acting on the current state. The adjustable controllers do not perform as well as the non-adjustable controllers in terms of mean costs for the case λ = 0. Also, for the adjustable case the recurrent controller outperforms the static controller with regards to J max,c .

Control Performance in the Application
The controllers, trained in simulation, are transferred unchanged to the real system. To evaluate the control performance, we record measurements of each controller for a test trajectory and evaluate the costs in Equation (13). Our test trajectory is 220 s long and contains 10 random target locations. Every 10 s, the target location changes from the origin to one of the target locations and then back to the origin after another 10 s. The accumulated costs over our test trajectory, i.e., ∑ t c x (x t ) and ∑ t c u (u t ), for different controllers are given in Table 3 with lower values indicating a better control performance. The recurrent controllers r(x, h) and r λ (x, 0, h), trained on various model dynamics, achieve a better performance than the static controller with accumulated costs reduced by 20% and 22.7%. The controller r λ (x, 0.3, h) is also represented in the table, and performs slightly worse than r λ (x, 0, h). This is expected as the cost functions (13a) that is used for the evaluation of the values in Table 3 is different from the costs in Equation (15) for Measurement data for the test trajectory for the static controller g(x), the recurrent controller r(x, h) and the adjusted controller r λ (x, 0.3, h) are shown in Figures 5-7, respectively. In the application, the approaching of the target position is slower than in simulation due to oscillations in the tilt angle at low velocities. The remaining position error is most pronounced for the static controller in Figure 5, e.g., when approaching the new target position after 30 s. The position error is visibly reduced using a recurrent controller r(x, h) as can be seen comparing Figure 6 with Figure 5. The performance increase is also reflected in a reduced value for ∑ t c x in Table 3. However, even after 10 s the target location is not reliably reached for the static and the recurrent controller, as the controllers try to reduce the angular velocity of the tilt angle rather than drive towards the target position. We therefore use the adjustable controller r λ (x, λ, h) and increase the adjusting parameter λ to give more weight to the position and less weight to velocities. As can be seen in Table 3, this leads to an increase in the accumulated costs compared to the controller r(x, h) according to the cost function in Equation (13), caused by higher velocities and control signals. However, the behavior is subjectively better due to faster and more accurate approaching of the target location as is seen in Figure 7. Increasing λ also increases the speed of convergence and accuracy of the yaw angle, as can be seen by comparing Figure 7 with Figures 5 and 6 respectively.  For a qualitative impression of the control performance of the final controller r λ (x, λ, h) with λ = 0.3, a sequence of images of the MIP during the test trajectory is shown in Figure 8 for the real application and in Figure 9 for the same controller in the simulation. For the real application, a visualisation of the recorded measurement data is provided for each timestep as well, with the target position depicted as a static green MIP. The time delay between each image is 0.25 s. Comparing both, we see that the dynamic manoeuvre is performed almost identically with the largest differences close to the position of rest. For a further qualitative impression, a video of the MIP using the adjustable controller is accessible at https://youtu.be/MwVZgRJSnXg.

Outlook of Application Specific Variations
Our controller was trained to drive the MIP towards static target positions and orientations. By continuously shifting the target position, a trajectory or a moving target can be followed by this type of controller as well. An application suited for this type of controller is, e.g., driving after a person based on an on-board camera. However, the MIP will be driving slightly behind the target position in the case of a moving target, as it was trained to reach targets with zero velocity. This effect can be seen in the video at https://youtu.be/MwVZgRJSnXg. If a more precise trajectory tracking is required, one has to include training data of trajectories that were optimized to follow non stationary targets. Other tasks, e.g., ascending slopes or avoiding collisions, also require the addition of optimized trajectories specific for that task.

Conclusions
We developed a parametric feedback controller for the mobile inverted pendulum using imitation learning with optimized trajectories. The controller is able to stabilize the system and drive to target positions within a certain radius without the need to compute trajectories online. The optimal trajectories used in the training were generated using varying model parameters in simulation and the controller used a recurrent structure in order to adapt its behavior to the real system. In order to train a recurrent controller using imitation learning, it is necessary to have training targets on state sequences generated by the recurrent controller itself. We therefore trained an intermediate oracle controller with full information of the model parameters that acts as a teacher to the recurrent controller. We show an improvement of the robustness of the controller both in simulation and in the real application by comparing it to a controller without the recurrent structure. Finally, an additional input allows us to easily adjust the behavior of the recurrent controller in the application.
Author Contributions: Conceptualization, methodology, software, validation, formal analysis, investigation, data curation, writing-original draft preparation, visualization: C.D.; resources, writing-review and editing, supervision, project administration, funding acquisition: B.L. All authors have read and agreed to the published version of the manuscript.
Funding: This research was partially funded by the German Research Foundation DFG, SFB768.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: