#### 5.1. Control Design Details

The first part of our approach consists of the trajectory optimization. We optimize N = 10,000 trajectories for each controller. We show, however, that the task can be solved using less trajectories. The large number of trajectories reduces the influence of random sampling for our results. We sample three data sets. The first data-set, which is used to train a reference controller, has varying initial states with constant model parameters and a constant cost function. The second data-set has varying initial states and model parameters, with a constant cost function. The third data-set includes varying initial states, model parameters and the cost function parameter.

The initial states are sampled from uniform distributions. All entries of the initial state vector $\mathbf{x}$ are perturbed, which is important to include starting states of a falling MIP and thus to receive trajectories of the MIP recovering from falling. The initial positions are sampled in a radius of $1.1\phantom{\rule{3.33333pt}{0ex}}\mathrm{m}$ around the origin of the inertial coordinate system. The maximum range we consider for the control later is 1 m; all target points outside of this range are projected back onto the radius.

The optimization horizon is 501 steps from

$t=0$ to

$t=T=500$, with a discrete step size of

$\delta t=0.01\phantom{\rule{3.33333pt}{0ex}}\mathrm{s}$. Each trajectory is thus 5 s long. For the first two data-sets, we minimize the accumulated costs in Equation (12) with the cost functions specified in Equation (13).

Our cost function penalizes states that are far from the origin with the term ${x}_{t}^{2}+{y}_{t}^{2}$ as well as high motor voltages using ${c}_{u}\left({\mathbf{u}}_{t}\right)$. We also penalize large angular velocities ${\dot{\alpha}}_{t}$ and ${\dot{\gamma}}_{t}$ and the driving velocity ${v}_{t}$. The cross term ${\alpha}_{t}\phantom{\rule{0.166667em}{0ex}}\xb7\phantom{\rule{0.166667em}{0ex}}{\dot{\alpha}}_{t}$ penalizes a falling motion, but rewards rising motions. The constant coefficients weight the importance of the different control goals against each other and were hand tuned by trial and error to produce a subjectively appealing behavior of the MIP.

For the described task, however, we were unable to design a cost function describing the desired behavior on its own, which is why we had to add end constraints. The end constraints force our trajectories to end in the origin ${\mathbf{x}}_{T}=\mathbf{0}$ at step $T=500$. Moreover, adding end constraints increased the convergence speed of the optimization in our case.

The necessity of including end constraints for this problem is in fact the main reason we chose the approach of imitation learning with trajectory optimization learning over reinforcement learning approaches. Equality constraints are not possible with reinforcement learning approaches, which are based on probabilistic reasoning. Recent progress in that area has been made to include inequality constraints only [

49]. For our problem, the equality and inequality constraints, previously Equations (2d) and (2e), are summarized in Equation (14).

As was mentioned earlier, we use equality constraints in Equation (14a) to force convergence of the trajectories to the desired state in finite time. The inequalities (14b) and (14c) are required since the maximum voltage that can be provided to the motors is restricted. The inequalities (14d) and (14e) are used to prohibit non-continuous control signals that would lead to larger training errors during the approximation with a smooth function approximator. The constraints restrict the second derivative of the control signals, expressed as a finite difference scheme.

After two data-sets of 10,000 trajectories are created, two controllers are trained on this data using supervised learning. The first controller

$\mathbf{g}\left(\mathbf{x}\right)$ is trained on the dataset with constant model parameters and only takes the state as input. The second controller is our oracle controller

$\mathbf{g}(\mathbf{x},\mathbf{p})$ and is trained on the dataset with randomized model parameters as explained in

Section 3.2. We use fully connected neural networks with two hidden layers of 128 neurons each. The hidden layers include tanh nonlinearities and a linear output layer. We use a random portion of

$80\%$ of the data as our training-set and the remaining data as our test set. We train the neural networks using a GPU over 10,000 epochs. We did not observe overfitting on the data during supervised learning, even when training on as few as 20 trajectories.

The recurrent neural network

$\mathbf{r}(\mathbf{x},\mathbf{h})$ uses three hidden layers: a recurrent tanh layer with 32 neurons, a static tanh layer with 64 neurons and a static tanh layer with 32 neurons. The output layer is again linear. For the data generation in Algorithm 1, we again use the trajectory length

$T=500$ and add a noise with standard deviation

$\u03f5=0.001$. We chose the amplitude of the noise in simulation by aiming for a disturbance that leads to trajectories that are subjectively not too far from the undisturbed case, yet the amplitude should not be so small that the noise is not visible. During each epoch,

${N}_{traj}=500$ trajectories are sampled according to Algorithm 1. The training of the recurrent neural network was performed on truncated sequences of 50 time steps, and we performed

${N}_{gd}=50$ parameter updates per epoch using Adam [

41]. We ran Algorithm 2 for

${N}_{epoch}=500$ epochs.

In order to adjust the behavior on the real system later, we use our third data-set to train a second recurrent controller with an additional input

${\mathbf{r}}_{\lambda}(\mathbf{x},\lambda ,\mathbf{h})$, as presented in

Section 3.4. For this controller, the optimal trajectories are generated using a modified cost function with the adjusting parameter

$\lambda \in [-1,1]$:

This cost function is equal to (13) for $\lambda =0$. For $\lambda \in [0,1]$, the velocities are penalized less and the position error is stronger penalized, which leads to faster transition behavior. For $\lambda \in [-1,0]$, velocities are penalized stronger, leading to a slower transition behavior.

#### 5.2. Results in Simulation

For the following analyses and comparisons, we create a new test-set by optimizing 2000 trajectories that were not included in any training data previously. This set of trajectories contains initial states around 1 m of the origin and random model parameters sampled from the same distribution ${D}_{p}\left(\mathtt{p}\right)$ as for the supervised training.

To assess the robustness of a controller, we evaluate two metrics. The first metric is the mean of the accumulated costs over the initial states and model parameters in the test-set

The cost function $c({x}_{t},{u}_{t})$ is the same that was used for the trajectory optimization. A lower value for ${J}_{\mathbb{E},c}$ means that the controller is closer to the optimal trajectories.

The second metric is the highest accumulated costs subtracted from the optimal accumulated costs with initial states and model parameters from the test-set.

The metric ${J}_{max,c}$ is used to compare the worst case performance of the controllers. Again, smaller values are better.

In order to also quantify the violation of the end constraint, we define the cost function

The cost function

${c}_{T}\left(\mathbf{x}\right)$ is only evaluated using the final state at

$t=T$. We use

${c}_{T}\left(\mathbf{x}\right)$ for two metrics that quantify the mean and highest

${c}_{T}\left(\mathbf{x}\right)$ of all simulations with the respective controllers.

As a first analysis, the influence of the number of trajectories and the number of training epochs on the performance on the oracle network is analyzed.

Figure 3 shows the expected value of accumulated costs

${J}_{\mathbb{E},c}$ of oracle controllers, trained by supervised learning on a different number of trajectories

N for different numbers of epochs

${N}_{gd}$. For a small number of trajectories

N, the best control performance is reached after a few epochs of supervised learning and the performance starts degrading with more training. Thus, a decrease in the error during supervised learning does not necessarily lead to an increase in the closed-loop performance. Including more trajectories leads to an increased performance in the closed loop. Including more than 5000 trajectories, however, did not lead to a noticeable increase in the closed loop performance for our case. The only case that produced a controller unable to stabilize the system was with as low as 20 trajectories.

We use the oracle controller that uses all 10000 trajectories to train our recurrent controller

$\mathbf{r}(\mathbf{x},\mathbf{h})$ using DOI. The mean simulation costs of the recurrent controller is depicted in

Figure 4, computed every 5 training episodes. The costs are decreasing exponentially with the number of training episodes in our case.

For each type of controller, i.e., static, oracle, recurrent and adaptive recurrent controllers, we extract the controllers with the best average closed loop performance that we obtained in the training on 10,000 trajectories for the following comparison. We evaluate the metrics in Equations (16)–(19b) on the test-set of 2000 initial states and model parameters. The values for each controller are shown in

Table 2. The oracle network slightly outperforms the simple controller without parameter information in terms of mean costs and maximum above optimal costs. The recurrent controller performs better even than the oracle network in terms of both mean costs and maximal above optimal costs. For the mean accumulated costs this is unexpected, as the recurrent controller did not have a direct access to the cost function during its training. For the maximum above optimal accumulated costs, we believe that the recurrent layer is able to average out poor actions in individual states by acting on the past history instead of only acting on the current state. The adjustable controllers do not perform as well as the non-adjustable controllers in terms of mean costs for the case

$\lambda =0$. Also, for the adjustable case the recurrent controller outperforms the static controller with regards to

${J}_{max,c}$.

#### 5.3. Control Performance in the Application

The controllers, trained in simulation, are transferred unchanged to the real system. To evaluate the control performance, we record measurements of each controller for a test trajectory and evaluate the costs in Equation (13). Our test trajectory is 220 s long and contains 10 random target locations. Every 10 s, the target location changes from the origin to one of the target locations and then back to the origin after another 10 s. The accumulated costs over our test trajectory, i.e.,

${\sum}_{t}{c}_{x}\left({\mathbf{x}}_{t}\right)$ and

${\sum}_{t}{c}_{u}\left({\mathbf{u}}_{t}\right)$, for different controllers are given in

Table 3 with lower values indicating a better control performance. The recurrent controllers

$\mathbf{r}(\mathbf{x},\mathbf{h})$ and

${\mathbf{r}}_{\lambda}(\mathbf{x},0,\mathbf{h})$, trained on various model dynamics, achieve a better performance than the static controller with accumulated costs reduced by

$20\%$ and

$22.7\%$. The controller

${\mathbf{r}}_{\lambda}(\mathbf{x},0.3,\mathbf{h})$ is also represented in the table, and performs slightly worse than

${\mathbf{r}}_{\lambda}(\mathbf{x},0,\mathbf{h})$. This is expected as the cost functions (13a) that is used for the evaluation of the values in

Table 3 is different from the costs in Equation (15) for

$\lambda \ne 0$.

Measurement data for the test trajectory for the static controller

$\mathbf{g}\left(\mathbf{x}\right)$, the recurrent controller

$\mathbf{r}(\mathbf{x},\mathbf{h})$ and the adjusted controller

${\mathbf{r}}_{\lambda}(\mathbf{x},0.3,\mathbf{h})$ are shown in

Figure 5,

Figure 6 and

Figure 7, respectively. In the application, the approaching of the target position is slower than in simulation due to oscillations in the tilt angle at low velocities. The remaining position error is most pronounced for the static controller in

Figure 5, e.g., when approaching the new target position after 30 s. The position error is visibly reduced using a recurrent controller

$\mathbf{r}(\mathbf{x},\mathbf{h})$ as can be seen comparing

Figure 6 with

Figure 5. The performance increase is also reflected in a reduced value for

${\sum}_{t}{c}_{x}$ in

Table 3. However, even after 10 s the target location is not reliably reached for the static and the recurrent controller, as the controllers try to reduce the angular velocity of the tilt angle rather than drive towards the target position. We therefore use the adjustable controller

${\mathbf{r}}_{\lambda}(\mathbf{x},\lambda ,\mathbf{h})$ and increase the adjusting parameter

$\lambda $ to give more weight to the position and less weight to velocities. As can be seen in

Table 3, this leads to an increase in the accumulated costs compared to the controller

$\mathbf{r}(\mathbf{x},\mathbf{h})$ according to the cost function in Equation (13), caused by higher velocities and control signals. However, the behavior is subjectively better due to faster and more accurate approaching of the target location as is seen in

Figure 7. Increasing

$\lambda $ also increases the speed of convergence and accuracy of the yaw angle, as can be seen by comparing

Figure 7 with

Figure 5 and

Figure 6 respectively.

For a qualitative impression of the control performance of the final controller

${\mathbf{r}}_{\lambda}(\mathbf{x},\lambda ,\mathbf{h})$ with

$\lambda =0.3$, a sequence of images of the MIP during the test trajectory is shown in

Figure 8 for the real application and in

Figure 9 for the same controller in the simulation. For the real application, a visualisation of the recorded measurement data is provided for each timestep as well, with the target position depicted as a static green MIP. The time delay between each image is

$0.25$ s. Comparing both, we see that the dynamic manoeuvre is performed almost identically with the largest differences close to the position of rest. For a further qualitative impression, a video of the MIP using the adjustable controller is accessible at

https://youtu.be/MwVZgRJSnXg.