# Adjustable and Adaptive Control for an Unstable Mobile Robot Using Imitation Learning with Trajectory Optimization

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Related Work

#### 2.1. Training Robust Parametric Controllers

#### 2.2. Imitation Learning with Trajectory Optimization

## 3. Design of an Adaptive and Adjustable Controller Using Imitation Learning

- Trajectory optimization with randomized model parameters.
- Training an intermediate oracle network.
- Training of a controller with internal states.

#### 3.1. Trajectory Optimization

#### 3.2. Oracle Training

#### 3.3. Training a Robust Recurrent Network

Algorithm 1 Generating training data for the recurrent neural network | |

Inputs: $\mathbf{g},{D}_{x0},{D}_{p0},\u03f5$ ${\mathbf{x}}_{0}\sim {D}_{x0}\left(\mathbf{x}\right),\mathbf{p}\sim {D}_{p0}\left(\mathbf{p}\right),{\mathbf{h}}_{(-1)}=\mathbf{0}$ for $\mathrm{t}=0\phantom{\rule{4.pt}{0ex}}\mathrm{to}\phantom{\rule{4.pt}{0ex}}T$ do${\widehat{\mathbf{u}}}_{t}=\mathbf{g}({\mathbf{x}}_{t},\mathbf{p})$ ${\mathbf{u}}_{t},{\mathbf{h}}_{t}=\mathbf{r}({\mathbf{x}}_{t},{\mathbf{h}}_{t-1};\mathbf{\Theta})$ ${\mathbf{x}}_{t+1}=\mathbf{f}({\mathbf{x}}_{t},{\mathbf{u}}_{t};\mathbf{p})+\mathcal{N}(\mathbf{0},{\u03f5}^{2}\mathbf{I})$ end forreturn $\mathbf{X}=[{\mathbf{x}}_{0},\dots ,{\mathbf{x}}_{t-1}],\widehat{\mathbf{U}}=[{\widehat{\mathbf{u}}}_{0},\dots ,{\widehat{\mathbf{u}}}_{t-1}]$ | ▹ Evaluate oracle ▹ Evaluate controller ▹ Disturbed dynamics |

Algorithm 2 Disturbed Oracle Imitation (DOI) |

for$\mathrm{epoch}=1\phantom{\rule{4.pt}{0ex}}\mathrm{to}\phantom{\rule{4.pt}{0ex}}{N}_{epoch}$do$\mathcal{D}\leftarrow \varnothing $ for $\mathrm{traj}=1\phantom{\rule{4.pt}{0ex}}\mathrm{to}\phantom{\rule{4.pt}{0ex}}{N}_{traj}$ dosample sequence ${\mathbf{X}}_{\mathrm{traj}},{\widehat{\mathbf{U}}}_{\mathrm{traj}}$ $\mathcal{D}\leftarrow \mathcal{D}\cup \left(\right)open="("\; close=")">{\mathbf{X}}_{\mathrm{traj}},{\widehat{\mathbf{U}}}_{\mathrm{traj}}$ end forfor $\mathrm{gd}=1\phantom{\rule{4.pt}{0ex}}\mathrm{to}\phantom{\rule{4.pt}{0ex}}{N}_{gd}$ doUpdate $\mathbf{\Theta}$ using TBPTT. end forend for |

#### 3.4. Adding Adjustable Behavior

## 4. Task and Model Description

#### 4.1. System and Task

#### 4.2. Mathematical Model

## 5. Application and Results

#### 5.1. Control Design Details

#### 5.2. Results in Simulation

#### 5.3. Control Performance in the Application

#### 5.4. Outlook of Application Specific Variations

## 6. Conclusions

## Author Contributions

## Funding

## Conflicts of Interest

## Abbreviations

DAGGER | Dataset aggregation |

DART | Disturbances for augmenting robot trajectories |

DOI | Disturbed oracle imitation |

MIP | Mobile inverted pendulum |

MPC | Model Predictive Control |

TBPTT | Truncated backpropagation through time |

## Appendix A. Rigid Body Dynamics Model

## References

- Yang, C.; Li, Z.; Li, J. Trajectory planning and optimized adaptive control for a class of wheeled inverted pendulum vehicle models. IEEE Trans. Cybern.
**2012**, 43, 24–36. [Google Scholar] [CrossRef] [PubMed][Green Version] - Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal Policy Optimization Algorithms. arXiv
**2017**, arXiv:1707.06347. [Google Scholar] - Faulwasser, T.; Weber, T.; Zometa, P.; Findeisen, R. Implementation of nonlinear model predictive path-following control for an industrial robot. IEEE Trans. Control Syst. Technol.
**2016**, 25, 1505–1511. [Google Scholar] [CrossRef][Green Version] - Deisenroth, M.; Rasmussen, C.E. PILCO: A model-based and data-efficient approach to policy search. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), Bellevue, WA, USA, 28 June–2 July 2011; pp. 465–472. [Google Scholar]
- Golemo, F.; Taiga, A.A.; Courville, A.; Oudeyer, P.Y. Sim-to-Real Transfer with Neural-Augmented Robot Simulation. In Proceedings of the Conference on Robot Learning, Zurich, Switzerland, 29–31 October 2018; pp. 817–828. [Google Scholar]
- Pattanaik, A.; Tang, Z.; Liu, S.; Bommannan, G.; Chowdhary, G. Robust deep reinforcement learning with adversarial attacks. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, Stockholm, Sweden, 10–15 July 2018; pp. 2040–2042. [Google Scholar]
- Pinto, L.; Davidson, J.; Sukthankar, R.; Gupta, A. Robust adversarial reinforcement learning. arXiv
**2017**, arXiv:1703.02702. [Google Scholar] - Rajeswaran, A.; Ghotra, S.; Ravindran, B.; Levine, S. Epopt: Learning robust neural network policies using model ensembles. arXiv
**2016**, arXiv:1610.01283. [Google Scholar] - Peng, X.B.; Andrychowicz, M.; Zaremba, W.; Abbeel, P. Sim-to-real transfer of robotic control with dynamics randomization. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, Australia, 21–25 May 2018; pp. 1–8. [Google Scholar]
- Muratore, F.; Treede, F.; Gienger, M.; Peters, J. Domain randomization for simulation-based policy optimization with transferability assessment. In Proceedings of the Conference on Robot Learning, Zurich, Switzerland, 29–31 October 2018; pp. 700–713. [Google Scholar]
- Hussein, A.; Gaber, M.M.; Elyan, E.; Jayne, C. Imitation learning: A survey of learning methods. ACM Comput. Surv. (CSUR)
**2017**, 50, 21. [Google Scholar] [CrossRef] - Dessort, R.; Chucholowski, C. Explicit model predictive control of semi-active suspension systems using Artificial Neural Networks (ANN). In 8th International Munich Chassis Symposium 2017; Pfeffer, P.E., Ed.; Springer Fachmedien Wiesbaden: Wiesbaden, Germany, 2017; pp. 207–228. [Google Scholar]
- Mordatch, I.; Todorov, E. Combining the benefits of function approximation and trajectory optimization. Robot. Sci. Syst.
**2014**, 4, 5–32. [Google Scholar] - Abbeel, P.; Quigley, M.; Ng, A.Y. Using inaccurate models in reinforcement learning. In Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA, USA, 25–29 June 2006; pp. 1–8. [Google Scholar]
- Mordatch, I.; Lowrey, K.; Todorov, E. Ensemble-CIO: Full-body dynamic motion planning that transfers to physical humanoids. In Proceedings of the 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Hamburg, Germany, 28 September–2 October 2015; pp. 5307–5314. [Google Scholar]
- Liu, C.; Li, H.; Gao, J.; Xu, D. Robust self-triggered min–max model predictive control for discrete-time nonlinear systems. Automatica
**2018**, 89, 333–339. [Google Scholar] [CrossRef] - Yu, W.; Liu, C.K.; Turk, G. Preparing for the Unknown: Learning a Universal Policy with Online System Identification. arXiv
**2017**, arXiv:1702.02453. [Google Scholar] - Schulman, J.; Levine, S.; Moritz, P.; Jordan, M.I.; Abbeel, P. Trust Region Policy Optimization. arXiv
**2015**, arXiv:1502.05477. [Google Scholar] - Muratore, F.; Gienger, M.; Peters, J. Assessing Transferability from Simulation to Reality for Reinforcement Learning. IEEE Trans. Pattern Anal. Mach. Intell.
**2019**. [Google Scholar] [CrossRef][Green Version] - Bousmalis, K.; Irpan, A.; Wohlhart, P.; Bai, Y.; Kelcey, M.; Kalakrishnan, M.; Downs, L.; Ibarz, J.; Pastor, P.; Konolige, K.; et al. Using simulation and domain adaptation to improve efficiency of deep robotic grasping. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, Australia, 21–25 May 2018; pp. 4243–4250. [Google Scholar]
- Chebotar, Y.; Handa, A.; Makoviychuk, V.; Macklin, M.; Issac, J.; Ratliff, N.; Fox, D. Closing the sim-to-real loop: Adapting simulation randomization with real world experience. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 8973–8979. [Google Scholar]
- Salimans, T.; Ho, J.; Chen, X.; Sutskever, I. Evolution strategies as a scalable alternative to reinforcement learning. arXiv
**2017**, arXiv:1703.03864. [Google Scholar] - Chrabaszcz, P.; Loshchilov, I.; Hutter, F. Back to Basics: Benchmarking Canonical Evolution Strategies for Playing Atari. arXiv
**2018**, arXiv:1802.08842. [Google Scholar] - Rückstieß, T.; Sehnke, F.; Schaul, T.; Wierstra, D.; Sun, Y.; Schmidhuber, J. Exploring parameter space in reinforcement learning. Paladyn
**2010**, 1, 14–24. [Google Scholar] [CrossRef] - Ortega, J.G.; Camacho, E. Mobile robot navigation in a partially structured static environment, using neural predictive control. Control Eng. Pract.
**1996**, 4, 1669–1679. [Google Scholar] [CrossRef] - Åkesson, B.M.; Toivonen, H.T.; Waller, J.B.; Nyström, R.H. Neural network approximation of a nonlinear model predictive controller applied to a pH neutralization process. Comput. Chem. Eng.
**2005**, 29, 323–335. [Google Scholar] [CrossRef] - Ross, S.; Gordon, G.; Bagnell, D. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, Ft. Lauderdale, FL, USA, 11–13 April 2011; pp. 627–635. [Google Scholar]
- He, H.; Eisner, J.; Daume, H. Imitation learning by coaching. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2012; pp. 3149–3157. [Google Scholar]
- Laskey, M.; Lee, J.; Fox, R.; Dragan, A.; Goldberg, K. Dart: Noise injection for robust imitation learning. arXiv
**2017**, arXiv:1703.09327. [Google Scholar] - Mordatch, I.; Lowrey, K.; Andrew, G.; Popovic, Z.; Todorov, E.V. Interactive control of diverse complex characters with neural networks. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2015; pp. 3132–3140. [Google Scholar]
- Levine, S.; Koltun, V. Guided policy search. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), Atlanta, GA, USA, 16–21 June 2013; pp. 1–9. [Google Scholar]
- Levine, S.; Koltun, V. Learning complex neural network policies with trajectory optimization. In Proceedings of the International Conference on Machine Learning, Bejing, China, 22–24 June 2014; pp. 829–837. [Google Scholar]
- Zhang, T.; Kahn, G.; Levine, S.; Abbeel, P. Learning deep control policies for autonomous aerial vehicles with mpc-guided policy search. In Proceedings of the 2016 IEEE international conference on robotics and automation (ICRA), Stockholm, Sweden, 16–21 May 2016; pp. 528–535. [Google Scholar]
- Kahn, G.; Zhang, T.; Levine, S.; Abbeel, P. PLATO: Policy Learning using Adaptive Trajectory Optimization. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017. [Google Scholar]
- Levine, S.; Finn, C.; Darrell, T.; Abbeel, P. End-to-end training of deep visuomotor policies. J. Mach. Learn. Res.
**2016**, 17, 1334–1373. [Google Scholar] - Paul, S.; Kurin, V.; Whiteson, S. Fast Efficient Hyperparameter Tuning for Policy Gradient Methods. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2019; pp. 4618–4628. [Google Scholar]
- Von Stryk, O.; Bulirsch, R. Direct and indirect methods for trajectory optimization. Ann. Oper. Res.
**1992**, 37, 357–373. [Google Scholar] [CrossRef] - Bock, H.G.; Plitt, K.J. A multiple shooting algorithm for direct solution of optimal control problems. IFAC Proc. Vol.
**1984**, 17, 1603–1608. [Google Scholar] [CrossRef] - Biegler, L.T. Nonlinear Programming: Concepts, Algorithms, and Applications to Chemical Processes; SIAM: Philadelphia, PA, USA, 2010; Volume 10. [Google Scholar]
- Wächter, A. An Interior Point Algorithm for Large-Scale Nonlinear Optimization with Applications in Process Engineering. Ph.D. Thesis, Carnegie Mellon University, Pittsburgh, PA, USA, 2002. [Google Scholar]
- Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv
**2014**, arXiv:1412.6980. [Google Scholar] - Tallec, C.; Ollivier, Y. Unbiasing Truncated Backpropagation Through Time. arXiv
**2017**, arXiv:1705.08209. [Google Scholar] - Ang, K.H.; Chong, G.; Li, Y. PID control system analysis, design, and technology. IEEE Trans. Control Syst. Technol.
**2005**, 13, 559–576. [Google Scholar] - Muralidharan, V.; Mahindrakar, A.D. Position Stabilization and Waypoint Tracking Control of Mobile Inverted Pendulum Robot. IEEE Trans. Control Syst. Technol.
**2014**, 22, 2360–2367. [Google Scholar] [CrossRef] - Dini, N.; Majd, V.J. Model predictive control of a wheeled inverted pendulum robot. In Proceedings of the 2015 3rd RSI International Conference on Robotics and Mechatronics (ICROM), Tehran, Iran, 7–9 October 2015; pp. 152–157. [Google Scholar] [CrossRef]
- Ha, J.; Lee, J. Position control of mobile two wheeled inverted pendulum robot by sliding mode control. In Proceedings of the 2012 12th International Conference on Control, Automation and Systems, JeJu Island, Korea, 17–21 October 2012; pp. 715–719. [Google Scholar]
- Pathak, K.; Franch, J.; Agrawal, S.K. Velocity and position control of a wheeled inverted pendulum by partial feedback linearization. IEEE Trans. Robot.
**2005**, 21, 505–513. [Google Scholar] [CrossRef] - Kara, T.; Eker, I. Nonlinear modeling and identification of a DC motor for bidirectional operation with real time experiments. Energy Convers. Manag.
**2004**, 45, 1087–1106. [Google Scholar] [CrossRef] - Achiam, J.; Held, D.; Tamar, A.; Abbeel, P. Constrained Policy Optimization. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017. [Google Scholar]

**Figure 1.**Visual comparison of DAGGER, DART and DOI. ${\mathbf{x}}_{0}$ inidicates the intial state and ${\mathbf{x}}_{r}$ a reference state with low costs. The black line indicates a trajectory sampled using the parametric controller in the loop. Blue arrows indicate the training data created for each approach, possibly generated for a distribution of trajectories, indicated by a blurred area.

**Figure 3.**Mean accumulated costs of oracle controllers $\mathbf{g}(\mathbf{x},\mathbf{p})$ trained on different numbers of trajectories over ${N}_{epoch}$ epochs. The number of trajectories used during training is given in the line label.

**Figure 4.**Mean accumulated costs of the recurrent controller $\mathbf{r}(\mathbf{x},\mathbf{h})$ trained using DOI over the number of epochs ${N}_{epoch}$.

**Figure 5.**Measurement data for an application of a static neural network controller $\mathbf{g}\left(\mathbf{x}\right)$. Units are in meters for the position coordinates x and y (top plot) and radian for $\gamma $ (bottom plot).

**Figure 6.**Measurement data for an application of a recurrent neural network controller $\mathbf{r}(\mathbf{x},\mathbf{h})$. Units are in meters for the position coordinates x and y (top plot) and radian for $\gamma $ (bottom plot).

**Figure 7.**Measurement data for an application of an adjustable recurrent neural network controller ${\mathbf{r}}_{\lambda}(\mathbf{x},\lambda ,\mathbf{h})$ with $\lambda =0.3$. Units are in meters for the position coordinates x and y (top plot) and radian for $\gamma $ (bottom plot).

**Figure 8.**Image sequence showing a manoeuvre of the real MIP using the recurrent control structure ${\mathbf{r}}_{\lambda}(\mathbf{x},\lambda ,\mathbf{h})$ with $\lambda =0.3$. The top image shows the real system and attached below is a visualization of the measurement data (gray MIP), also showing the target position as a green MIP.

**Figure 9.**Image sequence showing a manoeuvre of the MIP in simulation using the recurrent control structure ${\mathbf{r}}_{\lambda}(\mathbf{x},\lambda ,\mathbf{h})$ with $\lambda =0.3$. The target position is shown as a green MIP.

Variable | Value | Unit | Description |
---|---|---|---|

${M}_{\mathrm{b}}$ | 1.76 | $\mathrm{kg}$ | Mass of the body |

${M}_{\mathrm{w}}$ | 0.147 | $\mathrm{kg}$ | Mass of a wheel |

R | 0.07 | $\mathrm{m}$ | Radius of the wheels |

${c}_{z}$ | $0.07\pm 20\%$ | $\mathrm{m}$ | Height of the center of mass above wheel axis |

b | 0.09925 | $\mathrm{m}$ | Half length between wheels |

${I}_{xx}$ | 0.0191 | $\mathrm{k}\mathrm{g}{\mathrm{m}}^{2}$ | Moment of inertia, x-axis |

${I}_{yy}$ | $0.0158\pm 20\%$ | $\mathrm{k}\mathrm{g}{\mathrm{m}}^{2}$ | Moment of inertia, y-axis |

${I}_{zz}$ | 0.0048 | $\mathrm{k}\mathrm{g}{\mathrm{m}}^{2}$ | Moment of inertia, z-axis |

${I}_{wa}$ | $3.6\phantom{\rule{0.166667em}{0ex}}\xb7\phantom{\rule{0.166667em}{0ex}}{10}^{-4}$ | $\mathrm{k}\mathrm{g}{\mathrm{m}}^{2}$ | Moment of inertia. Wheel, y-axis |

${I}_{wd}$ | $1.45\phantom{\rule{0.166667em}{0ex}}\xb7\phantom{\rule{0.166667em}{0ex}}{10}^{-3}$ | $\mathrm{k}\mathrm{g}{\mathrm{m}}^{2}$ | Moment of inertia. Wheel, z-axis |

${k}_{1}$ | 0.018 | $\mathrm{V}\mathrm{s}$ | Motor constant |

${k}_{2}$ | 0.61 | $\mathrm{N}\mathrm{m}\mathrm{A}{}^{-1}$ | Motor constant |

${c}_{\mathrm{fric},1}$ | $0.24\pm 20\%$ | $\mathrm{N}\mathrm{m}{}^{-1}$ | Friction model constant |

${c}_{\mathrm{fric},2}$ | 2.0 | / | Friction model constant |

${c}_{\mathrm{fric},3}$ | 0.4 | / | Friction model constant |

${c}_{\mathrm{fric},4}$ | $8\phantom{\rule{0.166667em}{0ex}}\xb7\phantom{\rule{0.166667em}{0ex}}{10}^{-4}$ | $\mathrm{N}\mathrm{s}\mathrm{m}{}^{-1}$ | Friction model constant |

**Table 2.**Mean and maximal accumulated costs for different controllers in simulation. Controllers with gray font can not be used in practice and are only given as a reference.

$\mathbf{g}\left(\mathbf{x}\right)$ | $\mathbf{g}(\mathbf{x},\mathbf{p})$ | $\mathbf{r}(\mathbf{x},\mathbf{h})$ | ${\mathbf{g}}_{\mathit{\lambda}}(\mathbf{x},0,\mathbf{p})$ | ${\mathbf{r}}_{\mathit{\lambda}}(\mathbf{x},0,\mathbf{h})$ | opt. | |
---|---|---|---|---|---|---|

${J}_{\mathbb{E},c}$ | 181.47 | 178.33 | 176.58 | 185.81 | 185.75 | 155.163 |

${J}_{max,c}$ | 712.03 | 520.95 | 121.38 | 494.27 | 182.67 | 0 |

${J}_{\mathbb{E},{c}_{T}}$ | 0.133 | 0.102 | 0.090 | 0.070 | 0.051 | 0 |

${J}_{max,{c}_{T}}$ | 1.10 | 2.13 | 1.34 | 1.092 | 0.524 | 0 |

$\mathbf{g}\left(\mathbf{x}\right)$ | $\mathbf{r}(\mathbf{x},\mathbf{h})$ | ${\mathbf{r}}_{\mathit{\lambda}}(\mathbf{x},0,\mathbf{h})$ | ${\mathbf{r}}_{\mathit{\lambda}}(\mathbf{x},0.3,\mathbf{h})$ | |
---|---|---|---|---|

$\sum {c}_{x}$ | 5221.86 | 3957.26 | 4062.04 | 4131.76 |

$\sum {c}_{u}$ | 1674.85 | 1566.88 | 1265.35 | 1431.55 |

$\sum c$ | 6896.71 | 5524.14 | 5327.39 | 5563.31 |

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Dengler, C.; Lohmann, B.
Adjustable and Adaptive Control for an Unstable Mobile Robot Using Imitation Learning with Trajectory Optimization. *Robotics* **2020**, *9*, 29.
https://doi.org/10.3390/robotics9020029

**AMA Style**

Dengler C, Lohmann B.
Adjustable and Adaptive Control for an Unstable Mobile Robot Using Imitation Learning with Trajectory Optimization. *Robotics*. 2020; 9(2):29.
https://doi.org/10.3390/robotics9020029

**Chicago/Turabian Style**

Dengler, Christian, and Boris Lohmann.
2020. "Adjustable and Adaptive Control for an Unstable Mobile Robot Using Imitation Learning with Trajectory Optimization" *Robotics* 9, no. 2: 29.
https://doi.org/10.3390/robotics9020029