# Reinforcement Learning Approach to Design Practical Adaptive Control for a Small-Scale Intelligent Vehicle

^{1}

^{2}

^{*}

^{†}

## Abstract

**:**

## 1. Introduction

## 2. Experimental Setup

## 3. Control Strategy Based on Reinforcement Learning

#### 3.1. Problem Formulation

#### 3.2. The Key Concept of Reinforcement Learning

**System state**: In the RL algorithm, control action is directly determined by the system states. In this study, the offset from centerline, vehicle yaw angle, and vehicle velocity are selected to form a three-dimensional state space, i.e., $s\left(t\right)={\left(e\left(t\right),\phi \left(t\right),v\left(t\right)\right)}^{T}$, where $e\left(t\right)$ represents the offset from the centerline at time t, $\phi \left(t\right)$, and $v\left(t\right)$ represent yaw angle and vehicle speed at time t, respectively. Figure 8 illustrates the three defined states, in which the value after the servo represents a set of yaw angle states, the value after Pulse-Width Modulation (PWM) represents a set of vehicle speed states, and the number at the top of the figure represents a set of the offset from the centerline. It should be noted here that only a small number of states are chosen in this study in order to (a) facilitate the training process and (b) showcase the generalization ability of the RL techniques.

**Control action:**The decision-making on the steering servo motor which decides the vehicle steering angle and the motor which determines the vehicle speed are the core problem of the control strategy. We choose the steering angle of the servo motor and the duty cycle of the motor as the control actions, denoted as $A\left(t\right)=\left(\mu \left(t\right),\rho \left(t\right)\right)$, where t is the time step index. $A\left(t\right)$ should be discretized in order to apply the RL-based algorithm, i.e., the entire action space is $A=\left\{{A}^{1},{A}^{2},\dots {A}^{n}\right\}$, where n is the degree of discretization and is the multiplication of the number of steering actions and the number of speed actions. In this study, for the first two tasks, we consider n as 9, and for the ultimate task, n is set to be 27 with three speed control actions added.

**Immediate reward:**Immediate reward is important in the RL algorithm because it directly influences the convergence curves and, in some cases, a fine adjustment of the immediate reward parameter can bring the final policy to the opposite poles [29]. The agent is always trying to maximize the reward, which it can obtain by taking the optimal action at each time step. Therefore, the immediate reward should be defined according to the optimization objective. The control objective of this work is to enable the vehicle to travel along the centerline (for task a, b, and c) and adaptively alter speed according to different road conditions (for task c). Keeping this objective in mind, the function of the offset from the centerline (for task a, b, and c) and the current vehicle speed are defined as the immediate reward. A large penalty value was introduced to penalize the situation when the vehicle is running out of the boundaries. In this experiment, when the vehicle is about to rush out of the runway at a long distance from the centerline, the vehicle is given a command to reverse the full steering angle and use this as the current action to update Q table in the current state. In the conventional process of RL, the experimental body has to be put back into the environment to continue learning when rushing out of boundaries, and this design can creatively make the experiment body correct itself and continue to learn when it is about to go out of bounds, which effectively reduces the human-made operational factors. Meanwhile, there is no need to supervise the learning process at all times, which greatly improves the learning efficiency. In the following, the equations for the immediate reward are given:

#### 3.3. The Q-Learning Algorithm

Algorithm 1. The Q-learning algorithm pseudo code. |

$\begin{array}{l}\mathrm{Algorithm}:\mathrm{RL}:\mathrm{Q}\text{-}\mathrm{learning}\mathrm{algorithm}\\ 1.\mathrm{Initialize}Q(s,a)\mathrm{arbitrarily}\\ 2.\mathrm{Repeat}(\mathrm{for}\mathrm{each}\mathrm{episode}):\\ 3.\hspace{1em}\mathrm{Initialize}s\\ 4.\hspace{1em}\mathrm{Repeat}(\mathrm{for}\mathrm{each}\mathrm{step}\mathrm{of}\mathrm{episode}):\\ 5.\hspace{1em}\hspace{1em}\mathrm{Choose}a\mathrm{from}s\mathrm{using}\mathrm{policy}\mathrm{derived}\mathrm{from}Q(e.g.,\epsilon -greedy)\\ 6.\hspace{1em}\hspace{1em}\mathrm{Take}\mathrm{action}a,\mathrm{observe}r,{s}^{\prime}\\ 7.\hspace{1em}\hspace{1em}Q(s,a)\leftarrow Q(s,a)+\alpha [r+\gamma {\mathrm{max}}_{{a}^{\prime}}Q({s}^{\prime},{a}^{\prime})-Q(s,a)]\\ 8.\hspace{1em}\hspace{1em}s\leftarrow {s}^{\prime}\\ 9.\hspace{1em}\mathrm{Until}s\mathrm{is}\mathrm{terminal}\end{array}$ |

#### 3.4. The Sarsa Algorithm

Algorithm 2. The Sarsa algorithm pseudo code. |

$\begin{array}{l}\mathrm{Algorithm}:\mathrm{RL}:\mathrm{Sarsa}\mathrm{algorithm}\\ 1.\mathrm{Initialize}Q(s,a)\mathrm{arbitrarily}\\ 2.\mathrm{Repeat}(\mathrm{for}\mathrm{each}\mathrm{episode}):\\ 3.\hspace{1em}\mathrm{Initialize}s\\ 4.\hspace{1em}\mathrm{Choose}a\mathrm{from}s\mathrm{using}\mathrm{policy}\mathrm{derived}\mathrm{from}Q(e.g.,\epsilon -greedy)\\ 5.\hspace{1em}\mathrm{Repeat}(\mathrm{for}\mathrm{each}\mathrm{step}\mathrm{of}\mathrm{episode}):\\ 6.\hspace{1em}\hspace{1em}\mathrm{Choose}{a}^{\prime}\mathrm{from}{s}^{\prime}\mathrm{using}\mathrm{policy}\mathrm{derived}\mathrm{from}Q(e.g.,\epsilon -greedy)\\ 7.\hspace{1em}\hspace{1em}\mathrm{Take}\mathrm{action}a,\mathrm{observe}r,{s}^{\prime}\\ 8.\hspace{1em}\hspace{1em}Q(s,a)\leftarrow Q(s,a)+\alpha [r+\gamma Q({s}^{\prime},{a}^{\prime})-Q(s,a)]\\ 9.\hspace{1em}\hspace{1em}s\leftarrow {s}^{\prime};a\leftarrow {a}^{\prime};\\ 10.\hspace{1em}\mathrm{Until}s\mathrm{is}\mathrm{terminal}\end{array}$ |

#### 3.5. The Sarsa (λ) Algorithm

Algorithm 3. The Sarsa (λ) algorithm pseudo code. |

$\begin{array}{l}\mathrm{Algorithm}:\mathrm{RL}:\mathrm{Sarsa}\text{-}\lambda \mathrm{algorithm}\\ 1.\mathrm{Initialize}Q(s,a)\mathrm{arbitrarily},\mathrm{for}\mathrm{all}s\in S,a\in A(s)\\ 2.\mathrm{Repeat}(\mathrm{for}\mathrm{each}\mathrm{episode}):\\ 3.\hspace{1em}E(s,a)=0,\mathrm{for}\mathrm{all}s\in S,a\in A(s)\\ 4.\hspace{1em}\mathrm{Initialize}S,A\\ 5.\hspace{1em}\mathrm{Repeat}(\mathrm{for}\mathrm{each}\mathrm{step}\mathrm{of}\mathrm{episode}):\\ 6.\hspace{1em}\hspace{1em}\mathrm{Take}\mathrm{action}A,\mathrm{observe}R,{S}^{\prime}\\ 7.\hspace{1em}\hspace{1em}\mathrm{Choose}{A}^{\prime}\mathrm{from}{S}^{\prime}\mathrm{using}\mathrm{policy}\mathrm{derived}\mathrm{from}Q(e.g.,\epsilon -greedy)\\ 8.\hspace{1em}\hspace{1em}\delta \leftarrow R+\gamma Q({S}^{\prime},{A}^{\prime})-Q(S,A)\\ 9.\hspace{1em}\hspace{1em}E(S,A)\leftarrow E(S,A)+1\\ 10.\hspace{1em}\hspace{1em}\mathrm{For}\mathrm{all}s\in S,a\in A(s)\\ 11.\hspace{1em}\hspace{1em}Q(s,a)\leftarrow Q(s,a)+\alpha \delta E(s,a)\\ 12.\hspace{1em}\hspace{1em}E(s,a)\leftarrow \gamma \lambda E(s,a)\\ 13.\hspace{1em}\hspace{1em}S\leftarrow {S}^{\prime};A\leftarrow {A}^{\prime};\\ 14.\hspace{1em}\mathrm{Until}S\mathrm{is}\mathrm{terminal}\end{array}$ |

#### 3.6. Dyna-Q Algorithm

Algorithm 4. The Dyna-Q algorithm pseudo code. |

$\begin{array}{l}\mathrm{Algorithm}:\mathrm{RL}:\mathrm{Dyna}\text{-}\mathrm{Q}\mathrm{algorithm}\\ 1.\mathrm{Initialize}Q(s,a)\mathrm{and}M(s,a)\mathrm{for}\mathrm{all}\mathrm{s}\in S\mathrm{and}a\in A(s)\\ 2.\mathrm{Do}\mathrm{forever}:\\ 3.\hspace{1em}S\leftarrow \mathrm{current}\left(\mathrm{nonterminal}\right)\mathrm{state}\\ 4.\hspace{1em}A\leftarrow \epsilon -greedy(S,Q)\\ 5.\hspace{1em}\mathrm{Execute}\mathrm{action}A;\mathrm{observe}\mathrm{resultant}\mathrm{reward},R,\mathrm{and}\mathrm{state},{S}^{\prime}\\ 6.\hspace{1em}Q(S,A)\leftarrow Q(S,A)+\alpha [R+\gamma {\mathrm{max}}_{a}Q({S}^{\prime},a)-Q(S,A)]\\ 7.\hspace{1em}\mathrm{Model}(S,A)\leftarrow R,{S}^{\prime}(\mathrm{assuming}\mathrm{deterministic}\mathrm{environment})\\ 8.\hspace{1em}\mathrm{Repeat}n\mathrm{times}:\\ 9.\hspace{1em}\hspace{1em}\hspace{1em}S\leftarrow \mathrm{random}\mathrm{previously}\mathrm{observed}\mathrm{state}\\ 10.\hspace{1em}\hspace{1em}A\leftarrow \mathrm{random}\mathrm{action}\mathrm{previously}\mathrm{taken}\mathrm{in}S\\ 11.\hspace{1em}\hspace{1em}R,{S}^{\prime}\leftarrow \mathrm{Model}(S,A)\\ 12.\hspace{1em}\hspace{1em}Q(S,A)\leftarrow Q(S,A)+\alpha [R+\gamma {\mathrm{max}}_{a}Q({S}^{\prime},a)-Q(S,A)]\end{array}$ |

## 4. Experimental Results and Discussion

#### 4.1. Tracking Control at Constant Vehicle Velocity

#### 4.2. Tracking Control While Learning to Increament the Vehicle Speed

#### 4.3. Steering Control at Adaptive Velocity

## 5. Conclusions

- The Q-learning and Sarsa (λ) algorithms can achieve a better tracking behavior than the conventional Sarsa, although they all can converge within a small number of training episodes. In addition, the converging speed and the final tracking behavior of Sarsa (λ) seems to be better than Q-learning by a small margin, but Q-learning outperforms Sarsa (λ) in terms of computational complexity and thus is more suitable for the vehicle’s real time learning and control.
- The Dyna-Q method, which can learn both in the model and in the environment interaction, performs similarly with the Sarsa (λ) algorithms, but with a significant reduction of computational time.
- The Q-learning algorithm with a good balance between the converging speed, the computational complexity, and the final control behavior is seen to perform better compared with a fine-tune PID controller in terms of adaptability and tuning efficiency, and the Q-learning method can also be easily applied to control problems with over one control actions, putting it as more suitable for the self-driving vehicle control whose steering angle and vehicle speed needs to be regulated simultaneously.

## Author Contributions

## Funding

## Data Availability

## Conflicts of Interest

## References

- Paden, B.; Čáp, M.; Yong, S.Z.; Yershov, D.; Frazzoli, E. A Survey of Motion Planning and Control Techniques for Self-Driving Urban Vehicles. IEEE Trans. Intell. Veh.
**2016**, 1, 33–55. [Google Scholar] [CrossRef] - Broggi, A.; Cerri, P.; Debattisti, S.; Laghi, M.C.; Medici, P.; Molinari, D.; Panciroli, M.; Prioletti, A. PROUD—Public Road Urban Driverless-Car Test. IEEE Trans. Intell. Transp. Syst.
**2015**, 16, 3508–3519. [Google Scholar] [CrossRef] - Li, L.; Huang, W.; Liu, Y.; Zheng, N.; Wang, F. Intelligence Testing for Autonomous Vehicles: A New Approach. IEEE Trans. Intell. Veh.
**2016**, 1, 158–166. [Google Scholar] [CrossRef] - Xu, Z.; Wang, M.; Zhang, F.; Jin, S.; Zhang, J.; Zhao, X. Patavtt: A hardware-in-the-loop scaled platform for testing autonomous vehicle trajectory tracking. J. Adv. Transp.
**2017**, 1–11. [Google Scholar] [CrossRef] - From the Lab to the Street: Solving the Challenge of Accelerating Automated Vehicle Testing. Available online: http://www.hitachi.com/rev/archive/2018/r2018_01/trends2/index.html/ (accessed on 1 September 2019).
- Ruz, M.L.; Garrido, J.; Vazquez, F.; Morilla, F. Interactive Tuning Tool of Proportional-Integral Controllers for First Order Plus Time Delay Processes. Symmetry
**2018**, 10, 569. [Google Scholar] [CrossRef] - Liu, X.; Shi, Y.; Xu, J. Parameters Tuning Approach for Proportion Integration Differentiation Controller of Magnetorheological Fluids Brake Based on Improved Fruit Fly Optimization Algorithm. Symmetry
**2017**, 9, 109. [Google Scholar] [CrossRef] - Chee, F.; Fernando, T.L.; Savkin, A.V.; Heeden, V.V. Expert PID Control System for Blood Glucose Control in Critically Ill Patients. IEEE Trans. Inf. Technol. Biomed.
**2003**, 7, 419–425. [Google Scholar] [CrossRef] - Savran, A. A multivariable predictive fuzzy PID control system. Appl. Soft Comput.
**2013**, 13, 2658–2667. [Google Scholar] [CrossRef] - Lopez_Franco, C.; Gomez-Avila, J.; Alanis, A.Y.; Arana-Daniel, N.; Villaseñor, C. Visual Servoing for an Autonomous Hexarotor Using a Neural Network Based PID Controller. Sensors
**2017**, 17, 1865. [Google Scholar] [CrossRef] - Moriyama, K.; Nakase, K.; Mutoh, A.; Inuzuka, N. The Resilience of Cooperation in a Dilemma Game Played by Reinforcement Learning Agents. In Proceedings of the IEEE International Conference on Agents (ICA), Beijing, China, 6–9 July 2017. [Google Scholar]
- Meng, Q.; Tholley, I.; Chung, P.W.H. Robots learn to dance through interaction with humans. Neural Comput. Appl.
**2014**, 24, 117–124. [Google Scholar] [CrossRef] - Zhang, Z.; Zheng, L.; Li, N.; Wang, W.; Zhong, S.; Hu, K. Minimizing mean weighted tardiness in unrelated parallel machine scheduling with reinforcement learning. Comput. Oper. Res.
**2012**, 39, 1315–1324. [Google Scholar] [CrossRef] - Iwata, K. An Information-Theoretic Analysis of Return Maximization in Reinforcement Learning. Neural Netw.
**2011**, 24, 1074–1081. [Google Scholar] [CrossRef] [PubMed] - Jalalimanesh, A.; Haghighi, H.S.; Ahmadi, A.; Soltani, M. Simulation-based optimization of radiotherapy: Agent-based modelling and reinforcement learning. Math. Comput. Simul.
**2017**, 133, 235–248. [Google Scholar] [CrossRef] - Fernandez-Gauna, B.; Marques, I.; Graña, M. Undesired state-action prediction in multi-Agent reinforcement learning for linked multi-component robotic system control. Inf. Sci.
**2013**, 232, 309–324. [Google Scholar] [CrossRef] - Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction, 2nd ed.; The MIT Press: Cambridge, MA, USA, 2018; pp. 97–113. [Google Scholar]
- Liu, T.; Zou, Y.; Liu, D.; Sun, F. Reinforcement Learning–Based Energy Management Strategy for a Hybrid Electric Tracked Vehicle. Energies
**2015**, 8, 7243–7260. [Google Scholar] [CrossRef] - Sistani, M.B.N.; Hesari, S. Decreasing Induction Motor Loss Using Reinforcement Learning. J. Autom. Control Eng.
**2016**, 4, 13–17. [Google Scholar] [CrossRef] - Shen, H.; Tan, Y.; Lu, J.; Wu, Q.; Qiu, Q. Achieving Autonomous Power Management Using Reinforcement Learning. ACM Trans. Des. Autom. Electron. Syst.
**2013**, 18, 1–24. [Google Scholar] [CrossRef] - Anderlini, E.; Forehand, D.I.M.; Stansell, P.; Xiao, Q.; Abusara, M. Control of a Point Absorber using Reinforcement Learning. IEEE Trans. Sustain Energy
**2016**, 7, 1681–1690. [Google Scholar] [CrossRef] - Sun, J.; Huang, G.; Sun, G.; Yu, H.; Sangaiah, A.K.; Chang, V. A Q-Learning-Based Approach for Deploying Dynamic Service Function Chains. Symmetry
**2018**, 10, 646. [Google Scholar] [CrossRef] - Aissani, N.; Beldjilali, B.; Trentesaux, D. Dynamic scheduling of maintenance tasks in the petroleum industry: A reinforcement approach. Eng. Appl. Artif. Intell.
**2009**, 22, 1089–1103. [Google Scholar] [CrossRef] - Habib, A.; Khan, M.I.; Uddin, J. Optimal Route Selection in Complex Multi-stage Supply Chain Networks using SARSA(λ). In Proceedings of the 19th International Conference on Computer and Information Technology, North South University, Dhaka, Bangladesh, 18–20 December 2016. [Google Scholar]
- Li, Z.; Lu, Y.; Shi, Y.; Wang, Z.; Qiao, W.; Liu, Y. A Dyna-Q-Based Solution for UAV Networks Against Smart Jamming Attacks. Symmetry
**2019**, 11, 617. [Google Scholar] [CrossRef] - Mit-Racecar. Available online: http//www.Github.com/mit-racecar/ (accessed on 28 April 2019).
- Berkeley Autonomous Race Car. Available online: http//www.barc-project.com/ (accessed on 28 April 2019).
- Silver, D.; Schrittwieser, J.; Simonyan, K.; Antonoglou, I.; Huang, A.; Guez, A.; Hubert, T.; Baker, L.; Lai, M.; Bolton, A.; et al. Mastering the Game of Go without Human Knowledge. Nature
**2017**, 550, 354–359. [Google Scholar] [CrossRef] [PubMed] - Pandey, P.; Pandey, D.; Kumar, S. Reinforcement Learning by Comparing Immediate Reward. Int. J. Comput. Sci. Inf. Secur.
**2010**, 8, 1–5. [Google Scholar] - Liu, T.; Hu, X.; Li, S.E.; Cao, D. Reinforcement Learning Optimized Look-Ahead Energy Management of a Parallel Hybrid Electric Vehicle. IEEE/ASME Trans. Mechatron.
**2017**, 22, 1497–1507. [Google Scholar] [CrossRef]

**Figure 1.**Small- and full- scale platform for developing and testing the intelligent algorithm for self-driving vehicles.

**Figure 12.**The number of outbound and the accumulated rewards per episode for Q-learning, Sarsa, and Sarsa (λ).

**Figure 13.**(

**a**) The number of outbound and (

**b**) the accumulated rewards per episode between Dyna-Q and Sarsa (λ).

Q-Learning | Sarsa | Sarsa (λ) | Dyna-Q | |
---|---|---|---|---|

Minimum episodes to converge | 45 | 50 | 28 | 20 |

Final accumulated reward per episode | 9511 | 6578 | 12,204 | 12,506 |

Single step calculation time (%) * | 100 | 100 | 1335 | 766 |

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Hu, B.; Li, J.; Yang, J.; Bai, H.; Li, S.; Sun, Y.; Yang, X.
Reinforcement Learning Approach to Design Practical Adaptive Control for a Small-Scale Intelligent Vehicle. *Symmetry* **2019**, *11*, 1139.
https://doi.org/10.3390/sym11091139

**AMA Style**

Hu B, Li J, Yang J, Bai H, Li S, Sun Y, Yang X.
Reinforcement Learning Approach to Design Practical Adaptive Control for a Small-Scale Intelligent Vehicle. *Symmetry*. 2019; 11(9):1139.
https://doi.org/10.3390/sym11091139

**Chicago/Turabian Style**

Hu, Bo, Jiaxi Li, Jie Yang, Haitao Bai, Shuang Li, Youchang Sun, and Xiaoyu Yang.
2019. "Reinforcement Learning Approach to Design Practical Adaptive Control for a Small-Scale Intelligent Vehicle" *Symmetry* 11, no. 9: 1139.
https://doi.org/10.3390/sym11091139