Toward Self-Driving Bicycles Using State-ofthe-Art Deep Reinforcement Learning Algorithms

In this paper, we propose a controller for a bicycle using the DDPG (Deep Deterministic Policy Gradient) algorithm, which is a state-of-the-art deep reinforcement learning algorithm. We use a reward function and a deep neural network to build the controller. By using the proposed controller, a bicycle can not only be stably balanced but also travel to any specified location. We confirm that the controller with DDPG shows better performance than the other baselines such as Normalized Advantage Function (NAF) and Proximal Policy Optimization (PPO). For the performance evaluation, we implemented the proposed algorithm in various settings such as fixed and random speed, start location, and destination location.


Introduction
Bicycles are efficient vehicles in terms of their environment-friendly, affordable, and user-friendly characteristics.However, due to the unstable dynamics of bicycles, riders spend significant effort practicing and being observant while riding.There are already self-driving cars and autonomous air vehicles, but self-driving bicycles are still being developed [1].Many studies have proposed methods to improve such bicycles both in terms of their mechanisms and controllers [2][3][4][5].Particularly, studies [2,3] focus on the physical enhancement of bicycles, and studies [4,5] develop bicycle controllers based on control theory and bicycle dynamics.However, their proposed controllers only work properly in simulation environments and fails to apply to the real world due to disturbances present in a real environment.A reinforcement learning-based controller is able to interact with the surroundings and is adaptable to various environments [6][7][8].Despite these possibilities, very few studies have used reinforcement learning to develop bicycle controllers [5,9,10].Randlov [9] built a controller based on the SARSA algorithm [11], which could not handle the highly variable state and action space.Jie Tan [5] used shallow neural network controllers that apply policy gradient methods [12] to train parameters.However, shallow neural network controllers have limitations in expressing highly nonlinear environments such as bicycles.Tuyen [10] used a deep neural network to represent the bicycle controller.In his implementation, the controller is quickly trained by using an algorithm called the deep deterministic policy gradient (DDPG) [13].The controller allows the bicycle to perfectly balance itself but fails to lead the bicycle to any location.In this study, we propose an improved controller that can lead the bicycle to any location.Note that a bicycle with fixed speed will be difficult to turn while moving.Therefore, in this paper, we extend the existing bicycle models so that the bicycle velocity can be controlled by a neural network controller.
In fact, the paper is an extended version of work published in [10,14].We extend our previous work by modifying the bicycle dynamics and learning the controller to adapt to the new dynamics.The contributions of this paper are highlighted as follows.First, we redefine the bicycle dynamics to adaptively control the velocity of the bicycle.Second, we propose the learning process in which we use a reward function for not only balancing the bicycle but also leading the bicycle to a given destination.The learning process uses the DDPG algorithm.
The rest of the paper is arranged as follows.Section 2 shortly reviews background knowledge.Section 3 introduces the dynamics of the bicycle.Section 4 describes the overall learning process.Section 5 shows the results of our proposed controller.Finally, Section 6 summarizes our work and provides some future directions on this topic.

Reinforcement Learning and Policy Gradient Based Method
Reinforcement Learning (RL) [11] is a subfield of machine learning that uses reward values directly received from the environment to learn an agent.Basically, Markov decision process (MDP) is formally used to describe RL problems.It is modelled with a tuple (S, A, P, r, γ) which consists of a state space S; an action space A; a transition function P (s t+1 |s t , a t ) that predicts the next state s t+1 given a current state-action pair (s t , a t ); r (s t , a t ) that defines the immediate reward achieved at each state-action pair, and γ ∈ (0, 1) denotes a discount factor.Agent collects a sequence of state-action pairs (s t , a t ) called a trajectory ξ t (e.g., episode, rollout) with discounted cumulative reward given by The policy of a RL problem is a function π which maps the state space to the action space.An RL algorithm tries to find an optimal policy π * to maximize the expected total discounted reward as follows: Policy gradient-based methods are one approach to find the optimal policy.In the policy gradient-based methods, the policy is parameterized by a parameters vector θ and is updated along the gradient direction of the expected total discounted rewards as where α denotes the learning rate and k is the current update number.

Deep Deterministic Policy Gradient Algorithm
DDPG [13] is an off-policy algorithm that uses deep neural networks to represent the policy.The features of this algorithm are introduced as follows.First, the algorithm inherits an actor-critic framework [11].This means that there are two components in the algorithm, the actor and the critic.The actor takes responsibility for a policy, which receives a state as the input and generates an action.The critic estimates the action value function, which is used to assess the goodness of the actor.Second, the algorithm uses two deep neural networks, one each for the actor and the critic.Such a framework powerful enough to represent a highly non-linear task such as controlling a bicycle.Third, the algorithm uses a deterministic policy gradient [15] to train the actor network as follows: where ∇ θ J(π(θ µ )) is the policy gradient, ∇ a Q s, a|θ Q is the gradient of the action value function w.r.t action a, and ∇ θ µ µ (s|θ µ ) is the gradient of the actor w.r.t parameter θ µ .Finally, the DDPG algorithm inherits two features from Deep Q-Learning [16].The first feature is maintaining a copy for each network, e.g., copies of the actor network and critic network.The copy ones improve the stability of the learning process.The second feature is maintaining a replay memory that stores all of the sample data during interacting with the environment.At each time step, we randomly sample a batch of data from the replay memory and use them to train the networks.The replay memory removes the correlation in the sequence of the data sample.Using a deterministic policy is more stable than a stochastic policy where the actions are drawn from a distribution.

Extended Bicycle Dynamics
Studies on bicycle dynamics are often studied by researchers [2,5,9,17].The first bicycle studied by Randlov [9] is illustrated in Figure 1.This paper not only utilizes the work of Randlov [9] study, but also extends it to where the velocity of the bicycle is dynamic.Particularly, the bicycle has six dimensional states: (ω, ω, ω, θ, θ, ψ g ) where ω, ω, ω are the angle, angular velocity, and angular acceleration of the bicycle relative to the vertical plane.θ, θ are the angle and angular velocity of the handlebars and ψ g is the angle formed by the bicycle and a specified goal g.The bicycle states are demonstrated in Figure 2. To control the bicycle, the agent chooses three actions.The first action is the torque applied to the handlebar (T).The second action is the displacement (d) between the center of mass and the bicycle plan (Figure 3).The third action is the force (F) applied to the pedal of the bicycle.The equations of the position of the tires for the front tire: and for the back tire: where ψ is angle make by bicycle and horizontal line, and r b and r f (Figure 4) are radii of the front tire and back tire, respectively.The radii are given by: and The angular acceleration ω can be calculated as: where angle φ is the total angle of tilt of the center of mass (CM) (Figure 3), and is defined as: The angular acceleration θ of the front tire and the handle bar is The moment of inertia has the formula: where various moments of inertia for a tire (Figure 5) are estimated to: The velocity of the bicycle can be adjusted by learning the force applied to the pedal.Figure 6 shows how the force applied to the pedal can be transmitted to the bicycle.Using the static equilibrium assumption, we can write the following torque equations: and Since F 2 = F 3 , we can combine the above two equations to give an expression for F 4 : The force F 4 determines the acceleration of the bicycle.Particularly, the acceleration of the bicycle is as follows: From the acceleration, we can calculate the velocity of the bicycle: Various parameters for the bicycle dynamics are shown in Table 1.

Network Structure
There are two networks used for training, namely a critic network and an actor network, which are illustrated in Figure 7.The input of the actor network is a 6-dimensional state vector and the output of the actor network is a 3-dimensional action vector.Meanwhile, the input of the critic network is both a state vector and an action vector, and the output of the critic network is a Q action value.The configurations of the actor network and critic network are shown in Table 2. Two hidden layers of the actor network have 300 units and 400 units, respectively, while both hidden layers of the critic network have 200 units.The action vector only joins the network at the second hidden layers.The parameters of the networks are initialized randomly and are optimized using the ADAM algorithm [18].The target networks are updated using a soft-updating technique with learning rate τ = 0.001.We did not try different numbers of layers of units.From our experience, however, neural networks are quite flexible and can cope with a variety of settings.

Network Training
Critic network training.Critic networks include a main critic network (Q) parameterized by θ Q and a target critic network (Q ) parameterized by θ Q .At every discrete time step, the main network is updated using a batch of samples, which obtains data from the replay memory.Particularly, θ Q is optimized to minimize the loss function as follows: where i indicates the ith sample in the batch and θ Q is coupled with θ Q using a soft-updating technique with learning rate τ as follows: Actor network training.Similarly, actor networks include a main actor network (µ) parameterized by θ µ and a target actor network (µ ) parameterized by θ µ .The main actor network is updated using the deterministic policy gradient theorem [15] as follows: where ∇ a Q is the gradient of the critic w.r.t.action a and ∇ θ µ µ is the gradient of actor w.r.t.parameter θ µ .θ µ is updated using a soft-updating technique with learning rate τ as follows:

Learning Process
Applying reinforcement learning to the bicycle problem is described as follows.The state of a bicycle includes (ω, ω, ω, θ, θ, ψ g ), where ω, ω, ω are the angle, angular velocity, and angular acceleration of the bicycle relative to the vertical plane; θ, θ are the angle and angular velocity of the handlebars and ψ g is the angle formed by the bicycle and a specified goal g.These states are sent to the controller at each time step and the controller returns the values of d, the torque T, and the force F applied to the pedal.
How the DDPG algorithm trains the controller is summarized in Figure 8.Let Q(s, a|θ Q ) and Q (s, a|θ Q ) be the main network and target network of the critic, respectively, µ(s, a|θ µ ) and µ (s, a|θ µ ) be the main network and the target network of the actor, respectively, and R be the experience replay.The learning process is described as follows: Minibatch of N transitions [5] [5] [6] [7]

DDPG Algorithm
[3] Store e(s t ,a t ,r t ,s t+1) (1) The agent observes state s t from the bicycle and feeds it to the actor network (step [1] in Figure 8) for estimating the next action a t (step [2] in Figure 8) as follows: where N t is small random noise for exploring the action space.
(2) The bicycle transits to next state s t+1 and returns reward r t to the agent.
(3) The sampled data (s t , a t , r t , s t+1 ) is then stored the experience replay for later use (step [3] in Figure 8).
(4) From the experience replay memory, we randomly select a batch of N samples and use them to train the networks (step [4] in Figure 8).
(7) Thereafter, the parameters of the target networks (θ µ and θ Q ) are updated using soft-update techniques as Equations ( 16) and (18).

Reward Function
The reward function is defined as follows: r(s, a) = the last reward before falling down |ω| > π 6 −(ω 2 + 0.1 ω2 + 0.01 ω2 ) − 2.0ψ 2 where the term −(ω 2 + 0.1 ω2 + 0.01 ω2 ) takes responsibility for balancing the bicycle and the term −2ψ 2 g is for leading the bicycle to the goal.In this reward function, the bicycle is considered as falling down if the angle between the bicycle and the vertical plane is greater than π 6 rad (or 30 degree).When the bicycle falls down, the reward at this time is used until the end of the episode.The coefficients for each term are selected based on their contributions to the reward.Particularly, we use a coefficient of 1.0 for ω 2 , which is the most important in the balancing term.A coefficient of 2.0 is used for ψ g to highlight the importance of the go-to-goal term.Figure 9 shows the effects of the components on the reward value.Initially, the term −2ψ 2 g has a small value compared to −ω 2 , −0.1 ω2 and −0.01 ω2 .This indicates that this term is the most important to the reward function.However, during 5000 training episodes, the gap between the −2ψ 2 g term and the other terms is decreases and all terms are tend to zero.

Settings
The simmlation environment is as follows.The operating system is Linux (Ubuntu 16.04 LTS) with 64 GB DDR3 memory.We use PyCharm as an integrated development environment and experiment with Python as a language.Parameters of the algorithm are shown in Table 3, where we use the Ornstein-Uhlenbeck process [19] to explore the action space.The experience replay can contain up to 500,000 data samples.At each training step, we randomly obtain a batch of 64 samples from the experience replay and use them to train the controller.

Baselines
NAF (Normalized Advantage Function) [20] uses a Q neural network for the entire problem.To adapt to the continuous control tasks, the Q network is decomposed into a state value term V and an advantage term A: The advantage A is parameterized as a quadratic function of nonlinear features of the state: (a − µ(s|θ µ )) T P(s|θ P )(a − µ(s|θ µ )).
P(s|θ P ) is a square matrix with formula: where L(s|θ P ) is a lower-triangular matrix with entries come from a linear output layer of a neural network.This representation of the Q-network can deal with continuous action tasks.PPO (Proximal Policy Optimization) [21] is an on-policy algorithm.This means that the policy is learned from the trajectories that are generated from current policy instead of the trajectories from the replay memory.PPO gets rid of the computation created by constrained optimization.PPO implements the idea of TRPO's constraint [22], which does not allow the policy to change too much but instead uses a simpler form of equation.The features of this algorithm can be summarized as follows.First, denote the probability ratio between the old and new policies as PPO imposes the constraint by forcing r(θ) to stay within a small interval around 1, precisely [1 − , 1 + ], where is a hyperparameter.
The function clip (r(θ), 1 − , 1 + ) clips the ratio within [1 − , 1 + ].The objective function of PPO uses the minimum value between the original value and the clipped version, and therefore we lose the motivation for increasing the policy update to extremes to obtain better rewards.
When applying PPO to the network architecture with shared parameters for both the policy (actor) and value (critic) functions, in addition to the clipped reward, the objective function is augmented with an error term in the estimation (the second term in the formula) and an entropy term (the third term in the formula) to encourage sufficient exploration.
Here, both c 1 and c 2 are hyperparameter constants.

Results and Discussion
In this section, we compare the performance of a controller-based DDPG with other baselines.In addition, we show the performance of the bicycle before and after considering the velocity.The results are shown below.

Simulation without Controlling the Velocity
Comparison with baselines.In the first evaluation, we compare the performance of a bicycle trained by DDPG algorithms with the performance of the bicycle trained by other algorithms.
Particularly, we compare DDPG with the NAF algorithm (Normalized Advantage Function algorithm) and PPO algorithm (Proximal Policy Optimization algorithm).Both algorithms are state-of-the-art deep reinforcement learning algorithms that can deal with a highly continuous action space.Both algorithms produce deterministic policies that are expected to have lower variance and predictable performance compared to a stochastic policy.Figure 10 shows the performance of a bicycle that randomly starts at (50, 50) m and wants to reach a goal at (60, 65) m.The speed of the bicycle is fixed at 10 km/h and the displacement d is in the range from −20 cm to 20 cm.We report the performance throughout three runs of 5000 episodes.Each episode has 400 time steps.From the figure, we can see that the controller trained by DDPG outperforms the controller trained by other algorithms in term of variance and reward values.At the beginning of the learning process, DDPG seems to has a bigger variance than other algorithms.However, at the end of the learning process, the variance of DDPG algorithm is decreased while the variance of PPO algorithm is increased and bigger than other algorithms.The algorithm PPO (under development) is expected to gradually train a stable controller but not better than the DDPG algorithm on bicycle domain.The difficulty in tuning PPO's hyperparameters might be the reason for the reported results.In addition, PPO takes a long time to obtain the same performance as DDPG.Even though NAF can learn something on bicycle controller, it cannot obtain a good controller for stabilizing the bicycle.Figure 11 shows values of 6-dimensional states of a successful trajectory.The figure shows that all of states are stable and gradually converges to zero.Fixed start-fixed goal.The second evaluation shows the performance of a bicycle with different values of d.The bicycle randomly starts at (50, 50) m with a random direction and wants to reach a goal at (60, 65) m.The speed is fixed at 10 km/h.We train for 5000 episodes of 400 time steps.We learn for 5000 episodes of 400 time steps.The result reported in Figure 12a shows that an agent using a big displacement to adjust the center of mass will outperform an agent using a small displacement and an agent without displacement (only steering the handlebar).Intuitively, without considering d, the bicycle will easily fall down when it tries to turn the bicycle at a high speed.Figure 12 shows the trajectories of the back wheel of the bicycle during the learning process.During this time, the bicycle gradually reaches the goal (blue lines are early trajectories and red lines are late trajectories).However, the bicycle that starts at the opposite position of the goal location often falls down.Random start-fixed goal.The next evaluation is performed for a bicycle that starts at a random location and learns to reach a goal position at (150, 100) m.After around 700 episodes, the bicycle almost reaches the goal from any staring location.The reward converges to zero and the average number of steps to reach decreases from the initial steps (4000) to around 1000 (Figure 13a).Figure 13 shows the trajectories of the bicycle during the learning process.Random start-random goal.In the last evaluation of this section, we report the performance of a bicycle controller that is trained to start at random locations and reach a random goal.Figure 14a shows the same behaviors as the previous evaluations.This means the average cumulative reward converges to zero and the average number of steps decreases.Figure 14b shows 100 trajectories for a trained bicycle that starts at random locations and reaches pre-defined random goals.In all of the cases, the bicycle reaches the goal.The report shown in Figure 16 indicates that the controller is gradually improved by increasing the average reward and decreasing average step and distance to the goal.The speed of the bicycle is decreased to 2 m/s in the first 100 steps and then increases a little before being stabilizing at around 2 m/s.Intuitively, the trained controller decreases the velocity to help the bicycle become stable when turning time.However, due to the curve of dimensionality, the controller needs to explore more of the state space and action space, requiring training via a millions of episodes to obtain an optimal controller.

Conclusions and Future Works
In this paper, we propose a method to control a bicycle using the DDPG algorithm and show that it can be successfully controlled.A controller with a deep neural network can rotate the bicycle handlebars, move the center of gravity displacement, and adjust the speed so that the bicycle can change directions without collapsing.The agent using the proposed neural network controller in this paper was able to reach a specified position and generate a gentle and smooth trajectory.For future work, first, we can enhance the controller by forcing the bicycle to follow a pre-defined trajectory such as a bicycle running on the road.In addition, the DDPG algorithm requires picking a step size that falls into the right range.If it is too small, the training progress will be extremely slow.If it is too large, training tends to be overwhelmed by noise, leading to poor performance.The DDPG algorithm does not assure monotonically improved performance of the controller.Therefore, the second future work is using PPO with some modifications to obtain a more stable controller.Finally, making a real autonomous bicycle needs to consider many aspects such as the effect of a cyclist on the handlebar (T), the effect of the cyclist's foot on the force to pedal, or the hardware needed to build the bicycle.However, most of them have been ignored in this study for simplifying the learning problem.A study on these aspects will be valuable to make a full understanding of an autonomous bicycle.

Figure 1 .
Figure 1.Illustration of a bicycle.

Figure 3 .
Figure 3.The bicycle seen from behind.The thick line represents the bicycle [9].

Figure 4 .
Figure 4.The bicycle seen from above.The thick line represents the front tire [9].

Figure 5 .
Figure 5. Axis for the moments of inertia for a tire [9].

Figure 7 .
Figure 7.The two neural networks used in this paper.

Figure 8 .
Figure 8. Workflow of the deep policy gradient (DDPG) algorithm applied to the bicycle.

2 gFigure 9 .
Figure 9. Contributions of the components to the reward function.

Figure 11 .
Figure 11.Observed states of a successful trajectory.
d is in range −2 cm and 2 cm.
d is in range −20 cm and 20 cm.
Performance of a bicycle with a fixed goal and random starting location.

Figure 14 .
Figure 14.Bicycle with random goal locations and random starting locations.

Figure 16 .
Figure 16.The performance of a bicycle starting at any place to reach a random location.

Table 2 .
Parameters of the actor network and the critic network.

Table 3 .
Parameters of algorithm.