Online Reinforcement Learning-Based Control of an Active Suspension System Using the Actor Critic Approach

: In this paper, a controller learns to adaptively control an active suspension system using reinforcement learning without prior knowledge of the environment. The Temporal Difference (TD) advantage actor critic algorithm is used with the appropriate reward function. The actor produces the actions, and the critic criticizes the actions taken based on the new state of the system. During the training process, a simple and uniform road proﬁle is used while maintaining constant system parameters. The controller is tested using two road proﬁles: the ﬁrst one is similar to the one used during the training, while the other one is bumpy with an extended range. The performance of the controller is compared with the Linear Quadratic Regulator (LQR) and optimum Proportional-Integral-Derivative (PID), and the adaptiveness is tested by estimating some of the system’s parameters using the Recursive Least Squares method (RLS). The results show that the controller outperforms the LQR in terms of the lower overshoot and the PID in terms of reducing the acceleration.


Introduction
Ride quality and passenger comfort are some of the major concerns of any vehicle designer and have been well investigated and researched over the past few decades. The vehicle suspension system plays an important role in vehicle safety, handling, and comfort. It keeps the tires in contact with the road, provides good handling and stable steering, and minimizes the vibrations and oscillations due to road irregularities, which ensure the comfort of the passengers. In general, the suspension is divided into three types: passive, semi-active, and active suspension systems. Most of today's vehicles use the passive system, which consists of a spring and a damper that are fixed at the design stage, which provide an acceptable performance for a limited frequency range. Changing the suspension properties like the damping coefficient allows for good performance, but in a different frequency range. That is why sport cars have better handling than standard cars, whereas the latter have better riding quality. In a semi-active suspension system, energy is not added to the system; however, the system varies the viscous damping co-efficient of the shock absorber. Therefore, the need for a system that attains the objectives for the entire working frequency range justifies the momentum for active suspension research.
In the active suspension system, an actuator is added to the spring and damper, which adds energy to the system by exerting an adaptive counter force, resulting in an improved dynamics behavior. Several controllers were implemented in the active suspension system problem, and the discrete indirect pole assignment with a fuzzy logic gain scheduler was developed by [1]. A non-linear model of the active suspension was investigated by [2] using a sliding mode controller that utilizes the sky-hook damper system without the need for road profile input. There have been more recent attempts to improve the system such as the study conducted by [3], which used two loop PID applied to the half model in the non-linear form. Compared to the passive system with the same parameters, they showed excellent improvement. However, the controller could not completely reject the disturbance. Another example of the utilization of the PID controller in the half model was that developed by [4]. They studied three scenarios and tried three tuning methods where the iterative learning algorithm performed better than the other two methods. Reference [5] showed that fuzzy logic performed significantly better than the PID in the quarter model based on two types of road conditions. Another comparison based on the quarter model, but between a robust PI and Linear Quadratic Regulator (LQR) controller, was conducted by [6], where they showed that the LQR outperformed the PI controller in the presence of parameter uncertainty. Another optimal controller is H ∞ , which was examined by [7] and showed a 93% reduction in the car's acceleration, wheel travel, and suspension travel.
Recently, machine learning has been widely investigated in control problems and applied to vehicle suspension. In many cases, neural networks are used in combination with traditional controllers including the sliding mode controller [8], PID [9], and LQR [10] to enhance the controller performance and other tasks like determining the road roughness [11], but they have rarely been used as the controller itself. A good example that motivates the use of machine learning methods as the controller is [12], where a neural network was trained by the optimal PID and surpassed it under parameter uncertainties. Another machine learning method that has recently gained momentum and shown great success in a broad range of applications including games and economics, but rarely applied to control problems, is reinforcement learning.
Despite the rare application of reinforcement learning in the active suspension problem, one of the earliest attempts to the best of our knowledge was conducted by [13][14][15]. Although there is a significant difference between their learning algorithms and the currently used algorithms, the core idea of learning by interactions and experience is the same. The three attempts shared the same idea of maximizing a cost function and learning to select the controller parameters to achieve the best performance. The learning algorithm implemented by [13] was able to achieve near optimal results compared with the Linear Quadratic Gaussian (LQG) under idealized conditions. Reference [15] had the same approach, but they introduced a new learning scheme, which allowed the controller after learning to work in certain conditions where the traditional LQG controller resulted in an unstable system. They also tested the learning method in a real vehicle, but with a semi-active suspension system installed, and they showed promising experimental results. A more recent successful attempt was the study done by [16] where they applied a stochastic real-valued reinforcement learning control to a non-linear quarter model. A similar approach to the one considered in this research was conducted by [17]; the actor critic networks were trained by the policy gradient method, and the controller was tested to some extent with the same road profile considered in this study. They compared their work with the passive suspension system and showed 62% improvement.
In this paper, an online deep reinforcement learning controller is developed using the TD advantage actor critic algorithm. The controller is applied to a quarter model active suspension system to investigate the performance of this learning-based algorithm against the Linear Quadratic Regulator (LQR) and the optimum Proportional-Integral-Derivative (PID). In order to handle possible system uncertainties, the algorithm is integrated with a Recursive Least Squares method (RLS) for online estimation of system damping coefficients.

Modeling of the Active Suspension System
The active suspension model considered in this paper is the linear quarter model shown in Figure 1. It consists of two masses supported by two dampers and two springs. The first one is the sprung mass m s , which represents the vehicle body and is supported by the suspension, which is modeled by the damper b s and k s . The wheel mass and the tire contact with the road are represented by the unsprung mass m us , the damper b us , and the stiffness k us , respectively. The nominal values considered in this study are shown in Table 1, and the mathematical model of the active suspension is represented by the following state space:ẋ The states are defined as: the displacement of the sprung mass x s , the vehicle body velocityẋ s , the displacement of the unsprung mass x us , and the vertical velocity of the wheelẋ us . Other states can be obtained such as the suspension travel x s − x us and the wheel relative deflection with respect to the road profile x us − z r . z r andż r represent the changes of the road profile and the rate of its change. The actuator is positioned between the sprung and unsprung mass and generates a counter force to improve the system performance by reducing the acceleration and the suspension travel. The open loop response of the system can be obtained by setting F c = 0, which corresponds to the performance of the passive suspension system. We assume the working of the actuator, also referred to as the action space later, to be between −60 N and 60 N.

Reinforcement Learning
Reinforcement learning is one of the three machine learning methods that has witnessed in the past few years a significant leap due to hardware improvements, especially after the publication of the first successful deep learning model that learned to control the policies of seven Atari 2600 games [18]. As shown in Figure 2, reinforcement learning works by the concept of trial and error where an agent at a certain state s t takes an action a t to interact with the environment, steps to a new state s t+1 , and gets a reward that depends on the value of the new state, which therefore reflects the quality of the action taken by the agent. The target is to find a policy π θ that maximizes the accumulated reward, subsequently. It can be said that the agent learns by experience. This method has gained much attention due to its ability to solve complex problems that cannot be solved by supervised learning efficiently without the need for huge training datasets and correct answers that are passed to the neural network through backpropagation. It can also overtake traditional controllers that work in a bounded range and has limitations when applied to non-linear models and MIMO systems. In this paper, the agent, the environment, and the action represent the controller, the active suspension model, and the force, respectively.
Deep reinforcement learning utilizes neural networks to generate an action a t given state s t and estimates the value of a state s t+1 , which allows us to extend reinforcement learning to the continuous space and continuous action problems. In this paper, we investigate deep reinforcement learning in control problems by applying the TD advantage actor critic algorithm to the active suspension system.

TD Advantage Actor Critic Algorithm
The TD advantage actor critic algorithm is an online model-free algorithm that consists of two neural networks, the actor network π θ (s) and the critic network V U π (s), where θ is the actor network parameters (weights and biases) and U is the critic network parameters (weights and biases). The actor network takes the current state s t as an input and generates an action a t given the current policy π θ , and the agent executes the action in the environment, which therefore produces a scalar reward signal r t and steps to the next state s t+1 . The critic network takes the current state as an input and produces a value for that state V U π (s), then the next state is fed to the critic network to produce a value for it V U π (s t+1 ) before the parameters are updated. The critic learns a value function, which is then used to update the actor's weights in the direction of improving the performance at every time step.
In this algorithm, the output of the actor is not the action itself; this is the mean µ and the standard deviation σ, which give the probability distribution of choosing action a t given the state s t . Therefore, the policy is not deterministic; it is a stochastic policy. A step by step implementation is shown in Algorithm 1.

Algorithm 1: TD Advantage Actor Critic.
Initialize the Critic Network V U π (s) and the Actor Network π θ (s) for episode = 1:M Start from the initial state s t o for i = 1:N Sample action a t ∼ π θ (s t |µ, σ) = N(s t |µ, σ) Execute action a t in the environment and obtain the reward r and step to the next state s t+1 Calculate the temporal difference error δ t = r + γV U π (s t+1 ) − V U π (s) Update the Critic parameters by minimizing δ 2 t Update the Actor parameters by minimizing the Loss = −log(N(s t |µ, σ)) · δ t Set s t = s t+1 end end

Reward Function
The formulation of the reward function plays a crucial role in the learning process. It is used to indicate the quality of the action taken by the agent after it steps to the next state. In supervised learning, the correct answer is known and fed to the neural network through backpropagation which with using an appropriate optimization method, tunes the network parameters to the direction of minimizing the error. However, in reinforcement learning, the agent receives a reward signal from the environment based on the next state which determines the quality of the taken action. Positive rewards encourage the agent to accumulate as much reward as possible whereas negative rewards incentivize the agent to reach the targeted state quickly to avoid accumulating penalties. We tested three different reward functions, the first one takes the following form: The second one is as follows: The third reward function is: During the learning process, the third reward function performed the best. The neural networks struggled to converge with the first reward function due to low and close numerical values. The second reward function performed better; however, the actor network was not able to eliminate the steady-state error of the sprung mass x s . The reason for this is that the system can reach zeroẋ s , therefore obtaining the maximum reward in this case without reaching the desired x s . This problem was solved in the third reward function by adding a small penalty on the force, which will encourage the actor to produce zero forces whenẋ s is zero.
In the third reward function, the vehicle body velocity is used as the indicator of the quality for the action taken; the value is squared for a gradual feedback and to benefit from the optimization properties of a convex, thus allowing the controller to know that it is improving. The signal is amplified because the range of the suspension travel being studied is within a few centimeters, and the negative sign is applied to have the reward increased asẋ s decreases, which therefore motivates the controller to reach the desired state as fast as possible. k 1 and k 2 are chosen to be 1000 and 0.1, respectively. k 1 is chosen to have a high value to motivate the controller to reach zero velocity, while k 2 is chosen to have a much smaller value as higher values impose limitations on the working range of controller force, which therefore will have a negative impact on the response. Other values might improve or worsen the learning process.

Learning and Optimization
The critic network and actor network structures are shown in Figures 3 and 4, respectively, and the implementation of the algorithm is illustrated in Figure 5. Weights are initialized using variance scaling for Tanh and Sigmoid, Xavier for ReLu, and elu, whereas layers with the linear activation function are initialized from a uniform distribution. The mean output layer µ is initialized with µ = 0 and σ = 0.5. The vehicle body velocityẋ s is utilized to represent the state of the suspension system and used as the input to both networks. The loss functions for both networks are minimized using the adaptive learning rate algorithm ADAM optimizer, which can be found in [19]. The learning rates for the critic network are chosen to be higher than the actor network such that the critic learns faster; thus, it produces accurate values for the states, which then help the actor learn and generate better actions. The learning rates are 0.01 and 0.001, respectively, and after a marked learning and improvement, the learning rates are decreased to 0.001 and 0.0001, respectively. In many cases, the experience (s t , a t , r t , s t+1 ) is stored in a memory with a pre-defined size, and the networks' parameters are updated by computing the average gradient over sampled transitions at every time step. While this approach solves the problem of correlated states, however, it is not considered in this paper. Computing the gradients at each time step for the single experience was faster and more efficient since the road profile changes every 1.5 s. The value of the discount factor γ was chosen to be 0.99, and the activation functions used throughout our study are summarized in Table 2. Table 2. List of activation functions.

Type Function Derivative
Linear Unfortunately, there is no standard method of choosing the number of neurons, the number of hidden layers, the types, and the order of activation functions. Therefore, we built and ran various models of the actor and critic multiple times until satisfactory performance was achieved. The best performing neural networks in this study are summarized in Table 3. The algorithm was implemented in Python 3.7 and trained using TensorFlow libraries.

Online Estimation
In real-life applications, the parameters of the active suspension vary with time. Therefore, for more realistic investigation, the controller after training is tested without accurate knowledge of the parameters of the car body damper b s and the damping of the tire b us , and we estimated them iteratively using the recursive least squares with exponential forgetting method, which works well with time-varying parameters [20], assuming that the parameters to be determined are not constant, but the true value changes with time. Given the initial values of P o andθ(t − 1), the recursive method satisfies the following Equations (6)- (8).
K(t) = P(t − 1)ϕ(t)(λI + ϕ T (t)P(t − 1)ϕ(t)) −1 (7) whereθ(t) is a vector that includes the estimated parameters b s and b us and K(t) is a weighting vector that indicates how the correction and the pervious estimate should be combined. y(t) is the measured valued; in this paper, it is the true values added with noise generated from the standard normal distribution. ϕ is the regressor. P(t) is a matrix defined only when the matrix Φ(t) T Φ(t) is nonsingular where: I is the identity matrix with a size of nxn, where n is the number of parameters to be determined, and λ is a constant chosen to be 0.90. For fast convergence and to avoid singularities, the matrix P(t) is initialized as the identity matrix multiplied by 1000. We use the unsprung mass accelerationẍ us in the online estimation process as the output y(t), since it includes all the parameters to be estimated. From (1), we can obtain the following: Taking only the coefficients of the parameters to be determined as the other parameters will disappear when the error (y(t) − ϕ T (t)θ(t − 1)) is calculated and considering that the output satisfies: yield the following: such that: where the term y(t) is the measured output and x(t) contains the measured parameters, whereas the term ϕ(t) is the estimated output andθ(t) contains the estimated parameters.

Results and Discussion
A simple road profile was used in the learning process where z r is generated by a square wave with an amplitude of 0.02 m and a period of 3 s, as shown in Figure 6a,b. The performance of the controller after training was compared with the optimum PID obtained from [12] and with the Linear Quadratic Regulator (LQR) from the Quanser laboratory guide [21] with the weighting matrices as follow:  Two scenarios were studied to build confidence and test the trained controller. In the first scenario, the controller was compared with the optimum PID and LQR on the same road profile used during the training process under ideal conditions. In the second scenario, a new bumpy road profile was used, and parameter estimation was added to the simulation.

Scenario 2
The range of the road profile was extended as shown in Figure 8, and online parameter estimation was included. The new road profile is shown in Figure 8a. b s and b us were assumed to be changing frequently at a rate of 0.5 Hz. b s varied between 4 and 9 s/m, and b us varied between 3 and 7 s/m. Noise was added to the parameters to simulate noisy measurements. Figure 9a,b shows that the system successfully estimated the parameters in less than 100 ms. Despite the changing of parameters and the bumpy road profile, the actor network was able to maintain excellent performance and provided 6.14% lower overall acceleration compared to the optimum PID. The average acceleration values obtained were 0.6861 m/s 2 for the actor network and 0.7310 m/s 2 for the optimum PID.

Conclusions
In this paper, online reinforcement learning with the TD advantage actor critic was used to train an active suspension system controller. The structure of the neural networks was obtained by the trial and error method. Three different reward functions were studied, and the implemented one used the body vehicle velocity and the produced force as an indication of the quality of the action taken.
The results showed that the reinforcement learning can obtain near optimal results under parameter uncertainty while estimating them using the RLS with forgetting factor method.
The results encourage further studies by testing other algorithms like the Deep Deterministic Policy Gradient (DDPG) and Asynchronous Advantage Actor Critic (A3C). In addition, a full model suspension system will provide a better understanding of the controller capabilities. Moreover, the adaptiveness and the ability to continuously learn the complex dynamics under disturbances and uncertainties motivate the use of deep reinforcement learning in non-linear models.