Deep Reinforcement Learning for Autonomous Dynamic Skid Steer Vehicle Trajectory Tracking

: Designing controllers for skid-steered wheeled robots is complex due to the interaction of the tires with the ground and wheel slip due to the skid-steer driving mechanism, leading to nonlinear dynamics. Due to the recent success of reinforcement learning algorithms for mobile robot control, the Deep Deterministic Policy Gradients (DDPG) was successfully implemented and an algorithm was designed for continuous control problems. The complex dynamics of the vehicle model were dealt with and the advantages of deep neural networks were leveraged for their generalizability. Reinforcement learning was used to gather information and train the agent in an unsupervised manner. The performance of the trained policy on the six degrees of freedom dynamic model simulation was demonstrated with ground force interactions. The system met the requirement to stay within the distance of half the vehicle width from reference paths.


Introduction
Mobile robots play an important role in modern daily life and in various industries such as construction, logistics and transportation. Their influence has become even more evident with the advancement in autonomous driving research. On-road vehicles operate in a much more structured environment than off-road vehicles, where the environment is complex, unstructured and mechanically taxing. Skid-steered robots have several advantages while operating in these tough environments by maintaining greater traction on rough terrain. The same design can be used for both tracked and wheeled robots. Since there are no steering wheels, there is no need for a steering mechanism. All the wheels on each side drive in same direction so only two motors are sufficient for driving and steering the robot.
Skid-steered robots, whether they are tracked or wheeled vehicles, have complex dynamics associated with their motion. This arises from interaction with the road or the contact surface and their joints which constrain their motion. This is not considered by models without any slip descriptions or purely kinematic descriptions of their motion. This makes it difficult to model them using simplistic models that have no slip assumptions or purely kinematic descriptions of their motion.
One of the fundamental requirements for an autonomous system is to be able to follow a trajectory provided by a path planner in a reliable manner. Due to limitations of existing control schemes, the objective was to develop a controller to adapt to this complexity while ensuring the vehicle followed a trajectory. Most of the previous approaches developed for skid-steer robot path following use kinematic models of the robot. These models do not account for wheel slip. For example, Ref. [1] used a terrain dependent kinematic model whose parameters were experimentally evaluated. This might not be feasible for vehicles operating in uncertain and dynamic environments. For example, Ref. [2] used a slip aware model predictive component to correct the control signals generated by a pure pursuit path follower. This approach used linearized dynamics for estimating slip, which is not ideal for some of the operating conditions of a field robot. A 6 Degrees of Freedom (DoF) dynamic model of the vehicle with non-linear ground contact forces was employed. This improved the accuracy of the vehicle motion modeling and enabled better training and tuning of the reinforcement learning agent.
The remainder of the paper is organized as follows. Section 2 provides a literature review. Section 3 provides a description of the nonlinear vehicle model. Sections 4 and 5 discuss the structure of the Deep Deterministic Policy Gradient and reinforcement learning neural network setup and training. Section 6 provides the results of the trained policy as evaluated on various trajectories in simulation to assess performance measurements and errors. Lastly, Section 7 concludes the paper.

Background
In recent years, deep learning has been successfully used in computer vision, natural language processing and speech recognition. Deep learning was developed from artificial neural networks(ANNs), whose structure is explained in Figure 7. ANNs were originally inspired by biological neuronal networks. Deep neural networks have intermediate layers of neurons called hidden layers and they help a deep network form abstract representations of the problem domain. This is evident within feature representations of various layers of a trained deep neural network and in contrast to the traditional methods where the features are designed by domain experts. Despite their success in many fields, deep neural networks are trained with huge sets of labeled training data. This approach is prohibitively expensive or impractical for problems involving real world robotic control due to safety concerns. Reinforcement learning involves agents that learn to perform a task by interacting with the environment without expert operation and guidance. Feedback to the agent is provided through a reward function from which it learns the decision making policies to perform the task. The advances made in reinforcement learning and deep learning in the past few years to solve classic Atari games [3] inspired the research community to design algorithms to address problems in vehicle control [4], motion planning [5] and navigation [6] related tasks.
In [7], the dynamics of a skid-steered vehicle was used for the path following task. The controller needed to be tuned in a supervised manner. Using the reinforcement learning paradigm allowed the designer to train the controllers in an unsupervised manner through the use of a reward system. This removed the need for large databases or hand tuning of controller parameters. Reinforcement learning has been successfully applied in control problems developing policies for low level motor control such as [8], robot tracking [9] and aircraft control [10]. The success of deep learning and reinforcement learning in control problems led to the exploration of these approaches for path tracking of skidsteered vehicles. Robotics control problems have continuous variables as outputs. Deep Deterministic Policy Gradients [4] (DDPG) is one of the reinforcement learning algorithms designed to accommodate continuous control actions. The DDPG was applied towards designing a controller to perform the trajectory following task for a skid-steered robot. This work is, given similar research, the first time that a reinforcement learning algorithm, as demonstrated in Figure 1, is used to train a Deep Neural Network based controller for the trajectory following of a dynamic skid-steer vehicle system in simulation and in a real world application.

Vehicle Model Description
Skid steering is a driving mechanism implemented on vehicles with either tracks or wheels, which uses a differential drive concept. Common skid-steered vehicles are tracked tanks and bulldozers. This driving method engages each side of the tracks or wheels separately, and turning is accomplished by generating differential velocity on opposite sides of the vehicle. The wheels or tracks are non-steerable. For example, a right turn can be achieved by turning the left wheels forward and the right wheels backward or making them stationary and vice versa for a left turn. Another advantage to this mechanism is that if the velocities are applied continuously then the robot can theoretically complete a full 360 • . turn with zero radius provided there is no wheel slip. Skid-steered robots have complicated dynamics due to the nature of the interaction of wheels with the contact surface leading to nonlinear forces acting on the vehicle. Much of the work performed in this area uses simplified kinematic models or dynamic models that cannot incorporate ground contact forces in a detailed manner. The dynamic model developed in the simulation environment was based on a commercial skid-steer robot called the Jackal and made by Clearpath Robotics. The Jackal, as seen in Figure 2, is a small, entry-level field robotics research platform. It has an onboard computer, GPS and an IMU along with the Robot Operating System (ROS) integration. This facilitates communication, algorithm development and deployment of the final controllers. It has a long battery life and a rugged aluminum chassis which enables all-terrain operation.
The dynamic model of the skid-steered vehicle has a wheelbase and four wheels attached to it with 6 DoF. The forward dynamics is calculated using 6 DoF rigid body dynamics, with torques and forces generated by normal forces and friction forces acting on the wheels as well as the internal motor torques.
In this model, the vehicle was treated as a rigid body that interacted with a compliant ground. This compliant ground was modeled as a uniform distribution of an infinite number of non-linear spring-damper pairs. Further, these rigid body-ground interactions were represented as a set of discrete contact points, each of which caused the ground to deflect spherically. Ground forces were calculated by modeling the tires as pneumatic soft tires. The wheel was approximated with a set of springs of given spring constant under compression from ground forces. Skid-steered vehicles have fixed tire orientations which means that during a turn, each tire has a different turn radius. Assuming a no slip condition would geometrically constrain the vehicle and movement would be impossible. Therefore a skidsteer vehicle relies on slipping to complete turns. A detailed dynamic model was required to calculate the reaction forces and account for the effect of slipping. The tire-terrain model proposed by [11] required slip ratio, slip angle, sinkage and tire velocity to compute at each tire for the reaction forces in the x, y, and z directions as well as the reaction torque. Tire slip ratio was defined as: where µ x is the forward velocity of the tire, R is the radius of the tire, and ω is the rotational tire velocity. Slip angle is defined as: where µ y is the lateral velocity of the tire. Sinkage is the depth the tire sinks, called z r . The model detailed in [11] works by integrating the shear and normal stresses caused by the soil displacement over the surface of the tire.
z u refers to the unloading height and z e is the sinkage of the flexible tire. p g is the tire pressure,which is assumed constant and b is the tire width. k c , k φ , k u are soil parameters.
where θ f , θ r , θ c are tire angles corresponding to z f , z r , z c . R is the radius of the tire. Equations from [11] demonstrate how the shear and normal stress can be calculated from the deformation of the soil in these three zones. The result is a nonlinear relationship between the input parameters and the reaction forces which can be observed in Figures 3-6. The relative modulus of elasticity between the wheel(s) and the ground, E * , was computed using (8). In (8) the wheel(s) and ground had moduli of elasticity tied to E w and E g , and Poisson ratios given by ν w and ν g , respectively.    The stiffness and damping coefficients were defined in (9) where r was the radius of the sphere and α was a constant.
The spatial velocity vector for F1, the main body reference frame, was given by (10). This vector was comprised of the angular and translational velocities represented by ω and v respectively.
Equation (11) was the velocity of each of the wheels. The motion transformation from F 1 to F i was given by i X 1 , the subspace matrix of each wheel was S i , and the angular velocity

Deep Deterministic Policy Gradients
The DDPG is an actor-critic reinforcement learning algorithm and has separate neural networks for actor and critic functions. In this section, multi layered neural networks, as shown in Figure 7, are introduced including policy function µ and critic function Q.

Fully Connected Layer
A fully connected layer can be represented by where w T is the weights matrix and b is the bias. x is the input to the layer and y is the output. The Rectified Linear Unit (ReLU) was chosen as the activation function and ε was the error function. The entire network was made of multiple fully connected layers and it can be shown from the universal approximation theorem [12] that

Activation Function
The simplest activation function is referred to as linear activation, where no transform is applied at all. A network comprised of only linear activation functions is quite easy to train but cannot learn complex mapping functions. Nonlinear activation functions are preferred as they allow the nodes to learn more complex structures in the data. Traditionally, two widely used nonlinear activation functions are the sigmoid and hyperbolic tangent activation functions.
The nonlinear activation functions, such as sigmoid and tanh, suffer from the problem of vanishing gradients. Layers that are deep in large networks using nonlinear activation functions do not receive useful gradient information. The error is back propagated through the network and used to update the weights. The amount of error decreases dramatically with each additional layer through which the gradients are propagated, given the derivative of the chosen activation function. This is called the vanishing gradient problem that prevents deep (multi-layered) networks from learning effectively. ReLU activation helps deal with the vanishing gradients problem and enables the development and training of deep neural networks. The ReLU function is a piece wise linear function that will output the input directly if is positive, otherwise, it will output zero, as seen in Figure 8. It has become the default activation function for many types of neural networks because a model that uses it is easier to train and often achieves better performance.

Control Policy
The control policy is represented by the following equation: where s t is the state at time t and a t is the action calculated by the policy function µ at time t. The long term reward function was calculated by using the equation γ is the discount factor (0 < γ < 1), which was used to discount the future rewards with respect to current reward. The problem was to find arg min s t ,a t {R(s t , a t )}. To solve the above problem, the critic function was defined as An iterative approach was used to calculate the critic function, discretizing the above equation to obtain and the optimal critic function waŝ The critic function Q(s t , a t ) was replaced with a neural network Q(s t , a t |ω) In order to obtain an optimal critic function, the loss function was defined as Loss = (y t − Q(s t , a t |ω)) 2 2 (20) The optimal critic functionQ(s t , a t ) was found by minimizing the value of the Loss function.
Thus, the policy gradient algorithm [13] was applied by sampling a batch of (s t , a t ) and computing the average Loss function.
The gradient of Loss was given by The gradient descent method was used to update the weights where α is the learning rate which can be tuned according to need.

Actor Function
After obtaining the critic function, it was used to update the actor function by propagating the gradients back to the actor. Then µ(s t ) was replaced using neural network µ(s t |θ) and substituting a t = µ(s t |θ) into Q(s t , a t |ω).
The gradient was given by Similar to the critic function, the gradient from the above equation can be used to update the actor function weights.
In order to update the actor and critic functions in a smooth fashion, the DDPG [4] algorithm creates a copy of actor and critic networks, Q (s t , a t |ω ) and µ (s t |θ ) and updates them gradually. The stopping criteria can decided by a set reward value or number of episodes during training.

Reinforcement Learning Setup and Training
Reinforcement learning consists of an environment and an agent. The agent performs an action in the environment which leads to a change in the state. The environment then supplies the next state and a scalar quantity called the reward.The reward is computed and provided as feedback to the agent. In this case, state information derived from vehicle position was used as an input to the agent, which outputs angular velocity. Both the inputs and outputs were continuous variables.

DDPG Agent Structure
The DDPG agent consists of two component neural networks called the actor and the critic. The actor network is the controller that maps the current state of the environment to angular velocity. The critic network takes the state of the environment and the corresponding actions predicted by the actor and outputs the Q value, or value of the corresponding state-action pair.

Actor
The actor network, provided in Figure 9, was made of two intermediate layers with 128 and 64 neurons each, and the final output layer had 1 neuron (2 neurons if linear velocity control was enabled). ReLU activation was used for the outputs of intermediate layers and the final layer was a hyperbolic tangent (i.e., tanh activation to limit the action output of the controller).

Critic
The critic network, as illustrated in Figure 10

State
Information regarding the state of the vehicle with respect to the trajectory was supplied to the actor and critic neural networks to enable them to act upon the state. State observations consisted of a 10-element array.The states provided were as follows:
Derivative of distance Error;
State information was calculated from the vehicle position, heading and waypoint positions. The information was acquired from the reference trajectory and sensors on-board the vehicle. State information provided to the agent was broken down into three categories: error information, predictive action information, and angular velocity. Figure 11 provides a visual description of the state variables supplied to the agent.

State Information-States 1-3
The error information consisted of distance error, derivative of distance error and heading error values, so that the agent could learn to associate these errors with action outputs. The derivative of heading error was not supplied to the agent because of the discontinuities in the derivative near the waypoints. These discontinuities introduced instability in the training process, so the agent was supplied with angular velocity. This provided similar information without the spikes and stabilized the training.

State Information-States 4-9
The predictive action information was supplied to the agent to encourage predictive turning behaviors near a sharp turn to reduce overshoot and hence minimize overall tracking error. A look-ahead point was defined as a point located at a specified constant distance down the path from the point on the path perpendicular to the vehicle's current position. The angle to the look-ahead point was the angle between the vehicle's current heading and the line drawn between the vehicle and the look-ahead point. The intention of the look-ahead point was to provide the controller with information describing the test course that lay ahead of the vehicle's current position. Three look-ahead points were used with specified distances of 1, 2 and 3 m.

State Information-State 10
There was a difference between commanded angular velocity (ω c ) and actual angular velocity (ω a ). Even though the agent had the information about predicted output angular velocity, the internal motor controllers took a finite amount of time to converge to the new angular velocity. This introduced a time delay between the predicted and actual output velocities leading to a difference in predictions about vehicle motion. The goal of supplying this state information was to create a smarter controller that learned to map its target angular velocity to its actual angular velocity which was then mapped to desired actions in the environment.

Rewards
The reward/penalty functions were used to evaluate performance and provide feedback during training. The outputs of these functions were summed to a scalar reward value and supplied to the agent (penalty functions are simply negative reward functions). This reward value, in addition to state information, enabled the DDPG algorithm to evaluate the effectiveness of actions and train the neural networks.

Distance Error Penalty
The Distance Error (DE) Penalty function was very important for the training of agents because it directly corresponded to the primary objective of reducing distance error. As the agent strayed further away from the current trajectory, it accrued more DE Penalty which forced the agent to begin creating a policy that would avoid this behavior.

Heading Error Penalty
As heading error (HE) increased, so did the HE Penalty, which allowed the agent to create a policy that minimized heading error. In addition, there was a relationship between the magnitudes of the HE Penalty and distance error. The HE penalty was scaled down when distance error increased. This relationship was helpful in the scenario where the vehicle found itself off track. The vehicle needed to accommodate some heading error as it finds its way back to the track and continues its mission.

Steering Penalty
The steering penalty was implemented to reduce unnecessary turning. A small penalty was added that was proportional to the absolute value of angular velocity. An added benefit of the steering penalty was that it helped combat circular behavior. The steering penalty dominated distance and heading error penalties when the vehicle was close to waypoints. Large distance and heading error penalties overpowered the smaller steering penalty causing the vehicle to follow the next path segment. Like the HE Penalty, the Steering Penalty was related to distance error and it was scaled down as distance error increased.
The reward scalar supplied to the agent was the sum of the values produced by Equations (27)-(29). Modifying the values of constants changed what the controller perceived as good and bad behavior. This affected the behaviors acquired during the training process. Penalties were scaled by the gains (K d , K h , K ω ). E d and E h were the distance error and the heading error, respectively. The magnitudes of the gains in proportion to one another enabled the Deep Neural Network (DNN) Controller to prioritize the behavior. Penalty shapes (S d , S h ) determined the rate of increase in penalties as their corresponding error increased. Figures 12 and 13 demonstrate how varying penalty values affect the unscaled distributions according to the corresponding error.
As simulations were terminated when distance error surpassed 1 m, the distribution of the distance error penalty was bounded (from 0 to 1). The distribution of heading error penalty was bounded between 0 and π because the absolute value of heading error could not exceed π(−π < HE < π). The distance error relationship coefficients were used to prioritize the vehicle to minimize the distance error rather than directly chase after the next waypoint in a pure-pursuit-like manner. If the vehicle had an overshoot near a turn, it would prioritize getting back to the track and then continuing its mission. These relationships were intended to discourage turning when the DE was small but to allow the vehicle to return to the track when the DE was large by not penalizing HE when the vehicle was far from the track. An exponential function was used in Equations (28) and (29) to determine the relationship between DE and other reward/ penalty functions. When DE = 0, the exponential function was equal to 1 and the reward/penalty function was unaffected. However, as the absolute value of DE strayed from 0, the exponential function increased, with the corresponding decrease in reward/penalty. The DE Relationship Coefficients (R h , R ω ) dictated the magnitude of decrease in reward/ penalty with respect to DE.

Training and Reinforcement Learning Setup
The MATLAB Reinforcement Learning Toolbox [14] was used to implement the DDPG algorithm. Once the dynamic model of the robot was built in Simulink, the models were converted into an environment compatible with the toolbox. The network architecture for actor and critic were specified using the Deep learning Toolbox as described in Section 4.1. The DDPG agent can be constructed from actor and critic networks after specifying the learning rates and losses for individual networks. A mean squared error loss function was implemented.
The Reinforcement Learning (RL) toolbox was used to specify and monitor the training process with built in functionality to save and load trained agents. The training was conducted for 1000 episodes. Each episode was terminated when the agent's distance error from the reference trajectory was above the threshold value 1 m or if the agent had successfully completed the waypoint trajectory. Learning rates of 0.02 and 0.01 were chosen for the critic and the actor, respectively, based on experimentation. Various values were experimented with to improve the training time and maintain the stability of the learning process. A discount factor of 0.95 was used to discount future rewards as described in Section 3. The replay buffer size was set to 10 5 and the minibatch size used for training was set to 32.
The trained agent was saved in the workspace and the policy was extracted from the agent for evaluation. In order to deploy the trained policy on the Jackal, a feed forward function was implemented to take the inputs coming from the robot and compute the control outputs using the weights and biases of the policy. The communication with the vehicle hardware was performed with the Robot Operating System(ROS) using various ROS messages and topics.

Results
The trained policy was evaluated on various trajectories in simulation to identify the performance measurements and errors. The test trajectories were based on the system level test plan described in [15]. The objective of the controller was to stay within a distance of half the vehicle width (0.22 m) from the trajectory.
In Figure 14, the vehicle started with an initial distance error of 1 m from the reference trajectory and zero heading error. It was able to quickly converge back to the reference trajectory within a distance of 1 m along the path. Figures 15-17, demonstrate that the vehicle was able to track trajectories with sharp turns of varying angles while maintaining the original objective of staying within the distance of 0.22 m. In Figure 17, it can observed that it takes a distance of two vehicle lengths to reacquire the trajectory due to a large turning angle of 120 • .  Figures 18-20 demonstrate that the vehicle was able to complete relatively complex trajectories with straight line segments combined with consecutive left and right turns while staying within the error bounds specified for the task. A finite phase lag was observed in Figures 18 and 20. There was a consistent undershoot that can observed in the second half of the Figure 8 path in Figure 20. These behaviors were due to the vehicle turning early near the end of a path segment. This was to account for future path segments data provided through the lookahead points. Table 1 provides a summary of RMS distance errors for various trajectories discussed in this section.     The performance of the controller deteriorated with larger turn angles as it needed to deviate from the trajectory to accommodate for the next segment as can be seen in Figure 17 with a 120 degree turn which is expected for a steering controller that does not control linear velocity. Even though the vehicle is able to track the trajectory, the performance could be further improved with more tuning and training.

Conclusions and Future Work
A continuous control reinforcement learning algorithm was implemented to train a path tracking controller for the dynamic model of a skid-steered vehicle. The controller was able to adapt effectively to the complex nonlinear forces and torques acting at the wheel-ground interface. The trained controller's performance was demonstrated on a series of complex trajectories while maintaining the objective of keeping the vehicle within half the vehicle width of the desired trajectory. The convergence of vehicle motion towards the desired trajectory from a variety of initial conditions was also demonstrated.
The model based development approach enables the extension of this work to other vehicles of varying size and complexity without major changes to the development tools and framework. Using the code generation capabilities of MATLAB, the trained agents can be deployed directly onto the real vehicle without spending significant amounts of time coding the controllers in other programming languages.
The performance of the trajectory tracking controller can be improved by adding linear velocity control in addition to angular velocity. This would help the system better navigate sharper turns and ensure the convergence of the vehicle towards the trajectory in case of initial off-track errors.
This approach was successful but it has some limitations. The limitations of this approach included the need for an accurate simulation model for training and long training times for the reinforcement learning algorithm to explore and learn useful behaviors. Even though the model takes into account the effect of nonlinear forces and torques, some aspects of the model were lacking when compared to the real-world robot. More importantly, a neural network path tracking controller does not guarantee asymptotic or globally asymptotic stability over the operational range. This makes it difficult to deploy the controllers in safety critical applications where the autonomous robots are operating along with or near human operators.
There are other control schemes designed to satisfy stability conditions such as human operator models which are nonlinear and classical state space optimal controllers. Many of the state space nonlinear controllers suffer from instability outside their linearized operating region. A logical step for future work would be to combine some of these techniques to bound the exploration process of the RL agent with constraints or leverage the advantages of deep neural networks for tuning controller models that have stability guarantees.
Author Contributions: S.S. implemented the reinforcement learning algorithm on the skid-steer vehicle and tested the algorithm. W.R.N. conceptualized the test plan and research that would be implemented while providing supervision and direction for the research. D.N. and A.S. provided project supervision, direction, and funding. All authors have read and agreed to the published version of the manuscript.