Sim-to-Real Application of Reinforcement Learning Agents for Autonomous, Real Vehicle Drifting

: Enhancing the safety of passengers by venturing beyond the limits of a human driver is one of the main ideas behind autonomous vehicles. While drifting is mostly witnessed in motorsports as an advanced driving technique, it could provide many possibilities for improving traffic safety by avoiding accidents in extreme traffic situations. The purpose of the research presented in this article is to provide a machine learning-based solution to autonomous drifting as a proof of concept for vehicle control at the limits of handling. To achieve this, reinforcement learning (RL) agents were trained for the task in a MATLAB/Simulink-based simulation environment, using the state-of-the-art Soft Actor–Critic (SAC) algorithm. The trained agents were tested in reality at the ZalaZONE proving ground on a series production sports car with zero-shot transfer. Based on the test results, the simulation environment was improved through domain randomization, until the agent could perform the task both in simulation and in reality on a real test car.


Introduction
In the rapidly evolving landscape of transportation, autonomous vehicles stand out as one of the most anticipated and researched fields in recent years.In the area of autonomous vehicle control (AVC), advancements in technology will soon allow the application of automated driving systems (ADS) that are capable of acquiring specific human driving skills.These include, of course, lower SAE (Society of Automotive Engineers) levels [1] like cruise control, lane keeping, and basic parking [2], and also higher SAE levels like L2+ and L3 vehicle automation.However, motion control beyond the handling limit is still mostly unresolved, and its solutions primarily exist only in virtual reality.
Vehicle stability refers to the ability to maintain equilibrium or resume its original status after experiencing a disturbance such as a wind gust, uneven road surface, or a sudden small change in the steering [3].Stability analysis can be performed in many ways [4].In most cases, it refers to estimating a stability region where the vehicle has the above-described ability to maintain equilibrium.It can also be approached by analyzing tire slip angles and tire force saturation [5].This can also be an important aspect of vehicle stability control, especially when considering the tire's radial, circumferential, and lateral stiffness.Based on related studies [6], these can also affect the ability to transfer the torque input of the vehicle into the ability to cause and prevent sliding and drifting.
Current state-of-the-art Advanced Driver-Assistance Systems (ADAS) follow the proven principle of avoiding such driving scenarios that might cause a road vehicle to behave in ways that are unknown to most human drivers.There are several well-known technical approaches to this, like ABS (Anti-lock Braking System), ASR/TCS (Anti-Slip Regulation/Traction Control System), and ESC (Electric Stability Control).While it is undeniable that these driving systems provide safer and more effective vehicle control [7], it is still not proven that these would be the most effective way to reduce the number of control loss-related accidents.A good example of this would be a collision-avoidance scenario where an object enters the vehicle's collision path far inside its braking zone.In such cases, there is a chance that performing an avoidant steering maneuver could result in losing stability and this could endanger the passengers or other participants.Another example would be aquaplaning or prolonged driving on a low-traction surface at a high speed.In [8], the authors address multiple scenarios where a maneuver at the handling limit significantly reduces fatality and collision risk.In addition to this reasoning, it is also an essential scope of AVC to enhance road vehicle transportation beyond human limits, especially considering the rapidly evolving field of overactuated vehicle control [9,10], which is an overly complex task for most human drivers.This makes this area of research not just important for safety-critical systems, but also for motorsports, space exploration, and military sciences.
Drifting can be considered one of the most basic maneuvers of unstable vehicle control.It is a cornering motion mostly characterized by a high sideslip angle, saturated tire forces, and counter-steering.As such, it is an excellent example of unstable vehicle motion, and its predictable control can improve autonomous vehicles' capabilities significanty.Defining drifting for a formal representation can be carried out in many different ways.The research presented in this article is focused on steady-state drifting, which sets a goal to reach and maintain a vehicle state that can be considered as an unstable drift equilibrium (for definition, see Section 2.2).
The current literature on maneuvering autonomous vehicles beyond the limit of handling presents many different approaches to the problem.First, it is important to highlight the applications of linear feedback controllers on two-wheeled vehicle models, which successfully utilized the model-based calculation of equilibrium points.Concerning the selected actuators, in [11], the longitudinal and lateral movement handling operations were decoupled using two independent controllers.In [12,13], the authors used a Linear Quadratic Multiple Input Multiple Output (LQ MIMO) controller, wherein this controller was also successful.Also, positive results were obtained using a four-wheel vehicle model [14].In addition, some works utilize the added control of braking [14][15][16].Trajectory following drift controllers are also present in the literature [17].
MPC (Model Predictive Control) has proven successful for both stabilization and track-following drift tasks [18,19].It also performed well when simulating changing road conditions, thus indicating a good adaptive property.Robust control-based [20] and multilayer methods [21] are also present in the literature.Among the results that consider path planning and following, it is worth mentioning the hybrid application of linear control and MPC for implementing a parking maneuver to be carried out with drift [22].The former construction worked both in simulation and with a small RC car.To present a more advanced objective, drift-based collision avoidance has also been considered with different hierarchical controllers [23,24].
The application of reinforcement learning to solve autonomous drifting also shows a promising direction.Among the potential advantages of reinforcement learning is primarily the ability to generalize under constantly changing driving conditions (such as varying road surfaces) and the absence of the need to use previous data for training.It also provides an intuitive way to define objectives for the driving agent through its reward function without specifying low-level control demands or planning an exact trajectory if the use case in question does not require it.
In general, it can be stated that deep reinforcement learning (DRL) has been the most popular in autonomous vehicle control [25].The main reason is that a discrete agent operating in a continuous environment performs the task insufficiently or in too piecemeal of a fashion in theory, usually due to the incorrect construction of the finite space representation.Despite this, this paper has in scope the application of a discrete agent along with a DRL-based contender.The results produced by these methods in the simulation have been promising so far [26][27][28].
In addition to the previous works of the authors, other works, like [29], used the Probabilistic Inference for Learning Control (PILCO) [30] model-based policy-search algorithm to achieve steady-state drifting in simulation and on a small-scale remote-controlled vehicle.Strengthening these results, ref. [31] performed similar work using PILCO and deep Q-learning (DQL).Also, ref. [32] defined drifting as a track-following task in the CARLA [33] simulator, where the exact goal was to achieve large sideslip angles at high speeds.The agent was trained on several predetermined courses, and then evaluated on a course unknown to it, with several different types of vehicles and road condition settings.In [34], the authors proposed a similar solution and approach.The training in these two cases was carried out with the Soft Actor-Critic (SAC) algorithm, and [35] also experimented with this task with the TD3 algorithm [36], which also proved to work well.Further developing the results mentioned so far, ref. [37] successfully developed an agent that can even drift on arbitrary trajectories in simulation.Also, as with control methods, drift parking has been attempted with RL too [38].In addition to these works, an autonomous racing task, including control at handling limits, was also attempted with a model-based policy search [39].

Problem Statement
This paper presents novel results on applying reinforcement learning agents for autonomous drifting on a full-scale, real test vehicle.This is the same car that was used in [13,19].To the authors' best knowledge, this is the first time such an agent was successfully deployed to perform autonomous drifting on a real vehicle.
The exact goal of the agent is to perform a steady-state drifting maneuver and maintain it for arbitrary durations, even with GNSS (Global Navigation Satellite System) and/or sensor noise/delay with a real test car.The car is sped up by a controller, which is independent of the agent, to v x = 28 km/h, and, when this is attained, the control of the vehicle is entirely released to the RL agent.Then, its objective is to reach a defined medium-speed target drift state and maintain it with a minimal mean error while handling the actuators smoothly enough to not lose stability from sudden actuator handling, even with small disturbances.This goal was formulated intuitively without using prior data from expert drivers.This problem definition is possible because an RL agent is used for the solution.The considered traction conditions for this application are set to range between grip coefficients of µ = 0.6 and µ = 0.95, and the tests were performed in such conditions.Learning to drift for any traction coefficient in this range effectively requires an adaptive ability, which is one of the reasons to use an RL agent.Successfully performing this maneuver would prove that RL agents can learn to control vehicles beyond handling limits.This would indicate that it is possible to train such agents for collision avoidance where only such an aggressive maneuver could provide a possibility to ensure the safety of the passengers.
In the following section, the paper presents the vehicle model structure and drift equilibrium calculation background, which were implemented in the MATLAB/Simulink simulation environment for training the agents.In Section 3, the RL algorithm used for training (SAC) is described, along with the agent's specific structure and parameters.Section 4 discusses the applied sim-to-real methodology, the presentation of the test vehicle's characteristics, and the hardware used for the real tests.Section 5 uncovers the results achieved.A discussion of the analysis, acknowledgment, and references closes the paper.

Vehicle Dynamics
To have a proper understanding and a correct implementation for the problem of steady-state drifting, addressing the vehicle's dynamics is essential.In general, drifting requires a vehicle with sufficient rear torque and traction.Typically, most of these are rear-wheel drive (RWD) vehicles, so the chosen model is also RWD, but, at the same time, drift-like behavior is not impossible with front-wheel or four-wheel drive vehicles either.The vehicle model features presented throughout this section are validated based on previous works [13,19].
Because training a learning agent in a simulation environment requires significant computation resources, it is beneficial to have a simple but still realistic vehicle environment.In the case of drifting, ignoring pitch and roll dynamics does not seem to prevent engineering agents that can perform drifting (for example, [13] proves this), and, at the same time, involving them would significantly increase the dimensionality of the problem.Therefore, the vehicle models presented in this paper only consider three degrees of freedom: translation in the x and y directions and rotation around the z axis (yaw).
The Newton-Euler equations of motion of the vehicle in the body frame's coordinate system can be described as follows [40]: where v x and v y are the velocities in the x and y directions, respectively, r is the speed of vehicle rotation of the vehicle around the z-axis, m is the vehicle's total mass, and I z is the inertia constant.Based on these, the derivatives can be expressed as The force components in ( 1)-( 3) can also be derived as the total effect of the forces acting on the wheels: where a and b are the front and rear wheelbase (l = a + b), respectively, and δ is the front (steered) wheel angle.Since the chosen vehicle has rear-wheel drive, the longitudinal force acting on the steered front wheel is zero: F x f = 0.The rear longitudinal force F x r and the front wheel angle δ are the input parameters of the vehicle.The specific constant values (based on the properties of the test vehicle) for the vehicle model's equations are described in Table A1. Figure 1 illustrates the vehicle model.

Tire Modeling
The tire model is one of the most important parts of the simulation model's structure.Without a proper tire model, performing accurate simulations for drifting is not possible.The most critical expectation from the tire model is that the saturation of the lateral forces acting on the tires can be well described, which is an essential feature of drifting.For example, in [11][12][13][26][27][28], the implementation of a lateral slip brush tire model, proposed by [41], worked well for designing controllers for drifting.However, modeling longitudinal slip on the rear tire in addition to lateral slip supports an RL agent greatly, especially in the initiation phase of a drift maneuver.On this note, a combined slip brush tire model was implemented, based on the work of Pajecka [42].Also, there may be other state-of-the-art tire models which would be worth considering, like the TMeasy model [43], the elliptical tire model [44], and the Mooney-Rivlin material model [45,46].
The general form of the Pacejka model stands as the following: where y(x) (the output) is the longitudinal/lateral force component, x (the input) is the slip angle or slip ratio, D is the peak factor, C is the shape factor, B is the stiffness factor, E is the curvature factor, and Sv and Sh are the vertical and horizontal shifts, respectively.The shifts consider the camber thrust, conicity, and rolling resistance.Given the S x slip ratio, S xp peak slip ratio, α slip angle, and α p peak slip angle parameters, the normalized slip parameters are determined: Afterward, the resultant slip of the tire and the modified slip ratio and slip angle are determined: From these, the longitudinal and lateral force components are given: where F x 0 /F y 0 are the longitudinal/lateral force components, calculated using the modified slip ratio/angle and Equation ( 10).
In the model, the implemented wheel model is based on [47] and summarized as where J is the wheel inertia, T w is the wheel torque, T b is the break torque, F rr is the roll resistance, r w is the wheel radius, and ω w is the wheel speed.Given these, the rear longitudinal slip κ can be calculated as while the lateral slip angles α f , α r and the sideslip angle β are formulated as All the above equations conclude the implemented vehicle model, whose defined parameters are collected in Table A1.

Drift Equilibrium Calculation
As mentioned, this paper focuses on autonomous drifting and has an objective defined as reaching a desired drift equilibrium state.Based on the work in [48], drift can be described as vehicle equilibrium states that consider rear lateral slip saturation, which is how reaching handling limits is described, based on [41].As such, writing and combining with Equations ( 4)- (18) formulates an algebraic equation system with five variables v x , v y , r, F x r , δ .
The system can be solved by assigning a fixed constant value to any of these five parameters.In this presented research, given that δ = −10 • and v x = 10 m/s, the following drift equilibrium was used for the objective : This leads to a left-directional circular drift motion with a sideslip angle of β = −18.6382• .

The Reinforcement Learning Model
The fundamental concept of reinforcement learning revolves around the continuous cycle of interactions between an agent and the environment.In this learning paradigm, the agent's primary objective is to maximize the cumulative reward it receives from the environment.This is achieved by the agent making decisions (performing actions) based on the current state of the environment, which then responds with feedback in the form of rewards.The agent discovers the optimal actions (its optimal policy) by exploring the action space.
For the previously introduced vehicle model, the following state and action spaces are defined: The state space S ∈ S contains the model's describing velocities and their derivatives.The action space A ∈ A represents the model's input variables through the form of the gas pedal acc ped and the steering wheel angle δ steer , limited to a physically achievable range.The conversion of these variables into an (F x r , δ) input pair is discussed in Section 4.3.
The state transition function P : [S, A] → S maps accordingly to the underlying vehicle dynamics.It can be proven that the environment satisfies the Markov property: The reward function is the RMSE of the current state from the target drift state, with an added term that punishes the agent relative to the difference between its current and previous actuator request (action): The first term in (24) ensures that the agent reaches and maintains the target drift state with a minimal mean error, where S t is the value of the ith position in the current state vector and S (i) dri f t is the same for the target state vector.The second term encourages the agent to smooth its driving policy during the drift's initiation and stabilization phases, where A (i) t is the value of the ith position in the issued action vector and A (i) t−1 is the same but for the previous action vector.
The agents were trained in MATLAB while connected to the Simulink environment that contained the vehicle model introduced in Section 2. The training algorithm used was the state-of-the-art Soft Actor-Critic (SAC) algorithm [49], specifically designed for continuous state and action spaces.
The agent is separated into two parts (see Figure 2).The actor represents the agent's policy, a mapping from the state space to the action space.This is a stochastic function, meaning, technically, that the action is generated from a normal distribution A ∼ N (µ S , σ S ), where the distribution's parameters are given by the policy function (µ S , σ S ) = π(S).This stochasticity helps the agent to explore the action space.The measure of this is the distribution's differential entropy.The algorithm tunes the weight (temperature) of the entropy factor adaptive to the received rewards: less reward means more exploration is needed, while higher observed rewards in the long run cause reduced entropy.
The critic is the value function, whose purpose is to approximate the expected cumulative reward for a given state-action pair.It calculates the TD (Temporal Difference) error that contributes to the actor's training.For more information regarding the algorithm, please refer to [49].
Both the actor and the critic use neural networks to approximate their respective functions.For the neural network structures used for the agent and the exact training parameters, see Appendix B.

Sim-To-Real Structure & Methodology 4.1. Test Vehicle Setup
As a vehicle platform for the agent's sim-to-real application and testing, a series production sports car (Figure 3) was used after suitable modification to enable self-driving.It was powered by a 3.0-liter twin-turbocharged straight-six engine that produced 302 kW (411 LE) between 5230 and 7000 RPM and 550 Nm torque between 2350 and 5230 RPM.The car had rear-wheel drive with an electronically controlled differential lock, which is essential for drifting [51].The vehicle had a 7-speed dual-clutch transmission, and the 0-100 km/h acceleration time was 4.2 s.Performance sport tires were installed in sizes 245/35 ZR19 at the front and 265/35 ZR19 at the rear axle.The longitudinal position of the vehicle body Center of Gravity (CoG) was specified by measuring the individual axle weights with all the measurement equipment and two operators, which are needed currently to ensure the safety and handling of the measurement system.This results in 925 kg on the front and 895 kg on the rear axle, which gives a 1.32 m CoG longitudinal distance behind the front axle with the 2.69 m long wheelbase.
For the agent implementation, data acquisition, and controlling the actuators, dSpace microAutoBoxII real-time hardware was used, connected to a Raspberry PI 4 Model B (Figure 4).The latter was needed because the real-time hardware does not support neural network inference in the form of a Simulink model block; thus, an outside control unit was required.The neural network inputs (state) and outputs (action) were transferred through a CAN (Controller Area Network) bus with a 50 ms sample time between the two pieces of hardware.This is identical to the sample time of the agent during training (see Table A2).A control model, developed in a MATLAB/Simulink environment, received the action signal from the Raspberry unit, and then sent it forward to the vehicle's actuator units [52].After C code generation, it ran with a 1 ms sample time on the rapid prototyping system.To enable steer-by-wire, a steering robot was installed in place of the original steering wheel.It was used in angle-control mode and communicated via CAN bus with the microAutoBoxII.
For the longitudinal motion control of the car, the accelerator pedal was removed, and its signal was emulated by the microAutoBoxII.The engine torque-i.e., the traction force-can be controlled in this way.
For the self-localization of the vehicle, a GNSS system was used, with GSM RTK correction and dual antennas.All the relevant states (including sideslip angle, yaw rate, longitudinal velocity, etc.) of the vehicle can be calculated from the measured high-frequency position and heading signals.
Due to safety reasons, signals from the vehicle CAN were received by a non-intrusive contactless sensor and were converted to FMS-Standard messages by a special CAN Gateway (FMS Gateway).These CAN signals were used during model identification and real-time control as well.

Model Parameter Identification
The vehicle model parameters described in Table A1 were identified with measurements to reproduce the car's behavior as accurately as possible.The above-described tire models need the front and rear tires' cornering stiffness and friction factors.The values were specified with a ramp steer maneuver (ISO 13674-2) [53].After selecting the neutral gear, the steering wheel was subjected to a slow and constant velocity ramp input.
A homogeneous, flat, dry asphalt surface, the 300 m diameter Dynamic Platform of the ZalaZONE Automotive Proving Ground [54], was used for parameter identification and agent testing.The tire forces were calculated from the lateral dynamics equations of the one-track model with the assumption ṙ = vy = 0.The fitting of the tire model on the measurement data gives a friction coefficient of 1 for both tires and 300,000 N/rad cornering stiffness for the front and 500,000 N/rad for the rear tires.As it reveals, the positive understeer gradient results in the understeer behavior of the car.
One of the control inputs from the agent is the accelerator pedal position acc ped , which can be emulated in the control model into an analog signal for the vehicle.However, the vehicle model used for training requires the longitudinal force F x r as the actuator parameter.Therefore, the relationship between the two quantities must be analyzed to implement it in the simulation environment.The traction force was estimated during a test maneuver from the longitudinal vehicle acceleration (while estimating the air drag and rolling resistance) and the actual engine torque signal received via CAN from the engine ECU.The results were collected in the second gear, which was used during drifting.The reason for the gear selection and limitation was to provide the agent with the best possible transmission characteristics for performing the targeted medium-speed (v x = 10 m/s) drift maneuver without making the action selection more complex.The build-up of the engine torque follows the demand of the accelerator pedal with a considerable lag, which adds a challenge to the control task.Moreover, the engine torque signal gives relatively accurate feedback from the traction force.The steering robot can realize a given steering wheel position, but, from the control point of view, the roadwheel position has a meaning.Therefore, the steering ratio was measured for the whole steering range.Additionally, the equivalent roadwheel angle of the bicycle model was also calculated.

The Applied Sim-to-Real Methodology
One of the biggest challenges in today's sim-to-real RL advancement is the application of reinforcement learning for tasks where the target (application) environment cannot be modeled in simulation with high precision, like between a vehicle model and a real test vehicle.However, there are some methods in the current state of the art [55] for robotics applications where the above issue is present.
In this paper, the applied sim-to-real method is domain randomization, which means that some critical parameters of the vehicle are randomized in the simulation during training to increase the robustness of the agent for these specific criteria.These are usually parameters that could change depending on unmeasurable outside circumstances (meaning they could be seen as stochastic), or their measurements are noisy or include an unknown offset.During repeated testing and taking measurements of the performance of trained agents on the test vehicle, the following domain randomization solution was constructed.
The agent was trained on a range of traction coefficients between µ = 0.6 and µ = 0.95 to ensure good performance under different weather, varying location conditions, and tire wear conditions.Even though the drift target (see Equation ( 20)) was calculated for µ = 0.95, the agent was able to learn to stabilize the vehicle in a relatively close equilibrium to the target state for differing traction coefficients within the above range.
High-frequency white noise was added to the received state variables to model the measurement noise from the GNSS signal (mirroring the measurement uncertainties and value fluctuation/delay coming from the sensor).Also, to model CAN communication, sensor, and actuator delays, the input and output signals were further augmented with a random time delay between 0.5 ms and 20 ms.
The engine and powertrain dynamics were modeled with a 1-D lookup table with six breakpoints between 0% and 100% to convert the accelerator pedal value to engine torque between 0 Nm and 550 Nm.The values of the breakpoints were randomized between training episodes in a range based on measurement data from the test vehicle (for substituting the complex but not completely modeled characteristics of the engine).Also, linear transfer functions were added to the input signals to imitate the real transfusion between the agent's demand and the current state of the actuators.

Results
The following results show the trained agent's performance in simulation and with the real test vehicle.In Figure 5 on the left side, a scope of a simulation (blue) and a measurement (yellow) instance is shown from the point of the agent's inference until a 10 s termination mark.For the simulation results shown, the randomized values (e.g., noise, engine characteristics) were set to their mean values, except for the traction coefficient, which was µ = 0.95, appropriate to the definition of the target state.The real test results shown (named as measurement) were taken on a dry asphalt track.The simulation is started with the vehicle accelerating to v x = 28 km/h.In reality, the vehicle was accelerated to v x = 28 km/h using a PID (Proportional Integral Derivative) controller before turning on the RL agent.The left side shows the vehicle's state variables v x , v y , r along with the sideslip angle β.The right side consists of the calculated reward signal, the actuator signals, and a secondary performance measure called Isdri f t.The definition of this measure is This measure indicates if the agent successfully performs a left directional drift with a sideslip angle between 10 • and 35 • .On the graphs of the pedal and the steering wheel, the blue and yellow curves show the agent's demand during the simulation and the test, respectively, and the red and purple curves show the current state of the actuators in the same manner.The actual state of the pedal is determined from the current engine torque percentage, measured by the vehicle's sensors during measurement and calculated during simulation.Figure 5 clearly shows that the largest differences are in the pedal diagram, mirroring that the very complex behavior of the engine is not completely mapped in the simulation, while, at the steering wheel, the signals are close to each other, mirroring accurate mapping.Figure 6 shows the motion trajectory of the simulation results from Figure 5 (blue) and Figure 7 shows the motion trajectory of the exact measurement from Figure 5.The red part of the curve represents the (X, Y) coordinates of the vehicle (recorded from the GNSS signals) when Isdri f t = 0, and the green part indicates Isdri f t = 1.The blue arrows point the direction of the vehicle's heading angle along the motion trajectory.The black dots mark every 0.25 s of time since the start of the measurement, identically to the vertical grid lines in Figure 5.

Discussion
As a reminder for the discussion of the results, the goal of the RL agent was to perform an approximate β = −20 • sideslip angle drift at around a speed of v x = 10 m/s.This goal was formulated with a drift equilibrium target that was defined in Equation (20).
The simulation results show that the agent was successfully trained in the virtual environment.It performed the desired drift smoothly and efficiently, initiating and stabilizing the maneuver within 3 s.In addition, the applied domain randomization technique helped the agent to learn enough information to perform drifting during the real test as well.
Figure 7 represents well the evolution of this drift on the real vehicle.On the right bottom side, the car starts with a normal left turn without drifting because the blue arrows are directed to the right from the red trajectory line.At first, at (x = 12.5 m; y = 19 m), the arrow changes its direction to the left side of the trajectory curve, demonstrating that the car started the drift.The same moment is evident in Figure 5 where the yellow Isdri f t line changes its value from 0 to 1 at 2.2 s, similarly to the color change of the trajectory in Figure 7.This proves that the proposed reinforcement learning-based agent could initiate the drift state on a real vehicle.After this critical, drift-starting point, the arrows in Figure 7 are continuously on the left side of the remaining trajectory (in Figure 5, the yellow (real) Isdri f t indicator is continuously 1 until the end of the complete test episode), proving that the reinforcement learning-based agent could maintain continuous (stabilized) drifting on a real vehicle.
In Figure 5, some discrepancies between the simulation and the real data can be seen.These are due to the limitations of the current sim-to-real domain randomization and vehicle model, which do not represent the real conditions completely.The lateral velocity and the yaw moment reach the target values more slowly in the real environment, indicating that the engine dynamics could be implemented more accurately to move the simulated and the real environment closer to each other.Also, this affects the consistency of the sideslip angle and the time required to reach the drift domain.Between Figures 6 and 7, it can also be seen that this results in a smaller circle of drifting.
Considering expert human driver performance, based on [14], the agent handles the actuators and produces a similar motion trajectory compared to the human driver.Furthermore, the agent achieves a more consistent sideslip angle than the expert driver, which is essential to apply such techniques to collision avoidance.

Conclusions
The paper's goal was to show that it is possible to train RL agents to perform steadystate drifting, a challenging maneuver, on a real vehicle.The reason is to provide the groundwork for applying these agents for more complex tasks involving extreme maneuvers performed at the limits of handling, like collision avoidance.The presented methodology and results prove that RL agents can be trained in simulation only to perform drifting on real vehicles.This indicates that these agents can be trained to perform driving tasks involving collision avoidance and stability control with further improvements.

Future Research
Considering future work, there may be an improvement in performance with a remodeling of the complex engine dynamics, so that the characteristics of the pedal actuator match the real world even more.Extending the domain randomization further on the selected elements (and involving new ones, like the tire characteristics) would make the agent even more robust.
Another idea would be to take further measurements with the current agents, and then use the recorded data to further train the agent with mixed, "supervised-reinforcement" learning.This might be the more beneficial approach for the current use case, although it is unknown how much measurement data it would require to maximize the performance.Next, reinforcement learning could be based directly on these hybrid real-simulated measurements or only on real measured data beyond the currently proposed sim-to-real case.
Experimenting in low-traction (µ < 0.6) environments is also an important factor of future research, especially considering the possible robustness of these RL agents in such cases.Designing an agent that can control stability and perform extreme maneuvers in broad-scale changing grip conditions will be essential to improve road safety.To assist these capabilities, road traction coefficient estimation can also be considered [56].
Venturing even further, the main vision is to extend the agent's capabilities to more complex drift tasks, like trajectory following and collision avoidance.Overactuated vehicle control (like active suspension and decoupled drive control) is also in the scope of future research [16].

Figure 1 .
Figure 1.The implemented two-wheel dynamic vehicle model in its coordinate frame.

Figure 2 .
Figure 2. The SAC reinforcement learning framework [50].The black arrows represent the general actor-critic framework.The red arrows introduce the entropy term.

Figure 3 .
Figure 3.The author's test vehicle while performing the drift maneuvers presented in this paper.

Figure 4 .
Figure 4.The test vehicle's setup with the connected hardware frame.

Figure 5 .
Figure 5.Comparison of the agent's performance between the simulation and the real vehicle (measurement).

Figure 6 .
Figure6.The motion trajectory and the direction of the vehicle's heading when the agent is performing in simulation.This is the desired outcome of the defined drift objective in(24).

Figure 7 .
Figure 7.The measured motion trajectory and the direction of the vehicle's heading produced by the agent performing on the real test vehicle (Figure 1).