Autonomous Driving in Roundabout Maneuvers Using Reinforcement Learning with Q-Learning

: Navigating roundabouts is a complex driving scenario for both manual and autonomous vehicles. This paper proposes an approach based on the use of the Q-learning algorithm to train an autonomous vehicle agent to learn how to appropriately navigate roundabouts. The proposed learning algorithm is implemented using the CARLA simulation environment. Several simulations are performed to train the algorithm in two scenarios: navigating a roundabout with and without surrounding tra ﬃ c. The results illustrate that the Q-learning-algorithm-based vehicle agent is able to learn smooth and e ﬃ cient driving to perform maneuvers within roundabouts.


Introduction
One of the most challenging problems for autonomous vehicles is complex maneuvering, such as driving in roundabouts in urban and nonurban environments. Roundabouts are a special case of intersection, where a circular traffic flow is established for a change of direction. To navigate successfully in a roundabout, it is necessary to understand the choice of entry and exit lanes, how to apply priority rules, how to interpret the intentions of other drivers, and the existing traffic itself. However, for the correct selection of actions in this particular scenario, a global understanding of the situation of driving in roundabouts is necessary to obtain the best results.
One approach to understanding driving in roundabouts is through artificial intelligence and data mining techniques such as machine learning. For example, in [1] the authors presented rules of behavior to address a roundabout with an autonomous vehicle, modeling the behavior of a human driver through factors such as the speed of the vehicle, the angle of the wheel, the diameter of the roundabout, etc. The authors of [2] presented an adaptive tactical behavior planner (ATBP) for an autonomous vehicle, capable of planning behaviors similar to human drivers when navigating a roundabout. Roundabout safety under shared traffic was studied in [3] through models based on speed and traffic. The authors of [4] presented learning techniques to obtain behaviors of human drivers when approaching a roundabout without signs posted, in order to obtain behavioral profiles applicable to autonomous vehicles. Applying machine learning techniques such as support vector machine (SVM), the authors of [5] presented a prediction model to obtain the vehicle's intention to enter or exit a roundabout. In this study, variables such as vehicle global positioning system (GPS) and multiple sensors were used to obtain directions on the vehicle's path. Another approach using machine learning techniques is presented in [6], where the authors designed a roundabout driving classification using hidden Markov models (HMMs) trained with naturalistic driving data, while the authors of [7] proposed a sequential adaptive reinforcement learning approach for roundabout driving.

Reinforcement Learning Background
When training a machine learning model, three types of learning can be used, depending on the task to be performed: supervised learning, used mainly for classification and prediction tasks; unsupervised learning, which is suitable for clustering and finding relationships among attributes of data; and reinforcement learning, which creates models to learn patterns by trial and error. The latter is the algorithm used in the present work.
Reinforcement learning (RL), according to [10], is a machine learning technique that defines how a set of sequential decisions will result in the achievement of a goal. This is considered to be a trial-and-error method, where the environment indicates the usefulness of the result. According to the authors of [11], in their experimentation, they considered that RL is a branch of artificial intelligence in which an agent learns a control strategy when interacting with the environment. In the same way, the authors of [12] considered that RL is capable of learning and making decisions by interacting repeatedly with its environment. Currently, several machine learning algorithms use the reinforcement learning paradigm as the basis for implementation, such as adaptive heuristic critic (AHC) [13], Q-learning [14], Markov decision process [15], and deep Q-learning [16]. RL is currently applied in several areas, such as computer networks [17], traffic control [18], robotics [19], object recognition [20], facial recognition [21], and autonomous driving [22], among the most prominent areas of research.
As far as RL algorithms applied to autonomous driving, many studies can be found in the literature. For example, in [23] a successful application of RL is described for autonomous helicopter flight, while in [24], the authors describe an experiment in RL to direct a real robot car based on data collected directly from real experiments. In [25], the ability of a system to learn behaviors to allow safe navigation for autonomous driving at intersections is explored.
RL is also used in simulated environments, and multiple examples can be found. For example, in [11,26,27] various RL methods are proposed in which autonomous vehicles learn to make decisions while interacting with simulated traffic. In [28,29] the authors present different methods for deep end-to-end RL to navigate autonomous vehicles, and in [30] a speed control system is designed using RL.
In this paper, the system that was developed is based on the RL paradigm. The system's framework consists of an agent (the autonomous vehicle) that interacts with a driving simulation environment, a finite state space (S), a set of available actions (A), and a reward function (R), where the main agent's task is to find the policy π: S × A → [0, 1]. According to [31], the general framework is based on interactions between agents, where the environment is characterized by a set of states, in which the agents can achieve actions of the set itself. The agents interact with the environment and transition from state x (t) = x to x (t+1) = x by selecting an action a (t) = a. The interactions between the agent and the environment come from the agent's observations about the state of the environment, where it selects an action and finally receives feedback or reward from the environment according to the selected action. That is, when the agent observes the state of the environment x (t) at time (t), it selects an action and makes a transition to state x (t+1) at time (t + 1). Subsequently, the environment issues a reward r (t+1) for the agent. Figure 1 shows the general agent-environment interaction system. As far as RL algorithms applied to autonomous driving, many studies can be found in the literature. For example, in [23] a successful application of RL is described for autonomous helicopter flight, while in [24], the authors describe an experiment in RL to direct a real robot car based on data collected directly from real experiments. In [25], the ability of a system to learn behaviors to allow safe navigation for autonomous driving at intersections is explored.
RL is also used in simulated environments, and multiple examples can be found. For example, in [11,26,27] various RL methods are proposed in which autonomous vehicles learn to make decisions while interacting with simulated traffic. In [28,29] the authors present different methods for deep end-to-end RL to navigate autonomous vehicles, and in [30] a speed control system is designed using RL.
In this paper, the system that was developed is based on the RL paradigm. The system's framework consists of an agent (the autonomous vehicle) that interacts with a driving simulation environment, a finite state space (S), a set of available actions (A), and a reward function (R), where the main agent's task is to find the policy : S × A → [0, 1]. According to [31], the general framework is based on interactions between agents, where the environment is characterized by a set of states, in which the agents can achieve actions of the set itself. The agents interact with the environment and transition from state ( ) = to ( +1) = by selecting an action ( ) = . The interactions between the agent and the environment come from the agent's observations about the state of the environment, where it selects an action and finally receives feedback or reward from the environment according to the selected action. That is, when the agent observes the state of the environment ( ) at time (t), it selects an action and makes a transition to state ( +1) at time (t + 1). Subsequently, the environment issues a reward ( +1) for the agent. Figure 1 shows the general agent-environment interaction system.

CARLA Simulation Environment
CARLA is a simulation environment for autonomous driving systems. It consists of open-source code and protocols that provide digital assets, such as urban layouts, buildings, and vehicles, used to corroborate and evaluate decisions and design approaches in driving simulations. It follows the client-server paradigm. CARLA supports the configuration of various sensor sets and provides signals to train driving strategies, such as GPS coordinates, acceleration, steering wheel, etc. It also provides information related to distance travelled, collisions, and the occurrence of infractions, such as drifting into the opposite lane or onto the sidewalk. The environment model consists of a simulated autonomous driving system where driving in a roundabout is defined. According to the direct perception approach, dimensional video data (red, green, blue (RGB) camera, semantic segmentation camera, depth camera, and object vision detection) and a set of naturalistic conduction data are processed into meaningful data about roads. The agent model consists of an autonomous vehicle that

CARLA Simulation Environment
CARLA is a simulation environment for autonomous driving systems. It consists of open-source code and protocols that provide digital assets, such as urban layouts, buildings, and vehicles, used to corroborate and evaluate decisions and design approaches in driving simulations. It follows the client-server paradigm. CARLA supports the configuration of various sensor sets and provides signals to train driving strategies, such as GPS coordinates, acceleration, steering wheel, etc. It also provides information related to distance travelled, collisions, and the occurrence of infractions, such as drifting into the opposite lane or onto the sidewalk. The environment model consists of a simulated autonomous driving system where driving in a roundabout is defined. According to the direct perception approach, dimensional video data (red, green, blue (RGB) camera, semantic segmentation camera, depth camera, and object vision detection) and a set of naturalistic conduction data are processed into meaningful data about roads. The agent model consists of an autonomous vehicle that interacts with the simulated environment through different actions (acceleration, braking, determining roundabout diameter, adjusting vehicle speed, determining roundabout center, starting and ending the route, determining deviation angle from the center of the lane). It learns from the feedback resulting from the GPS position of the vehicle at stage x (t) at time (t) in the simulated environment, using the Q-learning algorithm and a reward system. Figure 2 shows the architecture of our system, and Figure 3 provides maps and some views of the simulation environment used. It includes various urban scenarios, including a roundabout. The range of the map is 600 m × 600 m, containing a total of around 5 km of road. interacts with the simulated environment through different actions (acceleration, braking, determining roundabout diameter, adjusting vehicle speed, determining roundabout center, starting and ending the route, determining deviation angle from the center of the lane). It learns from the feedback resulting from the GPS position of the vehicle at stage ( ) at time (t) in the simulated environment, using the Q-learning algorithm and a reward system. Figure 2 shows the architecture of our system, and Figure 3 provides maps and some views of the simulation environment used. It includes various urban scenarios, including a roundabout. The range of the map is 600 m × 600 m, containing a total of around 5 km of road.  The autonomous vehicle is controlled by different types of commands in the simulation environment: (1) Steering: The steering wheel angle is represented by a real number between -40° and +40°, which correspond to full left and full right, respectively. (2,3) Throttle, brake: These are represented by real numbers between 0 and 1. (4) Hand brake: A Boolean value is used to indicate whether the hand brake is activated or not. The data acquisition system includes (1) an RGB camera, equipped with semantic segmentation of the simulation environment and including 3D location and  The autonomous vehicle is controlled by different types of commands in the simulation environment: (1) Steering: The steering wheel angle is represented by a real number between −40 • and +40 • , which correspond to full left and full right, respectively. (2,3) Throttle, brake: These are represented by real numbers between 0 and 1. (4) Hand brake: A Boolean value is used to indicate whether the hand brake is activated or not. The data acquisition system includes (1) an RGB camera, equipped with semantic segmentation of the simulation environment and including 3D location and orientation with respect to the car's coordinate system; (2) a semantic segmentation pseudo-sensor camera, providing support for experiments of perception; and (3) a sensor providing GPS position and information on roads, lane markings, traffic signs, sidewalks, fences, poles, walls, buildings, vegetation, vehicles, pedestrians, etc. In addition to the observations and actions, information such as the diameter and center of the roundabout, the start and end of the route established for the roundabout, and the angle of deviation from the center of the lane are recorded.

Machine Learning Model
This section describes a CARLA environment for planning the behavior of a vehicle to navigate a roundabout using the Q-learning algorithm. In addition, a naturalistic driving dataset is used to provide contextual information. This section explains the concepts of a roundabout scenario and the application of the Q-learning algorithm in this context as well as the reward policy.

Roundabout Scenario
Roundabouts present specific challenges in the complexity of driving behavior, in terms of high variance in the number of lanes and increased uncertainty in perception due to the road geometry. It is crucial that autonomous vehicles exhibit natural behavior on roundabouts for the safety and smooth flow of shared traffic between them and manual vehicles [32]. The approach followed in this paper is to formulate the driving task within a roundabout as a Markov decision process (MDP) problem. The experiments performed were based on simulated and real data.
The roundabout shape used in the experiment is shown in Figure 4, where the exits are marked as A for the first one, B for the second, and so forth. Typical vehicle behavior in a roundabout consists of the following steps: approach, cross the roundabout, and exit. The possible driving paths are drawn in different colors with their centerlines. In the CARLA framework, a route is defined by the tuple {Start_point, End_point}, as follows: Roundabouts present specific challenges in the complexity of driving behavior, in terms of high variance in the number of lanes and increased uncertainty in perception due to the road geometry. It is crucial that autonomous vehicles exhibit natural behavior on roundabouts for the safety and smooth flow of shared traffic between them and manual vehicles [32]. The approach followed in this paper is to formulate the driving task within a roundabout as a Markov decision process (MDP) problem. The experiments performed were based on simulated and real data.
The roundabout shape used in the experiment is shown in Figure 4, where the exits are marked as A for the first one, B for the second, and so forth. Typical vehicle behavior in a roundabout consists of the following steps: approach, cross the roundabout, and exit. The possible driving paths are drawn in different colors with their centerlines. In the CARLA framework, a route is defined by the tuple {Start_point, End_point}, as follows:

Q-Learning Algorithm
Based on RL, an MDP was modeled for the task of planning the behavior of a vehicle to safely navigate a roundabout on paths A, B, and C. The adaptative model uses a Q-learning algorithm [33], a commonly used algorithm to solve Markov decision processes.
In Q-learning, the action value function (s, a) is the expected return E [ | = s, = a] for a state-action pair following a policy , where = reward, = state, and = action. Given an optimal value function (s, a), the optimal policy can be inferred by selecting the action with maximum value max a (s, a) at every time step. It is based on finding a function Q(s, a) toward an estimation of the function value (Q-value). The function Q(s, a) represents the utility of taking action The final objective is to determine the behavior strategy so that the autonomous vehicle will enter the roundabout and navigate the exits (A, B, C) correctly and safely, regardless of whether or not there are other vehicles on the road.

Q-Learning Algorithm
Based on RL, an MDP was modeled for the task of planning the behavior of a vehicle to safely navigate a roundabout on paths A, B, and C. The adaptative model uses a Q-learning algorithm [14], a commonly used algorithm to solve Markov decision processes.
In Q-learning, the action value function Q π (s, a) is the expected return E [R t |s t = s, a t = a] for a state-action pair following a policy π, where R t = reward, s t = state, and a t = action. Given an optimal value function Q π (s, a), the optimal policy can be inferred by selecting the action with maximum value max a Q π (s, a) at every time step. It is based on finding a function Q(s, a) toward an estimation of the function value (Q-value). The function Q(s, a) represents the utility of taking action a in state s. Given the function Q(s, a), the optimal policy is the one that selects for each state the action associated with the highest expected accumulated value (Q-value). The function Q(s, a) is updated using the following equation for adjusting temporal differences: a t )).
This equation adjusts Q(s, a) based on the current and predicted reward if all subsequent decisions were optimal. In this sense, the function Q(s, a) converges toward the optimal values of the function. The machine learning model can use the Q-values to evaluate each decision that is possible in each state. The decision that returns the highest Q-value is the optimum. The whole procedure of model operation based on the Q-learning algorithm for this paper is shown in Algorithm 1 in Appendix A.
The Q-value derived from the performance of an action is the sum of the immediate reward provided by the environment and the maximum value of Q for the new state reached. The transition to the next state is defined by function T, affected by parameter g, referred to as the discount factor. Formally, where the values of Q would be updated using the following: Q(s t , a t ) = Q(s t , a t ) + α(r t+1 + γQmax(s t+1 , a) − Q(s t , a t )), with 0 ≤ α, β ≤ 1. The learning mechanism is set using parameter α. For example, if α = 1, the new value of Q(s, a) does not take into account the previous history of the value of Q, but will be the direct reward added to the maximum value of Q for the new state corrected by the γ factor.
In the presented algorithm, the values of the Q function are modified and are organized as a table with information about the new states and actions being explored. Thus, each row corresponds to a different state, and each column stores information about the value of the actions. Specifically, element (i, j) of the table represents the value of performing from state s i if the action is a j . Table 1 is a Q-table, obtained by implementing in any of the total states acquired by the algorithm given in Appendix A, for example, driving through exit A.  such as the one presented in this paper, the number of possible states can be overwhelming, making it very expensive to collect all the experiments and their updates, and making the problem unmanageable from the computational point of view.

State-Space Training Model
In the Markov decision process context, the state space can be defined as follows: To manage with traffic, the simulated vehicle uses a perception system that detects and tracks other participants through different sensors, as explained in Section 3. The vehicle's trajectory control within a roundabout is achieved through the speed and wheel angle. The concept of defining the trajectory of the autonomous vehicle is based on the optimal control approach presented in [33] and successfully implemented in [34]. The viability of planned trajectories is guaranteed by imposing real predictions on the speed [1], acceleration, and exit of the roundabout.
The training model is used through a learning approach, where 70% of the actions are applied randomly (exploration model) and 30% of the remaining actions are based on actions already learned (exploitation model). That is why the Q-learning algorithm is used, based on the experience generated during the exploration of the environment for the training model. Each model is first trained without any other vehicles, with the goal of learning the optimal policy, and then is retrained with other vehicles using random initialization. The training model is based on the vehicle's GPS positions with respect to the center of the lane. The model is copied to the main network every 10,000 iterations. Deviation from the center of the lane, as depicted in Figure 5, is calculated as follows: After defining the parameters, the angle of deviation from the center of the lane, in degrees, is calculated according to

Reward Policy
In Q-learning, the reward policy acts as an objective function from an optimization problem point of view. For the current scenario, human behavior was taken into account, where the objective is to navigate safely and efficiently in a roundabout without impeding the flow of traffic. A system of double rewards was used, depending on the current state of the vehicle:

Reward Policy
In Q-learning, the reward policy acts as an objective function from an optimization problem point of view. For the current scenario, human behavior was taken into account, where the objective

Experimental Results
In this section, two experiments are described: the first one deals with entry and exit A without traffic, whereas the second one deals with entry and exits B/C with traffic. In the simulation the following nomenclature was used.
The symbolic description ∆ was used as follows: ∆ = {e, vm , mr, desv, dmr}. The target vector metrics contain the exit (e), the average speed (vm) for a given distance within the defined route, the efficiency of the reward system (mr), the average deviation (desv), and the average distance traveled (dmr) for each attempt by the vehicle to satisfactorily exit from the roundabout. The action-state cycle is repeated until the vehicle reaches the correct exit. If the vehicle ends up crashing or crossing one of the bounding lanes while in the action-state cycle, the training program is interrupted and the vehicle starts again. In traffic scenario (B, C), the action-state cycle is discretized through the reward function, which depends on the positions of other vehicles inside the roundabout. This discretization is carried out under techniques based on the nearest neighbor, and is used to provide a simple way to divide the state space into regions. The vehicle updates the trajectory information every 60 seconds, sends it to the training program, and takes the appropriate action specified by the training program. Figure 6 shows the two scenarios established in the experiment, where the test vehicle is in red.
The behavior was tested in the same scenario in which the vehicle was trained. The number of participating vehicles and their routes were fixed for a given experiment, but their initial positions varied randomly. The training dataset consisted of 96 hours of driving the vehicle manually within the simulation environment. The route included four roundabouts with three exits, A, B, and C. During the learning phase, the trained dataset was evaluated after 100 iterations using the trajectory examples. The agent was trained for a total of 10,000 episodes, with each lasting for 100 samples or until a collision occurred. For training, roundabouts with no traffic were considered, and the speed of the vehicle was set up according to the predictive model obtained in [1]. Figure 6 shows the two scenarios established in the experiment, where the test vehicle is in red.
The behavior was tested in the same scenario in which the vehicle was trained. The number of participating vehicles and their routes were fixed for a given experiment, but their initial positions varied randomly. The training dataset consisted of 96 hours of driving the vehicle manually within the simulation environment. The route included four roundabouts with three exits, A, B, and C. During the learning phase, the trained dataset was evaluated after 100 iterations using the trajectory examples. The agent was trained for a total of 10,000 episodes, with each lasting for 100 samples or until a collision occurred. For training, roundabouts with no traffic were considered, and the speed of the vehicle was set up according to the predictive model obtained in [1].  For the two scenarios based on the simulated environment's recorded trajectories for the observed vehicle, the vehicle decision distribution is a ∈ ∆. The proposed framework was evaluated using the metrics of performance previously cited to enter and exit roundabouts with and without traffic. The speed obtained from the predictive model [1] was adjusted in the trained model to the different segments of a roundabout, where convergence of the vehicle's behavior was observed within the simulation environment. During data collection, drivers involved in the experiment met the following conditions: (1) they used routes with roundabouts with different diameters, (2) they used single and multiple lanes; and (3) they used the same vehicle equipped for testing. An important feature of the naturalistic driving data used for the simulated environment is that the RL algorithm could learn decision-making when arriving at a roundabout, and human behavior seems more promising when it comes to a real-life changing environment.
To apply reinforcement learning and obtain an optimal solution through this methodology, the considered reward function was characterized by being bounded between two limit values in its measurement: (−, +). In the results, values with (−) correspond to the vehicle leaving on the left side of the state space and vary linearly between the limit values, and values with (+) correspond to when it leaves on the right side. Figures 7 and 8 show the return of the approach metrics for the simulations without traffic and with traffic, as well as the metric vectors obtained for each situation. As can be seen in the graphs, the metrics converge toward the value (mr) during the training phase, with the exception of traffic simulation data, where they diverge at a given point. This divergence is the result of the discretization of the reward function at the moment when the test vehicle must stop to yield to another vehicle inside the roundabout. Another significant aspect is the average speed (vm) in both experiments. In the case of traffic, the speed is reduced compared to the case without traffic, as well as the distance traveled (dmr) by roundabout typology. Figure 9 shows that the trajectory for exit A is observed by the vehicle according to the simulation results, where the red line represents the path that the vehicle must follow and the blue line is the obtained path after the reinforcement learning algorithm is applied. For the two scenarios based on the simulated environment's recorded trajectories for the observed vehicle, the vehicle decision distribution is ∈ ∆. The proposed framework was evaluated using the metrics of performance previously cited to enter and exit roundabouts with and without traffic. The speed obtained from the predictive model [1] was adjusted in the trained model to the different segments of a roundabout, where convergence of the vehicle's behavior was observed within the simulation environment. During data collection, drivers involved in the experiment met the following conditions: (1) they used routes with roundabouts with different diameters, (2) they used single and multiple lanes; and (3) they used the same vehicle equipped for testing. An important feature of the naturalistic driving data used for the simulated environment is that the RL algorithm could learn decision-making when arriving at a roundabout, and human behavior seems more promising when it comes to a real-life changing environment.
To apply reinforcement learning and obtain an optimal solution through this methodology, the considered reward function was characterized by being bounded between two limit values in its measurement: (-, +). In the results, values with (-) correspond to the vehicle leaving on the left side of the state space and vary linearly between the limit values, and values with (+) correspond to when it leaves on the right side. Figures 7 and 8 show the return of the approach metrics for the simulations without traffic and with traffic, as well as the metric vectors obtained for each situation. As can be seen in the graphs, the metrics converge toward the value (mr) during the training phase, with the exception of traffic simulation data, where they diverge at a given point. This divergence is the result of the discretization of the reward function at the moment when the test vehicle must stop to yield to another vehicle inside the roundabout. Another significant aspect is the average speed (vm) in both experiments. In the case of traffic, the speed is reduced compared to the case without traffic, as well as the distance traveled (dmr) by roundabout typology.    Figure 9 shows that the trajectory for exit A is observed by the vehicle according to the simulation results, where the red line represents the path that the vehicle must follow and the blue line is the obtained path after the reinforcement learning algorithm is applied.

Summary and Discussion
In this paper, a framework for reinforcement learning for autonomous driving in roundabout scenarios is proposed. The problem is tackled as a Markov decision process, where the behavioral planning of the vehicle for safely navigating roundabouts uses the Q-learning algorithm. The approach was implemented using the CARLA simulation environment. Simulations carried out in this work used a set of naturalistic driving data from [1], including environmental information, as well as machine learning models for predicting steering angle and vehicle speed. The main contribution of this paper is the design of a tangible learning technique for sequential and automatic decision-making of autonomous vehicles through examples in a simulated environment. In the experiments carried out, the behavior process benefited from a guided policy in automatic decisionmaking in terms of tangible learning as determined by the implemented Q-learning algorithm. The resulting behavior after the iterative adaptation of the Q-value function allowed the autonomous vehicle to choose the appropriate actions between the start and end of the defined scenarios through   Figure 9 shows that the trajectory for exit A is observed by the vehicle according to the simulation results, where the red line represents the path that the vehicle must follow and the blue line is the obtained path after the reinforcement learning algorithm is applied.

Summary and Discussion
In this paper, a framework for reinforcement learning for autonomous driving in roundabout scenarios is proposed. The problem is tackled as a Markov decision process, where the behavioral planning of the vehicle for safely navigating roundabouts uses the Q-learning algorithm. The approach was implemented using the CARLA simulation environment. Simulations carried out in this work used a set of naturalistic driving data from [1], including environmental information, as well as machine learning models for predicting steering angle and vehicle speed. The main contribution of this paper is the design of a tangible learning technique for sequential and automatic decision-making of autonomous vehicles through examples in a simulated environment. In the experiments carried out, the behavior process benefited from a guided policy in automatic decisionmaking in terms of tangible learning as determined by the implemented Q-learning algorithm. The resulting behavior after the iterative adaptation of the Q-value function allowed the autonomous vehicle to choose the appropriate actions between the start and end of the defined scenarios through

Summary and Discussion
In this paper, a framework for reinforcement learning for autonomous driving in roundabout scenarios is proposed. The problem is tackled as a Markov decision process, where the behavioral planning of the vehicle for safely navigating roundabouts uses the Q-learning algorithm. The approach was implemented using the CARLA simulation environment. Simulations carried out in this work used a set of naturalistic driving data from [1], including environmental information, as well as machine learning models for predicting steering angle and vehicle speed. The main contribution of this paper is the design of a tangible learning technique for sequential and automatic decision-making of autonomous vehicles through examples in a simulated environment. In the experiments carried out, the behavior process benefited from a guided policy in automatic decision-making in terms of tangible learning as determined by the implemented Q-learning algorithm. The resulting behavior after the iterative adaptation of the Q-value function allowed the autonomous vehicle to choose the appropriate actions between the start and end of the defined scenarios through GPS positioning in the reward function. The proposed method was evaluated in a challenging roundabout scenario with and without traffic by discretizing the reward function in a high-definition driving simulator. The results, in comparison with other learning methods, show that the autonomous vehicle had improved directionality against the direction of other vehicles, adapted the average speed in a more realistic way in an environment with traffic, and improved the deviation of the vehicle's rotation steering angle without hitting obstacles.
For future work, roundabouts with several exits and shapes as well as other scenarios will be considered. It would also be interesting to simulate the proposed framework using simulation of urban mobility (SUMO) [35], including simulating complete roundabouts (exit D). Another line of work is to compare the results obtained in this paper with application of the deep Q-learning algorithm. Finally, collecting more trajectories for analysis of the training and adaption phases is also desirable.
Author Contributions: L.G.C. was responsible for the design and implementation of the reinforcement learning algorithm, the learning model, and validation. E.P. was responsible for selecting the machine learning algorithms and the general data mining process. J.F.A. was responsible for designing naturalistic driving data. N.A. was responsible for drafting the paper. All authors contributed to writing and reviewing the final manuscript.

Appendix A
The model operation based on the Q-learning algorithm for this paper for the whole procedure is shown in this section.