The Design of Performance Guaranteed Autonomous Vehicle Control for Optimal Motion in Unsignalized Intersections

: The design of the motion of autonomous vehicles in non-signalized intersections with the consideration of multiple criteria and safety constraints is a challenging problem with several tasks. In this paper, a learning-based control solution with guarantees for collision avoidance is proposed. The design problem is formed in a novel way through the division of the control problem, which leads to reduced complexity for achieving real-time computation. First, an environment model for the intersection was created based on a constrained quadratic optimization, with which guarantees on collision avoidance can be provided. A robust cruise controller for the autonomous vehicle was also designed. Second, the environment model was used in the training process, which was based on a reinforcement learning method. The goal of the training was to improve the economy of autonomous vehicles, while guaranteeing collision avoidance. The effectiveness of the method is presented through simulation examples in non-signalized intersection scenarios with varying numbers of vehicles.


Introduction and Motivation
The handling of intersection scenarios is an important challenge in the research field of autonomous vehicles. From the aspect of control design, several crucial tasks of the autonomous vehicle control in intersection scenarios should be formulated. For example, the ordering of autonomous vehicles in intersections has an impact on the energy consumption of the vehicles, traveling time, emissions and traveling comfort (due to the acceleration/deceleration maneuvers, for example) [1]. This leads to a multi-objective optimization task, which generally has a Pareto-optimal solution. Moreover, the solution of the optimization problem regarding autonomous vehicles in intersections can require lengthy computations. This poses the challenge of the minimization of the computational time, e.g., finding approximations of the optimal solution. From an application point of view, the accurate detection of a vehicle's position in an intersection can be difficult. There exist camera-based vehicle localization strategies [2], solutions with LiDAR and camera fusion [3] and GPS and stereovision-based visual odometry solutions [4]. The detection and prediction of human behavior are also difficult and complex problems; see [5,6]. Nevertheless, considering position errors caused by human participants, robust control strategies are needed with which collision avoidance can be guaranteed.
Due to the special challenges of the control of autonomous vehicles in intersections, in the architecture of the control system it can be handled as a special feature of a path planning algorithm in future vehicles. Its main differences from the path planning problem are as follows. First, in intersections the autonomous vehicles can have interactions. Since the other vehicles are also in motion, the handling of intersection scenarios is a complex problem, which requires sensing, decision and control functionalities. In intersections the risk of the collision is increased, e.g., compared to the motion of the autonomous vehicle along a highway. Second, various types of intersections on the road can be found-they can depend on the number of lanes, directions and signalization. A challenge of the autonomous vehicles and the intelligent transportation systems is the handling of vehicle interactions in unsignalized intersections [7]. In these types of intersections, the experiences of the human drivers has a role in the avoidance of the conflicts; e.g., the speed of the surrounding vehicles and the intentions of the further drivers must be estimated. In a general path planning algorithm, the waypoints, the rules and the information on the road and on the local traffic conditions can be considered to be given, and thus, it requires less estimation. Third, in urban areas the autonomous vehicles during their routes must travel through many intersections. Since these intersections are connected to each other, the motion of the vehicles in the intersections has an impact on the global performance of the traffic system, i.e., on the average speed or on the traffic density; see, e.g., [8]. Thus, during the design of the autonomous vehicle's motion, the time minimization is an especially important at intersections. Moreover, the inefficient motion of autonomous vehicles in intersections can lead to increased energy consumption, e.g., a high number of stop and go maneuvers. Thus, during the design of the autonomous vehicle motion in intersections, a balance between the time minimization and the energy consumption minimization must be found. Although it is also a problem in a general path planning problem, it is a greater problem for intersections. Fourth, a special problem of vehicle control in intersections is the architecture of the control system. Since in intersections a high number of vehicles can be controlled and the presence of human-driven vehicles must be handled, an optimal solution of the motion problem can be found through the coordination of the autonomous vehicles in a centralized control system [9]. Nevertheless, it can require centralized control, which can lead to additional security, legal and privacy problems. Therefore, it is requested to develop distributed control structures; e.g., in [10] the problem is formulated as distributed optimal control for a system of multiple autonomous vehicles and then solved using a nonlinear model predictive control (MPC) technique. Similarly, Kloock et al. [11] investigated intersection control of multiple vehicles using a distributed MPC approach, where priorities need to be assigned to vehicles. Thus, the challenge of distributed control is also different to the general path planning and tracking problems, where the coordination of the vehicles is limited, or where it is carried out on the level of the transportation system.
The high number of challenges in the field of autonomous vehicle control in intersections has motivated increased research activity, i.e., several publications with different approaches. Methods of one of the most important groups are based on the MPC technique; see, e.g., [12][13][14][15]. Although they can provide appropriate results, the increase in the number of vehicles can make real-time computation difficult. A possible solution to the problem of increasing computational effort is the approximation of the optimal solution with neural networks; see, e.g., [16]. However, in this case it may be difficult to provide guarantees on collision avoidance for all intersection scenarios [17]. The quadratic programming method gives the possibility of real-time implementation compared to the convex optimization using space coordinates; see [18]. Collision avoidance is guaranteed through the definition of regions in the intersection with special rules, which reduces the complexity of the problem, but the conservativeness of the solution increases. Due to the ordering problem of vehicles in an intersection, the control problem can be formed as a mixed-integer linear programming (MILP) task; see, e.g., a method on the coordination in [19]. The goal of the control is to find an arrival schedule of the vehicles, which ensures safety while it reduces the number of stops and intersection delays. Muller et al. [20] presents a centralized MILP-based approach for intersection control in an urban environment of highly automated vehicles, with which the minimum vehicle delay at the intersection can be achieved. Another solution in the predictive control framework is to examine all ordering combinations of the vehicles [21], which can also be difficult for a high number of vehicles. Thus, although MPC-based methods can provide effective and handy solutions to the control problem of autonomous vehicles in intersections, a high number of vehicles can lead to numerical difficulties.
A further group of solutions is based on the minimax formulations of the optimal control problem, together with the coordination of the vehicles. For example, Tian et al. [22] proposed a game-theoretic approach to modeling vehicle interactions, in particular, for urban traffic environments with unsignalized intersections. In general, they are traffic models with heterogeneous control, and they are used for virtual testing, evaluation and calibration of autonomous vehicle control systems. A framework for intersection control, which is based on the queuing theory, was presented by [23]. Through this framework a capacity-optimal, slot-based intersection management system for the tworoad crossing configuration was presented. A slot preassigning method for connected vehicles was also proposed in [24]. The connection of intersections can also be handled by the multi-agent viewpoint [25,26]. This solution provides the ability to quickly reverse individual lanes in response to rapidly changing traffic conditions. A further multi-agent solution based on a heuristic optimization algorithm was presented by [27]. The objective of the research was to reduce total time delay for the entire intersection, while collisions were prevented. The approaches in the field of trajectory planning are possible ways to formulate the problem of autonomous vehicle motion in intersections. The actual methods and the future challenges in path planning are summarized in [28]. Some similarities between the path planning and the problem of autonomous vehicles in intersections motivated the use of Bézier-based methods in the problem. For example, in [29] a Bézier curve optimization method was proposed to cope with these constraints, and autonomous vehicles were considered that were equipped with all the necessary sensors for obstacle detection. In this way, the obstacle avoidance problem was transformed into an optimization problem under equality constraints. Another method with the use of Bézier curves was found in [30], which presents a novel method to resolve the obstacle avoidance and overtaking problems named Hybrid Planning. An MPC-based solution to track the lane centerline while avoiding collisions with obstacles was found in [31]. Further relevant works in the topic of path planning are as follows. Hu et al. [32] presented a real-time dynamic path planning method; i.e., an optimal path with appropriate acceleration and velocity profiles is the result. A graph-based planning method to generate an actionable set of multiple drivable trajectories for race vehicles, is found in [33]. For road vehicles, Werling et al. [34] proposed an optimal trajectory generation method, with which velocity maintenance, merging, following, stopping and a reactive collision avoidance functionality were achieved. In [35] a real-time safe stop trajectory planning algorithm, based on selection from a precomputed set of trajectories has been developed. Several papers deal with the selection of the velocity of autonomous vehicles, which has relevance in the problem of intersection scenarios. For example, Herrmann et al. [36] proposed an optimization-based velocity planner using the multi-parametric sequential quadratic problem, with which a spatially and temporally varying friction coefficient, and a race-focused energy strategy for the road can be handled. In [37] a solution to the mathematical problem behind the trajectory design problems has been provided; i.e., a general-purpose solver for convex quadratic programs based on the alternating direction method of multipliers is proposed. Moreover, in [38] a real-time velocity optimization algorithm for fixed paths was developed and implemented.
Another important group of efficient solutions on the problem contains the learningbased approaches, especially reinforcement learning (RL) [39][40][41][42]. The advantage of this method is its model-free property, which can provide solutions to the problem of constraint formulation. Moreover, the training of the control agent through a high number of epochs is carried out, which can lead to improved performances. A deep RL method was applied to unsignaled intersections in [39]. It also provided an analysis to explore the ability of the system to learn active sensing behaviors and safe navigation in the case of occlusions. Tran and Bae [43] also presented a deep RL-based model that considers the effectiveness of leading autonomous vehicles in mixed-autonomy traffic at a non-signalized intersection. Yudin et al. [44] offered a new approach to training the intelligent agent that simulates the behavior of an unmanned vehicle, based on the integration of reinforcement learning and computer vision. Using full visual information about the road intersection obtained from aerial photographs, they studied through automatic detection the relative positions of all road agents with various architectures of deep neural networks. In spite of promising achievements, the resulting neural-network-based agents cannot provide a guarantee of collision avoidance for the vehicles.
That short review of the existing literature showed that several efficient solutions exist, but there are challenges for achieving a reliable control method. The goal of the recent research activities was to find a control design method with which many autonomous vehicles can be handled, while the energy consumption and time minimization requirements, along with safety, i.e., the avoidance of collisions, are ensured. Similarly, the purpose of this paper is to provide a novel solution to the recent problems through the following concept. The increase in the number of vehicles is handled through separation, i.e., the control designed for individual autonomous vehicles, which considers limited information on the environment. In the solution the performance requirements are priorized and the specifications are guaranteed through different methods, such as optimal control and RL-based techniques. Thus, in this way the advantages of each method, such as providing guarantees and optimization of the performance level, can be achieved.
This paper focuses on the optimal design of the longitudinal control of an autonomous vehicle. Its elements and the related problems regarding the paper can be summarized as follows. • The objective may contain multiple criteria, the most important of which are the minimization of traveling time, energy consumption and the maximization of comfort. The variables of the optimization task are the control inputs of the vehicles, e.g., longitudinal acceleration commands or traction/braking forces. • The dynamics of each vehicle provide constraints in the optimization problem. • Furthermore, the safe motion of the vehicles, i.e., collision avoidance, must be guaranteed, which leads to constraints for vehicles with intersecting routes. • The velocity of the vehicles must be kept in a predefined range, considering the speed limit. • The control input of the vehicles must also be limited due to the physical limits of the driveline, the braking system and the tire-road contact.
In this paper the optimization problem of the vehicle's motion is divided according to the performance aspects. First, a quadratic optimization problem is proposed, whose role is to guarantee collision avoidance and the limitation of the velocities of vehicles. The constraint of collision avoidance is formed through a linear approximation of the quadratic constraint, which leads to high efficiency in computation. The quadratic optimization is solved at every time step during the motion of the vehicles. Second, the advantages of reinforcement learning in the improvement of the economy performances, e.g., the minimization of the control input, are exploited. In the training process the previously formed quadratic optimization task is applied as a part of the environment for learning. Similarly, during the operation of the control system, the trained neural network and the quadratic optimization task operate together. The effectiveness of the resulting control system through simulation examples with various vehicles is analyzed.
Thus, the contributions of the paper are as follows. The control design problem for autonomous vehicles in intersections is formed in a novel way through the division of the problem. It leads to the reduced complexity of the control problem in real-time computation. Moreover, an environment model for reinforcement learning is created, which guarantees that collision avoidance can be provided. An important benefit of the method is that the safety performance level of the resulting control structure is guaranteed; i.e., a collision of vehicles cannot be caused by the learning-based control agent. From an application point of view, an advantage of the method is that it is able to handle various situations at the intersection; i.e., the method is independent of the number of vehicles. Although this paper focuses on autonomous vehicle control, the proposed control design method may be used for further applications, considering the similarities with the intersections. For example, in the field of microfluid research, an actual problem is to find modeling and control methods for the handling of T-junctions [45]. In [46] computational models for microfluidic bubble logic systems with T-junctions have been proposed. Learning methods have also been used for control purposes in this context; see, e.g., [47]. Thus, the contributions of the proposed control system may have relevance to further applications.
The paper is organized as follows. In Section 2 the model formulation for handling intersection scenarios is proposed. The cruise control design for the autonomous vehicle with robustness issues is found in Section 3. The application of reinforcement learning for the improvement of economy performances is presented in Section 4. In Section 5 the effectiveness of the proposed method is shown through simulation examples. Finally, the conclusions and the future challenges are provided in Section 6.

Model Formulation for Handling Intersection Scenarios
The goal of this section is to formulate the collision-free motion of the vehicles in intersection scenarios. The formulation includes the longitudinal motion model of the vehicles and the constraints on their motion to avoid collisions.
The longitudinal motion of the vehicles is formulated through the simplified kinematics of vehicles: where i index represents the id of the vehicle, n is the number of vehicles, v i is longitudinal velocity and s i is longitudinal displacement. a i represents the longitudinal acceleration of the vehicle, which is handled as a control input command, and T is the time step of the discrete motion model. The longitudinal displacement is related to the center point of the intersection, and thus, it is defined as s i = 0 for all i at the center point. The longitudinal displacement of the approaching vehicle has a negative value, and the displacement of the vehicle moving away has a positive value. In the environment model of the intersection, the autonomous ego vehicle is numbered 1. Thus, for example, in the case of n = 5 in the environment, the ego vehicle and four surrounding vehicles are incorporated. The control input of the autonomous vehicles is separated into two elements, as has been introduced in Section 1 as where a K is the control input command of the robust longitudinal controller and ∆(k) is the additional input from the supervisor in the model. The purpose of the supervisor in the collision-free motion model is to select ∆(k) for the autonomous ego vehicle with the following objectives and constraints.

•
The objective of the selection is to minimize the difference between the current control input a 1 (k) and the output of the learning-based controller a L (k). The aim of the minimization is to preserve the performance level of the learning-based controller, if possible. • Some of the vehicles can intersect their routes at the intersection. Thus, an intervention with a 1 (k) must provide motion for the autonomous vehicle, with which a safe distance s sa f e between the autonomous vehicle and the other surrounding vehicles with intersecting routes can be guaranteed. Since the number of vehicles n can be high, a limited amount of surrounding vehicle motion must be simultaneously considered; i.e., n s represents the number of vehicles which are incorporated into the design process. It is a constraint in the selection of ∆.

•
The velocity of the vehicles must be inside of a predefined range. The upper bound of the range is defined by the speed limit v max or the curvature of the intersection.
The lower bound is represented by the stopping of the vehicle. Thus, it is necessary to select ∆ for all vehicles to keep velocities inside of the range. It is also a constraint in the selection process.
The selection process of ∆ for the autonomous ego vehicle is formed as an optimization problem as follows: where j represents the surrounding vehicles, whose motion can be in conflict with the autonomous vehicle; i.e., their routes are intersected. ∆ represents the domain of the optimization variable. In the optimization problem the kinematics of the vehicle motion (1) is considered through the formulation of the constraints, the separation of the control input (3) is involved in the objective function.
The objective (4a) is transformed using (3) as where f (a vector) is formed as , it is omitted in the rest of the optimization problem. The constraint for collision avoidance (4b) is formed to keep s sa f e between the vehicles. The distance is measured in the sense of the longitudinal displacements of the vehicles on their routes. Geometrically, the quadratic constraints (4b) represent that the trajectories of the autonomous vehicle s 1 and the related surrounding vehicles must be out of a circle. The radius of the circle is defined by s sa f e ; see Figure 1a. For example, it there are two surrounding vehicles and the autonomous vehicle the trajectories must be out of a sphere, whose radius is represented by s sa f e ; see  Although the circle perfectly describes the region of = state-space that must be avoided, it leads to a quadratic constraint in the optimization problem (4). Since the optimization problem must be solved at each k time step during the motion of the vehicles, the quadratic constraint can make the solution of the optimization more complex, and it can make the real-time solution more difficult. This motivates the transformation of the quadratic constraints into linear constraints. In this paper it is performed through the approximation of the avoidable regions of circles to the avoidable regions of half-planes.
The method of the approximation is illustrated in Figure 2. First, the tangent lines to the circle from the actual state s 1 (k) s j (k) T are assigned. The avoidable half-plane is determined by the region between the tangent lines; i.e., the trajectories must be out of it. Second, two linear inequality constraints are specified, which represent that the trajectory of the state must be out of the avoidable half-plane: where s T1,1 (k) s T1,j (k) , s T2,1 (k) s T2,j (k) are the tangent points on the circle at time step k. Note that the linear constraints result in an outer approximation of the avoidable set.
There are regions of the half-space which cannot be reached due to the linear constraints, but they are out of the circle. The longitudinal displacement of the autonomous vehicle at k + 1 in (6) is transformed to express the linear constraints in terms of ∆. The transformation is based on the motion equation (1) and the relation of actuation separation (3). Moreover, the motions of surrounding vehicles for k + 1 with their actual velocity v j (k) are predicted this way: which can be substituted into (6). It leads to the linear constraints or Figure 2 illustrates that the reachable set for the state s 1 (k + 1) s j (k + 1) T is non-convex, which means that (8) formulates disjunctive inequalities. However, each constraint in (8) leads to convex reachable sets, which means that the optimization problem can be divided, as it is proposed below. Another constraint in the optimization (4) is on the velocity of the autonomous vehicle; see (4c). In the case of this constraint, v i (k + 1) is expressed in terms of ∆ using the motion equation (1) and the relation of actuation division (3). The linear inequality constraint is formed as which leads to where v max is the maximum velocity of the autonomous vehicle. In an urban environment it is determined by the velocity regulations, and in the case of cornering maneuvers at intersections, the curvature avoids skidding. The last constraint in the optimization problem (4) is the limitation of the resulting optimization variable; see (4d). The value of ∆ is limited by the bounds of a 1 (k), such as a min and a max , which represent full braking and throttle. Since a 1 (k) is also influenced by a K (k), the constraints on ∆(k) are formed as which leads to the constraint The optimization task (4) using (5), (8), (10) and (12) is reformulated as The quadratic optimization in (13) contains disjunctive inequalities. This means that the optimization task for the solution can be reformulated to a mixed-integer optimization problem [48]. In practice, the optimization problem (13) can be solved as a set of quadratic optimization problems, which has only "and" conditions between the constraints. Thus, in the case of n s surrounding vehicles, 2 n s distinct constrained optimization problems can be formed. Since the objective function for each problem is the same, the global minimum solution can be found through a comparison of their solutions. Finally, the solution is yielded by ∆(k), which leads to the minimum value of the objective function, considering all of the optimization tasks. The optimization problem during the motion of the autonomous vehicle is continuously solved.

The Design of Robust Cruise Control
An intervention in the longitudinal dynamics of the autonomous vehicle at an intersection has great importance for achieving the required motion profile for time step k + 1. The control input of the vehicle is composed by two elements, a K (k) and ∆(k); see (3). The computation of ∆(k) in Section 2 has already been proposed. In this section the design of the control intervention a K (k) is proposed.
The most important requirements for the controller are formed as follows. • The control system must guarantee safe longitudinal motion for the vehicle, even if a L has degradation or a fault scenario. The safe motion is guaranteed through a K and ∆.

•
The longitudinal controller must have robust characteristics, because a(k) = a K (k), if ∆(k) = 0. Robustness must also be guaranteed in extreme vehicle dynamic scenarios: for example, if a K (k) suggests full throttle, but it has been overridden by ∆(k) to be full braking. Similarly, robustness must be guaranteed if a K (k) suggests full braking, but it has been overridden by ∆(k) to be full throttle. • The control signal a K (k) must be computed to avoid the saturation of the control actuation a(k). Since in the computation process the computation of a K (k) is prior to ∆(k), the result of (13) is not considered in the computation of a K (k). Thus, the role of the cruise controller is to guarantee a min ≤ a K (k) ≤ a max . The design of the longitudinal controller is based on the simplified vehicle model (1) with one state, such as the longitudinal velocity v 1 (k). The controller is formed as a gain P for velocity tracking, whose input is the velocity error; see Figure 3. The reference signal for the tracking is v re f . The idea behind the formulation of the control design as a tracking problem is as follows. The vehicle must adapt to the environment, which determines the achievable velocity of the vehicle, e.g., through maximum speed limits. The aim of the cruise control is to consider the regulations and the motion of the other vehicles to achieve safe motion. The maximum speed limit information is provided by v re f (k), and information on the motion of the vehicles in the environment in the computation of ∆(k) is included. Thus, through the selections of v re f (k) and ∆(k), safe motion is guaranteed. The selection of P is determined by the requirement of robustness by the controlled system. In the presented structure, ∆(k) is handled as a disturbance and the performance of the system is the minimization of the tracking error, such as |v re f (k) − v 1 (k)|. Thus, for the robustness issue, it is necessary to limit the impact of ∆(k) disturbance on the performance. The robustness criterion is based on the small-gain theorem [49]. As a consequence, the H ∞ norm transfer function from ∆(k) on the velocity error must be smaller than 1, which guarantees disturbance attenuation.
In the computation of the transfer function the maximum value of the∆ = |∆(k)| is also considered. The maximum is determined by the vehicle dynamic scenario, if ∆(k) implies full throttle instead of full braking or full braking is implied instead of full throttle. Thus, the value of∆ is |a min | + |a max |. The transfer function from ∆ i (k) to the velocity error is The criterion on the selection of P is to guarantee ||G(z)|| ∞ < 1.
The constraint on the control input a 1 (k) is influenced by the limitation of the reference signal. The reference velocity in Figure 3 represents the maximum speed limit v max (k) for the vehicle on the given road section. Nevertheless, it is necessary to provide a reference velocity for the vehicle, with which the constraint on a(k) is kept.
The expression represents that the reference velocity can differ from that actual velocity with a limited value. The value of P has a role in the computation of the reference signal. Finally, the method of the selection of P is as follows. It is recommended to select a value of P as high as possible, which can result in fast operation of the tracking control. However, the value of P is constrained by the robustness criterion: The output of the robust longitudinal controller is a K (k), which is used in the computation of the control input (3) and in the computation of ∆(k) (13).

Design of Motion Profile Using Reinforcement Learning
In the previous sections a control strategy was formed with which the safe motion of the autonomous vehicle in an intersection can be guaranteed. Nevertheless, in the design problem of the vehicle motion there are economy aspects that should be considered during the design process; i.e., it is recommended to minimize the control energy of the vehicles. In this section the consideration of the economy performance in the design process is proposed.
The goal of the design process is to find an agent which is able to generate the control signal a L (k). The agent is trained through a reinforcement learning process. The structure of the design process is illustrated in Figure 4. The model for the learning process contains the optimization task (13) together with the result of the robust control design (16). The model guarantees the avoidance of the collision in the intersection for every a L (k) signal. Thus, during the training process of the agent in every episode the safety performances are guaranteed, and similarly, the economy performance is improved. The output of the motion model is reward r(k), which is composed by a(k) and v 1 (k) as follows: where Q 1 and Q 2 positive values are design parameters, which scale the importance of each term in r(k). The reward contains the control inputs in the vehicles a(k), which represent the economy performance of the vehicles. If the reward contains only a(k), it can result in unacceptably slow motion for the autonomous vehicle, because a(k) = 0 is the best choice for the maximization of the reward. Thus, in the reward the traveled distance of the autonomous vehicle (s 1 (k) − s 1 (0)) is also incorporated, where s 1 (0) is the initial position of the autonomous vehicle. The observation for the agent includes the position of the autonomous vehicles s 1 (k) and of the surrounding vehicles s j (k) j ∈ n s , and their velocities v 1 (k), v j (k) j ∈ n s . in intersection motion model of vehicles reinforcement learning agent The goal of the reinforcement learning process is to maximize reward (17) during episodes. In this paper the training process through the deep deterministic policy gradient (DDPG) is carried out, which is a model-free, online, off-policy reinforcement learning method [50]. A DDPG agent is an actor-critic reinforcement learning agent that computes an optimal policy that maximizes the long-term reward. In this method, actor and critic approximators are used. Both approximators use the observations, which are represented by S. The purpose of the actor approximator µ(S) is to find the action A with a L (k), which maximizes the long-term future reward. The role of the critic Q(S, A) is to find the expected value of the long-term future reward for the task.
The result of the training process is an agent whose output is a L (k). In the control process of the autonomous vehicle, the agent works together with the control strategy (13). As a result, the collision avoidance of the vehicles in the intersection is guaranteed, and similarly, the economic performance of the vehicles is improved.

Simulation Results
Four simulation examples are shown to illustrate the effectiveness of the designed controller. The first and second examples show simplified scenarios, in which three vehicles are at the intersection. The goal of the simulations in scenario 3 is to shown that the proposed control system is able to adapt to various traffic environments, depending on the initial velocities and positions of the vehicles. The fourth example has increased complexity, in which seven vehicles are incorporated. In these simulations the closest three vehicles in the constraints of the optimization tasks are incorporated.
The RL-based agent was trained through 500 simulation scenarios, in which the number of vehicles, their initial velocities and positions were selected randomly; i.e., the vehicle number was varied between 1 and 7, the initial positions were selected between −20 m . . . −40 m and the initial velocities were varied between 10 km/h . . . 50 km/h. In the learning process of the neural networks, the following structures have been trained. The actor network had eight neurons in the input layer; three fully connected layers with 48 neurons and Rectified Linear Unit (ReLU) functions in each layer; and one neuron with hyperbolic tangent functions in the output layer. The critical network has the same structure. In the structure and parameter selection of the neural networks, the k-fold cross-validation technique [51] and a hidden-layer number optimization process [52] can be used. The sampling time in each episode is selected at T = 0.05 s. The terms in the reward function are considered with the same design parameters, such as Q 1 = −0.1 and Q 2 = 10. The training process has been performed through Matlab 2020a Reinforcement Learning Toolbox [53] on an Intel i7 CPU.
For the simplified examples, the first and second scenarios are illustrated in Figure 5. In these cases, vehicle 1 is the ego vehicle; vehicle 2 and vehicle 3 are the surrounding vehicles on the perpendicular road section. Both surrounding vehicles cross the route of vehicle 1, which results in a conflict situation. In the first scenario the following initial conditions were set: s 1 = −40 m, s 2 = −36 m, s 3 = −41 m and all of the vehicles have 50 km/h initial velocity. In the second scenario, vehicle 3 had different settings, while the other vehicles had the same initial conditions, i.e., s 2 = −48 m and v 2 = 30 km/h. The safe distance was set to s sa f e = 8 m. The results of the simplified scenarios are found in Figure 6. Figure 6a,b shows that the autonomous ego vehicle is able to guarantee keeping s sa f e . Nevertheless, in the two scenarios, the ordering of the vehicles in the intersection is different. Since in scenario 1 the surrounding vehicles have increased velocity, the autonomous vehicle decides to reduce its velocity to 18 km/h and allows the surrounding vehicles with right of way to enter the intersection. In scenario 2 the autonomous vehicle has enough time to go through the intersection before vehicle 3 enters it. Thus, in this case v 1 can be increased after vehicle 2 leaves the intersection, see Figure 6d. The differences in the motion of vehicle 1 also influence the trajectories in Figure 6a,b. Moreover, the control input signals for each scenario are found in Figure 6e,f. In case of scenario 2 the acceleration of the ego vehicle is increased to 3 m/s 2 in 2.6 s to avoid a collision with vehicle 2, while in scenario 1 the acceleration is increased later than 3 s by a lower value. In the next simulation examples, it is proposed that the proposed algorithm provides safe motion for the autonomous vehicle in various scenarios. The intersection scenario is illustrated in Figure 7, where vehicle 1 represents the autonomous vehicle and the other vehicles are the surrounding vehicles. In the next four simulations the initial positions and velocity values of the vehicles are varied, and thus, a varying ordering in the intersection is yielded. In all of the simulations, vehicle 1 has conflicts with the surrounding vehicles because of their route crossing. The goal of the simulations are to show that the autonomous vehicle is able to adapt to various traffic scenarios, and the proposed optimization (13) together with the reinforcement learning is efficient. The initial position and velocity values of the scenarios are listed in Table 1.

Initial Condition
Scenario 3a Scenario 3b Scenario 3c Scenario 3d  Figure 8 shows the results of each scenario in the third simulation setup, related to Figure 7. Figure 8a illustrates the trajectories of scenario 3a, where vehicle 1 could go through the intersections first. Thus, the velocity of vehicle 1 was set to the constant maximum 50 km/h, with which the avoidance of the collision was guaranteed; see Figure 8b and the related control inputs in Figure 8c. In scenario 3b the autonomous vehicle was further from the intersection at the beginning of the simulation; see Table 1. Thus, it was not possible to go through the intersection first, and therefore, vehicle 1 was the third in line; see the trajectories in Figure 8d. v 1 was slightly reduced before 2 s to guarantee the safe distance. Figure 8f shows that u was close to u L during the entire simulation, which means that the learning process and the optimization (13) are efficient. In scenario 3c, the vehicle had reduced velocity related to the surrounding vehicles, and thus, it was fourth at the intersection (see Figure 8g). While guaranteeing safe distance, the velocity was reduced, and at 5 s the the maximum velocity was achieved; see Figure 8h. Although u is close to u L , it is necessary to modify it to avoid the collision, e.g., between 1.5 s . . . 3.8 s (Figure 8i). Figure 8j shows the trajectories of scenario 3d, where vehicle 1 is the last in the ordering. Its reason is that the initial velocity is smaller than in scenario 3c; see Figure 8k. Since for vehicle 5 the priority must be guaranteed, vehicle 1 must accelerate with a reduced value between 2 and 6 s; see the velocity profile in Figure 8k and the acceleration command in Figure 8l. Nevertheless, u can be selected to be close to u L , which can provide an improved level of economy performance, while the safety performance iis simultaneously guaranteed. Finally, the operation of the designed autonomous vehicle control system in a complex simulation scenario in urban context is analyzed. In the example, seven vehicles took part, i.e., the ego vehicle and six other surrounding vehicles. The illustration of the fourth scenario is found in Figure 9. In this scenario the autonomous ego vehicle is illustrated as vehicle 1. In the example, all of the other vehicles are in conflict with vehicle 1, which means that the signals of vehicles 2-7 are used during the design of the control input a.  Some scenes of scenario 4 can be found in Figure 10. Moreover, the video of the entire scenario in https://youtu.be/VGqcoX-9YTY can be seen. Figure 10a shows that vehicle 2 and vehicle 3 already left the intersection and vehicle 4 was in the intersection performing a left turn. Since vehicles 2-4 were closer to the intersection than vehicle 1, the ego vehicle had to reduce its velocity to give way; see Figure 11b. It was yielded by the actuation; see the acceleration command u in Figure 11c. Moreover, after vehicle 4 left the intersection, vehicle 5 arrived at 50 km/h; see Figure 10b. Due to the reduced velocity of vehicle 1, it was not possible to go through between vehicle 4 and vehicle 5, and thus, vehicle 1 stopped. Figure 10c shows when vehicle 5 left the intersection, the velocity of vehicle 1 was increased to go through the intersection before vehicle 6 and vehicle 7 arrived (Figure 10d).  Figure 11 shows an insight into the operation of the control system. Figure 11a illustrates the trajectories of the vehicles and the avoidable region. The illustration shows that all of the trajectories are outside of the avoidable region, and thus, the safety constraint on the motion of the autonomous vehicle can be guaranteed. Figure 11b shows the distances between vehicle 1 and the closest n s = 3 vehicles, where d 1 is related to the distance between vehicle 1 and the closest vehicle and d 3 is related to the distance between vehicle 1 and the third closest vehicle. Note that the vehicles in n s vary during the simulation, e.g., directly before t = 4 s distance d 1 is related to s 1 -s 4 , d 2 is related s 1 -s 5 and d 3 is related to s 1 -s 6 ; while directly after t = 4 s, d 1 is related to s 1 -s 5 , d 2 is related s 1 -s 6 and d 3 is related to s 1 -s 7 . If there are less than three close vehicles, virtual vehicles with constant 100 m positions are defined; see, e.g., d 3 after 5.5 s. Figure 11b shows that all of the distances were above s sa f e = 8 m during the entire simulation, independently from the selection of the closest vehicle. This was yielded by the control input u, which is illustrated in Figure 11d. Figure 11d also shows the further inputs, such as u K and u L . Since the role of u K is to provide velocity tracking, its value was 3 m/s 2 constantly, because v 1 < 50 km/h during the entire simulation, except the initial velocity. The role of u L is to provide vehicle motion, with which the reward function is maximized, so energy consumption is minimized and the traveled distance of the vehicle is maximized. However, none of the control inputs u K and u L are acceptable, because the safety constraints are not guaranteed. Thus, u K is modified through ∆, which results in the control input signal u. The maximum deceleration of the vehicle before 5.5 s guaranteed the avoidance of collision with vehicles 2-5, and the maximum acceleration between 5.5 s . . . 8 s guaranteed the avoidance of the collision with vehicles 6 and 7. After 8 s, the conflict between the vehicles ceased, and thus, vehicle 1 could move with u = u L to minimize the objective in (13) to zero. As a result, the velocity of vehicle 1 was slightly increased after 8 s; see Figure 11c. Moreover, Figure 11c presents that the difference between v re f and v 1 is very small, which illustrates the effectiveness of the designed robust cruise control. The simulation examples have shown that the proposed control algorithm is able to guarantee the safe motion of the autonomous vehicle; i.e., the collisions in the intersection are avoided. The contribution of the complex scenario is that the autonomous vehicle is able to handle the presence of increased number of vehicles in the intersection.

Conclusions
In this paper a motion control design method for autonomous vehicles in intersection scenarios has been proposed. The effectiveness of the proposed method through various simulation examples has been presented. The scenarios show that collisions between the autonomous and the surrounding vehicles is avoided, and moreover, the autonomous vehicle can adapt to the current traffic situation. Consequently, the designed control strategy is able to guarantee safe and economical motion of the autonomous vehicle.
The challenge of the future work is to analyze the traffic scenarios in which more autonomous vehicles with the same control strategy are incorporated. It is necessary to examine how it is possible to handle the concurrent situations, e.g., with the extension of the reward function in the reinforcement learning process. It requests the training of the same agents for each vehicles, but in the learning process the interaction of the vehicles must be considered. In this way, the proposed method can be an alternative solution of the distributed control solutions in the field of intersection scenarios. Moreover, a further challenge of the method has arisen from the aspect of the analysis; i.e., the design of velocity profile for the autonomous vehicle if it must pass through several consecutive intersections. This means that the training of the RL-based agent must be carried out on an extended traffic network. In this case, further global traffic level performances in the reward function can be incorporated, e.g., the maximization of the traffic outflow on the network.