A Hybrid Deep Reinforcement Learning and Optimal Control Architecture for Autonomous Highway Driving

: Autonomous vehicles in highway driving scenarios are expected to become a reality in the next few years. Decision-making and motion planning algorithms, which allow autonomous vehicles to predict and tackle unpredictable road trafﬁc situations, play a crucial role. Indeed, ﬁnding the optimal driving decision in all the different driving scenarios is a challenging task due to the large and complex variability of highway trafﬁc scenarios. In this context, the aim of this work is to design an effective hybrid two-layer path planning architecture that, by exploiting the powerful tools offered by the emerging Deep Reinforcement Learning (DRL) in combination with model-based approaches, lets the autonomous vehicles properly behave in different highway trafﬁc conditions and, accordingly, to determine the lateral and longitudinal control commands. Speciﬁcally, the DRL-based high-level planner is responsible for training the vehicle to choose tactical behaviors according to the surrounding environment, while the low-level control converts these choices into the lateral and longitudinal vehicle control actions to be imposed through an optimization problem based on Nonlinear Model Predictive Control (NMPC) approach, thus enforcing continuous constraints. The effectiveness of the proposed hierarchical architecture is hence evaluated via an integrated vehicular platform that combines the MATLAB environment with the SUMO (Simulation of Urban MObility) trafﬁc simulator. The exhaustive simulation analysis, carried out on different non-trivial highway trafﬁc scenarios, conﬁrms the capability of the proposed strategy in driving the autonomous vehicles in different trafﬁc scenarios.


Introduction
Autonomous vehicles (AVs) are expected to transform our understanding of transportation systems. AVs are going to hit the road in the next few years, bringing a variety of positive social impacts such as increased safety for passengers and pedestrians, reduced traffic congestion on highways, and environmental pollution, just to name a few [1][2][3][4][5]. The main feature of AVs is their ability to be "smart", i.e., to make autonomous decisions. This is possible by leveraging information coming from exteroceptive and proprioceptive sensors, such as IMU, GNSS, cameras, and LiDAR, as well as the ones that are beyond line-of-sight and field-of-view, i.e., the information shared among cars and/or infrastructure through the V2X communication paradigm [6].
To perform this task, control architectures for AVs are hierarchical and composed of four main modules: perception, decision-making, planning, and control [7]. Within this technological paradigm, particular attention must be paid to the decision-making module since it performs the same functions as the human brain in adapting the behavior of the AV to the current traffic situation [8]. Since the daily driving experience shows a wide variety of traffic situations and possible interactions among users [9], the open challenge is to ensure that AVs could effectively take, in real-time, safe and efficient driving decisions. Therefore, given the highly uncertain and potentially dangerous dynamic traffic scenarios, • We propose a novel and innovative hybrid hierarchical decision-making and motion planning control architecture for autonomous vehicles driving in unknown and uncertain non-trivial highway scenarios, which combines the DRL theory for the decisionmaking layer with the optimal NMPC control theory for the motion-planning task; • The proposed hybrid architecture solves the problem of autonomous highway driving in a unified fashion by simplifying the overall decision process. Indeed, the decisionmaking task is demanded to a high-level layer, which only selects the proper route to follow, while the lower-layer computes an optimized and smooth driving profile compliant with vehicle dynamics. Therefore, different from other solutions dealing with both tasks via fully-AI methods, our approach requires fewer data for the training; • Differently from the hybrid architectures proposed for autonomous highway driving (see refs. [19,20]), our approach is able to explicitly take into account the practical and unavoidable AV state/control constraints; • Unlike the automotive technical literature, where vehicle behavior is modeled according to car-following scenarios and lane-change maneuvers via IDM, MOBIL, or their combination, here the motion planning layer is designed by exploiting NMPC theory, which enables the prediction of vehicle future behavior. This allows optimizing safety and comfort requirements, also taking input and state constraints into account; • The proposed hierarchical architecture is validated in a high-detailed co-simulation platform and a comparison analysis w.r.t. to well-known benchmark IDM solution is provided, in order to better disclose the benefits of the proposed solution.
Finally, the rest of the paper is organized as follows. Section 2 presents the related technical literature. The proposed hybrid DRL and optimal control-based hierarchical decision-making and motion planning architecture is designed in Section 3. Section 4 discloses the DRL agent training methodologies adopted for the design of the decisionmaking module as well as the training results. Virtual testing simulations are reported in Section 5, while the conclusions are drawn in Section 6.

Related Works
Several related works have addressed the problem of designing the decision-making and motion planning control architecture. Regarding motion planning-based methods, there are approaches inspired by the robot motion planning algorithms such as A [24] and Artificial Potential Field (APF [25]). However, these kinds of solutions usually do not fully take into account the controlled vehicle dynamics, thus generating planned trajectories that could be potentially unfeasible. To overcome this crucial issue, some alternative approaches have been recently investigated, ranging from the on-line Rapidly-exploring Random Trees [26] to the predictive occupancy map [27] to the use of Model Predictive Controllers (MPC) [28]. Risk assessment-based methods, instead, propose hierarchical solutions that adopt a two-step approach for guaranteeing safety, i.e., the risk of the current driving state is first evaluated and then a suitable control action is formulated according to this assessment [29]. Note that, in order to perform the first step, some deterministic safety metrics (such as Time To Collision (TTC), Time To Brake (TTB), or Time Headway (THW)) are commonly used, while only fewer works assess risk via probabilistic methods (interested readers can be referred to [16] for an exhaustive literature overview). The third class of decision-making solutions exploits machine learning theory to deduce the correct policy to follow based on the shared data. When leveraging supervised learning approaches, human experience is exploited to assess the driving policy (behavior cloning), but the difficulty of collecting enough labeled data from skilled drivers makes these solutions hard to use [30]. To overcome this problem, given its capacity to solve decision-making problems in an unsupervised fashion, the alternative Reinforcement Learning (RL) method has recently become one of the most promising solutions. In addition, RL methods can learn the optimal, or near-optimal, policy by directly interacting with the environment without the need to make restrictive assumptions about it, such as its partial knowledge [31].
More recently, following the major breakthroughs of Deep Learning (DL) [32], and by leveraging the capability of neural networks to serve as universal function approximators, the brand new concept of Deep Reinforcement Learning (DRL) has been raised [33,34]. DRL approaches can be mainly categorized into model-based, which require partial information about the system [35], and model-free, which do not make any assumptiosn about the environment [36]. Due to their advantages, researchers have been also started to apply the DRL theory to solve various autonomous driving tasks [16]. For example, in ref. [37] authors propose an RL-based strategy to train an agent to learn an automated lane change behavior in order to improve the collision avoidance task in unforeseen scenarios, while in ref. [38] a hierarchical architecture is exploited to learn a sequential collision-free decision strategy for AVs. Again, a hierarchical architecture is suggested in ref. [17], where the decision-maker implements a kernel-based least-squares policy iteration algorithm while the lower layer is designed via a dual heuristic programming algorithm to address the motion planning problem. Conversely, ref. [16] proposes a novel lane change decisionmaking framework based on DRL in order to let the AVs learn a risk-aware driving decision strategy with the minimum expected risk. Instead, ref. [20] presents a safe deep reinforcement learning system for automated highway driving where both rule-based and learning-based approaches are exploited for safety assurance.
However, any RL methodology that tries to solve both decision-making and motionplanning simultaneously may require a large amount of training data [19]. In addition, the large amount of data required by RL strategies is mainly used to maximize a longterm reward, defined as a function of different aspects, e.g., comfort, efficiency, and even safety. Thus, it implies that only sub-optimal performance can be achieved for the safety aspect, and consequently, they are not comparable with the ones achievable via optimal control methods [39]. To address this issue, some hybrid control approaches combining RL methods and model-based strategies have been proposed. Along this line, ref. [19] proposes a hybrid RL and PID control for solving the problem of autonomous highway driving, where RL is used for the decision-making task and the PID ensures the actuation of the selected decision. In order to improve the performance and safety of these hybrid architectures, MPC approaches have been proposed in combination with RL [40]. For instance, ref. [39] suggests this solution to allow an AV to autonomously merge a highway from an on-ramp. However, in this framework, to the best of our knowledge, there are no works combining RL with NMPC and applying this solution to solve the problem of autonomous highway driving.
Finally, focusing on the design of the model-based motion planning layer, it is worth noting that the technical literature widely exploits the IDM and MOBIL benchmark models, along with their modifications [22]. However, these classical approaches are characterized by some drawbacks and limitations. Specifically, the IDM can be used to reproduce the carfollowing behavior easily, but it does not take into account the mechanical characteristics of the AV [41], while the MOBIL model, relying on a threshold definition for safety decision criteria, is not able to define a single threshold to handle various traffic conditions [42]. Conversely, a combined architecture composed of IDM and MOBIL only relies on the observations at the current time instant, hence neglecting the possible predictions of future traffic conditions [43]. Differently, the proposed NMPC-based motion planning layer is able to optimize safety and comfort requirements while taking into account state and control constraints.

Hybrid Hierarchical Decision-Making and Motion Planning Control Architecture for Highway Autonomous Driving
Consider an AV, named Ego, which, whil travelling in an unknown and uncertain threelane highway environment, has to make smart decisions about its motion by leveraging exteroceptive and proprioceptive sensors, as well as from the information shared with other road entities (such as other cars and/or infrastructure) via V2X communication. Due to the variety and complexity characterizing this traffic scenario, the problem addressed in this work, as depicted in Figure 1, is how to provide the Ego vehicle with the abilities to make proper driving decisions (e.g., lane keeping or lane changing) when traveling in the presence of different traffic flow conditions, while improving efficiency and preserving safety. To solve this problem, we propose the hybrid hierarchical architecture depicted in Figure 2, where, at the higher layer, a double deep Q-learning approach is exploited for solving the real-time decision-making problem, while the lower-layer aims at finding an optimal trajectory via the solution of an Optimal Control Problem (OCP), which also takes into account state and input constraints. Details about the design of both layers are provided in the following sections.

Decision-Making Layer
The decision-making problem is here solved via Deep Reinforcement Learning theory. In our case study, the agent and the environment are represented by the Ego and the surrounding vehicles, respectively. The problem is modelled as a Markov Decision Process (MDP), i.e., the tuple <S, A, T, R, γ>, where S is the set of states, A is the set of actions, T : S × A → S is the state transition function, R : S × A × S → R is the reward function, and γ ∈ [0, 1] is the discount factor. The objective of the training process is to find a policy that maximizes the following discounted cumulative reward: where r t+k is the reward at the time instant t + k and γ is the discount factor. To this aim, the action value function, or Q-function, representing the relation between state-action (s t , a t ) and reward R t , is defined as: being π the control action policy. The function (2) can be rewritten in the following recursive form: It follows that the optimal control action with respect to the policy π is the one maximizing the Q-function (2), i.e.,: The solution of the above maximization problem involves the use of adaptive neural estimators (approximating the action-value function Q(s, a)) whose weights θ are updated at each training iteration according to the following loss function [44]: where, according to the Double Deep Q-Learning (DDQN) approach, In the appraised case study, the state vector is selected as s(t) ∈ R 19 . Its components are the Ego vehicle speed v(t) and all the data gathered by the surrounding vehicles. These latter are the longitudinal/lateral distances and the relative longitudinal velocities of both the vehicles ahead and the ones on the rear that occupy the three closest lanes, i.e., the three vehicles in front of the Ego vehicle and the three on the rear, hence yielding a total of 18 measurements. If a lane is free, dummy values, corresponding to a vehicle at the limit of the sensor detection range, are used.
The action vector is selected as a(t) ∈ R 5 . Specifically, the Ego vehicle can choose among the following possible five actions: (1) performing the lane changing to the left; (2) performing the lane keeping; (3) performing lane changing to the right; (4) performing an acceleration maneuver in order to decrease the distance w.r.t. the vehicle ahead; (5) braking to the increase the distance w.r.t. the vehicle ahead. A summary of the state and action vectors can be found in Table 1 where j = 1, 2, . . . , 6.
Regarding the choice of a suitable reward function, here the idea is to ensure the proper tracking of the reference behavior by the Ego vehicle while encouraging speed gain and discouraging potentially dangerous and aggressive behaviors. To this end, the following reward function has been designed: where /v is a velocity-dependent positive reward driving the agent to follow the desired speed v des ; r lc is a constant term that penalizes useless lane changes maneuvers and, in this case, is set to 1; r ttc is a constant term that penalizes the reward when the Time-To-Collision (TTC) with respect to vehicles ahead goes below the safety threshold of 2[s], i.e., r ttc = 5; r coll penalizes possible collisions with the vehicles in the surrounding. Note that the last two terms in Equation (7) are responsible for steering the agent towards safe behavior. On the one hand, when a collision is detected, the r coll = 10 is assigned, and, hence, the episode is terminated. On the other hand, inspired by human-like behavior, r ttc aims to improve collision avoidance performance by penalizing the reward when a possible dangerous situation occurs. The output of the decision-making layer is the best suitable action to be performed by the AV according to the actual traffic conditions. The chosen action a(t) is, then, fed as input to the motion planning layer and converted into the reference behavior for the NMPC, as detailed in the next section. Changing Lane to the right a 4 Acceleration manoeuvre a 5 Braking manoeuvre

Motion Planning Layer
Based on the driving decisions, which are on-line computed by the DRL agent and dynamically adapted to the current traffic scenario, the motion planning layer finds the optimal longitudinal and lateral trajectories to be imposed on the AV. More specifically, with the aim of reducing the complexity of the optimization problems and decreasing the total execution time, the proper AV motion is derived via the design of two NMPC, i.e., one for the longitudinal dynamics and one for the lateral dynamics. To this end, we first introduce the Ego vehicle's lateral and longitudinal dynamics.
The Ego lateral dynamics are described by the kinematic bicycle model in a rear-axle centered frame, recast in the curvilinear Frenet frame [45], i.e., where s(t) m is the arc length on the curvilinear reference path; e y (t) m and e ψ (t) [rad] are the lateral and angular errors with respect to the reference path, respectively; v(t) m/s is the vehicle speed in the C.o.G. reference frame; δ(t) rad is the steering angle while the input u 1 (t) rad/s is the steering rate command; the parameters l f m and l r m are the distances between the center of gravity and the front/rear axles, respectively; ρ(s) 1/m is the curvature of the reference path as a function of the arc length. It is worth noting that, although a more accurate dynamical model could also be adopted, here we chose a kinematic model due to its well-known performance when mimicking driving conditions far from handling limits, in addition to being more suitable for embedded applications due to its lower computational complexity [46]. The Ego vehicle longitudinal dynamics are, instead, described by the well-known car-following model [45], under the assumption that the leading vehicle acceleration is zero, i.e., where d(t) [m] and ∆v(t) = v lead (t) − v(t) [m/s] are the distance and the relative longitudinal velocity with respect to the leading vehicle, respectively; a(t) m/s 2 is the Ego vehicle acceleration; τ s is the power-train time constant; and u 2 (t) m/s 2 is the control input providing the desired acceleration to be imposed. Note that, in practice, the Ego vehicle position, speed, and acceleration are always estimated based on the combination of GNSS information with the one provided by the IMU [47] by using, for instance, the Kalman filter [48], the parallel adaptive Kalman filter [49], or vision-based techniques [50]. By defining the following state vectors The reference angular error is, instead, always null since, at steady state, the vehicle must be always aligned to the lane. Thus, the goal of the lateral motion planner is to regulate the vector e 1 (t) to zero-thus yielding e y (t) → e y re f and e ψ (t) → 0-while minimizing, at the same time, the steering angle variations, i.e., the control input u 1 (t) in (17), as well as increasing the driving comfort. To solve this problem, the lateral NMPC-based path planner is designed as the solution, at each sample time T s , of the following Optimal Control Problems (OCP): where T indicates the prediction horizon; Q 1 = diag(q 1 , q 2 ); and r 1 are the tracking and control effort weights, respectively; the subscripts max and min represent the maximum and minimum admissible state/input values for the related variable. Moreover, given the longitudinal dynamics as in (17), we define the vector and v re f are the reference spacing policy and the reference Ego vehicle speed, respectively. The desired distance is set as d re f (t) = d 0 + T H v(t), where d 0 m is the required safe distance at a standstill and T H s is the headway time. This last parameter is on-line computed and dynamically adapted by the DRL agent. In particular, T H is augmented or reduced by step increments ∆T H , if the distance to the leading vehicle must be increased or decreased, respectively. In any case, the heading time cannot go below the minimum value of 0.1 s, which guarantees that the leading distance reference is always positive and that longitudinal collisions are avoided. The reference velocity is set according to the highway environment, i.e., v re f = 33 m/s. Hence, the aim of the longitudinal planner is to achieve the convergence of the error vector e 2 towards zero, thus yielding d(t) → d re f (t), ∆v(t) → 0 and v(t) → v re f -while minimizing at the same time the control effort u 2 (t), i.e. minimizing the vehicle jerk, as well as increasing the driving comfort. To this end, the longitudinal path planner is designed as the solution of the following OCP: where Q 2 = diag(q 3 , q 4 , q 5 ) and r 2 are the tracking and control effort weights, respectively. In the special case where a leading vehicle does not exist, the Ego vehicle control speed objective is taken into account by setting to zero the elements q 3 and q 4 . Finally, it is worthy to remark that the simultaneous accomplishment of the control objectives ∆v(t) → 0 and (v(t) − v re f ) → 0 is in general not always possible. Nonetheless, the NMPC is designed for dealing with this trade-off and to converge toward the best suitable solution according to the choices made for the different weights, penalizing the deviations of the actual measurement from their desired values within the performance index as in (24). In both the OCP, the control horizon T c , i.e., the time interval in which control command can be chosen, is set as less or equal to the prediction horizon T in order to achieve an acceptable trade-off between control performances and solving time [51]. Note that, according to common practice in the technical literature, the functional cost weights are chosen through a trial-and-error procedure. All the weighting factors, along with all the other NMPC parameters and constraints, are summarized in Table 2.

Remark 1.
Note that, the proposed hybrid architecture is designed under the assumption that perception results are accurate and no adversarial attacks are present on the V2X communication networks. Indeed, in practice, the performance of the proposed solution heavily relies on: (i) the accuracy of the perception results, which often suffer from partial perception problems, i.e., potential dangers obscured by obstacles may be ignored; (ii) the deployment of robust measures to counter adversarial attacks. However, how to combine these issues into the design of control architecture for intelligent vehicles, still an open challenge in the field [52], is beyond the scope of this work.

Scenario Description and DQN Training Phase
For the training of the DRL agent at the decision-making layer, we exploit a proper co-simulation platform which integrates in a unified framework MATLAB/Simulink and SUMO platforms. According to Figure 2, MATLAB/Simulink is responsible for the main training loop and the control of the Ego vehicle motion, while SUMO emulates the highway driving environment, manages the setting up of the road structure, and is responsible for the control of the other vehicles movements within a traffic flow.
As a road scenario, we consider a straight, endless, three-lane highway with three traffic flows, each of them differ in terms of desired velocity and assertiveness. The vehicular traffic flows are randomly generated by SUMO at each sample time with a at regrandom starting speed, random position, and starting lane. Each surrounding vehicle in the traffic scenarios is driven by the Intelligent Driver Model (IDM), which regulates the longitudinal movements, and by the Minimizing Overall Braking Induced by Lane Changes (MOBIL) for the lane change initiations. Additionally, the traffic vehicles can choose randomly to perform lane changes, both on the left and right side. For them, the system parameters, such as the maximum velocity, are randomly chosen. This is to ensure a diverse traffic scenario in the training and evaluation phase [19].
The training of the DDQN agent is summarized in Algorithms 1 and 2 as pseudo-code. At each simulation step, an action is chosen and translated into references for the NMPCs. The input signals resulting from the optimization problem solutions are then fed to the vehicle dynamics to obtain a new vehicle position to be evaluated in the highway network. The new state observations and the reward are obtained from SUMO. This procedure is repeated iteratively at each time step. The DRL agent has been trained for N episode = 60,000 and each episode is terminated after N steps = 500, or earlier if a collision occurs. At each sample time, the exploration constant is decreased as an exponential function as: where decay = 2.3026 × 10 −6 is the decay rate. Note thats set as the minimum exploration value is set as end = 0.1. The DRL agent hyper-parameters, chosen according to the state-of-art [53], are reported along with all the training parameters in Table 3.
Regarding the optimization procedures, due to its ability to generate highly efficient Sequence Quadratic Programming (SQP) code, we chose FORCES Pro [54] as a solver. Moreover, it can be exploited in real-time applications and presents a user-friendly MATLAB interface. Equations (8) and (13) are discretized and integrated with an explicit fourth-order Runge-Kutta algorithm. The sampling time is set as T s = 0.2 s, and both the control horizon and prediction horizon are set to T c = T = 4.0 s, for a total of twenty integration and control steps.

Algorithm 1: DDQN main training loop algorithm.
Initialize the Replay Buffer M replay ; Randomly Initialize primary and target network weights θ 0 andθ 0 ; for episode = 1 → N episode do s 0 ← Reset Environment; for timestep = 1 → N steps do if random value ≤ then a t ← random action else a t = arg max a Q(s t , a, θ) s t+1 , r t ← Step Environment(a t ); store tuple (s t , a t , r t , s t+1 ); sample mini-batch M batch from replay memory; θ i+1 ← SGD on loss L(θ i ) from (5); update with -decay from (30); each N update copy weights θ i toθ i .

Algorithm 2: StepEnvironment(a t ).
if a t = a 1 then e y re f = e y re f + L w else if a t = a 3 then e y re f = e y re f − L w else if a t = a 4 then (18); Integrate vehicle dynamics from (8), (13); s t+1 ← evaluate a SUMO step; r t ← get reward from (7);

Training Results
In this study, we report the results of the training phase for the DDQN agent and the overall decision-making layer. During the training phase, the agent parameters are saved and stored every 300 episodes, while in the testing phase, each saved agent's performance is evaluated on 100 episodes of 500 maximum steps by setting the exploration factor to zero. The total normalized return per episode is collected, and the mean and the variance are used as performance indexes. Figure 3 shows the result in terms of mean (red line) and variance (shaded area) for each saved agent. It can be observed how, as the training phase proceeds, the overall performance of the DRL agent increases. Despite that, due to the high variability of the road scenarios, high variance can be observed in the resulting normalized return. Moreover, even though the first agent achieves a non-zero total reward because it does not perform any lane changes, the NMPC is still capable of keeping its lane and avoiding frontal collisions. According to the reward definition in (7), we point out that a maximum reward is achieved only if the speed v(t) is kept at the constant desired value v des , without ever changing from the starting lane. Note that, these goals can be achieved only if the starting lane is always empty, and since this scenario is devoid of interest, it is therefore excluded from the training and testing phase. The last agent exhibits better performance, achieving a collision rate of around 4.0%.
Finally, we highlight that the training process is performed via an Intel ® Core TM i7-11900K 3.2 GHz processor, 64 GB 4.400 MHz DDR5 of RAM, GPU NVIDIA ® GeForce RTX TM 3080, and a Windows 11 system. The time required for the training phase, with regards to the hyper-parameters in Table 3, is approximately 20 h. It is worth noting that the training phase represents the highest computational burden required by the decisionmaking layer, which is, however, offline executed. Conversely, the one required to make the driving decision is very low during the deployment phase, since only the forward propagation in deep NN is involved [55].

Virtual Testing Simulation
Leveraging the above-mentioned co-simulation platform, the proposed control decisionmaking and motion planning control architectures are evaluated in non-trivial highway scenarios. For illustrative purposes, we first consider the single-lane changing scenario shown in Figure 4, where the Ego vehicle has to follow the leading one until t = 100 s, when a lane-changing maneuver to the left free lane is required; afterward, the reference speed is set as v re f = 33 m/s. When modeling the leading vehicle in SUMO, in order to better grasp the dynamics of manually driven vehicles, we also consider the parameter called driver imperfection, which is modeled as additive noise on the acceleration command. In addition, to better show the advantages of the proposed solution, we also compare its performance with respect to the one obtained by one of the most popular state-of-the-art solutions. Specifically, the comparison analysis only involves the motion planning layer and not the decision-making one, since the performance achievable via deep learning methods is strictly correlated to the platform adopted for its design [56]. Accordingly, for the sake of fairness, we compare the NMPC with regards to the IDM solution (widely used as a benchmark for testing new control strategies [23]), both running based on the results provided by the DQN-based decision-making layer. Simulation results, related to Ego vehicle longitudinal dynamics until the time instant t = 100 s, are disclosed in Figure 5, where the relative distance and the relative speed of the Ego vehicle w.r.t. the leading one are depicted. Herein it can be observed how the ego vehicle reaches and maintains the desired spacing policy with regards to the preceding vehicle d re f = 30.6 m (see Figure 5a), while the relative speed converges towards zero (see Figure 5b). Comparison results with regards to IDM disclose the improved performances achieved by the NMPC controllers since, despite the faster convergence exhibited by these optimal techniques, the time history of the acceleration shows lower peaks, thus providing the highest driving comfort experience. This is also confirmed by Figure 6, where the acceleration trends for both NMPC and IDM are reported over the overall simulation time, i.e., t ∈ [0, 200] s. From this figure, it is possible to appreciate that, when leveraging the NMPC, lower values of the acceleration peak can be obtained. It results in a more comfortable driving experience. In fact, while the IDM reaches the saturation value of a max = 2.4 m/s 2 , the resulting acceleration computed by the NMPC case is always below the value of a = 1.2 m/s 2 . Moreover, it is worthy to point out that the presence of noise, arising from the implementation of driver model for the leading vehicle in SUMO, clearly affects the IDM performances, while NMPC shows enhanced resilience. The vehicle speed v(t) and lateral position e y (t) profiles are, instead, depicted in Figures 7 and 8, respectively. Specifically, simulation results in Figure 7 confirm that the ego vehicle keeps the leading speed, namely v lead = 25 m/s, until the time instant of 100 s when the lane-change maneuver to the left is performed; then, its speed reaches the desired reference value of v re f = 33 m/s. Once more, the NMPC is able to provide a more comfortable driving experience with respect to the IDM since the desired set-point is reached with smoother behavior (see also Figure 6). Moreover, during the lane changing maneuver, the Ego vehicle's lateral position e y (t) is equal to zero before the time instant t = 100 s and, afterwards, it is correctly regulated to e y re f = L w = 3.6 m (the left lane center-line) (see Figure 8).
In order to provide more insights on the behaviour and performance achieved by the NMPC, Figures 9 and 10 show the time history of the lateral and longitudinal control inputs, u 1 (t) and u 2 (t), respectively. Results show that, as expected, the dynamics of these input variables are always smooth and well-bounded, thus backing up the comfort results shown in the previous figures, and that the NMPC enables us to tune the right trade-off between aggressiveness and comfort in a straightforward manner; this also ensures that the input constraints are never violated. Finally, to evaluate the real-time performance of the proposed hybrid control architecture, we evaluate its computational load via the Simulink Profiler Tool [2]. By leveraging the same hardware used in the training phase, the simulation profile reports that the required total computational times are 3.75 s and 19.05 s for the DQN agent and NMPC controller, respectively, w.r.t. an overall simulation time of 200 s.  As a more exhaustive validation scenario, we consider a three-lane highway scenario characterized by heavy traffic conditions, i.e., a lane count of 1500 vehicles per hour. The vehicles along the lanes travel at different random constant speeds. However, under heavy-density traffic conditions, they vary their velocities according to the traffic flow and perform lane-changing maneuvres both on the right and on the left lane, when necessary. In this highly uncertain scenario, the proposed hybrid control architecture drives the Ego vehicle in performing safe lane-changing maneuvers to the right and left according to the encountered traffic scenario. An exemplary visualization of this appraised scenario is shown in Figure 11, where the ego vehicle performs a lane change maneuver to the right in order to gain speed. Detailed results about this complex scenario, reported at https://youtu.be/Srj7TXdrKhY (accessed on 6 March 2023), confirm the ability of the proposed hybrid architecture in safely driving the Ego vehicle under highly-varying traffic conditions, as well as its effectiveness in facing the uncertainties arising from unexpected and sudden lane changing maneuvers performed by the surrounding vehicles. Figure 11. Sequence of extracted frames from an exemplary scenario. The ego vehicle (in red) changes lane to the right so to overtake the front vehicle and gain speed. The time initiation is appropriate to avoid a collision with rear vehicles.

Conclusions
In this paper, a novel hybrid hierarchical decision-making and motion planning control architecture for Autonomous Vehicles, driving in uncertain and unknown highway environments has been addressed. The high-level decision-making, designed via the emerging Deep Reinforcement Learning theory, has allowed the AV to learn how to adapt, in real-time, its driving behavior according to the current surrounding highway environment and, hence, how to take the proper longitudinal and lateral autonomous decisions to avoid collisions and improve travel safety. The decision-making process has been carried out by exploiting a proper reward function that penalizes hazardous and non-safe maneuvers. The training of the DDQN agent has been performed by leveraging a high-fidelity cosimulation platform, and the results have confirmed the correct learning process in this highly uncertain driving scenario, whose features are difficult to capture via classical modelbased approaches. These optimal driving policies represent the reference behavior for the motion planning layer, where optimal controllers based on NMPC have been responsible for the optimized driving behaviors, also enforcing constraints in the state/control of the ego vehicle. The validation of the designed architecture, carried out by also considering an high-density traffic flow, has disclosed how the vehicle, thanks to the high decision-level was able to select, according to the current highway driving situations, the proper safe maneuver (i.e., lane changing, overtaking of acceleration/braking maneuvers) to engage, while the low-level determines the optimal control commands to be imposed on the AV (explicitly considering its dynamic constraints) for ensuring a safe tracking of the selected driving mode. Summarizing, the virtual simulations in non-trivial highway traffic scenarios have confirmed the effectiveness of the proposed hybrid architecture and its advantages in combining model-based and artificial intelligence-based methods. Future work will involve the extension of the approach to different road geometries and urban environments.