Deep, Consistent Behavioral Decision Making with Planning Features for Autonomous Vehicles

: Autonomous driving promises to be the main trend in the future intelligent transportation systems due to its potentiality for energy saving, and trafﬁc and safety improvements. However, traditional autonomous vehicles’ behavioral decisions face consistency issues between behavioral decision and trajectory planning and shows a strong dependence on the human experience. In this paper, we present a planning-feature-based deep behavior decision method (PFBD) for autonomous driving in complex, dynamic trafﬁc. We used a deep reinforcement learning (DRL) learning framework with the twin delayed deep deterministic policy gradient algorithm (TD3) to exploit the optimal policy. We took into account the features of topological routes in the decision making of autonomous vehicles, through which consistency between decision making and path planning layers can be guaranteed. Speciﬁcally, the features of a route extracted from path planning space are shared as the input states for the behavioral decision. The actor-network learns a near-optimal policy from the feasible and safe candidate emulated routes. Simulation tests on three typical scenarios have been performed to demonstrate the performance of the learning policy, including the comparison with a traditional rule-based expert algorithm and the comparison with the policy considering partial information of a contour. The results show that the proposed approach can achieve better decisions. Real-time test on an HQ3 (HongQi the third ) autonomous vehicle also validated the effectiveness of PFBD.


Introduction
Autonomous driving (AD) has been vastly investigated in different domains for several decades and has a wide variety of applications, especially in intelligent transportation systems. AD provides more economic, comfortable and safer service than human drivers with fewer collision risks. The behavior decision and trajectory planning system is a key component of AD.
Presently, as a mature, rule-based decision and planning method, finite state machines (FSMs) [1,2] have been applied in most autonomous vehicles. Nonetheless, FSMs still have problems with guaranteeing decisions' accuracy, especially in complex scenarios, due to the difficulty in determining the boundary condition of different states and going through all states with limited experience. To make the vehicles more intelligent and more acceptable for human beings, learning methods have been attractive in recent years to enhance decisions in AD. Among them, neural networks related approaches [3][4][5] have been used to navigate autonomous vehicles due to their strong capacity of mapping inputs to control signals. The raw pixels from cameras [3,4] were used to train neural networks for keeping a vehicle in a lane, which proves that the task of lane following can be learned from the only limited training signal (steering) in a supervised manner. Fusing multiple inputs the motion planner, firstly, evaluates whether the gaps are enough for the ego agent to merge in. If space is wide enough, the maneuver is executed with a path generated; otherwise, the command will be ignored for safety's sake. The final conservative strategy must be a compromise between an optimal policy and a safe one. A policy with safety ensured can be implemented in autonomous vehicles. Generating a policy by combining path planning features can be one feasible choice. If given the traffic participants and environments, a system can provide all possible maneuvers with corresponding reasonable paths; then, the decision policy can be reinforced with learning methods by selecting the best maneuver. Proceeding with previous work, the synchronous maneuver searching and trajectory planning algorithm (SMSTP) [23], we propose a new planning-features-based behavior decision method whose architecture is shown in Figure 2. SMSTP outputs no-collision trajectories which represent the safe policy in yellow of Figure 1, while the target of our proposed PFBD approach was to learn the optimal policy in green. A contour [23] is one transitable spatio-temporal area for autonomous vehicles in the synchronous maneuver searching and trajectory planning algorithm (SMSTP) algorithm.
In the SMSTP algorithm, three key parameters, the contour to drive, the expected pose and velocity at predicting horizon in the contour, are selected empirically. To learn an optimal policy, we creatively propose to use the DRL algorithm to learn the aforementioned three key parameters instead. The SMSTP algorithm provides a basic policy with a safety guarantee, while PFBD aims at optimizing the behavioral decision towards the optimum. Once the three key parameters are decided, a no-collision trajectory can be optimized numerically.
To the authors' knowledge, the deep combination of trajectory planning and behavioral decisions for autonomous vehicles is proposed for the first time here. In brief, we summarize the contributions of the proposed approach in this paper as follows:

•
We came up with a compact semantic framework of the feature representation which includes planning space's states, lanes' states, ego vehicle's states and task's states. The states selected from planning space are necessary and representative for making behavioral decisions in various traffic scenarios.

•
The proposed PFBD method selects a route generated from the planning space with a safety guarantee. Besides, it learns better policy through exploring the environment instead of struggling with the rules.

•
The proposed method learns a policy through deep reinforcement learning. The semantic training data can be sampled from both simulations and real traffic. The simulated samples help the policy generalize the unknown scenarios.

•
We applied the method to autonomous driving in situations, such as lane changing, zebra passing and ramp-merging. Both simulation and real-time tests were performed. The PFBD method achieved competitive performance compared to rule-based methods.

Reinforcement Learning Basics
RL originates from the trial-and-error search with delayed rewards, and it is dedicated to learning the optimal policy, which maximizes a long-term reward in the subsequent decisions. The agent learns a control policy by continuously interacting with its environment, and the policy maps states to actions from interactions.
The agent is in a state s t ∈ S of itself and its environment at the time t ∈ [0, 1, 2, . . . ], and it has a possible action a t ∈ A s t ⊆ A on state s t , where S and A are the sets of all the environment states and available actions, respectively. An agent in a new state s t+1 observes an immediate reward r t = r(s t , a t ) ∈ R after executing the action a t . A new state transition tuple (s t , a t , r t , s t+1 ) is then sampled to train for a control strategy, which is defined by a policy π mapping the states to the actions π : S → P(A). The agent aims at maximizing the long-term discounted return defined as: On the policy π, the state-action value function: describes how good the policy is after taking an action a t in state s t . The policy π is often acquired by recursively solving the Bellman equations: The driving policy in a realistic traffic situation can be viewed as a deterministic one. The policy then can be described using a function µ : S ← A, and the expectation of action can be avoided: Deep Reinforcement Learning For a value-only based RL algorithm, such as deep Q-learning, it adopts a deep neural network (DQN) instead of a table to estimate the value to each state-action pair. DQN uses a greedy µ(s) = arg max a Q(s, a) in discrete action spaces. Approximated by parameter θ Q , one agent optimizes the function by minimizing the loss: where the term y t = r(s t , a t ) + γQ(s t+1 , µ(s t+1 )|θ Q ) is called the target, as the loss tries to make the value function be more like this target. The long-term accumulation of reward leads to high variance. Besides, DQN only works well with few actions; thus, failing in continuous action spaces. DDPG [19], a model-free off-policy algorithm, can be used for environments with continuous action spaces. DDPG has two pairs of actor-critic networks with the same structure. The critic network approximates the state-action value function Q(s, a|θ) by parameter θ, which updates by minimizing (4). The actor network approximates the agent's policy µ(s|ω) by parameter ω, which deterministically maps the observed state s to an certain action a. The actor network is updated by the derivative of (3) respectively to the actor parameters: where N is the number of state transitions. Two main technologies are employed by DDPG to enable a stable learning process. The first one is the replay buffer, which is the set E of previous experiences in the form of state transitions (s t , a t , r t and s t+1 ). The learning policy may shift towards and even over-fit to training samples. Therefore, old experiences are indispensable for robustness. To balance the learning process, new experiences are given a higher probability than older ones of being sampled in a minibatch. The other technology is the target network originally used in DQN. The target term in (4) depends on the same parameters to be learned while the network is being trained. This problem makes loss optimization unstable. The solution is to use another pair of networks Q (s, a|θ ) and µ (s|ω ) with a time delay. A second network, called the target network, lags behind the first one, which is called the evaluation network. The target network updates by a smooth average: where τ 1. To address the over-approximation of actor-critic networks, the twin-delayed deep deterministic policy gradient algorithm (TD3) [24] based on DDPG employs three technologies. The delayed policy update changes the target networks slowly every d steps, and this makes networks steadier by avoiding over accumulating the rewards. The policy smoothing supposes similar states should have similar actions by adding noise to the target actions in target Q networks: where σ is a parameter for noise, c is the boundary for clipping. Clipped double Q-Learning uses two randomly initialized critic networks as estimates and always takes the minimum one to update the target:

Vehicle Models
The vehicle uses a kinematic bicycle model for forward motion and lateral movement: where x and y are the vehicle's lateral and longitudinal positions; v and a are the velocity and acceleration; and θ and δ are the vehicle's heading and steering angles, respectively. L is the wheelbase and ∇T is the time step size. Since this paper focuses on the decision level, the dynamic of other traffic vehicles is assumed to be controlled by fixed acceleration speed. To simulate generally, different behavior modes of traffic vehicles are used; namely: aggressive mode, regular mode and modest mode. The behavior modes are characterized by different acceleration speeds, safe distances to follow a front vehicle, safe distances to overtake a front vehicle and safe distances to avoid near back vehicles. Each participant vehicle emerges randomly around the ego agent with random target speed. These simulated vehicles are also enabled to change lanes within a random time interval following general traffic rules in their own behavior modes. These settings enable the vehicle to explore more potential types of traffic scenarios, which are prerequisites to learn the optimum policy.

Planning Space
In real traffic flows, an experienced driver not only observes the nearest vehicle in each lane but also looks far ahead in a defensive manner. Looking far ahead helps drivers notice potential hazards and drive smoothly. Therefore, the traffic vehicles in front of the ego agent's planning scope should be considered.
In previous work [23], the horizon for decision and planning was fixed as T h . The planning scope in Figure 3 shows the range between the end pose (X d ) of agent decreasing to zero velocity, and the end pose (X a ) of an agent accelerating to the maximum velocity allowed within T h . Vehicles inside such a scope influence the agent's behavior and the states for decision.

States
The agent's behavioral decision depends on all relevant traffic entities. An entity can be a traffic participant, a pedestrian, a car or a bicycle with pose and velocity information. The lanes and signs, such as stop signs and speed limit signs with particular instructive information, are also entities. However, to generate a safe decision based on the topological routes, entities' information must be translated into planning features with specific meanings, as summarized in Table 1.
Features representing these entities are crucial for the agent's decision, which are pre-processed as following.

States (of) Features Meaning
Contour(s c ) the planning scope's front-end velocity e v (•) agent's absolute velocity e a agent's acceleration speed e h (•) agent's heading relative to lane's direction tk d the degree of task (mandatory, discretionary) tk l the lane-error to target lane of task tk p the right of way in task Task (continuous,(s c )) t p the position of special task t v the target velocity to follow at t p − the preserved states for future use

States of Contours in Planning Space
To make a safe decision, the features for available planning space are extracted from results of SMSTP algorithm, which guarantee safe topological routes in Figure 3. Each contour corresponding one route has three features, as shown in Figure 4. Two of them are the rear-end and front-end poses of a contour, c rd and c f d , respectively. One feature is the recommended velocity c f v at the front end. If the contour is split by a vehicle at the front end, then this feature is related to the velocity of that vehicle; otherwise, it depends on the maximum velocity that ego agent can reach at time T h .

States of Tasks
Driving along a road without following traffic rules is relatively simple. However, autonomous vehicles of this kind are not acceptable in realistic situations. An agent's capability includes both going into a target lane and reaching the given position at a specific speed. The former is guided by task features in a discrete form, while the latter is instructed by target features in a continuous form.
Three separate task-level features are employed here. The task-level tk d means to which degree the agent should follow a given task. When the agent's task is to leave the arterial road in scenario (c) of Figure 5, the mandatory task-level requires the agent to arrive at the rightmost lane before coming to the separation point. The discretionary task-level means that the agent can select any available lane to maintain a high cruise speed in scenario (c). The other one, lane-error tk l , is the relative distance to the target lane. Lane-error represents how many lanes there are between the current lane and the target lane, which is the main punishment for task requirements. Another key feature is the prioritization tk p . When two vehicles' distances to intersections have no significant difference, a vehicle with the right of way is encouraged to go in priority. Otherwise, it should go after vehicles in other roads.
Two continuous target features, the target pose t p to reach and the target velocity t v to follow, guide the agent to arrive at the position with given speed. Taking intersection cross, for example, a recommended speed can be 25 km/h for the sake of safety and traffic rules, if the agent traverses at higher speeds, some punishment will be given. These two features also represent speed limits from traffic signs. Including four preserved ones, the state of task s tk has nine features.
The final state for MDP is s t =< s t c , s t l , s t e , s t tk > with a total of 57 features.

Actions
The action a t =< a t l , a t d , a t v > uses three sub-actions jointly to decide an agent's behavior and optimizing parameters in a select contour. The first discrete action a t l ∈ [−1, where o max and o min are the maximal and minimal expected values of inputs. In fact, due to the values of action a t d and action, a t v can hardly be 1 or −1 through hyperbolic sine function. We enlarge both o max and o min a bit and truncate the outputs to expected ranges.
Action a t l selects the target lane, and it has several contours in the corresponding planning scope. The undetermined contour for the trajectory optimizer depends onā t d . Ifā t d locates inside one contour of the planning scope in Figure 4, then this contour is selected. Otherwise, either the nearest or the furthest contour is selected andā t d is adjusted to either c 0 rd or c i f d (i depends on the number of contours in the lane). In the trajectory optimizer layer of the SMSTP, the chosen contour, adjustedā t d andā t v , determine the final trajectory in predicted horizon T h .

Rewards Design
The ego agent employs an action and gets a reward r t at time t, which evaluates the current state and action. r t guides the policy to learn proper actions for higher rewards and reflects to what degree the agent drives as we expect.
A well-designed reward represents the desired policy for complex driving tasks. Therefore, r t is a compromise among agent's safety, traveling efficiency, and task achievement in linear combination form: where r t s is the safety penalty, r t e is the reward for agent's driving efficiency, r t t is the penalty for deviation from task's target and v i is the velocity of the i's traffic participant. As frequent lane changes can cause danger in traffic flow, a small penalty is set in r t s of (14) to discourage unnecessary lane changes. A reasonable agent drives as fast as possible within the speed limit of a lane. However, being a cooperative member of the traffic flow is much safer. Hence, r t e in (15) leads the agent to drive at the average speed of traffic, and to pursue the maximal speed of the lane, whether there are traffic participants or not. r t t provides an agent the ability to perform different tasks, which depends on the requirements. For example, −5tk l in (16) punishes the agent for deviating from the target lane when the agent should be in the target lane. −|e v − t v | punishes the agent for being unable to follow given speed limitations. To achieve other tasks, r t t should adjust the penalty term accordingly. Each term in r t t is not less than −5.

Neural Networks
The actor and critic networks are presented in Figure 6. A state s t enters the actor network, which has five neurons as the output layer in green. A softmax function takes the first three neurons as input and outputs the discrete action a t l . Two continuous actions a t d and a t v are the last two neurons adding some noises. The critic network takes a state s t as input, and concatenates the first hidden layer with the output layer of actor network as the input of second hidden layer. A scalar value is output as the action-value (Q-value).
The actor network has three mutual hidden layers, and each layer has 256, 512 and 128 neurons, respectively. The private layers connecting discrete and continuous actions both have 64 neurons. The critic network has three hidden layers, and each has 256, 128 and 64 neurons, individually. Rectified linear unit (ReLu) activation functions are used in the hidden layers. Meanwhile the output layer of actor network uses tanh functions to constrain the value of actions.

Algorithm Procedure
The procedure of PFBD method for autonomous vehicle safe decision is Algorithm 1. The algorithm includes two main parts. The first part is for sampling and evaluation. Experiences of state transitions are sampled before the policy is trained. Algorithm 1, firstly, initializes the actor-critic networks and sets a task for agent along with the penalty function for different tasks. After initializing the state s 0 in each episode, action a t is generated according to sampling mode among random sampling and -random sampling. a t shifts agent to next state s t+1 , and a reward r t follows. When the number of state transitions (s t , a t , r t , s t+1 ) reaches N max , the transitions are stored as a group into "Experience" on the local disk. If a policy has been trained, we set sampling mode to "Evaluation" to evaluate its performance.
For the training part, N env groups of samples in Experience are loaded from disk. The newer stored samples are given a higher priority to balance the shifting of policy. Then, the parameters of evaluation and target networks are updated in sequence.

Algorithm 1 Planning features based behavior decision algorithm (PFBD)
Initialization: Initialize evaluation actor-critic networks' parameters θ 1 , θ 2 , ω randomly Initialize target actor-critic networks' parameters θ 1 ← θ 1 , θ 2 ← θ 2 , ω ← ω 1: for task k = 1 to N k do 2: Set a Task and sampling mode, load the traffic map, reset traffic participants 3: Initialize the agent's state s 0 4: for step t = 0 to T do 5: generate action a 0 according to sampling mode 6: switch ( sampling mode ) 7: case Random sampling: randomize the output layer of the actor network 8: case -random sampling: randomize the output layer of actor network with probability 9: case Evaluation: use the trained policy (forward through actor network) 10: end switch 11: Employ action a t to next state s t+1 based on SMSTP algorithm, calculate the reward r t

13:
if n s == N max then 14: Store N max transitions as a group into Experience, n s = 0 15: end if 16: end for 17: for epoch ep = 1 to N epoch do 18: Load N env groups of Experience from disk with priority, randomly sample a minibatch of N minibatch transitions 19: Update critic in evaluation network by minimizing (4), target is updated by (9) 20: Update actor in evaluation network following (5) 21: Update the target networks following (6) 22: end for 23: end for

Simulation and Experiment
Since the proposed approach focuses on behavioral decision and planning with only abstract entities for autonomous driving, we tested the method with our vehicle simulation platform instead of real traffic. A schematic view of sampling and evaluation architecture is depicted in Figure 2

Traffic Participants
The other traffic participants and their behaviors were set as anticipated in Section 2.1. The participants changed their behavior modes at a probability p = 0.005. (If five traffic participants exist, the probability of at least one changing its behavior in one second is 0.2217.) The changing of behavior mode ensures abundant experiences of different scenes. A vehicle was reset when it got out of the agent's range.
The participants employ an expert system to make behavioral decisions and to execute longitudinal control. The expert system considers the distances and velocities of the nearest vehicles in the surrounding lane. The decision system has been well tested in real traffic experiments [25].

Hyper-Parameters for TD3
A minibatch of 64 state transitions was randomly selected from 32 groups of experiences to train the policy every two steps. The optimization method for TD3 is the Adam algorithm [26], and the learning rates for actor and critic networks are 10 −2 and 10 −4 , respectively. The discount for future reward was set to 0.99. We used deep learning framework, libtorch (Pytorch for C++) [27], to train the agents.

Performance Index
To evaluate the capabilities of PFBD method, the following metrics are introduced.

•
Collision Rate [%]: the collision rate is defined as where CR is the number of crashes that happened, V is the traffic volume in the agent's sensible range, T ≥ 1 is the time in hours the agent drives and L is the total distance traveled. This is the most critical metric for evaluating the agent's performance.

•
Average velocity (m/s): the average velocity represents how fast the agent drives and to what degree the expected velocity is matched. • Average distance between collisions (km): the distance between two collisions is to avoid the influence of agent's velocity and duration time.

•
Rule-violation rate (%): the violation rate indicates the agent's capability of achieving given tasks, such as decreasing when a zebra is coming, keeping in specific lanes and leaving the ramp's accelerating lane before coming to its end.

Simulation and Evaluation
In the simulation, we trained four different agents with four corresponding polices. Three agents' policies, P A , P B and P C were trained individually on three typical scenarios (see Section 4.1), respectively. To validate the versatility of PFBD, one more agent's policy P T was trained on all three scenarios. Besides, we compared the performance of PFBD method with the rule-based expert system and an untrained random policy.
An episode is reached in the following three cases: • Case 1: the agent is made to unavoidably collide with a vehicle in 1.5 s. • Case 2: the agent has successfully made more than 20 decisions. • Case 1: the agent updates for 600 steps without collisions.
The last two rules prevent an episode with too many running steps.
To generate the desired behavior, the reward function for scenarios B and C was modified accordingly. For scenario B with zebras on the road, r t t in (16) is added with a new term: where v zebra is the recommended velocity to pass a zebra area and d zebra is the distance to a zebra. For task entering the main road in scenario C, r t t in (16) is added with: where d ramp is the distance to ramp's end. At the first stage, we sampled more than one million state transitions in each scenario randomly. Then, the agents, P A , P B and P C were trained from samples in each scenario according to Algorithm 1, respectively. P T was trained on samples from all three scenarios. Then, we used the trained agents to sample with a probability randomly selecting the actions. The agents did not stop being trained until the loss stopped decreasing for 10 epochs. Figure 8 shows the decision and planning result of trained agents in the simulation environment. Figure 8a,b illustrates that the agent takes over the front vehicle (ID = 9) and follows a relatively quicker vehicle (ID = 13). Figure 8c-e shows that the agent merges into the middle lane timely before the ramp(right lane) ends.

Performance of PFBD on Different Scenarios
The statistical results for four agents are listed in Table 2. Generally, the trained agents achieved the designed maneuvers. The collision rate and average distance without collision show that the agent can learn proper policy to avoid collisions. Scenario A is easier for the agent to drive in since the agent achieves the highest average velocity and lowest collision rate. For scenarios B and C, the agent has to slow down for the zebra or limit its speed in the ramp. P T can also learn proper behaviors on all three scenarios without an evident decrease in performance. Note that the average distance without collision is relatively short in the simulation. The main cause is that a traffic participant sometimes emerges too close to the agent since it is randomly reset. For the abundance of sample distribution, the nearest distance to the agent is 20 m. When the agent is much quicker than the simulated vehicle in the front, a collision can be hard to avoid. The rule violation is also difficult to define clearly. In our simulation, for a successful result in the zebra scenario, the duration of the agent's velocity being smaller than the recommended velocity has to be kept up for at least two seconds.

Comparison with the Expert's Policy
DRL-based PFBD method can explore optimal policy via interacting with environments. Nevertheless, the rule-based expert' policy often struggles in dynamic situations with heavy traffic. Therefore, the performance of the trained policy was compared to the expert's policy. The agents P A , P B and P C ran for at least 10 5 iterations and at least 2500 maneuvers executed. The expert's policy described in Section 4.1.2 and a random policy (untrained) were both evaluated.
One critical concern is that the reward of lane-keeping must be recorded in a balanced manner. We did not record a reward in every iteration. On the contrary, either the agent kept in a lane for every 40 steps or a non-lane-keeping maneuver was selected; the reward of lane-keeping was recorded at one time. Figure 9 depicts the collision times after the agents ran for 10 5 iterations in three testing scenarios. It shows that the learned agents for all three scenarios significantly decreased the collision times after training. Besides, the learned agents achieved competitive performance, as did the conservative traditional expert. In fact, a large number of the collisions derived from a too-close-in-front vehicle after resetting. Although the low-level SMSTP algorithm can plan all non-collision topological paths, whether the trajectory can be well executed is undetermined (the agent cannot decelerate quickly enough). Therefore, collision times depend on the performance of a policy. The simulation results verify that the agents can learn proper policy after training.
For the rewards of the learned agents, as shown in Figure 10, most immediate rewards (in blue points) achieved zero punishments in scenario A and B, while the expert's immediate rewards (in red points) seldom got no punishment. A statistical distribution of immediate rewards is listed in Figure 11. Both agent P A and agent P B , more than 40% of the time, received zero punishments. This implies that the agents successfully found an optimal policy in the corresponding states. In contrast, the expert's policy mostly had a small immediate reward, as the expert behaved conservatively and got trapped in dynamic traffic at a relatively low speed most of the time. In the ramp merging scenario, the agent and expert both had a scattered distribution of immediate rewards, which reflected the difficulty in the ramp-merging task. Both policies had to compromise for a successful merging maneuver.

Comparison with standard DRL
To validate whether the states of contours help to make a better decision, we compared the proposed PBFD with a standard DRL where all the parameters were selected as the same as those in the PBFD except for the dynamic features from the planning layer being not concerned in the standard DRL. To achieve this, we specifically set, in the DRL, the planning scope's front-end distance c f d to zero with small noise. With the above design, we trained a new agent C A in scenario A, whose behaviors are displayed together with those of the PBFD and the expert in Figure 12.
The results show that agent C A had almost the same collision times as P A in 10 5 running steps. It can also be seen in Figure 12a,c that agent C A had about a 6% chance of achieving zero punishment, which is better than expert's policy (in Figure 11), which could hardly get any zero punishment outcomes. However, agent C A behaved worse than agent P A , as 40% of immediate rewards of agent P A were zero. This means that agent P A with full knowledge of contour states has larger probabilities of making optimal behavioral decisions. Thus, the creative design of contour is vital to learn better behavior decisions.

Real-time Experiment On HQ3 Autonomous Vehicle
To validate the capacity of the proposed PFBD algorithm for autonomous vehicle decision making, the real-time experiment on HQ3 autonomous vehicle was conducted in the suburbs with regular traffic. HQ3 [25] was equipped with necessary sensors, including cameras, LIDAR, radars, inertial measurement unit (IMU), etc.
To use the learned agents in real-time test, we kept the same states in both simulation and the real test, and we ignored those states that did not exist in simulations, such as an intersection with turn commands. The two main tests in the suburbs were keeping in the target lane near a zebra (corresponding to Scenario A and B) and merging onto the main road from a side road (corresponding to Scenario C). The agent's maximum speed was limited to 60 km/h. The road network avoided untrained tasks, such left turning or right turning that did not exist in simulation. Figure 13 shows the agent's behavioral decision and trajectory planning results for the two separate maneuvers. The blue short lines indicate the speeds and directions of the detected surrounding vehicles on the road. In the first test, HQ3 successfully merged to the middle lane. Besides, as the zebra area was coming (see Figure 13c, the planned trajectory led HQ3 to slow down (see green trajectory of Figure 13i). In the second test, our agent tried to merge into the left main road (see Figure 13d-f). HQ3 changed to the left curve lane before the road network of the side road coming to end. As in the real-time test shown, the trajectory planner based on the learned agent can successfully generate a smooth and safe trajectory for HQ3.

Conclusions
In this paper, a novel planning features-based, deep behavior decision method trained with TD3 was proposed to select an optimal maneuver for autonomous vehicles in dynamic traffic. The PFBD uses rule-based topological trajectories as basic guarantees and learns optimal policy with DRL, which helps to bring learning methods to the autonomous vehicle's core module, behavioral decisions and trajectory planning. The behavioral decision problem was modeled as an MDP process, and the training algorithm was proposed. Due to the creative design of input states and actions, the behavior decision is inherently associated with drivable routes in the topological planning space. Thus, the policy's decision is consistent with trajectory planning, and the path is a safer one. The evaluation results have demonstrated that the proposed method can learn proper behaviors. Besides, comparisons to the expert's policy and the standard DRL validate the effectiveness of PFBD and show its capability to find optimal strategies with larger possibilities. The real-time test also validated the effectiveness of the proposed PFBD.
Since the agent's performance is competitive compared to the rule-based expert's policy, the future work will extend the current discrete behavioral decisions to consecutive ones, which shall evaluate the present behaviors and potential behaviors in every time step. Therefore, some emergent situations could be tackled immediately rather than executing the ongoing behavior to the end. In the meanwhile, more traffic scenarios can be used to test the generalization performance of the proposed method.