Human-like Decision Making for Autonomous Vehicles at the Intersection Using Inverse Reinforcement Learning

With the rapid development of autonomous driving technology, both self-driven and human-driven vehicles will share roads in the future and complex information exchange among vehicles will be required. Therefore, autonomous vehicles need to behave as similar to human drivers as possible, to ensure that their behavior can be effectively understood by the drivers of other vehicles and be more in line with the cognition of humans on driving behavior. Therefore, this paper studies the evaluation function of human drivers, using the method of inverse reinforcement learning, aiming for the learned behavior to better imitate the behavior of human drivers. At the same time, this paper proposes a semi-Markov model, to extract the intentions of surrounding related vehicles and divides them into defensive and cooperative, leading the vehicle to adopt a reasonable response to different types of driving scenarios.


Introduction
With the development of autonomous driving technologies, autonomous vehicles are considered an important part of the future transportation system, in terms of environmental perception, decision making, path planning and motion control. Google's unmanned vehicles have traveled millions of kilometers, while in the automatic driving mode. Waymo company, its subsidiary, has planned to enter the driverless taxi industry. In addition, Tesla and Uber also have great potential in the field of autonomous driving. Tesla has developed its own automatic driving system, called Autopilot and applied it to a model S vehicle, to achieve the function of automatic driving [1]. In China, many IT giants have also joined the venture, such as Baidu, Alibaba and Huawei, injecting new vitality into the research and development of driverless cars, with their technology in the field of artificial intelligence. Baidu Apollo's automatic driving test mileage has exceeded 25 million kilometers travel distance. Baidu made available its unmanned vehicle service to the public, in Shougang Park in May 2021, taking the lead in entering a new stage of unmanned driving. Following, an unmanned driving test was conducted on the open road, in Yizhuang. Baidu's driverless transport service has also gradually expanded, from the closed park to people's daily travel needs. However, on the public roads with complex traffic conditions, the overemphasis on the "correctness" of the perception control mapping process [2] often causes the driving behavior [3] with a certain social intelligence to be overlooked, which may lead to serious safety problems. For example, in 2016, Google's driverless car collided with a municipal bus, because the driverless strategy misjudged the intention of the bus to give way [4]. This judgment obviously does not conform to the driving behavior of human driving and does not realize the social aspect of human drivers, when making decisions [5]. That is, the current driverless strategy does not consider the driving behavior and decision-making on complex issues, through expert demonstration. Inverse Reinforcement Learning (IRL) can deal with the problems related to learning driving style from expert demonstration [35]. Then, the IRL method is developed into maximum entropy IRL [36], so the driving trajectories of expert examples on the highway can be learned [37]. These methods are to obtain the weight of the reward function, through feature matching between the best trajectory of the reward function and the expert demonstration data. Levine and Koltun [38] introduced an approach that facilitates the use of local optimal demonstrations for inverse reinforcement learning. Naumann et al. [39] aim to investigate to what extent cost functions are suitable for interpreting and imitating human driving behavior. They focus on how cost functions differ from each other, in different driving scenarios. Huang et al. [40] used a driving model, based on internal reward function, to simulate human decision-making mechanism, inferred the reward function from real human driving data.
In [41], a large number of literature surveys are conducted, referring to many literatures related to end-to-end and IRL methods, and the respective characteristics of the two methods are listed. The end-to-end method has the ability to self-optimize, does not require prior knowledge, and has a black-box nature. End-to-end methods mostly use behavioral cloning, which is simpler than other methods. However, the end-to-end method also has shortcomings such as insufficient datasets that can be used [42], poor generalization ability, inability to take into account task security issues, and lack of standard evaluation methods. The IRL method uses expert demonstration data to train the system, has good generalization ability, and the required human driver data is easy to collect. However, the IRL method also has certain shortcomings: the computational complexity of the reward function is high, and the reward function has a certain ambiguity, and the amount of data required by the IRL method is also relatively large.
At present, the related work on human-like driving has been developed to a certain extent, but the scenes involved in these works are mostly straight driving roads, and rarely involve intersection scenes. The intersection scene is the main component of traffic, and it is also the road section with frequent traffic accidents, so the research on the intersection scene is very important. This paper uses the inverse reinforcement learning algorithm to simulate the driving behavior of human drivers, in the intersection scenario, so as to improve the understandability of the behavior of autonomous vehicles, when passing through an intersection.
The rest of this paper is organized as follows: Section 2 introduces the decision-making model in an intersection scenario and the related content of inverse reinforcement learning. Section 3 includes the verification in a Carla simulation environment. Section 4 carries out the experiment in an actual traffic scenario. Section 5 summarizes the methods used in this paper and points out the shortcomings.

Humanoid Decision-Making Process at an Intersection
Driverless decision-making systems need algorithms that can deal with complex situations. In particular, urban scenarios, such as intersections with multiple pedestrians, traffic lights, cars and bicycles, are a huge challenge, even for human drivers. In order to make the autonomous vehicle run safely and reliably in such a complex and realistic environment, various technical challenges need to be considered and verified repeatedly in real and virtual scenarios.
As shown in Figure 1, studies of the decision-making process of human drivers, in dynamic and complex environments, contribute to building a decision-making model for uncertainty. Human generated data are used to optimize the model and adjust the parameters, so that the output behavior of the model is humanlike. Furthermore, the formulation of reasonable evaluation indexes and training models can improve the performance of agents. Finally, the simulation experiment can provide a safe test environment. Based on simulation experiments, the model can be continuously optimized and the generalization ability of the decision model can be improved. parameters, so that the output behavior of the model is humanlike. Furthermore, the formulation of reasonable evaluation indexes and training models can improve the performance of agents. Finally, the simulation experiment can provide a safe test environment. Based on simulation experiments, the model can be continuously optimized and the generalization ability of the decision model can be improved.

Scenario Evaluation
The main purpose of this study is to determine whether there are safety issues between the test vehicle and the opposite moving vehicle. After the evaluation of the driving track of the opposite moving vehicle, there is the question of coincidence area of the global planned trajectory of the test vehicle, along the direction of the driving track of the opposite moving vehicle. For example, as shown in Figure 2, the upper left vehicle is the test vehicle (ego vehicle), the lower right vehicle is the opposite moving vehicle, and the coincidence area of red track and the green track is the potential collision area. If there is an overlapping area, the opposite moving vehicle is the vehicle with potential collision risk. The remaining vehicles are classified as irrelevant vehicles and they are not evaluated, because it is considered that such vehicles will not pose a potential threat to the test vehicle. This is a highly computational method, especially when there are many interactive vehicles, at the intersection, thus, the effective screening of the relevant vehicles can improve the calculation efficiency. Later, the possibility of collision will be evaluated in a reliable way.

Scenario Evaluation
The main purpose of this study is to determine whether there are safety issues between the test vehicle and the opposite moving vehicle. After the evaluation of the driving track of the opposite moving vehicle, there is the question of coincidence area of the global planned trajectory of the test vehicle, along the direction of the driving track of the opposite moving vehicle. For example, as shown in Figure 2, the upper left vehicle is the test vehicle (ego vehicle), the lower right vehicle is the opposite moving vehicle, and the coincidence area of red track and the green track is the potential collision area. If there is an overlapping area, the opposite moving vehicle is the vehicle with potential collision risk. The remaining vehicles are classified as irrelevant vehicles and they are not evaluated, because it is considered that such vehicles will not pose a potential threat to the test vehicle. This is a highly computational method, especially when there are many interactive vehicles, at the intersection, thus, the effective screening of the relevant vehicles can improve the calculation efficiency. Later, the possibility of collision will be evaluated in a reliable way. parameters, so that the output behavior of the model is humanlike. Furthermore, the formulation of reasonable evaluation indexes and training models can improve the performance of agents. Finally, the simulation experiment can provide a safe test environment. Based on simulation experiments, the model can be continuously optimized and the generalization ability of the decision model can be improved.

Scenario Evaluation
The main purpose of this study is to determine whether there are safety issues between the test vehicle and the opposite moving vehicle. After the evaluation of the driving track of the opposite moving vehicle, there is the question of coincidence area of the global planned trajectory of the test vehicle, along the direction of the driving track of the opposite moving vehicle. For example, as shown in Figure 2, the upper left vehicle is the test vehicle (ego vehicle), the lower right vehicle is the opposite moving vehicle, and the coincidence area of red track and the green track is the potential collision area. If there is an overlapping area, the opposite moving vehicle is the vehicle with potential collision risk. The remaining vehicles are classified as irrelevant vehicles and they are not evaluated, because it is considered that such vehicles will not pose a potential threat to the test vehicle. This is a highly computational method, especially when there are many interactive vehicles, at the intersection, thus, the effective screening of the relevant vehicles can improve the calculation efficiency. Later, the possibility of collision will be evaluated in a reliable way.

Time to Collision (TTC)
This paper uses TTC to quantitatively divide the current scenario. TTC is mainly used as the standard to evaluate the potential threat in a collision mitigation system, because it represents the relationship between spatial proximity and velocity difference. First, the time when other vehicles enter the area of possible conflict is used to make a preliminary division of the moving vehicles. A time instance is set as the threshold of the possible conflict area, while the surrounding vehicles are divided into dangerous situation, radical Noh [43] mentioned the division of dangerous scenarios, using collision time (TTC) as the standard, to evaluate the potential threat, at unsignalized T-intersections [44][45][46].
As shown in Table 1, there is a strong correlation between the TTC and the actual braking time compared with the speed of the vehicle and the distance of the vehicle from the conflict scene. In order to verify the proposed hypothesis in this paper, a linear fitting analysis is performed, as shown in Figure 3. Considering the relationship between the braking time of the real vehicle and the ratio between the distance between the vehicle and the virtual conflict point and the speed of the vehicle, the following relationship is obtained: where d: Distance between the test vehicle and the virtual collision point; v: Speed of the test vehicle. as the standard to evaluate the potential threat in a collision mitigation system, because it represents the relationship between spatial proximity and velocity difference. First, the time when other vehicles enter the area of possible conflict is used to make a preliminary division of the moving vehicles. A time instance is set as the threshold of the possible conflict area, while the surrounding vehicles are divided into dangerous situation, radical situation and safety situation. Different driving strategies are implemented, according to the respective classification into the various dangerous situations. Noh [43] mentioned the division of dangerous scenarios, using collision time (TTC) as the standard, to evaluate the potential threat, at unsignalized T-intersections [44][45][46].
As shown in Table 1, there is a strong correlation between the TTC and the actual braking time compared with the speed of the vehicle and the distance of the vehicle from the conflict scene. In order to verify the proposed hypothesis in this paper, a linear fitting analysis is performed, as shown in Figure 3. Considering the relationship between the braking time of the real vehicle and the ratio between the distance between the vehicle and the virtual conflict point and the speed of the vehicle, the following relationship is obtained:  According to the international standard ISO22 179:2009, a threshold value of 3 s is assigned. Specifically, when the time of future collision, as calculated according to the current state of the vehicle, is greater than 3 s, the current situation is regarded as safe, whereas when it is less than or equal to 3 s, the current situation is regarded as dangerous. According to the international standard ISO22 179:2009, a threshold value of 3 s is assigned. Specifically, when the time of future collision, as calculated according to the current state of the vehicle, is greater than 3 s, the current situation is regarded as safe, whereas when it is less than or equal to 3 s, the current situation is regarded as dangerous.

Semi-Markov Model
The problem formulation in this study is similar to MAXQ method [47]. In this paper, the mathematical expression of time abstraction is constructed according to the Semi-Markov Decision Process (SMDP) [48]. SMDP is an abstract form of Markov Decision Process (MDP). Contrary to MDP, which can only perform operations in a single time step, the operations in the SMDP framework can last for a variable period of time. In order to better describe various forms of behavior with obvious differences, a semi-Markov model is used [49]. The SMDP framework proposed in this paper is shown in Algorithm 1.

Algorithm 1 SMDP Framework
In the planning window t: Then Execute defensive strategies Else Execute general strategy In this paper, a hierarchical strategy gradient method is proposed, which uses the time abstract formula of Semi-Markov Decision Process (SMDP), to solve the strategy. This method is applied to the scenario of autonomous vehicle at an intersection. The vehicle can exhibit two different types of behavior (such as defensive behavior and cooperative behavior), while its sub actions (such as acceleration and deceleration) should follow the corresponding behavior. The real data verification and simulation results in this paper show that, the proposed method can lead to correct decisions in the scenario of interaction with other vehicles at an intersection, and the produced behavior is similar to real human response. On the contrary, using MDP solution method does not achieve good experimental results, while there is also a large variance. In addition, the layered strategy method can be used in similar modules in the future, to cover more scenarios [49].
In the process of drafting a strategic solution, this paper adopts the approach of policy search, within the set planning window. In the environment of an intersection, the length of planning window is different, due to the different environment. In dangerous scenarios, the planning window is set to 3 s, while in the non-hazardous scenario, the planning window is selected to be 8 s long. In the process of strategic solution, the selection of the prediction step will also affect the quality of the final result. If the vehicle is in a dangerous situation, the prediction step is set at 0.5 s. When the vehicle is in a non-dangerous state, the prediction step is selected to be 1 s. In the solving process of longitudinal decision-making, based on this model and according to the current motion state of the autonomous vehicle and the prediction state of the surrounding vehicles, the candidate strategy set P is generated, under the given prediction period H and prediction step size. Each candidate strategy is calculated according to the evaluation function, while the one with the largest reward value is output as the future driving action of the vehicle.

Maximum Entropy Inverse Reinforcement Learning
The reward function of traditional reinforcement learning is often set artificially, and there is a certain irrationality. Inverse reinforcement learning [50] considers that the strategy of the expert example is optimal or close to the optimal, and then learns the reward function from the expert example, imitating the driving strategy of the expert, which is a human-like learning method.
The inverse reinforcement learning algorithm used in this paper, is based on human suboptimal behavior. This paper adopts the method of probabilistic graph modeling to solve the ambiguity of path selection. At the same time, based on the principle of maximum entropy, the reward function is restored [36].
In practical problems, because the state of vehicle is a continuous space, the maximum entropy method will lead to an extremely large solution space and function nonconvergence. To solve this problem, a probability-based inverse optimal control algorithm is introduced in this paper. The implementation of Inverse Reinforcement Learning (IRL) algorithm restores the unknown reward function in Markov Decision Process (MDP), based on the expert demonstration of corresponding strategy. This reward function can be used for apprenticeship, reinforcement learning, and so on, to extend the behavior of experts to new situations, or to infer the behavior goals of experts.
Considering a continuous state s = (s 1 , . . . , s T ) T , a continuous action a = (a 1 , . . . , a T ) T and discrete time, this kind of task is characterized by a dynamic function F. This paper defines it as: F(s t−1 , a t ) = s t . At the same time, the reward function is also defined as r(s t , a t ). Given the initial state s 0 , the optimal action is determined by a = argmax a ∑ t r(s t , a t ).
In this study, the maximum entropy model, closely related to the linear Markov model, is used for modeling. According to this model, the probability of selecting action a is directly proportional to the index of rewards, as encountered in this process: where Z is the partition function. An approximation of Equation (1), as given below, allows efficient learning in highdimensional continuous domains. In addition to breaking the exponential dependence on dimension, this approximation needs only one example of approximate local optimal, instead of an example trajectory being globally optimal.
It is assumed that experts perform local optimization, rather than global planning, when selecting action a. Compared to the global optimality requirement, local optimization shows much less limitation. Using r(a) to denote the sum of rewards along the path, Equation (1) can be written as: This probability is approximated by the second-order Taylor expansion of the reward, near the action: Denoting the gradient ∂r ∂a as g and the Hessian ∂ 2 r ∂a 2 as H, the approximation to Equation (2) is given by: The approximate log likelihood is obtained as: The gradient of the reward parameter θ is: The following definitions are set: The gradient can be rewritten as: where [H −1 ] ttij denotes the ijth entry in the block tt; J is Jacobian matrix; h = H −1 g [38].

Design of Evaluation Function
In Markov modeling, the formulation of the evaluation function affects the generation of strategy. This paper uses Markov model to generate driving behavior and train it according to human driver behavioral characteristics.
At intersections, people often set to execute different driving tasks in different environments. When the environment is relatively safe, people will choose to reach their destination more quickly. When the environment is relatively dangerous, people will first consider safety and will rather sacrifice speed to ensure it. The theory based on utility prospect has two main limitations: (1) The standard model assumes that human beings are risk neutral in their utility function; (2) It assumes that humans do not distinguish scenarios with known outcome probability from scenarios with unknown outcome. The experimental results show that, it is not a good model of human behavior in a risk scenario. It has been established that, human beings not only hate risk, but also have fuzzy aversion. For example, in the decision-making behavior at intersections, regarding the characteristics of human risk aversion modeling, a hierarchical decision model is adopted (Algorithm 2). The upper layer is used as the division of dangerous scenarios, while the lower layer is based on different evaluation functions of safe and dangerous scenarios. The inverse reinforcement learning of continuous state space is used to learn the human evaluation function from the human driving data. The evaluation function is assumed to be linear and can be expressed as follows: Timeliness: where v re f refers to the desired reference speed and v 0 refers to the current vehicle speed. Comfort: where the longitudinal jerk is used to represent the jerk and the steering wheel angle is used to represent the steering angle of the vehicle. Security: where d represents the distance between two vehicles (m) and n s represents the number of surrounding interactive vehicles.

Scenario Design
This paper uses the simulation test platform Carla, to build an urban road intersection setup and arrange relevant traffic elements, to simulate the driving behavior of human drivers, in the traffic scenario of urban road intersections. Figure 4 shows the initial interface of Carla.  In this paper, a preset intersection on the Carla simulation platform is selected as the experimental scenario ( Figure 5): the other vehicle turns left at the intersection, while the test vehicle goes straight, forming a conflict area of the two vehicles, at the intersection. Since this paper mainly studies the vertical decision-making process of the vehicle, the test vehicle adopts the straight route, while the other vehicle adopts the left-turn route, thus reducing the influence of the horizontal decision-making, component. The trajectory of the other vehicle will be designed according to the acceleration and speed distribution characteristics of the human driver, acting in a real scenario. The longitudinal control of the vehicle is manual and controlled by the accelerator pedal. For this scenario, two types of accidents that occur frequently at intersections, are selected. One is when the other vehicle stops near the conflict area at the intersection and suddenly starts moving. The second accident case is the process, where the other vehicle decelerates sharply, at the intersection.

Experimental Design
The setup is designed as a conflict scenario, where the test vehicle travels straight and the other one turns left. At the beginning, both vehicles stood 30 m away from the In this paper, a preset intersection on the Carla simulation platform is selected as the experimental scenario ( Figure 5): the other vehicle turns left at the intersection, while the test vehicle goes straight, forming a conflict area of the two vehicles, at the intersection. Since this paper mainly studies the vertical decision-making process of the vehicle, the test vehicle adopts the straight route, while the other vehicle adopts the left-turn route, thus reducing the influence of the horizontal decision-making, component. The trajectory of the other vehicle will be designed according to the acceleration and speed distribution characteristics of the human driver, acting in a real scenario. The longitudinal control of the vehicle is manual and controlled by the accelerator pedal. For this scenario, two types of accidents that occur frequently at intersections, are selected. One is when the other vehicle stops near the conflict area at the intersection and suddenly starts moving. The second accident case is the process, where the other vehicle decelerates sharply, at the intersection.  In this paper, a preset intersection on the Carla simulation platform is selected as the experimental scenario ( Figure 5): the other vehicle turns left at the intersection, while the test vehicle goes straight, forming a conflict area of the two vehicles, at the intersection. Since this paper mainly studies the vertical decision-making process of the vehicle, the test vehicle adopts the straight route, while the other vehicle adopts the left-turn route, thus reducing the influence of the horizontal decision-making, component. The trajectory of the other vehicle will be designed according to the acceleration and speed distribution characteristics of the human driver, acting in a real scenario. The longitudinal control of the vehicle is manual and controlled by the accelerator pedal. For this scenario, two types of accidents that occur frequently at intersections, are selected. One is when the other vehicle stops near the conflict area at the intersection and suddenly starts moving. The second accident case is the process, where the other vehicle decelerates sharply, at the intersection.

Experimental Design
The setup is designed as a conflict scenario, where the test vehicle travels straight and the other one turns left. At the beginning, both vehicles stood 30 m away from the intersection. The moment the two vehicles cross the intersection is regarded as the end of

Experimental Design
The setup is designed as a conflict scenario, where the test vehicle travels straight and the other one turns left. At the beginning, both vehicles stood 30 m away from the intersection. The moment the two vehicles cross the intersection is regarded as the end of time.
Track design of surrounding vehicles: in order to setup the trajectory of surrounding vehicles closer to the driving conditions of the real driver, three driving trajectories, with obvious characteristics from the real vehicle data, were extracted and used as the trajectory input of the opposite vehicle. Based on the speed data of the opposite vehicle, three different speed curves are selected. As shown in the Figure 6, the blue curve represents the case of approaching constant speed and mild passing through the intersection, the black curve represents the scenario of accelerating radically, while passing through the intersection and the red curve represents rapid deceleration and avoidance. In order to extract human sensitive factors to danger, we recruited 7 participants (2 women and 5 men), all of whom had driven at least 2 years of driving experience. The collected driver information is shown in Table 2. We conducted experiments with driving simulator. Driver inputs can be obtained by steering wheel and pedals of the driving simulator, and a total of 74 groups of effective data are collected, of which 56 groups are selected as the test set and 18 groups as the verification set. the case of approaching constant speed and mild passing through the intersection, the black curve represents the scenario of accelerating radically, while passing through the intersection and the red curve represents rapid deceleration and avoidance. In order to extract human sensitive factors to danger, we recruited 7 participants (2 women and 5 men), all of whom had driven at least 2 years of driving experience. The collected driver information is shown in Table 2. We conducted experiments with driving simulator. Driver inputs can be obtained by steering wheel and pedals of the driving simulator, and a total of 74 groups of effective data are collected, of which 56 groups are selected as the test set and 18 groups as the verification set.  According to the verification of the test set (Table 3), it is clear that the coefficient of the collision function, in the dangerous area, will be significantly higher than the coefficient of the collision function, in the safe situation, and in the in-between type of environment. This result shows that, when the risk increases, people will strongly prefer to act in favor of safety, above anything else. In addition, in dangerous scenarios, people's urge for high speed will be relatively lower. At the intersection, people often slow down to ensure safety, so the experimental results are also in line with this reality. In the process of learning human behavior, the root mean square (RMS) value of the Risk-sensitive [51] learning system (Figure 7), based on risk preference, is 2.52 km/h, while the error is controlled at about 10% for the driving behavior with an average speed of 17  According to the verification of the test set (Table 3), it is clear that the coefficient of the collision function, in the dangerous area, will be significantly higher than the coefficient of the collision function, in the safe situation, and in the in-between type of environment. This result shows that, when the risk increases, people will strongly prefer to act in favor of safety, above anything else. In addition, in dangerous scenarios, people's urge for high speed will be relatively lower. At the intersection, people often slow down to ensure safety, so the experimental results are also in line with this reality. In the process of learning human behavior, the root mean square (RMS) value of the Risk-sensitive [51] learning system (Figure 7), based on risk preference, is 2.52 km/h, while the error is controlled at about 10% for the driving behavior with an average speed of 17 km/h. The root mean square value of risk-neutral [52,53] learning system is 3.96 km/h, which is significantly higher than that of the risk sensitive learning system, whereas its error exceeds the acceptable range of 15%. The experimental results show that, the risk-sensitive learning system better fits the behavior of human drivers than the risk-neutral learning system, verifying the hypothesis that drivers exhibit the characteristics of risk sensitive behavior, at intersections.

Scenario 1: The Opposite Vehicle Not Giving Way
When there are vehicles passing by in the front, humans usually choose the defensive strategy; that is, stop in front of the traffic flow and wait for all vehicles to pass. In this experiment, the effectiveness of SMDP model is verified by comparing it to conventional strategies. In Figures 8-10, the gray-colored curves in the experimental images are the average of the conventional predicted trajectories, and the red line is the average of the human driving trajectories, while the blue line represents the average of the defensive behavior trajectories, whereas the shaded area shows the standard error. The action of this vehicle adopts the discrete action output, with a step size of 0.5 s, providing an output acceleration range of [−2, 0] m/s 2 at an acceleration interval of 0.5 m/s 2 .

Scenario 1: The Opposite Vehicle Not Giving Way
When there are vehicles passing by in the front, humans usually choose the defensive strategy; that is, stop in front of the traffic flow and wait for all vehicles to pass. In this experiment, the effectiveness of SMDP model is verified by comparing it to conventional strategies. In Figures 8-10, the gray-colored curves in the experimental images are the average of the conventional predicted trajectories, and the red line is the average of the human driving trajectories, while the blue line represents the average of the defensive behavior trajectories, whereas the shaded area shows the standard error. The action of this vehicle adopts the discrete action output, with a step size of 0.5 s, providing an output acceleration range of [−2, 0] m/s 2 at an acceleration interval of 0.5 m/s 2 .

Scenario 1: The Opposite Vehicle Not Giving Way
When there are vehicles passing by in the front, humans usually choose the defensive strategy; that is, stop in front of the traffic flow and wait for all vehicles to pass. In this experiment, the effectiveness of SMDP model is verified by comparing it to conventional strategies. In Figures 8-10, the gray-colored curves in the experimental images are the average of the conventional predicted trajectories, and the red line is the average of the human driving trajectories, while the blue line represents the average of the defensive behavior trajectories, whereas the shaded area shows the standard error. The action of this vehicle adopts the discrete action output, with a step size of 0.5 s, providing an output acceleration range of [−2, 0] m/s 2 at an acceleration interval of 0.5 m/s 2 .  As shown in Figure 8, the predicted vehicle speed curve is roughly consistent with the real scenario, which is in line with the common sense of human behavior. Specifically, in the case of the traffic flow scenario, the choice is to slow down slowly and stop in front of the traffic flow. As shown in Figures 9 and 10, when the opposite vehicle adopts acceleration of 0.25 m/s 2 or speed of the opposite vehicle is constant, the predicted vehicle speed curve is also roughly consistent with the real scenario, following the common sense of human behavior.

Scenario 2: The Opposite Vehicle Giving Way
In the design of Scenario 2, the other vehicle suddenly decelerates (−0.5 m/s 2 ) at a distance from the intersection, but it is impossible to judge whether it can stop before reaching the conflict area, which means that it shows unclear intentions. In this case, both defensive and cooperative strategies are executed. In Figures 11 and 12, the gray line trajectories in the experimental plots are the average of the conventional prediction trajectories, the red line trajectory is the average of the human driving trajectories and the blue line trajectory is the average of the cooperative behavioral trajectories, whereas the shaded area shows the standard error. As shown in Figure 11b, the curve fitting has a large error, while the curves of the cooperative behavior and the conventional behavior coincide. Whether it is the error of the human driver or the comparison to the predicted curve, the error is relatively large. The reason is that, when the acceleration is 0.5 m/s 2 , it is easy for the driver to switch before defensive and cooperative behavior, due to determining the As shown in Figure 8, the predicted vehicle speed curve is roughly consistent with the real scenario, which is in line with the common sense of human behavior. Specifically, in the case of the traffic flow scenario, the choice is to slow down slowly and stop in front of the traffic flow. As shown in Figures 9 and 10, when the opposite vehicle adopts acceleration of 0.25 m/s 2 or speed of the opposite vehicle is constant, the predicted vehicle speed curve is also roughly consistent with the real scenario, following the common sense of human behavior.

Scenario 2: The Opposite Vehicle Giving Way
In the design of Scenario 2, the other vehicle suddenly decelerates (−0.5 m/s 2 ) at a distance from the intersection, but it is impossible to judge whether it can stop before reaching the conflict area, which means that it shows unclear intentions. In this case, both defensive and cooperative strategies are executed. In Figures 11 and 12, the gray line trajectories in the experimental plots are the average of the conventional prediction trajectories, the red line trajectory is the average of the human driving trajectories and the blue line trajectory is the average of the cooperative behavioral trajectories, whereas the shaded area shows the standard error. As shown in Figure 11b, the curve fitting has a large error, while the curves of the cooperative behavior and the conventional behavior coincide. Whether it is the error of the human driver or the comparison to the predicted curve, the error is relatively large. The reason is that, when the acceleration is 0.5 m/s 2 , it is easy for the driver to switch before defensive and cooperative behavior, due to determining the As shown in Figure 8, the predicted vehicle speed curve is roughly consistent with the real scenario, which is in line with the common sense of human behavior. Specifically, in the case of the traffic flow scenario, the choice is to slow down slowly and stop in front of the traffic flow. As shown in Figures 9 and 10, when the opposite vehicle adopts acceleration of 0.25 m/s 2 or speed of the opposite vehicle is constant, the predicted vehicle speed curve is also roughly consistent with the real scenario, following the common sense of human behavior.

Scenario 2: The Opposite Vehicle Giving Way
In the design of Scenario 2, the other vehicle suddenly decelerates (−0.5 m/s 2 ) at a distance from the intersection, but it is impossible to judge whether it can stop before reaching the conflict area, which means that it shows unclear intentions. In this case, both defensive and cooperative strategies are executed. In Figures 11 and 12, the gray line trajectories in the experimental plots are the average of the conventional prediction trajectories, the red line trajectory is the average of the human driving trajectories and the blue line trajectory is the average of the cooperative behavioral trajectories, whereas the shaded area shows the standard error. As shown in Figure 11b, the curve fitting has a large error, while the curves of the cooperative behavior and the conventional behavior coincide. Whether it is the error of the human driver or the comparison to the predicted curve, the error is relatively large. The reason is that, when the acceleration is 0.5 m/s 2 , it is easy for the driver to switch before defensive and cooperative behavior, due to determining the intention of other vehicles, while the behavior strategy proves ambiguous, resulting in a large deviation of the final result. Therefore, the prediction curve error for the driver of the vehicle is large. As shown in Figure 12, the other vehicle suddenly decelerates sharply with a deceleration of −1 m/s 2 . According to the estimation of the uniform acceleration motion, the other vehicle can stop before reaching the conflict area, which shows that it has a clear intention. In this case, the defensive behavior is executed at the same time. As shown in Figure 12, the comparison of the deceleration curves −1 m/s 2 and −0.5 m/s 2 demonstrates that, when the deceleration is −1 m/s 2 , the cooperative behavior fits the human driver data more accurately. This may occur because the intention of the other vehicle is clearer, when the deceleration is −1 m/s 2 , whereas the hierarchical behavior can more accurately fit the human driving behavior. As shown in Figure 12b, the curve fitting error is small, while the gap between the curve of cooperative behavior and the curve of conventional behavior is large. The speed curve predicted by the vehicle is roughly consistent with the real scenario. The vehicle chooses to accelerate, when approaching the conflict area, when facing the scenario of other vehicles evading. As shown in Figure 12, the other vehicle suddenly decelerates sharply with a deceleration of −1 m/s 2 . According to the estimation of the uniform acceleration motion, the other vehicle can stop before reaching the conflict area, which shows that it has a clear intention. In this case, the defensive behavior is executed at the same time. As shown in Figure 12, the comparison of the deceleration curves −1 m/s 2 and −0.5 m/s 2 demonstrates that, when the deceleration is −1 m/s 2 , the cooperative behavior fits the human driver data more accurately. This may occur because the intention of the other vehicle is clearer, when the deceleration is −1 m/s 2 , whereas the hierarchical behavior can more accurately fit the human driving behavior. As shown in Figure 12b, the curve fitting error is small, while the gap between the curve of cooperative behavior and the curve of conventional behavior is large. The speed curve predicted by the vehicle is roughly consistent with the real scenario. The vehicle chooses to accelerate, when approaching the conflict area, when facing the scenario of other vehicles evading. As shown in Figure 12, the other vehicle suddenly decelerates sharply with a deceleration of −1 m/s 2 . According to the estimation of the uniform acceleration motion, the other vehicle can stop before reaching the conflict area, which shows that it has a clear intention. In this case, the defensive behavior is executed at the same time. As shown in Figure 12, the comparison of the deceleration curves −1 m/s 2 and −0.5 m/s 2 demonstrates that, when the deceleration is −1 m/s 2 , the cooperative behavior fits the human driver data more accurately. This may occur because the intention of the other vehicle is clearer, when the deceleration is −1 m/s 2 , whereas the hierarchical behavior can more accurately fit the human driving behavior. As shown in Figure 12b, the curve fitting error is small, while the gap between the curve of cooperative behavior and the curve of conventional behavior is large. The speed curve predicted by the vehicle is roughly consistent with the real scenario. The vehicle chooses to accelerate, when approaching the conflict area, when facing the scenario of other vehicles evading.

Real Vehicle Scenario Verification
BYD Tang is proposed to be used as the data acquisition mobile platform, equipped with the sensor layout as shown in Figure 13. This specific platform is also used in [54].
The sensors loaded on the platform are as follows: a Velodyne HDL-32E lidar, an OxTs Inertial+ GNSS/INS suite, Mako cameras and an industrial computer.

Real Vehicle Scenario Verification
BYD Tang is proposed to be used as the data acquisition mobile platform, equipped with the sensor layout as shown in Figure 13. This specific platform is also used in [54]. The sensors loaded on the platform are as follows: a Velodyne HDL-32E lidar, an OxTs Inertial+ GNSS/INS suite, Mako cameras and an industrial computer. The driving action of the vehicle, as generated by the semi-Markov model, is compared to that of the human driver, in a real scenario, in order to verify the effectiveness of the model. This experiment sets two characteristic scenarios for verification: one is the scenario of multiple traffic flow ahead and the other is the scenario of the opposite vehicle turning left to avoid the test vehicle.

Scenario I
When a large number of traffic flows ahead, as the test vehicle drives along, human drivers usually choose defensive strategy, that is, stop in front of the traffic flow and wait for it to pass. During the experiment, the conventional strategies are compared, to verify the effectiveness of the model. The black curve in Figure 14 is the speed curve of the actual vehicle, while the red curve is the predicted speed curve. The action of the test vehicle adopts discrete action output, with a step of 0.5 s and the range of output acceleration is As shown in Figure 14, the predicted speed curve of the vehicle is roughly consistent with the real scenario. When facing the scenario of traffic flow, the vehicle chooses to slow down and stop in front of the traffic flow. Compared to conventional behavior, defensive driving behavior shows three characteristics: The driving action of the vehicle, as generated by the semi-Markov model, is compared to that of the human driver, in a real scenario, in order to verify the effectiveness of the model. This experiment sets two characteristic scenarios for verification: one is the scenario of multiple traffic flow ahead and the other is the scenario of the opposite vehicle turning left to avoid the test vehicle.

Scenario I
When a large number of traffic flows ahead, as the test vehicle drives along, human drivers usually choose defensive strategy, that is, stop in front of the traffic flow and wait for it to pass. During the experiment, the conventional strategies are compared, to verify the effectiveness of the model. The black curve in Figure 14 is the speed curve of the actual vehicle, while the red curve is the predicted speed curve. The action of the test vehicle adopts discrete action output, with a step of 0.5 s and the range of output acceleration is [−2, 0] m/s 2 , at an acceleration interval of 0.5 m/s 2 .

Real Vehicle Scenario Verification
BYD Tang is proposed to be used as the data acquisition mobile platform, equipped with the sensor layout as shown in Figure 13. This specific platform is also used in [54]. The sensors loaded on the platform are as follows: a Velodyne HDL-32E lidar, an OxTs Inertial+ GNSS/INS suite, Mako cameras and an industrial computer. The driving action of the vehicle, as generated by the semi-Markov model, is compared to that of the human driver, in a real scenario, in order to verify the effectiveness of the model. This experiment sets two characteristic scenarios for verification: one is the scenario of multiple traffic flow ahead and the other is the scenario of the opposite vehicle turning left to avoid the test vehicle.

Scenario I
When a large number of traffic flows ahead, as the test vehicle drives along, human drivers usually choose defensive strategy, that is, stop in front of the traffic flow and wait for it to pass. During the experiment, the conventional strategies are compared, to verify the effectiveness of the model. The black curve in Figure 14 is the speed curve of the actual vehicle, while the red curve is the predicted speed curve. The action of the test vehicle adopts discrete action output, with a step of 0.5 s and the range of output acceleration is As shown in Figure 14, the predicted speed curve of the vehicle is roughly consistent with the real scenario. When facing the scenario of traffic flow, the vehicle chooses to slow down and stop in front of the traffic flow. Compared to conventional behavior, defensive driving behavior shows three characteristics: As shown in Figure 14, the predicted speed curve of the vehicle is roughly consistent with the real scenario. When facing the scenario of traffic flow, the vehicle chooses to slow down and stop in front of the traffic flow. Compared to conventional behavior, defensive driving behavior shows three characteristics: (1) Defensive behavior, similarly to human behavior, has a turning point in the middle of deceleration. The turning point of the vehicle speed curve decline is roughly consistent, indicating that human beings follow a rapid deceleration process, when approaching the conflict area. This shows that the learned defensive behavior is based on the defensive characteristics of human beings; (2) Defensive behavior actually achieves the defense effect by setting a safe area. Compared to conventional behavior, it still does not slow down to 0, when approaching the conflict area, which still poses a great risk of collision within the conflict area, indicating that setting a safe area can indeed achieve the effect of ensuring safety.
As shown in Figure 15, the blue vehicle is the test vehicle (ego vehicle) and the green vehicle is the opposite vehicle. Compared with defensive behavior, conventional behavior does not learn such a behavior through the reward function. Finally, it is quite close to other vehicles and is in an extremely dangerous situation. (1) Defensive behavior, similarly to human behavior, has a turning point in the middle of deceleration. The turning point of the vehicle speed curve decline is roughly consistent, indicating that human beings follow a rapid deceleration process, when approaching the conflict area. This shows that the learned defensive behavior is based on the defensive characteristics of human beings; (2) Defensive behavior actually achieves the defense effect by setting a safe area. Compared to conventional behavior, it still does not slow down to 0, when approaching the conflict area, which still poses a great risk of collision within the conflict area, indicating that setting a safe area can indeed achieve the effect of ensuring safety.
As shown in Figure 15, the blue vehicle is the test vehicle (ego vehicle) and the green vehicle is the opposite vehicle. Compared with defensive behavior, conventional behavior does not learn such a behavior through the reward function. Finally, it is quite close to other vehicles and is in an extremely dangerous situation.

Scenario II
When there is a vehicle waiting to pass on the left, human drivers usually choose cooperative strategy, to accelerate driving.
As shown in Figure 16, the predicted speed curve of the vehicle is roughly consistent with the real scenario. When facing the scenario of avoiding the opposite vehicle, the test vehicle chooses to accelerate, when approaching the conflict area. Compared to conventional driving behavior, cooperative driving behavior exhibits three characteristics: (1) Similar to human behavior, cooperative behavior has a turning point in the middle of acceleration. The turning point of the rising speed curve of the vehicle is roughly consistent, indicating that human beings follow a rapid acceleration process, when approaching the conflict area. It reflects the human response to the other vehicle avoidance behavior. This shows that, the learned cooperative behavior is based on the cooperative characteristics of human beings; (2) In fact, cooperative behavior means to leave the conflict area after 3s. Conventional behavior is not based on the reward function. On the contrary, in order to avoid being too close to the obstacle, the vehicle is in the state of deceleration and gives way to the opposite vehicle, so it does not try to understand the intention of the opposite vehicle;

Scenario II
When there is a vehicle waiting to pass on the left, human drivers usually choose cooperative strategy, to accelerate driving.
As shown in Figure 16, the predicted speed curve of the vehicle is roughly consistent with the real scenario. When facing the scenario of avoiding the opposite vehicle, the test vehicle chooses to accelerate, when approaching the conflict area. Compared to conventional driving behavior, cooperative driving behavior exhibits three characteristics: (1) Similar to human behavior, cooperative behavior has a turning point in the middle of acceleration. The turning point of the rising speed curve of the vehicle is roughly consistent, indicating that human beings follow a rapid acceleration process, when approaching the conflict area. It reflects the human response to the other vehicle avoidance behavior. This shows that, the learned cooperative behavior is based on the cooperative characteristics of human beings; (2) In fact, cooperative behavior means to leave the conflict area after 3 s. Conventional behavior is not based on the reward function. On the contrary, in order to avoid being too close to the obstacle, the vehicle is in the state of deceleration and gives way to the opposite vehicle, so it does not try to understand the intention of the opposite vehicle; (3) As shown in Figure 17, the blue vehicle is the test vehicle (ego vehicle) and the green vehicle is the opposite vehicle. The conventional behavior, compared to cooperative behavior, is not developed using the reward function, after constraining the vehicle to leave the conflict area for 3 s. Finally, it is quite far away from the opposite vehicle. In the environment of large traffic flow, parking arbitrarily at the intersection, in a way that other drivers do not understand, will put the vehicle in an extremely dangerous situation.
(3) As shown in Figure 17, the blue vehicle is the test vehicle (ego vehicle) and the green vehicle is the opposite vehicle. The conventional behavior, compared to cooperative behavior, is not developed using the reward function, after constraining the vehicle to leave the conflict area for 3 s. Finally, it is quite far away from the opposite vehicle.
In the environment of large traffic flow, parking arbitrarily at the intersection, in a way that other drivers do not understand, will put the vehicle in an extremely dangerous situation.
(a) (b) To verify the effectiveness of our method, one of the state-of-the-art IRL methods presented in [40] is selected for comparison. To measure the similarity between humanlike systems and human drivers, Human Likeness (HL) is defined by [40]. HL is the L2 norm between the position at the end of the ground truth trajectory and that of the closest prediction among the three most likely trajectories. vehicle is the opposite vehicle. The conventional behavior, compared to cooperative behavior, is not developed using the reward function, after constraining the vehicle to leave the conflict area for 3 s. Finally, it is quite far away from the opposite vehicle.

{ }
In the environment of large traffic flow, parking arbitrarily at the intersection, in a way that other drivers do not understand, will put the vehicle in an extremely dangerous situation. To verify the effectiveness of our method, one of the state-of-the-art IRL methods presented in [40] is selected for comparison. To measure the similarity between humanlike systems and human drivers, Human Likeness (HL) is defined by [40]. HL is the L2 norm between the position at the end of the ground truth trajectory and that of the closest prediction among the three most likely trajectories. To verify the effectiveness of our method, one of the state-of-the-art IRL methods presented in [40] is selected for comparison. To measure the similarity between humanlike systems and human drivers, Human Likeness (HL) is defined by [40]. HL is the L2 norm between the position at the end of the ground truth trajectory and that of the closest prediction among the three most likely trajectories.

{ }
whereξ i (i = 1, 2, 3) are the selected trajectories with the highest probabilities, ξ gt is the ground-truth trajectory by the human driver, and L is the end of the time horizon. Applying this concept to this paper, we can get human likeness as follows: whereζ is the trajectory that generated by the proposed method.
Another indicator, the root mean square error (RMSE), can be used to measure the stability of different methods. A higher RMSE means that the method is less stable. The RMSE here can be calculated by: where d i is the deviation of a set of measurements from the true value, and i = 1, 2, 3, . . .. As can be seen in Figure 18a, the HL of our method are very low, mostly around 2, except for the fourth set of data. Since the fourth set of data is a scenario in which the opposite vehicle decelerates to stop and gives way, and the ego vehicle accelerates to pass through the intersection, there is no other restriction on the speed of ego vehicle if the speed limit is not exceeded, so the error is relatively large. As shown in Figure 18b, the RMSE of the proposed method (2.356) can be reduced by 12.1% compared with the RMSE of general modeling (2.681) in [40], which shows that our proposed method is closer to human behavior and has better stability. In addition, the scenario in [40] is on a straight road, and only interactions with vehicles in the same direction are considered. While this paper focuses on the intersection with vehicles moving in different directions. Decision making at the intersection is much more complicated than that in [40].
where ( 1,2,3) i i ξ = are the selected trajectories with the highest probabilities, gt ξ is the ground-truth trajectory by the human driver, and L is the end of the time horizon.
Applying this concept to this paper, we can get human likeness as follows: where ζ is the trajectory that generated by the proposed method.
Another indicator, the root mean square error (RMSE), can be used to measure the stability of different methods. A higher RMSE means that the method is less stable. The RMSE here can be calculated by: where i d is the deviation of a set of measurements from the true value, and 1, 2,3,... i = .
As can be seen in Figure 18a, the HL of our method are very low, mostly around 2, except for the fourth set of data. Since the fourth set of data is a scenario in which the opposite vehicle decelerates to stop and gives way, and the ego vehicle accelerates to pass through the intersection, there is no other restriction on the speed of ego vehicle if the speed limit is not exceeded, so the error is relatively large. As shown in Figure 18b, the RMSE of the proposed method (2.356) can be reduced by 12.1% compared with the RMSE of general modeling (2.681) in [40], which shows that our proposed method is closer to human behavior and has better stability. In addition, the scenario in [40] is on a straight road, and only interactions with vehicles in the same direction are considered. While this paper focuses on the intersection with vehicles moving in different directions. Decision making at the intersection is much more complicated than that in [40].

Conclusions
This paper is based on the "BYD Tang" autonomous driving platform of the Intelligent Vehicle Research Institute of Beijing Institute of Technology to obtain human driving data. This paper analyzes the driving characteristics of human drivers through actual driving data and solves the key problems faced by autonomous vehicles. The contributions of this paper are listed below:

Conclusions
This paper is based on the "BYD Tang" autonomous driving platform of the Intelligent Vehicle Research Institute of Beijing Institute of Technology to obtain human driving data. This paper analyzes the driving characteristics of human drivers through actual driving data and solves the key problems faced by autonomous vehicles. The contributions of this paper are listed below: (1) A human-like behavior decision-making method is proposed for decision-making in intersection scenarios. It is developed based on inverse reinforcement learning and semi-Markov model. The proposed method can learn well from demonstration trajectories of human experts, generating human-like trajectories; (2) The proposed model combines driving intention of other vehicles and ego vehicle, and uses the semi-Markov model to directly correspond the intention to the driving strategy, which adapts to the intention ambiguity of other vehicles in the real scene. Meanwhile, it abstracts the interaction behavior between other vehicles and ego vehicle into defensive behavior and cooperative behavior, and takes different behaviors for different scenarios; (3) Compared to traditional methods, the proposed method can learn well from human demonstration data, and is more in line with human driving behavior. The analysis results show that the method also has good stability, in addition with accuracy.
Although this work has achieved some positive results, there are still some areas that can be further improved. First, the discretization of action space could lead to the error of behavior results. Second, this paper focuses the test vehicle and its surrounding vehicles, without considering other traffic participants, such as pedestrians, non-motor vehicles and so on. In future studies, other traffic participants will be considered to establish a more complete scenario model. At the same time, sampling methods will be considered, so as to reduce the error caused by discretization.