Reinforcement Learning-Based Vehicle Control in Mixed-Traffic Environments with Driving Style-Aware Trajectory Prediction

Zhang, Xiaopeng; Wang, Lin; Zhang, Yipeng; Feng, Zewei

doi:10.3390/su172410889

Open AccessArticle

Reinforcement Learning-Based Vehicle Control in Mixed-Traffic Environments with Driving Style-Aware Trajectory Prediction

by

Xiaopeng Zhang

^*,

Lin Wang

,

Yipeng Zhang

and

Zewei Feng

The State Key Laboratory of Intelligent Transportation System, The Research Institute of Highway Ministry of Transport, Beijing 100088, China

^*

Author to whom correspondence should be addressed.

Sustainability 2025, 17(24), 10889; https://doi.org/10.3390/su172410889 (registering DOI)

Submission received: 21 October 2025 / Revised: 25 November 2025 / Accepted: 25 November 2025 / Published: 5 December 2025

(This article belongs to the Special Issue Intelligent Transportation Systems for Sustainable Transportation Management)

Download

Browse Figures

Versions Notes

Abstract

The heterogeneity of human driving styles in mixed-traffic environments manifests as divergent decision-making behaviors in complex scenarios like highway merging. By accurately recognizing these driving styles and predicting corresponding trajectories, autonomous vehicles can enhance safety, improve traffic efficiency, and concurrently achieve fuel savings in highway merging scenarios. This paper proposes a novel framework wherein a clustering algorithm first establishes statistical priors of driving styles. These priors are then integrated into a Model Predictive Control (MPC) model that leverages Bayesian inference to generate a probability-aware trajectory prediction. Finally, this predicted trajectory is embedded as a component of the state input to a reinforcement learning agent, which is trained using an Actor–Critic architecture to learn the optimal control policy. Experimental results validate the significant superiority of the proposed framework. Under the most challenging high-density traffic scenarios, our method boosts the evaluation reward by 11.26% and the average speed by 10.08% compared to the baseline Multi-Agent Proximal Policy Optimization (MAPPO) algorithm. This advantage also persists in low-density scenarios, where a steady 10.60% improvement in evaluation reward is achieved. These findings confirm that the proposed integrated approach provides an effective decision-making solution for autonomous vehicles, capable of substantially enhancing interaction safety and traffic efficiency in emerging mixed-traffic environments.

Keywords:

driving style recognition; multi-agent reinforcement learning; Bayesian inference; uncertainty perception; eco-driving

1. Introduction

Due to the limitations in the development of autonomous driving technology, the traffic system is expected to remain in a long-term mixed state of human driving and autonomous driving. Since human driving exhibits heterogeneity in driving behavior, in response to complex traffic scenarios, drivers of varying driving styles are likely to exhibit divergent behavioral patterns. This discrepancy poses a significant challenge for autonomous driving systems. Autonomous vehicles are typically trained on standardized datasets, whereas the behavior of human drivers in real-world conditions is inherently uncertain and stochastic. Consequently, their actions often exhibit significant deviations from the normative patterns encapsulated in the training data. In mixed-traffic environments, this gap can impair an autonomous vehicle’s ability to timely recognize a human driver’s style and anticipate their maneuvers. Such a failure in anticipation may lead to traffic accidents or force the autonomous vehicle into making sudden evasive maneuvers, which in turn compromises its safety, degrades overall traffic efficiency, and increases energy consumption.

Human drivers do not merely react passively to their environment; they proactively manage uncertainty. For instance, when their line of sight is obstructed, a driver’s decision to decelerate is not solely to avoid a known obstacle but is, more fundamentally, an act of information seeking to rule out latent, unobserved risks [1]. This behavior exemplifies a proactive strategy for uncertainty resolution that necessitates precise modeling. This trend signifies a paradigm shift within the field: from modeling mere behavioral kinematics to modeling the cognitive agent—the driver. Researchers have begun to classify driver behaviors into distinct styles (e.g., “aggressive”, “cautious” or “eco-friendly”) [2], with the goal of predicting a vehicle’s next action based on the driver’s intrinsic profile [3]. For an autonomous vehicle, this approach is critical for generating robust predictions in novel or unseen scenarios, as a driver’s inherent disposition is often a reliable predictor of their future reactions [4].

Driving style heterogeneity not only influences safety and traffic efficiency but also exerts a measurable impact on fuel consumption and CO₂ emissions [5]. Aggressive drivers, who frequently accelerate and brake, create sharp fluctuations in power demand, thereby increasing instantaneous fuel consumption and emission rates. In contrast, smoother and more anticipatory driving behaviors contribute to energy savings and emission reduction [6]. Consequently, enabling autonomous vehicles to recognize and classify human driving styles allows decision-making systems to jointly optimize safety, efficiency, and environmental sustainability through carbon-emission-aware behavioral control [7].

In mixed-traffic environments, a fundamental challenge lies in interpreting and adapting to diverse human driving styles that embody different risk preferences, cognitive traits, and interaction patterns. Existing research on autonomous driving and driver behavior modelling has made significant advances, but several critical gaps remain. Behavioural representation remains constrained: most methods rely heavily on low-dimensional kinematic features and neglect richer attributes such as perception, intention changes, and driver risk-preference. For example, Fang et al. [8] note that current behavioural-intention prediction datasets are still limited to simple labels and fail to capture latent cognitive and risk dimensions.

The central challenge is achieving a dynamic balance between fuel economy and emission constraints while maintaining safe and efficient traffic flow. Prior efforts, such as the Dynamic Programming (DP)–based fuel-optimal trajectory planning of Wang et al. [9], and the emission-aware Model Predictive Control (MPC) strategy developed by Bakibillah A. S. M. et al. [10], demonstrate the feasibility of jointly optimizing fuel consumption and emission outcomes. While MPC offers strong physical priors and safety guarantees, deep learning models often operate independently, leading to state representations that omit physical constraints. Norouzi et al. [11] review ML-MPC integration in automotive systems and highlight that embedding MPC within neural training remains underexplored. Similarly, Mensing F. et al. [12] and Deng J. et al. [13] formulated eco-driving control strategies that incorporate emission penalties or multi-objective constraints to enhance energy efficiency. These studies suggest a clear trend: autonomous driving systems evolve toward unified optimization frameworks capable of harmonizing safety, efficiency, and environmental sustainability in complex human–machine traffic ecosystems.

At the decision-optimization level, recent studies attempt to extract the underlying behavior-generation mechanisms directly from human driving data and transfer them into autonomous driving frameworks, primarily through inverse reinforcement learning (IRL). For instance, Qiu et al. [14] utilized maximum-entropy IRL to analyze car-following behaviors in the NGSIM dataset, successfully capturing cut-in tendencies while balancing multiple driving objectives. in dynamic and safety-critical driving scenarios, decision-making frameworks often face an imbalanced multi-objective optimisation problem: safety and efficiency objectives tend to be treated in isolation rather than coordinated within a unified structure. Finally, although driver behaviour strongly influences fuel consumption and CO₂ emissions, many current models overlook the behaviour–emission coupling, undermining the sustainability of learned driving strategies. Xing et al. [15] demonstrate an initial attempt at “energy-aware deep learning for driving behaviour” but highlight that most behaviour models still neglect emission outcomes.

Despite notable progress in driving-style modeling and interactive decision-making, several key limitations persist:

Limited behavioral representation: Existing models rely mainly on low-dimensional kinematic features and fail to capture richer behavioral attributes such as perception, intention, and risk preference.
Insufficient integration of MPC and learning-based methods: Model Predictive Control (MPC) and data-driven approaches are often treated separately. The lack of MPC embedding within neural training leads to state representations that omit physical priors, thereby reducing prediction accuracy and decision reliability.
Imbalanced multi-objective optimization: In dynamic and safety-critical scenarios, decision-making frameworks struggle to balance safety and efficiency, often optimizing one objective at the expense of the other due to the absence of a coherent coordination mechanism.
Neglect of environmental factors: Driving behavior strongly influences fuel consumption and CO₂ emissions, yet current models rarely account for behavior–emission coupling, undermining the sustainability of learned strategies.

To address the above issues, this study proposes an overall framework consisting of four parts, as illustrated in Figure 1: traffic scenario selection and data analysis, Bayesian–MPC predictive control, Actor–Critic reinforcement learning decision-making and comprehensive evaluation. The details are as follows:

Data-driven driving style analysis: In the offline phase, K-means++ clustering is applied to naturalistic driving datasets to extract representative behavioral patterns. Two typical driving styles—aggressive and cautious—are identified based on key motion and interaction features. These clusters serve as interpretable prior distributions for subsequent Bayesian reasoning, providing a structured behavioral foundation that captures inter-driver diversity and supports knowledge transfer to the online learning stage.
Bayesian–MPC prediction and control: During online interaction, the Bayesian inference module dynamically updates the posterior probabilities of each driving style using real-time indicators such as speed, acceleration, and headway distance. This probabilistic reasoning quantifies the behavioral uncertainty of surrounding vehicles. The posterior probabilities are then integrated into a Model Predictive Control (MPC) framework that combines behavioral uncertainty with physical vehicle dynamics, enabling foresighted trajectory prediction and robust control adjustments under varying traffic conditions.
Actor–Critic decision-making: The Actor–Critic reinforcement learning module takes the Bayesian–MPC outputs as adaptive inputs or weighting factors, guiding the decision process under uncertain multi-agent interactions. The actor network generates behavior-aware control actions, while the critic evaluates their long-term values by balancing safety, comfort, and efficiency. This probabilistic conditioning allows the policy to dynamically adapt to the most probable driver style, ensuring global optimality, robustness, and interpretability of decision-making.
Evaluation and validation: The effectiveness of the proposed framework is verified using Post-Encroachment Time (PET), average speed, reward, and collision rate.

The remainder of this paper is organized as follows: Section 2 focuses on the methods and fundamental concepts of driving style recognition and trajectory prediction. Section 3 provides a detailed description of the multi-agent reinforcement learning model constructed in this study. Section 4 presents the experiments, results, and analysis. Finally, Section 5 concludes the paper.

2. Basic Methods

To capture these latent behavioral categories, this paper adopts the K-means++ clustering algorithm to extract representative driving patterns from the NGSIM US-101 trajectory dataset. As an unsupervised learning method, K-means++ offers two advantages that align with the target of driving-style abstraction. It avoids subjective labeling bias by allowing natural grouping of driving behaviors based on intrinsic feature similarity. Its optimized initialization enhances cluster robustness, improving convergence stability in large-scale trajectory data.

2.1. Driving Style Extraction Based on the K-Means ++ Algorithm

This study utilizes the publicly available NGSIM US-101 dataset, which provides continuous vehicle trajectory data—including position, velocity, acceleration, and headway—for driving behavior analysis, as shown in Table 1.

Wu et al. [16] combined K-means and D-S evidence theory decision methods to perform driving-style cluster analysis. Dörr [17] and Aljaafreh et al. [18] used fuzzy logic to design a driving style recognition system and realized online recognition of driving styles. Although K-means clustering has been widely used for driving style analysis, its performance remains sensitive to the initialization of cluster centers, leading to unstable and sometimes suboptimal clustering results [19]. To improve clustering robustness, this study adopts the K-means++ algorithm, which optimizes centroid initialization by selecting the first centroid

c_{1}

randomly and choosing subsequent centroids based on a distance-weighted probability distribution. The specific parameters and their descriptions are presented in Table 2.

After initialization, each sample is assigned to the nearest centroid

(c_{1}, c_{2}, \dots, c_{k})

according to the minimum Euclidean distance criterion, expressed as:

d_{i j} = | | z_{i} - c_{j} | |

(1)

d_{i j} (z_{i}, c_{j}) < d_{i m} (z_{m}, c_{m})

(2)

where

d_{i j} (z_{i}, c_{j})

denotes the Euclidean distance between sample

z_{i}

and cluster

c_{j}

.

c_{j}^{*} = \frac{\sum_{z_{i} \in s_{i}} z_{i}}{| s_{j} |}

(3)

Calculate the quality center

c_{j}^{*}

of each cluster

s_{j}

as the new cluster center. This process is repeated iteratively until centroids

(c_{1}, c_{2}, \dots, c_{k})

no longer change or the maximum number of iterations is reached, at which point the algorithm converges.

The Davies–Bouldin Index (DBI) was employed to evaluate the average similarity among clusters. A smaller DBI indicates stronger separability among clusters and improved clustering performance. When DBI approaches zero, overlap between clusters is minimized and boundaries are more distinct, indicating an ideal cluster structure.

D B I = \frac{1}{k} \sum_{i = 1}^{k} \max_{j \neq i} \frac{S_{i} + S_{j}}{M_{i j}}

(4)

where

S_{i}

represents the average intra-cluster dispersion of cluster

i

, and

M_{i j}

represents the distance between the centroids of clusters

i

and

j

.

Additionally, the Calinski–Harabasz (CH) index was used to assess clustering quality. Its principle is based on the ratio between inter-cluster dispersion and intra-cluster compactness. A higher CH index implies greater separation between clusters and stronger cohesion within clusters, reflecting superior clustering performance. The calculation is given by:

C H = \frac{T r (B_{k}) / (k - 1)}{T r (W_{k}) / (n - k)}

(5)

where

T r (B_{k})

is the trace of the between-cluster covariance matrix,

T r (W_{k})

is the trace of the within-cluster covariance matrix,

k

is the number of clusters, and

n

is the total number of samples. The covariance matrices are computed as:

W_{k} = \sum_{q = 1}^{k} \sum_{x \in c_{q}} {(x - c_{q}) (x - c_{q})}^{T}

(6)

B_{k} = \sum_{q = 1}^{k} n_{q} {(c_{q} - c_{e}) (c_{q} - c_{e})}^{T}

(7)

where

n_{q}

is the number of samples in cluster

q

,

c_{q}

is the centroid of cluster

q

, and

c_{e}

is the global centroid of all samples.

2.2. Behavior Prediction Based on Bayesian Model

Traditional MPC-based methods have been extended to adaptively tune their weight matrices under changing driving conditions. For instance, Chang et al. [20] introduced a fuzzy rule-based mechanism for real-time MPC weight optimization, improving both tracking accuracy and ride comfort. Similarly, Pang et al. [21] employed a fuzzy inference system to adapt MPC parameters to dynamic environments, while Tian et al. [22] demonstrated that curvature-aware weight adjustment enhanced stability in high-speed scenarios. Building on these, Liu et al. [23] integrated risk assessment into MPC to achieve adaptive shared control, and Liang et al. [24] proposed a multi-MPC coordination strategy for extreme situations. These methods remain primarily rule-driven, relying on handcrafted adaptation mechanisms that limit their scalability and generalization.

To better address uncertainty in surrounding vehicle behaviors, this study employs a Bayesian inference framework to provide probabilistic state predictions for the MPC module [25]. Velocity and time headway are selected as key behavioral features, and the posterior distribution

P (k | x_{t})

is computed using Bayes’ theorem based on historical observations, enabling short-term prediction of driving tendencies. These probabilistic predictions are then incorporated into MPC, which leverages its ability to handle dynamic constraints and optimize future trajectories, thereby achieving more anticipatory and stable motion planning [26].

P (k | x_{t}) = \frac{π_{k} N (x_{t} | μ_{k}, \sum_{k})}{\sum_{j = 1}^{K} π_{j} N (x_{t} | μ_{j}, \sum_{j})}

(8)

where

μ_{k}

and

\sum_{k}

denote the mean and variance of key behavioral variables corresponding to style.

Building on the posterior distribution of driving styles, MPC was further incorporated. By leveraging Bayesian inference to predict the future trajectories of all vehicles within the observation range, the ego vehicle can make proactive decisions and perform path planning in advance, thereby avoiding emergency braking or sudden lane changes and ensuring driving safety. As shown in Figure 2 the specific process is described as follows:

The MPC fuses sensor observation data and applies filtering to obtain the current state estimation of the ego vehicle and its surrounding vehicles.
The current state is then fed into the MPC prediction model, which performs receding-horizon simulation over a finite horizon $H$ , generating candidate control sequences. These sequences include multiple potential trajectories of the ego vehicle as well as probabilistic motion distributions of surrounding vehicles [27].
Candidate sequences that would lead to collisions or violate dynamic constraints are eliminated. The remaining safe sequences are evaluated through an objective function.
The optimal control sequence that minimizes the cost function is solved and transmitted as control commands to the vehicle for execution. This iterative process is repeated in real time, thereby achieving dynamic optimization.

2.3. Driving Style-Aware Trajectory Prediction and MPC Optimization

The decision planning of autonomous vehicles relies on accurate prediction of the future behaviors of surrounding vehicles. Since the primary influencing factor of vehicle behavior is the current kinematic state rather than complex long-term driver intentions, an efficient prediction can be achieved using a simplified kinematic model. Let the ego vehicle’s state at time

t

be

s_{t}^{i} = {[x_{t}^{i}, y_{t}^{i}, φ_{t}^{i}, v_{t}^{i}]}^{T}

, where

{(x}_{t}^{i}, y_{t}^{i})

denote position,

v_{t}^{i}

velocity, and

φ_{t}^{i}

the heading angle [28]. The optimal control sequence over a horizon

H

is denoted by:

U^{*} = {u_{t}, u_{t + 1}, \dots, u_{t + H - 1}}

(9)

where MPC optimization generates the optimal trajectory

X^{*} = {x_{t + 1 | t}, \dots, x_{t + H | t}}

.The specific values and definitions of the variables are shown in Table 3:

Based on a simplified constant yaw rate and constant acceleration model, the kinematic state evolves as:

\{\begin{matrix} φ_{t + k - 1}^{i} = φ_{t + k}^{i} + {\dot{φ}}_{t}^{i} Δ t \\ v_{t + k + 1}^{i} = v_{t + k}^{i} + {\dot{a}}_{b}^{i} Δ t \\ x_{t + k + 1}^{i} = x_{t + k}^{i} + {\dot{v}}_{t + k}^{i} \cdot c o s (φ_{t + k}^{i}) Δ t \\ y_{t + k + 1}^{i} = y_{t + k}^{i} + {\dot{v}}_{t + k}^{i} \cdot s i n (φ_{t + k}^{i}) Δ t \end{matrix}

(10)

For

k = 0,1, \dots, H - 1

,where

{\dot{φ}}_{t}^{i}

is the currently observed yaw rate and

{\dot{a}}_{b}^{i}

is the acceleration input. The trajectory point sequence is

{\hat{T}}_{b}^{i} = {s_{t + 1}^{i}, s_{t + 2}^{i}, \dots, s_{t + H}^{i}}

.

However, driving style significantly affects the distribution of acceleration. Drivers with different styles select distinct values for mean acceleration, headway, and lane-changing frequency, which in turn influence trajectory prediction [29]. To address this, a Bayesian network is employed to identify the driving styles of vehicles within the observation

a_{t}^{i}

range in real time. By utilizing a set of short-term behavioral features as observational evidence and combining them with the posterior probability distribution

P (k | x_{t})

of the background vehicle’s driving style, each driving style corresponds to a specific acceleration adjustment strategy and a parameterized driver model with its parameter set

Θ_{j}

, thereby generating a style-specific acceleration

a_{j}^{i} = f (s_{t}^{i}, Θ_{j})

. For each driving-style cluster we estimate a style-specific parameter set

Θ_{j}

by fitting a linearized car-following function

a_{j}^{i} = f (s_{t}^{i}, Θ_{j}) = θ_{j, 1} v_{i, t} + θ_{j, 2} Δ v_{i, t} + θ_{j, 3} h_{i, t} + θ_{j, 4}

(11)

where

Θ_{j} = {θ_{j, 1}, θ_{j, 2}, θ_{j, 3}, θ_{j, 4}}

is the parameter set characterizing driving style j. Consequently, the future acceleration of a vehicle is no longer constant but rather an expected acceleration

a_{j}^{i}

based on driving style probability:

E [a_{t}^{i}] = \sum_{j = 1}^{M} P (k | x_{t}) \cdot a_{j}^{i}

(12)

The expected acceleration is introduced into the kinematic model to obtain probability-aware trajectory prediction:

\{\begin{matrix} v_{t + k + 1}^{i} = v_{t + k}^{i} + E [a_{t}^{i}] Δ t \\ x_{t + k + 1}^{i} = x_{t + k}^{i} + {\dot{v}}_{t + k}^{i} \cdot \cos (φ_{t + k}^{i}) Δ t \\ y_{t + k + 1}^{i} = y_{t + k}^{i} + {\dot{v}}_{t + k}^{i} \cdot \sin (φ_{t + k}^{i}) Δ t \end{matrix}

(13)

The trajectory point sequence has been updated to

{\hat{T}}_{p}^{i} = {s_{t + 1}^{i}, s_{t + 2}^{i}, \dots, s_{t + H}^{i}}

, which means that the MPC now sees the most likely predicted trajectory that integrates the uncertainty of the opponent’s driving style.

Ultimately, the optimization objective function of MPC will be calculated based on this more accurate predicted trajectory:

J (U) = \sum_{k = 1}^{H} {| | x_{t + k | t} - x_{t + k}^{r e f} | |}_{Q} + \sum_{k = 0}^{H - 1} {| | u_{t + k} | |}_{R} + \sum_{k = 1}^{H} J_{o b s} (x_{t + k | t}, {\hat{T}}_{p}^{i})

(14)

Among them, the collision cost

J_{o b s}

depends on the probability-aware predicted trajectory

{\hat{T}}_{p}^{i}

, thereby more accurately assessing future collision risks.

2.4. Carbon Emission Modeling

Building upon driving style recognition and trajectory prediction, the dynamic behavioral characteristics of a vehicle—particularly its speed and acceleration—have a direct impact on both energy consumption and carbon emissions. Different driving styles can alter the patterns of speed fluctuation and acceleration distribution, thereby shifting the engine operating point and influencing the energy conversion efficiency, which ultimately leads to notable differences in emission levels [30]. To quantitatively characterize this coupling relationship between driving behavior and emissions, it is necessary to develop a carbon emission model that reflects the dynamic features of driver behavior.

Vehicle carbon emissions are primarily influenced by operational and environmental factors, such as driving speed and travel distance. For quantitatively assess the environmental impact of autonomous driving decisions and incorporate sustainability objectives into the optimization process, this study adopts the VT-Micro Microscopic Emission Model proposed by Ahn et al. [31] as the core computational module for energy consumption and carbon emission analysis. The VT-Micro model uses instantaneous speed and acceleration as inputs and establishes a nonlinear mapping between instantaneous fuel consumption rate and kinematic states through polynomial regression equations derived from extensive real-world vehicle experiments.

In this way, the model accurately captures the influence of driving behavior on energy use and emissions without requiring detailed engine parameters. Its functional form can be expressed as:

\log (\dot{E} (t)) = \sum_{i = 0}^{n} \sum_{j = 0}^{m} β_{i j} {v (t)}^{i} {a (t)}^{j}

(15)

where

\dot{E} (t)

denotes the instantaneous emission rate, and

β_{i j}

represents empirically calibrated regression coefficients. Based on the parameter ranges inferred from relevant literature [32,33], where

β_{s p e e d}

typically falls within 0.02–0.05 g/m and

β_{a c c}

within 0.005–0.03 g/m, we select

β_{s p e e d} = 0.035 g / m

and

β_{a c c} = 0.015 g / m

in accordance with our experimental environment and traffic settings. By discrete integration, the total emission over a driving cycle can be computed as:

E_{t o t a l} = \sum_{t = 0}^{T - 1} \dot{E} (t) Δ t

(16)

By integrating the VT-Micro model into the decision loop, the autonomous driving system can directly link behavioral choice to its corresponding emission consequence at the microscopic level [5]. This enables the controller to explicitly avoid high-emission maneuvers such as aggressive acceleration and oscillatory car-following, thereby achieving smoother trajectories with lower fuel consumption and CO₂ output. In this way, carbon-aware decision-making becomes an inherent outcome of the control process rather than a post-evaluation metric, providing a practical pathway for autonomous driving to enhance both safety and environmental sustainability [34].

3. Materials and Methods

The driving scenario considered in this study is a freeway straight road, where cooperative interactions among vehicles are modeled as a Partially Observable Markov Decision Process (POMDP). In this framework, each Connected and Autonomous Vehicle (CAV) can only access local observations of the environment and makes decisions based on reinforcement learning policies. The POMDP can be represented as a tuple:

P O M D P = (S, O, A, P, R, γ, N)

(17)

where

S

is the state space set, describing the global traffic environment;

O = O_{1}, O_{2}, {\dots, O}_{N}

is the local observation set, where each agent perceives only part of the environment.;

A = A_{1}, A_{2}, {\dots, A}_{N}

is the action space set, representing the set of available actions for all agents.

P (s^{'} | s, a)

denotes the joint state transition probability of moving from state

s

to

s^{'}

executing action

a

;

R = {R_{i}}

is the set of reward functions, where

R_{i}

represents the reward by agent

i

receives after executing action

a_{i}^{t}

in the global state.

γ

is the discount factor that weighs the importance of future rewards relative to current ones; and

N

is the number of autonomous vehicles. Each agent is equipped with an independent Actor network and Critic network.

3.1. State Space

The state

s_{i}

of connected autonomous vehicle

i

is defined as a matrix of dimensions

N_{i} \times W

, where

N_{i}

is the number of vehicles within the perception range and

W

is the number of vehicle state features. Each agent’s local state includes the following features:

E x i s t e n c e

: whether nearby vehicles are observed.

{P o s i t i o n}_{x}

: Longitudinal position of surrounding vehicles relative to the ego vehicle.

{P o s i t i o n}_{y}

: Lateral position of surrounding vehicles relative to the ego vehicle.

V_{x}

: Longitudinal velocity of surrounding vehicles relative to the ego vehicle.

V_{y}

: Lateral velocity of surrounding vehicles relative to the ego vehicle.

H e a d i n g

: Ego vehicle’s heading angle.

A g g

: The probability that the vehicle belongs to the aggressive driving style

C a u

: The probability that the vehicle belongs to the cautious driving style.

The ego vehicle’s perception range is set to [−100 m, 100 m]. Each vehicle can observe up to eight surrounding vehicles, including the leading and following vehicles in its own lane, as well as those in adjacent lanes. Therefore, the decision-making process of each autonomous vehicle is modeled as a POMDP.

3.2. Action Space

In the driving decision-making of connected autonomous vehicles, the action space

a^{t} = {{a}_{1}^{t}, a_{2}^{t}, \dots, a_{N}^{t}}

represents the set of feasible actions at each time step, including lane change (left/right), cruising, acceleration, and deceleration. The action set is defined as:

a_{i}^{t} \in A = {L a n e_l e f t, L a n e_r i g h t, C r u i s i n g, A c c e l e r a t e, D e c e l e r a t e}

(18)

All vehicles’ actions collectively form a high-dimensional joint action space

A = {A_{1}, A_{2}, \dots, A_{N}}

. In the decision-making process, reinforcement learning algorithms first learn high-level driving policies, while the low-level controller converts decision signals into steering and longitudinal control commands to drive the vehicle.

3.3. Reward

The reward function serves to assess an agent’s feedback when executing actions in specific states, thereby enabling each CAV to optimize its behavior for safe and efficient highway travel while proactively anticipating the driving styles of surrounding vehicles [35]. Ultimately, the goal is to improve overall traffic flow efficiency under the constraints of safety and comfort.

Consistent with prior work, the collision penalty was assigned the highest weight to ensure that safety dominates the learning process and prevents agents from exploiting unsafe high-efficiency behaviors [36,37]. The speed-related weight is higher for the aggressive driving style and lower for the cautious one, reflecting established distinctions between efficiency-oriented and risk-averse behaviors [38]. Comfort-related terms were given moderate weights, as suggested in previous studies indicating that comfort should influence decisions without overriding safety. Headway-related weights were set above comfort to maintain adequate following distance, consistent with multi-objective car-following models emphasizing Time-to-Collision (TTC) and spacing stability [39]. A moderately weighted eco-driving term can be incorporated into the model to account for energy-related objectives without compromising safety or traffic performance [40]. Accordingly, the reward function was formulated, and the associated weights were assigned following these principles, as presented in Equation (19) and Table 4.

\{\begin{matrix} R_{θ} = ω_{1} R_{c} {+ ω}_{2} R_{s} + ω_{3} R_{c o m} + ω_{4} R_{h d} + R_{e c o} \\ R_{c} = \{\begin{matrix} - 1 if collision \\ 0 others \end{matrix} \\ R_{s} = \{\begin{matrix} \frac{v_{i} - v_{m i n}}{v_{m a x} - v_{m i n}} if v_{m i n} < v_{i} < v_{m a x} \\ 0 others \end{matrix} \\ R_{c o m} = {(\frac{Δ a}{a_{m a x}})}^{2} \\ R_{h d} = \{\begin{matrix} - \log \frac{d_{h e a d w a y}}{t_{h} v_{t}}, v > 0 \\ 0, v = 0 \end{matrix} \\ R_{e c o} = - \frac{\dot{m} (v, a)}{m_{m a x}} \end{matrix}

(19)

where

d_{h e a d w a y}

is the current distance to the preceding vehicle, and

t_{h}

is the minimum safe headway threshold,

\dot{m} (v, a)

is the instantaneous fuel consumption rate depending on vehiclevelocity

v

and acceleration

a

,

m_{m a x}

is the maximum fuel consumption rate.

Accordingly,

R_{c}

denotes the collision penalty term,

R_{s}

is the speed-efficiency reward,

R_{c o m}

represents the comfort term,

R_{h d}

corresponds to headway maintenance, and

R_{e c o}

denotes the eco-driving term based on the VT-Micro fuel consumption model.

3.4. Actor–Critic Network

The Actor–Critic architecture is employed to improve learning efficiency and stability in complex continuous control tasks. In this framework, the Actor network determines the action to be executed based on the current state, while the Critic network evaluates the action-value, facilitating more effective policy optimization. The Actor learns a policy

π_{θ} (a | s)

, parameterized by

θ

, which outputs a probability distribution over possibleactions

a_{t}

given the current state

s_{t}

, as detailed in Figure 3.

After the Actor learns the policy

π

, the Critic is responsible for learning the action-value function to evaluate the long-term value

Q_{π} (s^{'}, a^{'})

of executing a specific action in the current state. The agent selects actions according to the Actor’s policy, interacts with the environment, observes the next state, and obtains rewards. The Critic defines the action-value function

Q_{π} (s, a)

as:

Q_{π} (s, a) = r + γ Q_{π} (s^{'}, a^{'})

(20)

The core task of the Critic is to evaluate the performance of the current policy by minimizing the estimation error of the value function. The advantage function is defined as:

A_{π} (s, a) = Q_{π} (s, a) - V_{π} (s)

(21)

V_{π} (s) = \sum_{a} π (a | s) Q_{π} (s, a)

(22)

Here,

V_{π} (s)

represents the expected cumulative discounted reward that an agent can obtain when following policy

π

and executing all possible actions in state

s

. The discount factor

γ

is applied to balance the importance of future rewards relative to immediate rewards. The advantage function

A_{π} (s, a)

measures the extent to which action

a

in state

s

outperforms or underperforms the average action under the current policy. A positive

A_{π} (s, a)

indicates superiority, while a negative value indicates inferiority. Therefore, the advantage function effectively identifies high-value actions, reduces estimation variance, and promotes efficient policy optimization.

3.5. MAPPO Algorithm

The MAPPO algorithm [41] is a policy gradient (PG)-based method that introduces a novel surrogate objective function to achieve mini-batch updates, effectively mitigating the sensitivity of traditional PG algorithms to learning rates and the difficulty of setting step sizes. The algorithm is derived from Trust Region Policy Optimization (TRPO) and incorporates a clipping mechanism to construct the surrogate objective function. We employ an Actor network

π_{θ}

to approximate the policy function and a Critic network

V_{φ}

to approximate the value function, where

θ

and

φ

are the network parameters [42].

L^{C L I P} (θ) = E_{t} [m i n (ρ_{t} (θ), c l i p (ρ_{t} (θ), 1 - ϵ, 1 + ϵ)) {\hat{A}}_{t}]

(23)

where

ρ_{t} (θ)

denotes the ratio between the new policy and the old policy, and clipping is applied to prevent instability caused by large policy updates.

{\hat{A}}_{t}

represents the Generalized Advantage Estimator (GAE), which calculates the advantage of a specific action relative to the average policy performance. Its formula is:

{\hat{A}}_{t} \approx r (s, a) + γ V_{φ} (s_{t + 1}) - V_{φ} (s_{t})

(24)

The Critic network parameters

φ

are updated by minimizing the following loss function:

L^{V F} (φ) = E {[V_{φ} (s) - y (t)]}^{2}

(25)

where

y (t) = r_{t} + γ V_{φ} (s_{t + 1})

denotes the target value. The overall structure of the MAPPO algorithm is illustrated in Figure 4.

4. Results

4.1. Driving Style Clustering and Analysis

The CH index and DBI were used to evaluate clustering quality. Based on these metrics, driving styles were effectively divided into two categories—aggressive and cautious. As shown in Table 5, when K = 2, CH = 234.94 and DBI = 0.48. When K = 3, CH = 214.31 and DBI = 0.87, with blurred cluster boundaries and unstable centroids, failing to improve style separability and reducing interpretability. Therefore, K = 2 was selected as it provided the clearest and most statistically distinct classification.

Figure 5 illustrates the clustering results of the driving style experiment using the K-means++ algorithm. Each point represents a sample of vehicle behavior, with colors indicating two primary driving style groups: aggressive (red) and cautious (blue). The scatter plots reveal clear separation patterns in the relationships among speed, acceleration, and time headway. The clustering centers provide quantitative insights into the driving styles:

Aggressive driving style: speed = 46.83 km/h, acceleration =0.45 m/s², and time headway = 1.53 s.
Conservative driving style: speed = 41.80 km/h, acceleration = 0.39 m/s², and time headway = 2.46 s.

From the clustering centers, the differences between driving styles are evident. Aggressive drivers maintain higher speeds (=46.83 km/h), slightly higher acceleration (=0.45 m/s²), and shorter time headways (=1.53 s), indicating a preference for efficiency, faster travel, and closer following. Conservative drivers travel at lower speeds (=41.80 km/h) with smoother acceleration (=0.39 m/s²) and longer headways (=2.46 s), reflecting a focus on safety, stability, and comfort.

To further characterize the behavioral differences across driving styles, this study employs Gaussian distributions to model key variables, including vehicle speed, acceleration, and headway. The probability density function is in the form of:

f (x | μ, σ^{2}) = \frac{1}{\sqrt{2 π σ^{2}}} e^{- \frac{{(x - μ)}^{2}}{2 σ^{2}}}

(26)

Gaussian distributions can capture both the central tendency and the stochastic variability of these variables. For aggressive driving behavior, vehicle speed follows a Gaussian distribution with a mean of 46.83 km/h and a variance of 5.54. Acceleration has a mean of 0.45 m/s² and a variance of 0.38, while headway exhibits a mean of 1.53 s and a variance of 0.36. These statistics indicate that aggressive drivers generally maintain higher speeds, respond with more rapid acceleration, and keep shorter but more consistent following distances. In contrast, conservative driving behavior is characterized by a vehicle speed mean of 41.80 km/h with a variance of 6.95, an acceleration means of 0.39 m/s² with a variance of 0.40, and a headway mean of 2.46 s with a variance of 0.73. These features suggest that conservative drivers tend to maintain lower speeds, exhibit smoother acceleration patterns, and keep longer yet more variable following distances.

Overall, the clustering centers highlight coherent behavioral patterns: aggressive drivers prioritize speed and dynamic maneuvers under acceptable risk, while conservative drivers adopt more cautious, stable driving strategies. These distinctions capture the heterogeneity of driving styles and provide a basis for assigning prior probabilities in subsequent Bayesian inference of driver behavior.

4.2. Scene Verification

4.2.1. Experimental Design

We designed a 1 km three-lane highway scenario, where the perception range of each autonomous vehicle was set to 100 m. According to different traffic densities, two groups of experimental conditions were established. The configurations of Connected and Autonomous Vehicles (CAVs) and Human-Driven Vehicles (HDVs) in each scenario were as follows. The highway traffic simulation scenario is illustrated in Figure 6.

Low-density scenario: 6–10 CAVs and 6–10 HDVs with identical driving styles were randomly generated on the road, with vehicles traveling at random speeds.
High-density scenario: 12–16 CAVs and 12–16 HDVs with identical driving styles were randomly generated on the road, with vehicles traveling at random speed [43].

During training, all algorithms were trained for a total of 1 million steps. Model evaluation was performed every 200 training episodes, with ten independent evaluation runs executed at each evaluation point; the reported average reward is computed over these 200-episode intervals. In the simulation environment, initial vehicle speeds were sampled uniformly in the range 25–27 m/s, and a random perturbation in [−1.5, 1.5] m was added to initial vehicle positions to improve realism. The decision-making frequency was set to 5 Hz. Experiments were conducted on a Windows 10 workstation equipped with an NVIDIA A100 (40 GB) GPU (Nvidia, Santa Clara, CA, USA) and an Intel(R) Core (TM) Ultra 9 285K processor (Intel, Santa Clara, CA, USA); the implementation uses Python 3.8, while other network parameters are listed in Table 6.

4.2.2. Risk Assessment

To quantify potential interaction risks, this study adopts the PET as the core metric. PET measures the time interval between two vehicles passing the same conflict point in succession, where a smaller PET indicates a higher collision likelihood [44]. Compared with TTC, PET is more robust under trajectory uncertainty and better suited for assessing conflicts in complex multi-vehicle interactions. By integrating PET into the prediction–decision loop, autonomous vehicles can evaluate short-term collision risks in advance, enabling proactive and safer maneuver planning. When the Euclidean distance between two vehicles is less than their average vehicle length, it is determined that there is a conflict risk, and the PET calculation is triggered:

‖p_{i} - p_{j}‖ < \frac{L_{i} + L_{j}}{2}

(27)

where

p_{i}

and

p_{j}

represent the positions of the two vehicles, respectively,

L_{i}

and

L_{j}

represent their lengths. Subsequently, the PET is calculated based on the time difference of the two vehicles passing through the conflict point:

P E T = t_{i} - t_{j}

(28)

where

t_{i}

and

t_{j}

respectively represent the time when vehicles

i

and

j

pass through the conflict point.

In Figure 7a, in the low-density scenario, the proposed method achieved an average PET of 1.4 s, outperforming both Multi-Agent Advanced Actor–Critic (MAA2C, 1.2 s) and Multi-Agent Actor–Critic using Kronecker-Factored Trust Region (MAACKTR, 0.8 s). Since PET directly reflects the temporal safety margin at potential conflict points, its improvement provides quantitative evidence that our model more effectively suppresses short-term collision risk. The larger PET indicates that the ego vehicle preserves greater safety margins during interactions, highlighting the advantage of using PET as a risk-assessment metric.

By incorporating driving-style recognition and trajectory prediction, the proposed method enables the ego vehicle to anticipate surrounding maneuvers earlier, thereby reducing motion uncertainty and avoiding last-moment evasive actions [45,46]. This leads to smoother interaction strategies and more proactive risk mitigation, which ultimately enlarges the safety buffer and further validates the effectiveness of the enhanced prediction–decision framework.

4.2.3. Emissions Assessment

To quantitatively evaluate the carbon emission performance under different autonomous driving decision strategies, this study establishes a set of microscopic-level emission evaluation metrics based on an instantaneous emission rate model. Let the distance time steps be

t = 0,1 \dots T - 1

with a sampling interval of

Δ t

, and denote the instantaneous emission rate as

\dot{E} (t)

. The total emissions for a single trip can be expressed as:

E_{t o t a l} = \sum_{t = 0}^{T - 1} \dot{E} (t) Δ t

(29)

The total distance traveled during the trip is computed as:

D_{t o t a l} = \sum_{t = 0}^{T - 1} v (t) Δ t

(30)

From the total emissions and total distance, the per-kilometer emission is calculated as:

E_{p e r k m} = \frac{E_{t o t a l}}{D_{t o t a l} / 1000}

(31)

This set of metrics allows for a consistent comparison of different autonomous driving strategies, linking instantaneous vehicle behavior to aggregate emission performance before further graphical and statistical analysis.

Figure 8 shows the variations in average emission rate (mg/s) and the corresponding carbon emission per kilometer (g/km), reflecting both instantaneous emission intensity and overall trip efficiency. Under the proposed method, emissions decrease from 126.44 mg/s to 120.0 mg/s (8.01 g/km) in low-density traffic, and from 120.96 mg/s to 110.82 mg/s (9.11 g/km) in high-density traffic, corresponding to reductions of 5.1% and 8.4%, respectively. These results indicate that the strategy effectively lowers carbon emissions by smoothing acceleration and regulating longitudinal control, with greater benefits under high-density conditions where frequent stop–go maneuvers amplify emission fluctuations.

4.2.4. Bayesian Inference Assessment

To evaluate the effectiveness of the Bayesian inference model in short-term prediction tasks, six sets of experiments were conducted with prediction horizons of 1 s, 2 s, and 3 s. The model integrates both ego-vehicle states and surrounding vehicle dynamics, including relative speed, distance, and lane position. By employing a Bayesian inference module, the method explicitly models the uncertainty of the predicted trajectory distribution during the inference phase, enabling probabilistic forecasting of future positions.

The prediction accuracy is quantitatively assessed using the Root Mean Square Error (RMSE) and Final Displacement Error (FDE) [47], defined as follows:

R M S E = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {| | {\hat{p}}_{i} - p_{i} | |}^{2}}

(32)

F D E = {| | \hat{p}}_{T} - p_{T} | |

(33)

where

\hat{p}

and

p_{i}

denote the predicted and ground-truth positions at time step

i

, and

T

is the final prediction horizon.

As shown in Figure 9, both RMSE and FDE increase gradually as the prediction horizon extends from 1 s to 3 s, reflecting the accumulation of predictive uncertainty over time. Under low-density conditions, the mean RMSE increases from 0.673 m (1 s) to 1.352 m (3 s), while FDE grows from 0.867 m to 1.850 m. This corresponds to an overall growth factor of 2.06× in RMSE, indicating a moderate accumulation of position deviation with time.

In contrast, under high-density conditions, the mean RMSE rises from 0.783 m to 1.868 m, and FDE from 0.923 m to 2.216 m, yielding a larger growth factor of 2.58×. This suggests that in dense traffic, not only are absolute errors higher, but the rate of error accumulation is also significantly faster. Such behavior highlights the nonlinear amplification effect of vehicle-to-vehicle interactions and trajectory uncertainty when surrounding vehicles’ motion becomes strongly coupled.

Although the RMSE values clearly increase with the prediction horizon, the magnitude of growth between RMSE-3s and RMSE-1s suggests that long-term errors are not merely an accumulation of short-term uncertainty. Instead, the larger deviation observed beyond 3 s likely stems from discrete driving decisions such as lane-changing and merging conflicts, which introduce abrupt trajectory deviations that cannot be captured by continuous kinematic prediction alone. This indicates that beyond short-range horizons, the stochasticity of interactive behaviors becomes the dominant source of prediction error.

Overall, even within a 3 s forecast horizon, the proposed Bayesian inference model maintains the mean RMSE and FDE within approximately 1.5–2.0 m across different densities, demonstrating robust generalization and credible short-term forecasting accuracy in complex multi-agent environments.

4.2.5. Ablation Study

To assess the effectiveness of the proposed method, we performed an ablation study in which the MPC module was removed from the framework. By comparing this ablated variant with the complete method, we evaluate the isolated contribution of MPC to overall performance in terms of efficiency, safety, and robustness across different traffic scenarios.

Overall Performance: As shown in Figure 10a, the average reward curves during training reveal that our method exhibits a relatively gradual improvement in the early stages, but after approximately 8000 episodes, the average reward rapidly increases to 74.25 and finally stabilizes at 83.30. In contrast, the baseline MAPPO converges to only 75.31, representing an improvement of about 10.6%. As shown in Figure 10b, in high-density scenarios, our method achieves an average reward of 58.28, exceeding the baseline MAPPO’s 52.38 by 11.26%. Moreover, our method demonstrates superior performance stability. In the low-density scenario, our method achieved an average speed of 25.52 m/s, compared with 23.03 m/s for baseline MAPPO (an improvement of 10.81%). In the high-density scenario, although average speeds of both methods declined to around 22 m/s due to congestion, our method exhibited a smaller standard deviation, indicating more stable driving performance.

These results clearly indicate that the integration of the Bayesian–MPC prediction module significantly improves agent performance, with reward values enhanced by approximately 10% compared with baseline MAPPO. Notably, under higher traffic density and increased environmental complexity, our method continues to maintain strong performance, demonstrating superior robustness and adaptability.

Safety Analysis: To assess safety, we conducted 30 randomized scenario tests across different traffic densities and measured the collision rate, defined as the proportion of collision steps relative to the total steps. Results are shown in Table 7. Our method achieved collision rates of 0.00 (low density) and 0.01 (high density), while baseline MAPPO achieved 0.00 under low density but rose to 0.03 in high density. We also evaluated robustness in heterogeneous traffic scenarios containing HVs.

In summary, simulation experiments and analysis confirm that our method effectively improves driving safety across multiple traffic scenarios and demonstrates strong generalization ability in high-density and complex environments.

4.2.6. Algorithm Comparison

In this subsection, we employ two representative multi-agent reinforcement learning algorithms, MAACKTR and MAA2C, as comparative baselines to evaluate the proposed framework. The comparison across different traffic scenarios enables a comprehensive assessment of safety, efficiency, comfort, and carbon-emission performance.

The proposed method exhibits clear advantages in safety, efficiency, and robustness across both low- and high-density traffic conditions. As shown in Figure 11, performance improvements are consistent when compared with MAACKTR and MAA2C. In low-density scenarios, our method achieves a high evaluation reward of 83.30, outperforming MAACKTR and MAA2C by 40.9% and 31.4%, respectively, while maintaining a zero-collision rate. The average speed increases to 25.52 m/s, representing over 10% improvement relative to both baselines.

In high-density scenarios, our method similarly attains the highest evaluation reward (58.28), exceeding MAACKTR by 7.11% and MAA2C by 16.62%, while reducing the collision rate to 0.01. The average speed also increases to 22.72 m/s, corresponding to 10.08–15.15% gains over MAACKTR and MAA2C. These results indicate that our approach demonstrates stronger foresight and more stable control under dense interactions.

Beyond quantitative improvements, the learned policies reveal interpretable behavior patterns: style-aware priors guide smoother acceleration, more consistent following distances, and reduced unnecessary lane changes, contributing to enhanced safety and driving comfort. Supported by the comparative metrics summarized in Table 7, these findings confirm that the proposed Bayesian style-aware framework not only delivers superior performance but also translates into safer, more efficient, and more human-aligned driving strategies in mixed-traffic environments.

By leveraging driving style recognition and the rolling optimization mechanism of MPC, the proposed method can proactively identify potential conflicts and avoid high-risk situations, leading to substantial safety improvements even in complex, high-density environments. Beyond safety, the style-aware reward function effectively constrains acceleration fluctuations and jerk, ensuring smoother motion profiles and enhanced ride comfort despite tighter control constraints. Furthermore, the integration of multi-step prediction and behavioral priors enables the policy to maintain stable performance under diverse interaction patterns, thereby enhancing robustness and generalization capability.

Collectively, these results confirm that the proposed approach achieves a balanced advancement in safety, comfort, and robustness by unifying predictive optimization with adaptive behavior modeling.

5. Conclusions

This paper presents an integrated autonomous driving decision-making framework that unifies Bayesian driving style inference, risk-aware trajectory prediction, and multi-agent cooperative reinforcement learning within a closed-loop “Identification–Prediction–Decision” architecture. By embedding Bayesian trajectory inference into the MPC module, the framework enhances multi-step motion prediction under uncertainty, enabling vehicles to anticipate potential conflicts and maintain safe, smooth trajectories. The recognition of heterogeneous driving styles provides high-level semantic priors that guide adaptive policy updates, improving decision robustness in complex mixed-traffic environments. Furthermore, through smoothing acceleration profiles and suppressing abrupt maneuvers, the proposed method effectively balances safety and comfort while reducing carbon emissions. This integration of predictive reasoning, behavioral semantics, and ecological awareness demonstrates a comprehensive advancement in safety, robustness, and sustainability for autonomous driving systems.

Despite the promising results, the prediction accuracy, particularly RMSE, tends to degrade over longer horizons, indicating the need for more advanced sequential models such as LSTM for future trajectory prediction [44]. Additionally, the current study mainly considers simple lane-change scenarios, and future work will extend to more complex settings involving intersecting lanes and denser traffic interactions.

Author Contributions

Methodology, X.Z.; project administration, L.W. and Y.Z.; supervision, L.W., X.Z. and Z.F.; writing—original draft, X.Z.; writing—review and editing, Y.Z.; software, Z.F. and X.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key Research and Development Program of China, grant number 2022YFB4300400.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed at the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Engström, J.; Wei, R.; McDonald, A.D.; Garcia, A.; O’Kelly, M.; Johnson, L. Resolving uncertainty on the fly: Modeling adaptive driving behavior as active inference. Front. Neurorobot. 2024, 18, 1341750. [Google Scholar] [CrossRef]
Azadani, M.N.; Boukerche, A. Driving behavior analysis guidelines for intelligent transportation systems. IEEE Trans. Intell. Transp. Syst. 2021, 23, 6027–6045. [Google Scholar] [CrossRef]
Raiyn, J.; Weidl, G. Predicting autonomous driving behavior through human factor considerations in safety-critical events. Smart Cities 2024, 7, 460–474. [Google Scholar] [CrossRef]
Chu, H.; Zhuang, H.; Wang, W.; Na, X.; Guo, L.; Zhang, J.; Gao, B.; Chen, H. A review of driving style recognition methods from short-term and long-term perspectives. IEEE Trans. Intell. Veh. 2023, 8, 4599–4612. [Google Scholar] [CrossRef]
Guo, L.; Sun, M.; Hu, Y.; Chen, H. Optimization of Fuel Economy and Emissions Through Coordinated Energy Management for Connected Diesel Vehicles. IEEE Trans. Intell. Veh. 2023, 8, 3593–3604. [Google Scholar] [CrossRef]
Singh, M.; Dubey, R.K. Deep Learning Model Based CO₂ Emissions Prediction Using Vehicle Telematics Sensors Data. IEEE Trans. Intell. Veh. 2023, 8, 768–777. [Google Scholar] [CrossRef]
Sun, Y.; Hu, Y.; Zhang, H.; Chen, H.; Wang, F.-Y. Transfer Methods for Vehicle Carbon Emission Models Based on the Parallel Transportation System. IEEE Trans. Intell. Transp. Syst. 2024, 25, 11664–11674. [Google Scholar] [CrossRef]
Fang, J.; Wang, F.; Xue, J.; Chua, T.-S. Behavioral intention prediction in driving scenes: A survey. IEEE Trans. Intell. Transp. Syst. 2024, 25, 8334–8355. [Google Scholar] [CrossRef]
Wang, J.; Yu, Q.; Li, S.; Ning, D.; Keqiang, L.I. Eco speed optimization based on real-time information of road gradient. J. Automot. Saf. Energy 2014, 5, 257. [Google Scholar]
Bakibillah, A.S.M.; Kamal, M.A.S.; Tan, C.P.; Hayakawa, T.; Imura, J. Eco-driving on hilly roads using model predictive control. In Proceedings of the 2018 Joint 7th International Conference on Informatics, Electronics & Vision (ICIEV) and 2018 2nd International Conference on Imaging, Vision & Pattern Recognition (icIVPR), Kitakyushu, Japan, 25–29 June 2018; pp. 476–480. [Google Scholar]
Norouzi, A.; Heidarifar, H.; Borhan, H.; Shahbakhti, M.; Koch, C.R. Integrating machine learning and model predictive control for automotive applications: A review and future directions. Eng. Appl. Artif. Intell. 2023, 120, 105878. [Google Scholar] [CrossRef]
Mensing, F.; Bideaux, E.; Trigui, R.; Ribet, J.; Jeanneret, B. Eco-driving: An economic or ecologic driving style. Transp. Res. Part C Emerg. Technol. 2014, 38, 110–121. [Google Scholar] [CrossRef]
Deng, J.; Polterauer, P.; del Re, L. Emission aware eco-driving on country roads. In Proceedings of the Dynamic Systems and Control Conference, Park City, UT, USA, 8–11 October 2019; American Society of Mechanical Engineers: New York, NY, USA, 2019; Volume 59148, p. V001T04A001. [Google Scholar]
Qiu, X.; Pan, Y.; Zhu, M.; Yang, L.; Zheng, X. Driving Style-aware Car-following Considering Cut-in Tendencies of Adjacent Vehicles with Inverse Reinforcement Learning. In Proceedings of the 2024 IEEE Intelligent Vehicles Symposium (IV), Jeju Island, Republic of Korea, 2–5 June 2024; pp. 1329–1336. [Google Scholar] [CrossRef]
Xing, Y.; Lv, C.; Mo, X.; Hu, Z.; Huang, C.; Hang, P. Toward safe and smart mobility: Energy-aware deep learning for driving behavior analysis and prediction of connected vehicles. IEEE Trans. Intell. Transp. Syst. 2021, 22, 4267–4280. [Google Scholar] [CrossRef]
Wu, Z.X.; He, Y.T.; Yu, L.J.; Fu, L.; Chen, P. Research on driving style recognition algorithm based on big data. Automob. Technol. 2018, 2018, 10–15. [Google Scholar]
Dörr, D.; Grabengiesser, D.; Gauterin, F. Online driving style recognition using fuzzy logic. In Proceedings of the IEEE 17th International IEEE Conference on Intelligent Transportation Systems (ITSC), Qingdao, China, 8–11 October 2014; pp. 1021–1026. [Google Scholar] [CrossRef]
Aljaafreh, A.; Alshabatat, N.; Al-Din, S.M.N. Driving style recognition using fuzzy logic. In Proceedings of the 2012 IEEE International Conference on Vehicular Electronics and Safety (ICVES 2012), Istanbul, Turkey, 24–27 July 2012; pp. 460–463. [Google Scholar]
Liang, K.; Zhao, Z.; Li, W.; Zhou, J.; Yan, D. Comprehensive Identification of Driving Style Based on Vehicle’s Driving Cycle Recognition. IEEE Trans. Veh. Technol. 2023, 72, 312–326. [Google Scholar] [CrossRef]
Chang, G.; Suqin, Q. An adaptive MPC trajectory tracking algorithm for autonomous vehicles. In Proceedings of the 2021 17th International Conference on Computational Intelligence and Security (CIS), Chengdu, China, 19–22 November 2021; pp. 197–201. [Google Scholar]
Pang, F.; Luo, M.; Xu, X.; Tan, Z. Path tracking control of an omni-directional service robot based on model predictive control of adaptive neural-fuzzy inference system. Appl. Sci. 2021, 11, 838. [Google Scholar] [CrossRef]
Tian, Y.; Yao, Q.; Hang, P.; Wang, S. Adaptive coordinated path tracking control strategy for autonomous vehicles with direct yaw moment control. Chin. J. Mech. Eng. 2022, 35, 1. [Google Scholar] [CrossRef]
Liu, J.; Guo, H.; Song, L.; Dai, Q.; Chen, H. Driver-automation shared steering control for highly automated vehicles. Sci. China Inf. Sci. 2020, 63, 1–16. [Google Scholar] [CrossRef]
Liang, Y.; Li, Y.N.; Khajepour, A.; Ni, J.; Zheng, L. Holistic adaptive multi-model predictive control for the path following of 4WID autonomous vehicles. IEEE Trans. Veh. Technol. 2020, 70, 69–81. [Google Scholar] [CrossRef]
Huang, Y.; Jafari, M.A. Risk-aware vehicle motion planning using bayesian LSTM-based model predictive control. arXiv 2023, arXiv:2301.06201. [Google Scholar]
Li, L.; Zhao, W.; Wang, C. Cooperative merging strategy considering stochastic driving style at on-ramps: A bayesian game approach. Automot. Innov. 2024, 7, 312–334. [Google Scholar] [CrossRef]
Li, J.; Dai, B.; Li, X.; Xu, X.; Liu, D. A dynamic Bayesian network for vehicle maneuver prediction in highway driving scenarios: Framework and verification. Electronics 2019, 8, 40. [Google Scholar] [CrossRef]
Wang, W.; Xi, J.; Li, X. Statistical Pattern Recognition for Driving Styles Based on Bayesian Probability and Kernel Density Estimation. IEEE Trans. Intell. Transp. Syst. 2021, 22, 2765–2776. [Google Scholar]
Shen, C.; Zhang, L.; Shi, B.; Ma, X.; Li, Y.; Hu, H. Human-Like Behavior Strategy for Autonomous Vehicles Considering Driving Styles; No. 2024-01-7046; SAE Technical Paper; SAE: Warrendale, PA, USA, 2024. [Google Scholar]
Huang, C.; Salehi, R.; Stefanopoulou, A.G. Intelligent Cruise Control of Diesel Powered Vehicles Addressing the Fuel Consumption Versus Emissions Trade-Off. In Proceedings of the 2018 Annual American Control Conference (ACC), Milwaukee, WI, USA, 27–29 June 2018. [Google Scholar]
Ahn, K. Microscopic Fuel Consumption and Emission Modeling. Ph.D. Thesis, Virginia Tech, Blacksburg, VA, USA, 1998. [Google Scholar]
Oduro, S.D.; Metia, S.; Duc, H.; Hong, G.; Ha, Q. Multivariate adaptive regression splines models for vehicular emission prediction. Vis. Eng. 2015, 3, 13. [Google Scholar] [CrossRef]
Le Cornec, C.M.; Molden, N.; van Reeuwijk, M.; Stettler, M.E. Modelling of instantaneous emissions from diesel vehicles with a special focus on NOx: Insights from machine learning techniques. Sci. Total Environ. 2020, 737, 139625. [Google Scholar] [CrossRef]
Incremona, G.P.; Polterauer, P. Design of a Switching Nonlinear MPC for Emission Aware Ecodriving. IEEE Trans. Intell. Veh. 2023, 8, 469–480. [Google Scholar] [CrossRef]
He, X.; Lou, B.; Yang, H.; Lv, C. Robust decision making for autonomous vehicles at highway on-ramps: A constrained adversarial reinforcement learning approach. IEEE Trans. Intell. Transp. Syst. 2022, 24, 4103–4113. [Google Scholar] [CrossRef]
Shalev-Shwartz, S.; Shammah, S.; Shashua, A. On a formal model of safe and scalable self-driving cars. arXiv 2017, arXiv:1708.06374. [Google Scholar]
Kahn, G.; Villaflor, A.; Pong, V.; Abbeel, P.; Levine, S. Uncertainty-aware reinforcement learning for collision avoidance. arXiv 2017, arXiv:1702.01182. [Google Scholar] [CrossRef]
Chen, J.; Zhao, C.; Jiang, S.; Zhang, X.; Li, Z.; Du, Y. Safe, efficient, and comfortable autonomous driving based on cooperative vehicle infrastructure system. Int. J. Environ. Res. Public Health 2023, 20, 893. [Google Scholar] [CrossRef]
ElSamadisy, O.; Shi, T.; Smirnov, I.; Abdulhai, B. Safe, efficient, and comfortable reinforcement-learning-based car-following for AVs with an analytic safety guarantee and dynamic target speed. Transp. Res. Rec. 2024, 2678, 643–661. [Google Scholar] [CrossRef]
Fan, Y.; Peng, J.; Wu, J.; Zhou, J.; Yu, S.; Ma, C. Eco-driving Strategy for Series Hybrid Electric Vehicle based-on Multi-Objective Deep Reinforcement Learning. IEEE Trans. Transp. Electrif. 2025, 11, 12381–12392. [Google Scholar]
Guo, Y.; Liu, J.; Yu, R.; Hang, P.; Sun, J. MAPPO-PIS: A Multi-Agent Proximal Policy Optimization Method with Prior Intent Sharing for CAVs’ Cooperative Decision-Making. arXiv 2024, arXiv:2408.06656. [Google Scholar]
Hu, H.; Lu, Z.; Wang, Q.; Zheng, C. End-to-end automated lane-change maneuvering considering driving style using a deep deterministic policy gradient algorithm. Sensors 2020, 20, 5443. [Google Scholar] [CrossRef]
Gan, Q.; Li, B.; Xiong, Z.; Li, Z.; Liu, Y. Multi-Vehicle Cooperative Decision-Making in Merging Area Based on Deep Multi-Agent Reinforcement Learning. Sustainability 2024, 16, 9646. [Google Scholar] [CrossRef]
Wu, S.; Tian, D.; Duan, X.; Zhou, J.; Zhao, D.; Cao, D. Continuous decision-making in lane changing and overtaking maneuvers for unmanned vehicles: A risk-aware reinforcement learning approach with task decomposition. IEEE Trans. Intell. Veh. 2024, 9, 4657–4674. [Google Scholar]
Lewis, F.L.; Vrabie, D.; Vamvoudakis, K.G. Reinforcement learning and feedback control: Using natural decision methods to design optimal adaptive controllers. IEEE Control Syst. Mag. 2012, 32, 76–105. [Google Scholar]
Huang, X.; McGill, S.G.; Williams, B.C.; Fletcher, L.; Rosman, G. Uncertainty-aware driver trajectory prediction at urban intersections. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019. [Google Scholar]
Chen, X.; Zhang, H.; Zhao, F.; Hu, Y.; Tan, C.; Yang, J. Intention-Aware Vehicle Trajectory Prediction Based on Spatial-Temporal Dynamic Attention Network for Internet of Vehicles. IEEE Trans. Intell. Transp. Syst. 2022, 23, 19471–19483. [Google Scholar] [CrossRef]

Figure 1. Integrated Framework for Driving Style Recognition and Decision-Making.

Figure 2. MPC schematic diagram.

Figure 3. Flow of Actor–Critic algorithm.

Figure 4. Framework of MAPPO.

Figure 5. Two-dimensional Feature Distribution Based on K-means++ Clustering.

Figure 6. Highway traffic simulation scenario.

Figure 7. Average PET Values of Different Algorithms under Two Modes. (a) PET in Low Density; (b) PET in High Density.

Figure 8. Comparison of emission rate and reduction under different strategies (a) Comparison chart of average emission rate and emission reduction rate (b) Comparison chart of emissions per unit mileage and emission reduction rate.

Figure 9. Comparison of Trajectory Prediction Performance Using RMSE and FDE Metrics (a) Low-Density Traffic Scenario; (b) High-Density Traffic Scenario.

Figure 10. Experimental Results Comparing with the MAPPO Model Reward Function (a) Low-Density Traffic Scenario; (b) High-Density Traffic Scenario.

Figure 11. Comparison of Reward Functions Comparison of the Reward Functions between the Proposed Method, MAA2C, and MAACKTR under Two Modes (a) Low-Density Traffic Scenario; (b) High-Density Traffic Scenario.

Table 1. Parameter Descriptions of the NGSIM US-101 Dataset.

Variable	Parameter Description
$V e h i c l e_I D$	Vehicle ID
$X$	Lateral Position
$Y$	Longitudinal Position
$V_V e l$	Instantaneous Velocity
$V_A c c$	Instantaneous Acceleration
$T i m e_h d w y$	Time headway

Table 2. Parameters and Descriptions of the K-means++ Clustering Algorithm.

Variable	Parameter Description
$c$	Cluster centroid
$k$	Total number of clusters
$z_{i}$	The i-th sample point
$s_{j}$	Set of samples belonging to the j-th cluster
d_ij	$Euclidean distance between sample z_{i}$ $and centroid c_{j}$
$D B I$	Davies–Bouldin Index
$M_{i j}$	$Similarity measure between clusters i$ $and j$
$C H$	Calinski–Harabasz Index
$T r (B_{k})$	$Trace of between - cluster covariance matrix B_{k}$
$T r (W_{k})$	$Trace of between - cluster covariance matrix W_{k}$
n	Total number of samples
B_k	Between-cluster covariance matrix
W_k	Within-cluster covariance matrix

Table 3. Parameters and variables of the model.

Variables	Definition
$s_{t}^{i}$	State vector of the observed vehicle
$x_{t}$	Lateral position of the vehicle
$y_{t}$	Longitudinal position of the vehicle
$v_{t}$	Instantaneous velocity of the vehicle
$φ_{t}$	Heading angle of the vehicle
$U^{*}$	Optimal control sequence over the future H steps
$X^{*}$	Optimal trajectory generated by MPC optimization
${\dot{φ}}_{t}^{i}$	Observed yaw rate
${\dot{a}}_{b}^{i}$	Baseline acceleration
${\hat{T}}_{b}^{i}$	Trajectory point sequence
$e_{t}^{i}$	A set of short-term behavioral features as observational evidence
$P (k \| x_{t})$	Posterior probability distribution of driving styles
$Θ_{j}$	Style-specific acceleration adjustment strategy and parameterized driver model
$a_{j}^{i}$	Style-specific acceleration
${\hat{T}}_{p}^{i}$	Updated trajectory point sequence
$J_{o b s}$	Collision cost
$J (U)$	Optimization objective function of MPC

Table 4. Reward Function Weights (

ω

) for Different Driving Styles.

Table 4. Reward Function Weights (

ω

) for Different Driving Styles.

Driving Style	$ω_{1}$	$ω_{2}$	$ω_{3}$	$ω_{4}$	$ω_{5}$
Aggressive	1.0	2.0	0.4	0.6	0.3
Cautious	0.7	4.0	1.0	1.2	0.6

Table 5. The various indices under different numbers of clusters.

K	2	3	4	5	6	7
CH Index	234.94	214.31	186.03	143.76	148.91	150.60
DBI	0.48	0.87	0.93	1.17	0.99	1.01

Table 6. Network Parameters.

Type	Parameter	Value
Training Settings	Experience Replay Pool	5000 episodes
	Batch processing	128 episodes
	Discount factor	0.99
Network Settings	Optimizer	RMS Prop optimizer
Network Settings	Learning rate	0.0001

Table 7. Comparison of Test Performance between the Proposed Method and Two State-of-the-Art Benchmark Methods.

Scene Mode	Indicator	Our Method	MAPPO	MAACKTR	MAA2C
Low-density scenarios	Evaluation reward	83.30	75.31 (10.6% ↑)	59.10 (40.9% ↑)	63.69 (31.4% ↑)
	Collision rate	0.00	0.00	0.01	0.01
	Average speed (m/s)	25.52	23.03 (10.81% ↑)	23.11 (10.43% ↑)	23.18 (10.09% ↑)
High-density scenarios	Evaluation reward	58.28	52.38 (11.26% ↑)	54.41 (7.11% ↑)	48.97 (16.62% ↑)
	Collision rate	0.01	0.03	0.02	0.01
	Average speed (m/s)	22.72	20.64 (10.08% ↑)	19.73 (15.15% ↑)	19.86 (14.40% ↑)

The upward arrow “↑” indicates the percentage improvement of Our Method over the corresponding baseline method.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, X.; Wang, L.; Zhang, Y.; Feng, Z. Reinforcement Learning-Based Vehicle Control in Mixed-Traffic Environments with Driving Style-Aware Trajectory Prediction. Sustainability 2025, 17, 10889. https://doi.org/10.3390/su172410889

AMA Style

Zhang X, Wang L, Zhang Y, Feng Z. Reinforcement Learning-Based Vehicle Control in Mixed-Traffic Environments with Driving Style-Aware Trajectory Prediction. Sustainability. 2025; 17(24):10889. https://doi.org/10.3390/su172410889

Chicago/Turabian Style

Zhang, Xiaopeng, Lin Wang, Yipeng Zhang, and Zewei Feng. 2025. "Reinforcement Learning-Based Vehicle Control in Mixed-Traffic Environments with Driving Style-Aware Trajectory Prediction" Sustainability 17, no. 24: 10889. https://doi.org/10.3390/su172410889

APA Style

Zhang, X., Wang, L., Zhang, Y., & Feng, Z. (2025). Reinforcement Learning-Based Vehicle Control in Mixed-Traffic Environments with Driving Style-Aware Trajectory Prediction. Sustainability, 17(24), 10889. https://doi.org/10.3390/su172410889

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Reinforcement Learning-Based Vehicle Control in Mixed-Traffic Environments with Driving Style-Aware Trajectory Prediction

Abstract

1. Introduction

2. Basic Methods

2.1. Driving Style Extraction Based on the K-Means ++ Algorithm

2.2. Behavior Prediction Based on Bayesian Model

2.3. Driving Style-Aware Trajectory Prediction and MPC Optimization

2.4. Carbon Emission Modeling

3. Materials and Methods

3.1. State Space

3.2. Action Space

3.3. Reward

3.4. Actor–Critic Network

3.5. MAPPO Algorithm

4. Results

4.1. Driving Style Clustering and Analysis

4.2. Scene Verification

4.2.1. Experimental Design

4.2.2. Risk Assessment

4.2.3. Emissions Assessment

4.2.4. Bayesian Inference Assessment

4.2.5. Ablation Study

4.2.6. Algorithm Comparison

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI