Enhancing Mixed Traffic Stability with TD3-Driven Bilateral Control in Autonomous Vehicle Chains

Liu, Kan; Jiao, Pengpeng; Hong, Weiqi; Chen, Yue

doi:10.3390/su17114790

Open AccessArticle

Enhancing Mixed Traffic Stability with TD3-Driven Bilateral Control in Autonomous Vehicle Chains

School of Civil and Transportation Engineering, Beijing University of Civil Engineering and Architecture, Beijing 100044, China

^*

Author to whom correspondence should be addressed.

Sustainability 2025, 17(11), 4790; https://doi.org/10.3390/su17114790

Submission received: 12 March 2025 / Revised: 14 May 2025 / Accepted: 16 May 2025 / Published: 23 May 2025

(This article belongs to the Special Issue Sustainable Intelligent Transportation: Cooperative Systems and Vehicle Automation)

Download

Browse Figures

Versions Notes

Abstract

This study presents a TD3-driven Bilateral Control Model (TD3-BCM) aimed at improving the stability of mixed traffic flows in autonomous vehicle (AV) chains. By integrating deep reinforcement learning, TD3-BCM optimizes control strategies to reduce traffic oscillations, smooth speed and acceleration fluctuations, and enhance overall system performance. Stability analysis shows that TD3-BCM effectively suppresses traffic fluctuations, with system stability improving from 1.132 to 1.182 as AV penetration increases. At an AV penetration rate of 40%, TD3-BCM surpasses both Cooperative Adaptive Cruise Control (CACC) and traditional Bilateral Control Model (BCM) approaches in terms of traffic efficiency, safety, and energy use—raising trailing vehicle speed by 12.6%, shortening average headway by 19.0%, increasing Time-to-Collision (TTC) by 87.3%, and lowering fuel consumption by 14.8%. When AV penetration reaches 70%, fuel savings rise to 19.7%, accompanied by further improvements in both traffic stability and safety. TD3-BCM provides a scalable and sustainable solution for intelligent transportation systems, particularly in high-penetration AV environments, by significantly enhancing stability, operational efficiency, and road safety.

Keywords:

TD3 algorithm; bilateral control model (BCM); mixed traffic dynamics; sustainable transportation; cooperative system; vehicle automation; energy management

1. Introduction

With the rapid advancement of autonomous driving technologies, road transportation systems are gradually transitioning from traditional human-driven dominance to increasingly integrated human–machine co-driving paradigms. Although fully autonomous driving remains the long-term goal, autonomous vehicles (AVs) and human-driven vehicles (HDVs) are expected to coexist for the foreseeable future, giving rise to highly heterogeneous and dynamically complex mixed traffic environments [1,2,3]. Within the broader goal of developing low-carbon, safe, and efficient transportation systems, a central challenge lies in effectively suppressing traffic disturbances while enhancing both operational stability and traffic flow efficiency under mixed traffic conditions [4,5].

Traffic disturbances are defined as deviations from steady-state conditions in the vehicle–road system, typically caused by individual driver behavior or external environmental fluctuations. Such disturbances typically manifest as spatiotemporal variations in vehicle speed, headway, and flow rate. Research in this area focuses on the mechanisms of generation, propagation, and attenuation of these disturbances, along with their combined impacts on traffic efficiency, safety, energy consumption, and associated carbon emissions [6,7]. Among these, traffic oscillations represent the most prevalent and disruptive form of disturbance in mixed traffic flows. These oscillations are typically characterized by periodic fluctuations in vehicle speed and spacing that propagate along vehicle platoons and amplify over time, resulting in reduced traffic efficiency, higher energy consumption, and increased safety risks [8,9,10].

Traffic stability refers to a system’s capacity to attenuate traffic disturbances and can be evaluated from three complementary perspectives: (1) Theoretical chain stability, derived from the linearization or frequency-domain analysis of car-following models, evaluates whether the AV penetration rate surpasses the critical threshold necessary to ensure system-wide stability [11,12]; (2) Cumulative damping ratio, which quantifies the decay of disturbances along the vehicle string based on acceleration energy ratios—where sustained values below 1 and a downstream-decreasing trend signify effective suppression [13,14]; (3) Traffic dynamics analysis, which employs spatiotemporal velocity maps and headway trajectories to visually assess the attenuation of traffic oscillations and validate the effectiveness of control strategies [10,15]. These three evaluation dimensions are complementary and collectively establish a robust framework for assessing the steady-state performance of control strategies in heterogeneous mixed traffic environments [16,17]. Prior research has demonstrated that optimizing traffic control strategies not only enhances system stability, reduces carbon emissions, and improves throughput and safety prediction, but also fosters the sustainable development of traffic systems across environmental, operational, and safety dimensions [18,19].

Traditional control strategies—such as classical car-following models based on unidirectional feedback, adaptive cruise control (ACC), and cooperative adaptive cruise control (CACC)—have improved individual vehicle responsiveness. However, they remain inadequate for suppressing multi-source disturbances and addressing behavioral heterogeneity in mixed traffic environments [20,21]. Bilateral control models (BCMs), which incorporate feedback from both leading and following vehicles, exhibit greater theoretical potential for mitigating disturbance propagation and enhancing chain stability [22]. However, most existing studies on BCM are confined to fully autonomous settings and do not address the heterogeneous dynamics inherent in AV–HDV coexistence.

In recent years, reinforcement learning (RL)—particularly its deep learning extension, deep reinforcement learning (DRL)—has demonstrated considerable promise in traffic control, owing to its strengths in policy optimization and adaptive learning within high-dimensional, uncertain environments [23,24]. Among DRL algorithms, the Twin Delayed Deep Deterministic Policy Gradient (TD3) has gained significant attention for its enhanced stability and suitability in continuous control tasks. In our previous work, we integrated TD3 into a bilateral control model to suppress traffic oscillations in fully autonomous environments, achieving notable improvements in both system stability and operational efficiency [25]. However, that framework was limited to homogeneous AV scenarios and did not address the behavioral heterogeneity and interaction complexity present in mixed traffic conditions—an issue this study seeks to resolve.

To this end, we propose a TD3-driven Bilateral Control Model (TD3-BCM) specifically designed for heterogeneous mixed traffic flows involving both autonomous vehicles (AVs) and human-driven vehicles (HDVs). This framework integrates the structural stability of BCM with the adaptive policy learning capabilities of TD3, and constructs a customized state space, action mechanism, and multi-objective reward function tailored to the dynamics of mixed traffic. The model aims to maintain system stability while simultaneously enhancing control efficiency and operational safety. The main contributions of this study are summarized as follows:

Theoretical Contribution: This study extends the existing TD3-BCM framework to heterogeneous mixed traffic scenarios involving both AVs and HDVs. By introducing AV penetration rate as a key parameter, we derive a chain stability criterion and identify a critical threshold $ϕ_{min}$ necessary for maintaining stable operation. Based on this, we develop a multi-dimensional evaluation framework that incorporates stability, traffic efficiency, safety, and energy conservation—thereby enriching the theoretical foundation of mixed traffic flow control.
Methodological Contribution: We construct a TD3-BCM control model suitable for heterogeneous vehicle platoons by embedding AV–HDV interaction states into the state–action–reward formulation of the DRL framework. Compared to previous BCM-DRL models designed for homogeneous AV scenarios, our approach demonstrates superior policy generalization and disturbance suppression, making it better suited for dynamic control tasks in real-world AV–HDV mixed environments.
Empirical Validation: Leveraging reconstructed NGSIM I-80 trajectory data and extensive numerical simulations, we evaluate the performance of TD3-BCM across varying AV penetration rates. The results indicate that, compared to conventional unidirectional CACC and fixed-gain BCM models, TD3-BCM significantly reduces the amplitude and frequency of traffic oscillations, improves speed stability and throughput, and concurrently lowers fuel consumption and collision risk. These findings confirm the model’s adaptability and practical relevance in future mixed traffic scenarios.

The remainder of this paper is organized as follows: Section 2 surveys prior research, covering the formation and sustainability impacts of traffic oscillations, the evolution of control models from classical car-following to bilateral control, and reinforcement-learning techniques for mixed-traffic management. Section 3 introduces the proposed methodology. It first revisits the traditional bilateral control model, then details the TD3-driven Bilateral Control Model—including the actor–critic architecture, target-policy smoothing, and clipped double Q-learning—and finally explains the state/action/reward formulation together with the training and evaluation workflow. Section 4 describes the simulation setup and reports the quantitative results. The stability study comprises a theoretical chain-stability analysis, cumulative damping-ratio verification, and traffic-dynamics visualization, whereas the performance study presents efficiency, safety, and energy-saving metrics. Section 5 concludes the paper by summarizing the key findings, discussing their implications for mixed-traffic systems, and outlining future research directions.

2. Related Work

2.1. Traffic Oscillations and Sustainability Impacts

Traffic oscillations—often observed as “stop-and-go waves”—constitute a prevalent form of instability in both traditional and mixed traffic flows. These oscillations typically manifest as periodic waves of acceleration and deceleration that can emerge spontaneously, even without explicit external stimuli. Early studies primarily attributed their formation to physical bottlenecks, such as lane drops or merges [26,27]. However, the landmark ring-road experiment conducted by Sugiyama et al. [28] demonstrated that traffic instabilities can also arise purely from driver reaction delays and behavioral heterogeneity, even in geometrically unconstrained conditions. At the system level, traffic oscillations significantly undermine the sustainability of transportation networks. From an environmental perspective, frequent acceleration and braking cycles increase fuel consumption and greenhouse gas emissions [29,30]. From a safety perspective, oscillations reduce the controllability of vehicle spacing and speed, thereby increasing collision risk [31]. Economically, they diminish road throughput and degrade commute reliability and travel time efficiency. Collectively, these adverse effects contradict the overarching goals of modern urban transportation systems—namely, safety, operational efficiency, and environmental sustainability [32].

In recent years, increasing attention has been paid to the intrinsic coupling between traffic stability and broader sustainability objectives. Studies have shown that suppressing oscillations not only improves driving comfort but also significantly reduces energy consumption and enhances key safety and environmental indicators, such as Time-To-Collision (TTC) and Fuel Economy Index (FEI) [25,33]. Therefore, traffic stability—understood as the system’s capacity to dampen perturbations in speed and spacing—should be regarded not only as a control objective but also as a foundational prerequisite for sustainable transportation development. With the rapid advancement of autonomous driving technologies, the coexistence of autonomous vehicles (AVs) and human-driven vehicles (HDVs) in mixed traffic environments is expected to persist over the long term. This coexistence introduces novel disturbance patterns and complex control challenges, which expose the limitations of traditional traffic control strategies in maintaining system stability and adaptability. While mesoscopic and macroscopic control methods—such as large-scale traffic management [34] and local coordination [35]—have shown partial success in traffic stabilization, they typically lack integration with microscopic vehicle dynamics. As a result, their ability to mitigate oscillations at the individual vehicle level is limited. Therefore, more refined and personalized control strategies are necessary to effectively address oscillation issues in mixed traffic environments.

2.2. Evolution of Control Models: From Car-Following Model to Bilateral Control Model

Research on traffic stability initially centered on car-following models (CFMs), in which each vehicle dynamically adjusts its acceleration based on the state of the vehicle ahead. Classical models such as the General Motors (GM) model and the Intelligent Driver Model (IDM) have laid the theoretical foundation for longitudinal control modeling [36,37]. Building upon these, adaptive cruise control (ACC) employs onboard sensors to dynamically regulate vehicle speed, thereby enhancing individual vehicle responsiveness [26]. Cooperative adaptive cruise control (CACC) further introduces vehicle-to-vehicle (V2V) communication to acquire upstream information and improve disturbance anticipation [38]. However, these control strategies rely on unidirectional feedback structures and lack the capability to respond to downstream disturbances. As a result, they often fail to ensure system-level stability in high-density or behaviorally heterogeneous mixed traffic environments and may exacerbate traffic oscillations.

To address these structural limitations, Horn et al. proposed the Bilateral Control Model (BCM) [39], which incorporates bidirectional state feedback from both leading and following vehicles to construct a local damping mechanism that effectively absorbs disturbances at their source. The control structure satisfies the damped wave equation and exhibits favorable string stability under various boundary conditions [40]. Wang et al. further introduced a chain stability criterion and validated the robust convergence properties of BCM across multiple disturbance frequencies [41]. The model was subsequently extended to a multi-node BCM framework, which integrates information from multiple neighboring vehicles and utilizes Taylor expansion and least-squares optimization to enhance low-frequency disturbance suppression and system responsiveness [42]. At the spectral analysis level, Wang and Horn developed an eigenvalue-based framework to analyze string stability, revealing the system’s stability characteristics in the frequency domain [43]. Furthermore, BCM has been adapted to scenarios involving the coexistence of autonomous vehicles (AVs) and human-driven vehicles (HDVs), where collaborative mechanisms and stability boundaries under different control structures were systematically explored [44]. Despite the structural advantages of BCM in theoretical modeling, most existing studies are based on idealized assumptions of fully autonomous environments and fixed control parameters. These models lack the adaptability to dynamic disturbances caused by perception delays, behavioral heterogeneity, and environmental uncertainty, which are prevalent in real-world mixed traffic systems.

2.3. Reinforcement Learning in Mixed Traffic Control

Reinforcement learning (RL) and its deep extension, deep reinforcement learning (DRL), have demonstrated remarkable capabilities in handling nonlinear, high-dimensional, and dynamic control problems. Through agent–environment interaction and trial-and-error learning, DRL exhibits superior adaptability and generalization compared to traditional rule-based or model-driven approaches. As a result, DRL has been widely adopted in the domain of intelligent traffic control.

Early applications of reinforcement learning (RL) in transportation began with Q-learning [45], which, despite its conceptual simplicity, exhibited poor scalability in large state spaces. To address this limitation, Deep Q-Networks (DQN) were introduced to approximate the Q-value function using neural networks, significantly improving learning efficiency in discrete action domains [46]. However, DQN is not suitable for continuous control tasks such as vehicle longitudinal regulation. To overcome this issue, the Deep Deterministic Policy Gradient (DDPG) algorithm was proposed, incorporating an actor–critic architecture to extend deep RL into continuous action spaces [47]. Shi et al. [22] pioneered the integration of bidirectional feedback into the DRL framework by developing a BCM-DRL model based on DDPG. By embedding both leading and following vehicle states into the state representation, their model achieved trajectory-tracking performance that surpassed that of human drivers in homogeneous AV platoons, demonstrating the potential of combining DRL with BCM architectures. However, DDPG often suffers from overestimation bias and unstable convergence during training.

The Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm enhances DDPG by introducing twin Q-networks, delayed policy updates, and target policy smoothing. These refinements significantly improve policy stability and learning robustness, making TD3 a key technique in real-time continuous control scenarios such as automated vehicle regulation. Liu et al. [25] proposed the BCM-DRL control framework, which integrates the TD3 algorithm into the bilateral control model as a replacement for conventional DDPG. Their study was conducted under a homogeneous autonomous vehicle (AV) setting, focusing on traffic oscillation suppression. The model incorporates an AV-centered state space, a continuous action mechanism, and a multi-objective reward function, significantly improving the system’s responsiveness to disturbances and enhancing string stability. Although the framework has not yet been extended to heterogeneous mixed traffic environments, it provides a solid foundation for future exploration of AV–HDV cooperative control strategies.

Furthermore, comparative studies between DRL and traditional approaches such as Model Predictive Control (MPC) [48,49] suggest that while DRL offers superior adaptability in uncertain or nonlinear traffic environments, it still faces challenges in convergence and training reliability. For example, Li et al. [50] combined the Helly model with MPC to develop a jam-absorption strategy, yet this approach depends on precise parameter calibration and assumes linear dynamics, limiting its generalizability. In contrast, DRL’s robust adaptability—particularly when coupled with bidirectional feedback from BCM—holds promise for more flexible, scalable, and effective control strategies that can suppress oscillations and optimize flow in mixed traffic conditions.

3. Methodology

This study proposes a TD3-driven Bilateral Control Model (TD3-BCM) that integrates deep reinforcement learning with bidirectional feedback mechanisms to address the complex, dynamic nature of mixed traffic environments. By introducing a carefully designed reward function emphasizing traffic-flow stability, the model optimizes the traditional Bilateral Control Model (BCM). It adaptively adjusts control strategies under various penetration rates of autonomous and human-driven vehicles (HDVs), significantly enhancing disturbance suppression, convergence stability, and control precision, thereby improving overall traffic-flow robustness. Section 3.1 introduces the theoretical foundation and key principles of traditional BCM. Building on this, Section 3.2 presents the proposed TD3-BCM, detailing how Twin Delayed Deep Deterministic Policy Gradient (TD3) reinforces the model’s robustness in complex traffic scenarios. Section 3.3 discusses the training process and optimization steps for TD3-BCM.

3.1. Introduction to Traditional Bilateral Control Models

The Bilateral Control Model (BCM), initially proposed by Horn in 2013 [39], is based on the principle of balancing a vehicle’s position and velocity relative to both its leading and following neighbors, as illustrated in Figure 1.

The control law for a BCM vehicle n can be expressed as Equation (1):

\begin{matrix} a_{n} (t) = & k_{d} (d_{ahead} - d_{behind}) \\ + k_{v} ((v_{lead} - v_{mid, n} (t)) - (v_{mid, n} (t) - v_{follow})) \\ + k_{c} (v_{mid, n} (t) - v_{desired}) \\ + k_{a} (a_{lead} - a_{mid, n}) \end{matrix}

(1)

where

$a_{n} (t)$ : Acceleration of vehicle n at time t.
$d_{ahead}, d_{behind}$ : Distances between vehicle n and its leading and following vehicles, respectively.
$v_{lead}, v_{mid, n}, v_{follow}$ : Speeds of the leading, current, and following vehicles.
$v_{desired}$ : Target cruising speed of vehicle n.
$a_{lead}, a_{mid, n}$ : Accelerations of the leading vehicle and vehicle n, respectively.

The parameters

k_{d}

,

k_{v}

,

k_{c}

, and

k_{a}

represent the control gains associated with distance error, relative velocity, speed tracking, and acceleration tracking, respectively. Based on prior empirical and simulation studies [39,51], their recommended ranges are set as follows:

k_{d} \in [0.1, 1.5]

,

k_{v} \in [0.1, 1.0]

,

k_{c} \in [0.1, 0.8]

, and

k_{a} \in [0.1, 0.5]

. These parameter intervals are selected to balance system responsiveness with damping performance under representative traffic conditions. For baseline simulations, the initial values are selected as

k_{d} = 0.8

,

k_{v} = 0.5

,

k_{c} = 0.5

, and

k_{a} = 0.2

, which have been validated in previous BCM implementations to provide a favorable trade-off between oscillation suppression and responsiveness [25]. Unless otherwise specified, these parameter settings assume idealized conditions of perception and communication.

In a BCM chain, the rear-most vehicle typically operates under a unilateral car-following model, such as an Adaptive Cruise Control (ACC) system, since it lacks a trailing vehicle. The ACC model is described by Equation (2):

\begin{matrix} a_{n} (t) = k_{1} (d_{desired} (t) - d_{n} (t)) + k_{2} (v_{desired} (t) - v_{n} (t)) \end{matrix}

(2)

where

$a_{n} (t)$ : Acceleration of the n-th vehicle at time t.
$k_{1}, k_{2}$ : Control gain parameters determining the vehicle’s response to errors in distance and speed. Suggested ranges: $k_{1} \in [0.3, 1.5]$ and $k_{2} \in [0.5, 2.0]$ , balancing stability and responsiveness [2,26].
$d_{desired} (t)$ : Desired distance, defined as $d_{0} + η v_{n} (t)$ , with $d_{0} \approx 2 m$ and $η \approx 1.5 s$ [36].
$d_{n} (t)$ : Actual distance between the n-th vehicle and the preceding vehicle.
$v_{desired} (t)$ : Desired speed, typically set as the target speed or road speed limit.
$v_{n} (t)$ : Actual speed of the n-th vehicle at time t.

Although BCM improves upon unidirectional feedback models, it relies on fixed control parameters that limit adaptability in non-stationary or nonlinear traffic dynamics [39,40]. To overcome these constraints, we propose a deep reinforcement learning extension—TD3-Bilateral Control Model (TD3-BCM)—that leverages real-time feedback from traffic conditions for enhanced stability, reduced energy consumption, and more effective oscillation mitigation than conventional BCM approaches.

3.2. TD3-Driven Bilateral Control Model

This section introduces the TD3-driven Bilateral Control Model (TD3-BCM), which integrates the Twin Delayed Deep Deterministic Policy Gradient (TD3) into the bilateral control framework to enhance adaptability and stability in mixed traffic. The TD3-BCM framework comprises two key components. The first component explains the foundational TD3 mechanisms—namely, the actor–critic structure, target policy smoothing, and clipped double Q-learning. The second component introduces the TD3-BCM formulation, including the state and action spaces, the multi-objective reward function, and the policy update strategy. By integrating these algorithmic foundations with a tailored formulation for real-time vehicle interactions, TD3-BCM effectively captures behavioral heterogeneity and enables stable and efficient control of mixed traffic flow.

3.2.1. Core Mechanisms of TD3

In the context of mixed traffic flows, the Core Mechanisms of TD3—specifically the Actor-Critic architecture, target policy smoothing, and clipped double Q-learning—collectively serve as the algorithmic backbone for handling the uncertainties and real-time demands introduced by diverse vehicle types operating in parallel. By balancing stability and robustness in policy learning, these mechanisms help minimize acceleration fluctuations and improve inter-vehicle coordination. This, in turn, enhances system-level efficiency and safety, thereby supporting the real-time application of bilateral control strategies in complex mixed traffic conditions.

(a): Actor-Critic Architecture

The TD3 algorithm adopts an actor–critic architecture that decouples policy generation from value estimation, thereby improving learning stability and precision. Two distinct sets of neural networks are designed for the tail vehicle and middle vehicles, respectively.

Tail Vehicle (Actor 1 and Critic 1): The tail vehicle operates under a unilateral car-following model. Its state vector $s_{n} (t) = [v_{n} (t), d_{n} (t), Δ v_{n} (t)]$ is fed into Actor 1, which comprises two hidden layers with 32 neurons each. Actor 1 outputs the acceleration action $a_{n} (t)$ . Critic 1 receives the same state vector along with $a_{n} (t)$ as input and estimates the corresponding Q-value using a 32 × 32 hidden structure, as shown in Figure 2.
Middle Vehicles (Actor 2 and Critic 2): Middle vehicles adopt a bilateral control structure, extending their state vector to $s_{n} (t) = [v_{n} (t), d_{n} (t), Δ v_{n} (t), v_{n + 1} (t), d_{n + 1} (t)]$ to incorporate information from both preceding and following vehicles. Actor 2 consists of two hidden layers with 64 neurons each and outputs $a_{n} (t)$ . Critic 2 estimates the Q-value based on the same state and action input, using a 64 × 64 hidden layer architecture, as shown in Figure 3.

This architecture enables the TD3-BCM framework to dynamically adjust vehicle acceleration while ensuring that these adjustments align with the objectives of traffic stability, safety, and efficiency. The tail vehicle focuses on interactions with its leader, while the middle vehicles learn cooperative behaviors through bidirectional feedback.

(b): Target Policy Smoothing

TD3 incorporates target policy smoothing to mitigate overfitting to deterministic policies, which may impair generalization in real-world scenarios. During each policy update, stochastic noise is injected into the target policy [52] to improve exploration, as shown in Equation (3):

π_{target} (s_{n}^{'}) = π (s_{n}^{'}) + clip (N (0, σ), - c, c)

(3)

where

$π_{target} (s_{n}^{'})$ : Smoothed target policy for state $s_{n}^{'}$ .
$π_{θ} (s_{n}^{'})$ : Deterministic action from the actor network.
$N (0, σ)$ : Zero-mean Gaussian noise.
$σ$ : Standard deviation of the noise, typically set to 0.2 [52].
c: Clipping threshold to limit noise within $[- c, c]$ , typically $c = 0.5$ [53].

This technique reduces the model’s dependency on deterministic actions, thereby enhancing generalization under dynamic traffic conditions.

(c): Clipped Double Q-Learning of TD3

Traditional reinforcement learning methods are prone to overestimation bias in Q-value estimation, which can undermine overall performance. TD3 addresses this issue using clipped double Q-learning, which trains two critic networks simultaneously and selects the minimum Q-value for policy updates [47,52]. In this approach, two independent critic networks are trained simultaneously, and the minimum of their Q-values is used to update the actor network, as shown in Equation (4):

Q_{target} = r + γ min (Q_{θ_{1}} (s_{n}^{'}, a^{'}), Q_{θ_{2}} (s_{n}^{'}, a^{'}))

(4)

where

$Q_{θ_{1}}$ and $Q_{θ_{2}}$ : Two independent critic networks.
$s_{n}^{'}$ and $a^{'}$ : Next state and next action.
r: Immediate reward, normalized to $[- 1, 1]$ for stability [47,52].
$γ$ : Discount factor, balancing short- and long-term returns, typically 0.95 or 0.99 [46].

By adopting the lower Q-value from two critics, this method ensures conservative value updates, which improve training stability and policy robustness.

3.2.2. TD3-BCM Formulation

The following describes how the aforementioned TD3 mechanisms are integrated into the bilateral control framework, covering the state space, action space, reward function, and policy update procedure. The resulting workflow enables real-time adaptability to mixed traffic conditions while balancing traffic stability, safety, and efficiency. An overview of the framework is given as follows:

(a): State Space

The state space captures dynamic traffic conditions and varies for middle and tail vehicles within the chain:

Middle Vehicles: Middle vehicles adopt a bilateral control model, integrating information from both preceding and following vehicles. The state vector is defined as Equation (5):

s_{mid, n} (t) = [v_{mid, n} (t), d_{ahead, n} (t), Δ v_{n, n - 1} (t), d_{behind, n} (t), v_{n + 1} (t)]

(5)

where

$v_{mid, n} (t)$ : Represents the current speed of the middle vehicle.
$d_{ahead, n} (t)$ : Is the distance to the preceding vehicle.
$Δ v_{n, n - 1} (t)$ : Is the relative speed with the preceding vehicle.
$v_{n + 1} (t)$ : Is the current speed of the following vehicle.
$d_{behind, n} (t)$ : Is the distance to the following vehicle.

Tail Vehicles: Tail vehicles use a single-sided car-following model, focusing on interactions with the preceding vehicle. The state vector is given as Equation (6):

s_{tail, n} (t) = [v_{tail, n} (t), d_{tail, n} (t), Δ v_{tail, n} (t)]

(6)

where

$v_{tail, n} (t)$ : Represents the current speed of the tail vehicle.
$d_{tail, n} (t)$ : Is the distance to the preceding vehicle.
$Δ v_{tail, n} (t)$ : Is the relative speed with the preceding vehicle.

These definitions enable middle vehicles to assess their position and velocity relationships with both preceding and following vehicles, while tail vehicles focus solely on preceding vehicle interactions.

(b): Action Space

The action space represents the longitudinal acceleration

a_{n} (t)

, a continuous variable constrained by

- 3 m / s^{2} \leq a_{n} (t) \leq 3 m / s^{2} .

This range, derived from Hu et al. [54], ensures realistic driving behavior, maintaining passenger comfort and operational feasibility.

(c): Reward Function

The reward function balances traffic flow stability, efficiency, and safety, with additional considerations for smooth control actions. The components include the following:

Stability: Large variations in acceleration are penalized to suppress traffic oscillations, shown as Equation (7):

r_{stability, n} = - α_{1} a_{n} {(t)}^{2} - α_{2} | a_{n} (t) |

(7)

where

a_{n} (t)

represents the acceleration of the n-th vehicle,

α_{1} (0.1 \sim 1.0)

penalizes squared acceleration to reduce traffic oscillations, and

α_{2} (0.1 \sim 0.5)

penalizes absolute acceleration to smooth variations, enhancing traffic stability and comfort [26,54,55].

Safety: The safety reward component encourages maintaining safe driving behavior by addressing potential risks associated with insufficient time gaps, unsafe inter-vehicle distances, and excessive speeds. This component integrates three critical sub-rewards: Time Gap Penalty, Distance Penalty, and TTC-Based Collision Avoidance.

Time Gap Penalty, A penalty is applied when the time gap

T_{n} (t)

between the vehicle and its preceding vehicle falls below the safe threshold of

0.6 s

, shown as Equation (8):

r_{time gap, n} = - β_{time gap} max (0, 0.6 - T_{n} (t))

(8)

where

T_{n} (t)

is the time gap between the vehicle and its preceding vehicle;

β_{time gap}

is the penalty weight for unsafe time gaps, adjustable based on safety requirements.

Distance Penalty: When the inter-vehicle distance

d_{n} (t)

is less than the minimum safe distance

d_{safe}

, the penalty escalates as the distance approaches zero, as shown in Equation (9):

r_{distance, n} = - β_{distance} max (0, d_{safe} - d_{n} (t))

(9)

where

d_{n} (t)

is the distance between the vehicle and its preceding vehicle;

d_{safe}

is the minimum safe distance, typically set to

2 m

[26];

β_{distance}

is the penalty weight for unsafe distances.

TTC-Based Collision Avoidance: To enhance safety during training and testing, a Time-to-Collision (TTC)-based collision avoidance mechanism is implemented. The safe speed is computed using the following Equations (10) and (11):

v_{safe} = v_{ahead} + max (0, \frac{d_{n} (t) - d_{safe} - v_{n} (t) \cdot T_{react}}{T_{react} + \frac{v_{n} (t)}{a_{\max}}})

(10)

r_{TTC, n} = - β_{TTC} max (0, v_{n} (t) - v_{safe})

(11)

where

$v_{safe}$ : Is the computed safe speed to avoid collisions.
$v_{ahead}$ : Is the speed of the preceding vehicle.
$T_{react} = 0.4 s$ [36]: Is the reaction time.
$a_{\max} = 3 m / s^{2}$ [56]: Is the maximum deceleration.
$β_{TTC}$ : Is the penalty weight for unsafe speeds.

The total safety reward for each vehicle (middle or tail) integrates the three components, as shown in Equation (12):

r_{safety, n} = r_{time gap, n} + r_{distance, n} + r_{TTC, n}

(12)

Efficiency: Encourages maintaining the desired speed

v_{desired}

, as described in Equation (13):

r_{efficiency, n} = - β_{efficiency} |v_{n} (t) - v_{desired}|

(13)

where

$β_{efficiency} (0.1 \sim 0.5)$ is the penalty weight for speed deviations, balancing efficiency and adaptability to traffic conditions [31,57].

Smoothy: To facilitate early-stage training and prevent excessive accelerations or inter-vehicle spacing, a logarithmic function is incorporated into the reward structure, as described in Equation (14):

r_{smoothy, n} = - log (1 + | a_{n} (t) |) + \frac{1}{d_{n} (t)}

(14)

This adjustment prevents excessively negative rewards caused by large accelerations or distances, improving training stability and convergence.

Overall Reward Function: The combined reward function for vehicles’ chains integrates stability, safety, efficiency, and smoothness components, as shown in Equation (15):

r_{n} (t) = r_{stability, n} + r_{safety, n} + r_{efficiency, n} + r_{smoothy, n}

(15)

Table 1 outlines the key reward components in the TD3-BCM framework, detailing their parameters and respective control functions. These components collectively drive the system toward stable, safe, efficient, and smooth operation.

(d): Policy Update

The TD3 algorithm iteratively updates the state and action spaces to optimize control strategies. This process evaluates the effectiveness of actions based on their impact on stability, safety, and efficiency. Techniques like target policy smoothing and clipped double Q-learning are used to mitigate overestimation and improve robustness. These iterative updates enable the control framework to adapt to dynamic traffic conditions, ensuring smoother and more efficient traffic flow.

3.3. Training and Evaluation of TD3-BCM

3.3.1. Training and Test Data

This study employs the publicly available NGSIM I-80 dataset, recorded on the I-80 interstate highway near Emeryville, California [58]. It spans a 503-meter-long highway weaving segment (lanes 2 to 5), with data recorded at 10 Hz, capturing vehicle positions, speeds, and other traffic attributes. The data was collected between 04:00 and 04:15 pm on 13 April 2005. It represents typical highway traffic flow by focusing on same-lane vehicle movements while excluding lane-changing behavior. Although the NGSIM I-80 dataset reflects a specific traffic scenario, the dataset is reconstructed in this study to improve model adaptability. The reconstructed dataset integrates diverse traffic patterns and driving behaviors to more accurately reflect real-world acceleration and deceleration, thereby enhancing the model’s generalizability across different traffic scenarios. It emphasizes vehicle acceleration and deceleration to ensure the model accurately reflects dynamic variations in mixed traffic environments.

For training, to improve efficiency and limit random exploration, each velocity profile was constrained to 30 s (300 simulation steps). This constraint prevents excessive vehicle spacing and prolonged negative rewards, thus accelerating convergence. The data selection criteria included profiles lasting longer than 30 s, with speeds above 2 m/s and speed standard deviations greater than 3 m/s. Ultimately, 50 high-quality 30-s profiles were selected, capturing common acceleration and deceleration patterns in mixed traffic.

For the test set, this study applied the processing method of Jiang et al. [59] for the highD dataset to the I-80 dataset’s vehicle speed profiles. Trajectories exhibiting traffic oscillation characteristics were selected, requiring durations longer than 30 s and speed standard deviations greater than 3 m/s. Speed gaps were filled by smoothly blending velocity profiles using acceleration transitions within ±0.3 m/s². Finally, 50 test trajectories, each 300 s long, were created and used exclusively for the lead vehicle (Leader Vehicle), ensuring the independence of the training and test datasets. Despite representing a specific traffic environment, the reconstructed dataset significantly enhances the model’s adaptability and robustness across various traffic conditions.

As shown in Figure 4, a velocity profile selected from the reconstructed test set trajectory is presented, accurately reflecting speed fluctuations consistent with real-world vehicle acceleration and deceleration patterns. Training on these realistic velocity profiles enables the TD3-BCM to better adapt to complex traffic dynamics, particularly in scenarios involving frequent vehicle starts and stops. Therefore, the reconstructed NGSIM I-80 dataset not only simulates real-world acceleration and deceleration processes but also demonstrates good generalization ability, making it applicable to a wider range of traffic scenarios.

3.3.2. Training and Test Steps

The training framework follows a two-stage process. In the first stage, a unilateral car-following model (CFM) is trained with TD3 to govern the tail vehicle, providing a behavioral baseline for subsequent training. In the second stage, middle vehicles are trained using the pre-trained tail vehicle as a reference, with parameter sharing employed to ensure consistency throughout the BCM framework. The reward functions—focusing on stability, efficiency, and safety—were refined to suppress acceleration fluctuations and maintain appropriate inter-vehicle spacing. Following training, the model was evaluated on unseen trajectory profiles to assess its adaptability and robustness.

Each vehicle updates its speed and inter-vehicle spacing in discrete time steps using

\{\begin{matrix} v_{n} (t + 1) = v_{n} (t) + a_{n} (t) Δ t, \\ Δ v_{n} (t) = v_{n} (t) - v_{n - 1} (t), \\ d_{n} (t + 1) = d_{n} (t) + \frac{v_{n} (t) + v_{n} (t + 1)}{2} Δ t, \end{matrix}

(16)

where

v_{n}

and

a_{n}

denote the speed and acceleration of the n-th vehicle at time t, respectively.

As shown in Figure 5, the training process for the TD3-BCM framework is divided into two stages. Step 1 involves training the tail vehicle using a unilateral car-following model (CFM) to establish a baseline for individual vehicle control. In this stage, Actor 1 represents the policy network that selects actions for the tail vehicle, while Critic 1 evaluates the quality of those actions based on the reward function. Experience replay 1 stores past interactions to improve learning stability. Step 2 extends this process to the middle vehicles in the chain, leveraging the pre-trained tail vehicle model. Here, Actor 2 is responsible for selecting actions for the middle vehicles, while Critic 2 assesses their performance. Experience replay 2 facilitates policy refinement across the entire vehicle chain, ensuring consistent and efficient learning.

3.3.3. Parameters and Training Results

Proper configuration of key parameters is essential to ensuring the TD3-BCM framework performs reliably across diverse traffic conditions. Informed by prior studies and extensive experimentation, these parameters are carefully calibrated to balance convergence speed, robustness, and adaptability.

Table 2 summarizes the primary parameters and their respective roles, laying the groundwork for subsequent experimental evaluation.

The training results highlight the TD3-BCM’s effectiveness in optimizing vehicle-chain control under mixed traffic conditions. As illustrated in Figure 6a, the reward value for the tail vehicle improves rapidly during the initial training phase and achieves convergence after approximately 200 episodes. This demonstrates the model’s capacity to rapidly adapt to unilateral control tasks while maintaining safety and efficiency.

In contrast, as shown in Figure 6b, the middle vehicle—responsible for managing bilateral interactions with both leading and following vehicles—requires approximately 300 episodes to achieve convergence. This highlights the framework’s capacity to address the added complexity of bilateral control scenarios effectively.

As training progresses, the decreasing variance in the reward curves reflects enhanced stability and robustness of the model. Overall, the results confirm that the framework successfully learns and implements effective control strategies, with both vehicle types achieving reliable convergence and stable performance within 1000 training episodes.

4. Simulation Setup and Performance Evaluation

This section systematically evaluates the performance of TD3-BCM in mixed traffic environments through numerical simulations. Building on the TD3-BCM established in Section 3 and the stability and performance evaluation metrics introduced in Section 4, the study analyzes the impact of different autonomous vehicle (AV) penetration rates on traffic stability, safety, efficiency, and energy consumption, validating the effectiveness of TD3-BCM in optimizing complex traffic conditions.

4.1. Simulation Setup

To thoroughly evaluate the performance of TD3-BCM in mixed traffic environments, this study selects five representative scenarios, as illustrated in Figure 7, to compare the proposed TD3-BCM model against the traditional BCM and unidirectional CACC models. The scenarios encompass various autonomous vehicle (AV) penetration levels and control strategies. Specifically, Scenario 1 uses three CACC-controlled AV sub-chains, each consisting of one tail vehicle and three middle vehicles, with an AV penetration rate of 40%. Scenario 2 uses three BCM-controlled AV sub-chains, each consisting of one ACC-controlled tail vehicle and three BCM-controlled middle vehicles, with an AV penetration rate of 40%. Scenario 3 uses three TD3-BCM-controlled AV sub-chains, each consisting of one TD3-controlled tail vehicle and three TD3-BCM-controlled middle vehicles, with an AV penetration rate of 40%. Additionally, Scenarios 4 and 5 increase the AV penetration rate to 50% and 70%, respectively. The configuration of AV sub-chains in these scenarios follows the design principles outlined by Li et al. [60].

The primary objective of these scenarios is to compare the performance of the CACC, BCM, and TD3-BCM models under the same AV penetration rate, and to examine the performance variations of the TD3-BCM model under different penetration rates. These experiments not only comprehensively assess the TD3-BCM model’s benefits—such as improved stability, safety, efficiency, and energy savings—but also examine how varying AV penetration rates influence the stability of mixed traffic flow, further validating the performance advantages of TD3-BCM compared to other control strategies.

The selected scenarios are representative of common dynamic patterns observed in mixed traffic, allowing for realistic simulation and comparative analysis [26,51]. Although this study focuses on the analysis of these five typical scenarios, other scenarios or different configurations of AV sub-chains can be expanded based on research needs to further verify the adaptability and robustness of TD3-BCM under various traffic conditions. Since the selected scenarios have already demonstrated the performance variations of the model under different penetration rates, this study does not include detailed discussions of additional scenarios, but further design and experimentation with other scenarios can be conducted in the future based on research needs.

Figure 7. Scenarios for Mixed Traffic Simulations with TD3-BCM Vehicle Chains.

All scenarios derive the lead vehicle’s velocity from real-world driving data (NGSIM I-80) in the testing dataset, introducing realistic traffic perturbations to assess the stability improvements of autonomous vehicles (AVs) in mixed traffic. Human-driven vehicles (HDVs) follow the IDM model, with parameters calibrated based on [61] to ensure realistic human driving behavior. In Scenario 1, the CACC model adopts the methodology and parameter settings from [38]. In Scenario 2, AVs utilize the traditional Bilateral Control Model (BCM) and Adaptive Cruise Control (ACC) model, both implemented based on [41]. The parameter settings for Scenarios 3–5 are derived from the optimized results of the model training process detailed in Section 3.3.3.

4.2. Stability Results

The stability analysis adopts predefined metrics to evaluate the system’s ability to suppress disturbances caused by external factors or oscillations induced by human-driven vehicles (HDVs). The analysis includes chain stability analysis, modular stability analysis, cumulative damping ratios, and traffic dynamics analysis. These metrics evaluate the disturbance propagation and the system’s stability under different autonomous vehicle (AV) penetration rates in the TD3-BCM model.

4.2.1. Theoretical Stability Analysis

The stability of a mixed-vehicle platoon depends on the autonomous vehicle (AV) penetration rate, denoted by

ϕ

. Stability is achieved when the AV penetration rate exceeds a critical threshold,

ϕ_{\min}

, derived from the system’s transfer functions [62]. This relationship is given as Equation (17):

ϕ > ϕ_{\min} = \frac{1 - G_{HDV} (s)}{G_{AV} (s) - G_{HDV} (s)}

(17)

where

$ϕ$ is the AV penetration rate.
$G_{HDV} (s) = \frac{0.5}{s + 0.5}$ is the transfer function for HDVs.
$G_{AV} (s) = \frac{1.0 + 0.2 s}{s^{2} + 0.8 s + 1.0}$ is the transfer function for AVs.

By analyzing the system across low, medium, and high frequency ranges (

ω = 0.01 \sim 10

) [62], the critical AV penetration rate required to ensure chain stability is found to be

ϕ_{\min} = 32.24 %

. This value represents the minimum proportion of AVs necessary to maintain system stability. Additionally, the AV penetration rates in the five scenarios meet the required threshold for chain stability, ensuring that the system remains stable across all scenarios.

Modular string stability metrics evaluate stability across sub-chains in mixed traffic systems. This framework decomposes the system into autonomous vehicle (AV) and human-driven vehicle (HDV) sub-chains, defining stability conditions for each segment and their interactions. The overall modular stability is defined as Equation (18):

S_{modular} = S_{AV} \cdot S_{HDV} \cdot S_{boundary}

(18)

where

$S_{AV}$ is the stability of the AV sub-chain, which is defined as the inverse of the AV sub-chain’s speed.
$S_{HDV}$ is the stability of the HDV sub-chain, which is defined as the inverse of the HDV sub-chain’s speed.
$S_{boundary}$ is the stability of the boundary between AV and HDV sub-chains, which is equal to the reciprocal of the speed of boundary vehicles (AV and HDV).

If

S_{modular} > 1

, the system is deemed stable, meaning the interactions between AV and HDV sub-chains do not compromise overall traffic stability [19,60].

Table 3 shows the stability metrics—

S_{AV}

,

S_{HDV}

,

S_{boundary}

, and

S_{modular}

—for five scenarios, based on the averages of 50 test dataset results. Scenarios 1, 2, and 3 represent the CACC, BCM, and TD3-BCM models at 40% AV penetration. In Scenario 3, the TD3-BCM model outperforms the CACC and BCM models in

S_{AV}

and

S_{HDV}

, with

S_{AV}

increasing from 0.814 to 1.056 and

S_{HDV}

from 0.870 to 1.076. Boundary stability (

S_{boundary}

) is also higher in TD3-BCM (1.098) compared to CACC (0.860) and BCM (1.015), indicating smoother transitions between AVs and HDVs and reduced disturbances in mixed traffic flow.

As AV penetration increases, the stability of the TD3-BCM model improves. In Scenario 4 (50% penetration),

S_{AV}

and

S_{HDV}

increase to 1.066 and 1.145, respectively, while

S_{boundary}

reaches 1.153. In Scenario 5 (70% penetration), TD3-BCM shows even higher stability, particularly in modular stability (

S_{modular}

), which rises to 1.182. These results confirm that higher AV penetration improves the stability of both AV and HDV sub-chains, and the TD3-BCM model enhances the overall stability of the system.

As illustrated in Figure 8, the TD3-BCM model exhibits consistent improvements in all key stability indicators—SAV, SHDV, Sboundary, and Smodular—as AV penetration increases. The small standard deviations—reflected by narrow error bars—indicate stable and reliable simulation outcomes, further validating the effectiveness of TD3-BCM in improving mixed traffic stability, especially in high AV penetration scenarios.

4.2.2. Cumulative Damping Ratio

The cumulative damping ratio quantifies a vehicle’s capacity to attenuate traffic disturbances. This metric, originally proposed by Ploeg et al. [63], approximates the disturbance dissipation capability by computing the 2-norm of a finite-length acceleration sequence.

The cumulative damping ratio is calculated as follows:

D_{n} = \frac{∥ a_{n} {(t) ∥}_{2}}{∥ a_{0} {(t) ∥}_{2}} = \frac{{(\sum_{t = 0}^{T} a_{n} {(t)}^{2})}^{\frac{1}{2}}}{{(\sum_{t = 0}^{T} a_{0} {(t)}^{2})}^{\frac{1}{2}}}

(19)

where

$D_{n}$ represents cumulative damping ratio.
$a_{n} (t)$ represents the acceleration of the n-th vehicle at time t.
$a_{0} (t)$ represents the acceleration of the lead vehicle at time t.
T denotes the total simulation time.

The cumulative damping ratio quantifies the system’s ability to attenuate traffic disturbances over time. A value of

D_{n} > 1

indicates amplification of disturbances, while

D_{n} < 1

suggests effective suppression, contributing to greater system stability.

As illustrated in Figure 9, the comparison of scenarios 1, 2, and 3 indicates that TD3-BCM performs better than the CACC and BCM models in improving traffic stability. With the increase in AV penetration, the cumulative damping ratio of TD3-BCM shows a significant decline, reaching its lowest point at the 30th vehicle, significantly enhancing overall stability. At the same time, in scenarios 2, 3, 4, and 5, a pronounced drop is observed at the 12th and 22nd vehicles, suggesting that the AV sub-chain effectively absorbs localized disturbances. This phenomenon underscores the pivotal role of the AV sub-chain under bilateral control in enhancing system stability, with the TD3-based deep reinforcement learning model exhibiting superior performance.

Scenario 1: The ratio decreases from 0.534 at vehicle 2 to 0.338 at vehicle 30, but with noticeable fluctuations, indicating that the unidirectional feedback of CACC limits its ability to suppress disturbances, leading to persistent oscillations across the chain.
Scenario 2: The ratio drops more rapidly, reaching 0.060 at vehicle 30, demonstrating improved disturbance suppression with bidirectional feedback. However, residual oscillations persist due to the lack of adaptive adjustment, limiting overall stability.
Scenario 3: The damping ratio further reduces to 0.048 at vehicle 30, highlighting the effectiveness of TD3-BCM in dynamically mitigating disturbances and enhancing overall stability compared to traditional BCM.
Scenario 4: With an increased AV penetration rate, the ratio declines to 0.035 at vehicle 30, reflecting improved coordination among AVs and more efficient disturbance absorption across the vehicle chain.
Scenario 5: The damping ratio reaches its lowest level, below 0.030 at vehicle 30, indicating near elimination of traffic oscillations. This confirms that higher AV penetration combined with TD3-BCM maximizes stability and minimizes disturbance propagation.

Overall, the results demonstrate the superior capability of the TD3-BCM model to enhance stability under varying AV penetration rates, offering practical implications for the design of adaptive control mechanisms in mixed traffic systems.

4.2.3. Traffic Dynamics

Traffic dynamics across the five scenarios demonstrate significant stability improvements with increasing AV penetration, as evidenced by detailed analyses of position, speed, acceleration, and jerk metrics shown in Figure 10.

In Figure 10a (Vehicle Distane), the leading vehicle in all scenarios experiences external perturbations, generating traffic waves. In Scenario 1, disturbances persist throughout the vehicle chain due to the unidirectional feedback structure of CACC, resulting in pronounced fluctuations among trailing vehicles. In Scenario 2, the bidirectional control of BCM reduces disturbance propagation, but residual oscillations remain due to the lack of adaptive adjustments. In Scenarios 3–5, increasing AV penetration gradually smooths vehicle trajectories. Scenario 5 exhibits minimal positional disturbances, confirming the TD3-BCM’s capacity to absorb and suppress traffic oscillations.

Figure 10b (Speed Profiles) further supports this trend. Compared to CACC, BCM and TD3-BCM significantly reduce speed fluctuations. In Scenario 1, vehicle 30 exhibits speed oscillations ranging from 2.1 m/s to 9.7 m/s (amplitude: 7.6 m/s), demonstrating severe velocity fluctuations. In Scenario 2, the amplitude reduces to 5.7 m/s, showing improved stability. In Scenario 3, the fluctuation drops further to 3.6 m/s, while in Scenario 4 and Scenario 5, the speed variation narrows to 2.4 m/s and 1.2 m/s, respectively. These results confirm that increasing AV penetration, especially with TD3-BCM, effectively stabilizes velocity fluctuations and enhances traffic flow smoothness.

Figure 10c (Acceleration Profiles) illustrates the effects of different control strategies on acceleration variations. In Scenario 1, vehicle 30’s acceleration fluctuates between

- 1.75

m/s² and

1.14

m/s², indicating instability under unilateral control. In Scenario 2, the fluctuation range narrows to

- 1.06

m/s² to

0.89

m/s², reflecting improved stability with BCM. In Scenario 3, acceleration variations decrease to

- 0.58

m/s² to

0.42

m/s², while in Scenario 4 and Scenario 5, they further reduce to

- 0.37

m/s² to

0.25

m/s² and

- 0.21

m/s² to

0.18

m/s², respectively. In Scenario 5, vehicles 10-30 experience minimal acceleration variations, indicating enhanced system stability with higher AV penetration.

Figure 10d (Jerk Profiles) presents jerk profiles—the rate of change of acceleration—which directly affect driving comfort. In Scenario 1, vehicle 30’s jerk fluctuates significantly between

- 1.82

m/s³ and

1.94

m/s³, implying frequent abrupt acceleration and deceleration events, reducing driving comfort. In Scenario 2, jerk variation decreases to

- 1.23

m/s³ to

1.29

m/s³, showing some improvement. In Scenario 3, jerk variation further shrinks to

- 0.74

m/s³ to

0.82

m/s³, while in Scenario 4 and Scenario 5, it continues to decrease to

- 0.48

m/s³ to

0.57

m/s³ and

- 0.26

m/s³ to

0.31

m/s³, respectively. The reduction in jerk fluctuation confirms that higher AV penetration and TD3-BCM significantly improve ride comfort.

TD3-BCM consistently outperforms both CACC and BCM by suppressing traffic oscillations, stabilizing speed and acceleration variations, and enhancing driving comfort. Its effectiveness increases notably with higher AV penetration rates.

4.3. Performance Results

The performance of the mixed-vehicle chain is evaluated using three key metrics: efficiency, safety, and energy efficiency. These metrics serve as quantitative indicators to assess the impact of the TD3-BCM on traffic flow dynamics.

4.3.1. Efficiency

Driving efficiency is evaluated by analyzing inter-vehicle time gaps, which serve as a proxy for overall traffic flow performance. The comparison of the five scenarios demonstrates the impact of different control strategies on traffic efficiency, as shown in Figure 11:

In Scenario 1, using the CACC model, there is significant fluctuation in the time gaps, especially at the head of the vehicle chain, with a mean time gap of 2.64 s and a standard deviation of 1.44 s. This indicates that, due to the limitations of unidirectional control, the spacing between vehicles fluctuates considerably, leading to unstable traffic flow.
In Scenario 2, after the introduction of the BCM model with bidirectional feedback, fluctuations are reduced, and the mean time gap decreases to 2.1 s with a reduced standard deviation, indicating improved traffic stability.
In Scenario 3, with the introduction of TD3-BCM, the stability of time gaps improves further. The average time gap for vehicles 2 to 10 stabilizes around 2 s, with a standard deviation reduced to 0.34 s, ensuring smoother traffic flow.
In Scenario 4, as AV penetration increases, the mean time gap for vehicles beyond the 15th position decreases to 1.94 s, with a standard deviation below 0.2 s, highlighting the continued advantage of TD3-BCM in higher penetration scenarios, ensuring even more stable traffic flow.
In Scenario 5, with further increases in AV penetration, the mean time gap reaches its lowest level across the vehicle chain. The average time gap for mid- and tail-end vehicles remains between 1.84 s and 0.80 s, with a standard deviation falling below 0.036 s, demonstrating the superior performance of TD3-BCM in high AV penetration scenarios.

Comparing Scenarios 1, 2, and 3, it is clear that TD3-BCM outperforms both CACC and BCM in reducing time gap fluctuations and improving traffic flow stability. Particularly in Scenarios 2, 3, 4, and 5, as AV penetration increases, TD3-BCM achieves progressively smaller time gaps, indicating that TD3-BCM’s effectiveness continues to improve as AV penetration rises. It is important to note that in Scenarios 2, 3, 4, and 5, time gap oscillations are observed, which is mainly due to the fact that the time gaps in the autonomous vehicle sub-chain are smaller than those in the traditional IDM (Intelligent Driver Model) vehicle-following model. The autonomous vehicle sub-chain plays a key role in reducing time gaps and optimizing traffic flow, further enhancing the stability and efficiency of traffic.

4.3.2. Safety

Time-to-Collision (TTC) is a key metric for quantifying collision risk and assessing traffic safety. A higher TTC value indicates a longer time gap between vehicles, effectively reducing the collision risk and enhancing traffic safety [64]. It is defined as follows:

{TTC}_{n} (t) = \{\begin{matrix} \frac{d_{n} (t)}{Δ v_{n} (t)}, & Δ v_{n} (t) < 0 \\ \infty, & Δ v_{n} (t) \geq 0 \end{matrix}

(20)

where

$d_{n} (t)$ : Distance between the vehicle and its preceding vehicle at time t.
$Δ v_{n} (t)$ : Relative velocity between vehicles.

Table 4 presents the mean TTC values and standard deviations for Scenarios 1–5 at thresholds of 1 s, 1.5 s, and 2 s (high collision risk thresholds) as well as 2.5 s and 3 s (low collision risk thresholds), reflecting the changes in collision risk and traffic flow stability under different models.

From Table 4, it can be seen that at high collision risk thresholds of 1 s, 1.5 s, and 2 s, as well as lower collision risk thresholds of 2.5 s and 3 s, the TTC values Table 4 for the CACC (Scenario 1), BCM (Scenario 2), and TD3-BCM (Scenarios 3, 4, and 5) models are 0, indicating that at these higher collision risk thresholds, the relative safety between vehicles is high and no collision risk is detected. This is attributed to the design of minimal headway and the optimization of the TD3 reward function.

As the TTC threshold increases from 2 s to 3 s, the TTC values for all three models gradually increase, indicating that the collision response time is extended, thus improving safety. At a TTC of 1.5 s, the TTC value for the TD3-BCM (Scenario 3) model is 0.3924, while the TTC values for the CACC and BCM models are 0.2453 and 0.3186, respectively, showing that TD3-BCM has a higher collision response time, thus enhancing safety. Furthermore, the standard deviation for TD3-BCM (Scenario 3) at the 1.5-s TTC threshold is 0.5632, compared to 0.7341 for CACC (Scenario 1) and 0.6453 for BCM (Scenario 2), indicating improved system stability.

Additionally, the results for Scenarios 3, 4, and 5 show that as the penetration rate of autonomous vehicles (AVs) increases, the TTC value for TD3-BCM continues to increase, further improving the safety of the system. At a 3-s TTC threshold, the TTC value for TD3-BCM (Scenario 5) is 2.6951, significantly higher than the TTC value for Scenario 3 (TD3-BCM) at 1.8749, and also higher than the TTC value for Scenario 4 (TD3-BCM) at 2.1874, indicating that safety improves as the penetration rate increases. Meanwhile, the standard deviation decreases progressively, with the standard deviation for Scenario 5 being 0.3951, much lower than in other scenarios, indicating that as the penetration rate increases, the system stability is significantly enhanced.

4.3.3. Energy Savings

Smoother acceleration and deceleration profiles significantly reduce fuel consumption, particularly under high AV penetration conditions. Energy efficiency is assessed through a regression-based fuel consumption model [65], expressed as follows:

e_{n} (t) = exp [\sum_{i = 0}^{3} \sum_{j = 0}^{3} K_{i j} ({|v_{n} (t)|}^{i}) ({|a_{n} (t)|}^{j})]

(21)

where

$K_{i j}$ : Empirical regression coefficients derived from real-world data [66].
$a_{n} (t)$ : Acceleration of the n-th vehicle at time t.
$v_{n} (t)$ : Speed of the n-th vehicle at time t.

By analyzing the relationship between vehicle speed, acceleration, and fuel consumption, this model evaluates the impact of CACC, BCM, and TD3-BCM on energy efficiency, particularly the variation in the energy-saving effect of TD3-BCM as AV penetration gradually increases.

As shown in Figure 12, the variations in fuel consumption across different scenarios are as follows:

In Scenario 1, using the CACC model, fuel consumption remains the highest across the vehicle chain. The lead vehicle’s average consumption is 0.99 mL/s, with trailing vehicles showing significant fluctuations, and a standard deviation of 0.073 mL/s. This indicates substantial inefficiencies due to inconsistent driving behavior.
In Scenario 2, the introduction of bidirectional control with the BCM model reduces fluctuations, leading to a moderate decrease in fuel consumption. The fuel consumption of the mid- and tail-end vehicles stabilizes at approximately 0.88 mL/s, with a standard deviation below 0.05 mL/s, indicating an improvement in traffic efficiency.
In Scenario 3, after deploying TD3-BCM, fuel consumption further decreases and stabilizes at 0.85 mL/s, showing the advantage of TD3-BCM in optimizing fuel consumption.
In Scenario 4, with the increase in AV penetration, fuel consumption decreases further to approximately 0.82 mL/s, with a standard deviation below 0.04 mL/s. This shows that TD3-BCM continues to demonstrate better stability and lower fuel consumption in higher penetration scenarios.
In Scenario 5, with further increases in AV penetration, fuel consumption reaches its lowest level across the vehicle chain. The average consumption of mid- and tail-end vehicles remains between 0.78 mL/s and 0.80 mL/s, with a standard deviation falling below 0.036 mL/s, highlighting the superior performance of TD3-BCM in high AV penetration scenarios.

Comparing Scenarios 1, 2, and 3, the results demonstrate that TD3-BCM outperforms both CACC and BCM in reducing fuel consumption. In Scenarios 4 and 5, as AV penetration increases, TD3-BCM’s effectiveness continues to improve, further reducing fuel consumption. Additionally, in Scenarios 2, 3, 4, and 5, especially at vehicles 12 and 22, a significant reduction in time gaps is observed, indicating that the autonomous vehicle sub-chain with bilateral control plays a key role in enhancing traffic flow stability and minimizing disturbances, with TD3-BCM showing superior performance over BCM.

5. Conclusions and Discussion

5.1. Conclusions

This study investigates the coexistence of autonomous vehicles (AVs) and human-driven vehicles (HDVs) in mixed traffic environments, and introduces an enhanced Bilateral Control Model—TD3-BCM—aimed at improving traffic system stability, efficiency, safety, and energy performance. This model builds on our previous research [25], extending it to the mixed traffic flow context and further optimizing the interaction between autonomous and human-driven vehicles, thereby improving the coordination and overall performance of the traffic flow system. TD3-BCM enhances vehicle spacing control and collision response through a tailored TD3 network architecture and reward function, thereby improving overall traffic flow efficiency and system stability. Multiple analyses—including theoretical modeling, cumulative damping ratio assessment, and traffic dynamics evaluation—demonstrate that TD3-BCM significantly enhances system stability. With higher AV penetration rates, TD3-BCM effectively suppresses traffic fluctuations, leading to a steady improvement in system stability—reflected by an increase in modular stability from 1.132 to 1.182, indicating that the system performs more stably under higher penetration rates. In terms of performance, TD3-BCM at a 40% AV penetration rate significantly outperforms both CACC (unilateral control) and traditional BCM (bilateral control) in traffic efficiency, safety, and energy efficiency. Specifically, TD3-BCM increases the average speed of trailing vehicles by 12.6%, reduces speed fluctuations by 36.2%, and shortens the average headway by 19.0%, significantly improving traffic flow efficiency. At the same time, Time-to-Collision (TTC) improves by 87.3%, effectively reducing collision risks. Regarding energy efficiency, TD3-BCM reduces fuel consumption by 14.8% and decreases braking fluctuations by 41.5%. Moreover, as AV penetration increases, TD3-BCM exhibits progressively stronger optimization effects: traffic oscillations are nearly eliminated, fuel savings reach 19.7%, and inter-vehicle spacing is further reduced, demonstrating clear advantages in enhancing the traffic safety and stability of mixed traffic flows.

This study makes several key contributions: (1) it introduces deep reinforcement learning (TD3) into the bilateral control framework (TD3-BCM), enabling dynamic adaptation in complex mixed traffic environments; (2) it quantifies the impact of autonomous vehicle (AV) penetration on traffic stability and establishes a critical stability threshold (

ϕ_{\min}

) for maintaining stable traffic flow; and (3) it offers a new theoretical perspective on the relationship between AV penetration and system stability. Practically, TD3-BCM improves traffic stability, reduces oscillations, and enhances fuel efficiency, making it well-suited for highways and urban roads. It also supports intelligent traffic management, optimizes road utilization, and promotes sustainable transportation, with the potential to alleviate congestion, reduce emissions, and improve safety.

5.2. Discussion

Although TD3-BCM performs well in numerical simulations, its practical applicability in real-world traffic systems warrants further investigation. First, the critical penetration rate threshold is likely influenced by infrastructure characteristics, driver behavior heterogeneity, and environmental variability, requiring calibration using broader real-world datasets. Second, this study centers on longitudinal control, whereas more complex behaviors—such as lane changes and intersection coordination—require further exploration to assess TD3-BCM’s adaptability to diverse and dynamic scenarios.

Moreover, future research should incorporate real-world vehicle testing to validate the effectiveness of TD3-BCM in various traffic environments. Such validation efforts will help refine reinforcement learning strategies, accelerating model convergence and enhancing computational efficiency. Additionally, multi-modal traffic control—integrating TD3-BCM with traffic signal coordination, lane assignment, and other optimization strategies—remains a promising direction for future research aimed at achieving comprehensive improvements in intelligent transportation systems.

This study is based on simplified traffic scenarios, and future research could explore the model’s applicability in more complex traffic environments, considering driver behaviors (e.g., lane changes) and real-world uncertainties (e.g., sensor noise, communication delays, and packet loss). Integrating these factors will help assess the robustness and scalability of TD3-BCM in addressing real-world challenges in mixed traffic systems. Furthermore, comparing the results with existing literature will provide clearer context for interpreting the findings and offer potential directions for improving the model in future iterations.

Author Contributions

Conceptualization, K.L. and Y.C.; methodology, K.L.; software, K.L. and Y.C.; validation, K.L., W.H. and Y.C.; formal analysis, K.L.; investigation, K.L.; resources, P.J.; data curation, K.L.; writing—original draft preparation, K.L.; writing—review and editing, W.H.; visualization, K.L.; supervision, P.J.; project administration, P.J.; funding acquisition, P.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Beijing Social Science Fund (21GLA010), the National Natural Science Foundation of China (52172301), the Youth Beijing Scholar Program (080), and the Beijing Xicheng District Outstanding Talent Program–Top Talent Team (202338).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to privacy or ethical restrictions.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:

TD3	Twin Delayed Deep Deterministic Policy Gradient
BCM	Bilateral Control Model
CFM	Car-Following Model
AVs	Autonomous Vehicles
HDVs	Human-Driven Vehicles
TTC	Time-To-Collision
IDM	Intelligent Driver Model

References

Chen, X.; Sun, D.; Li, Y. A future intelligent traffic system with mixed autonomous vehicles and human-driven vehicles. Inf. Sci. 2020, 529, 59–72. [Google Scholar] [CrossRef]
Guo, Q.; Ban, X.J.; Aziz, H.M.A. Mixed traffic flow of human driven vehicles and automated vehicles on dynamic transportation networks. Transp. Res. Part Emerg. Technol. 2021, 128, 103159. [Google Scholar] [CrossRef]
Zhou, Y.; Wang, M.; Zhang, H. Study on mixed traffic of autonomous vehicles and human-driven vehicles with different cyber interaction approaches. Veh. Commun. 2022, 33, 100550. [Google Scholar]
Li, Y.; Zhang, H.; Wang, M. Traffic breakdown probability estimation for mixed flow of autonomous vehicles and human driven vehicles. Sensors 2023, 23, 3486. [Google Scholar] [CrossRef] [PubMed]
Wang, J.; Li, K.; Wu, G. Optimizing mixed traffic flow: Longitudinal control of connected and automated vehicles to mitigate traffic oscillations. IEEE Trans. Intell. Transp. Syst. 2022, 23, 4001–4012. [Google Scholar]
Sun, M. A day-to-day dynamic model for mixed traffic flow of autonomous vehicles and inertial human-driven vehicles. Transp. Res. Part E Logist. Transp. Rev. 2023, 173, 103113. [Google Scholar] [CrossRef]
Zhang, Y.; Wang, X.; Li, Z. Car-following behavior of human-driven vehicles in mixed-flow traffic: A driving simulator study. Transp. Res. Rec. 2023, 2677, 1–12. [Google Scholar]
Bang, S.; Ahn, S. Mixed traffic of connected and autonomous vehicles and human-driven vehicles: Traffic evolution and control using spring-mass-damper system. Transp. Res. Rec. 2019, 2673, 1–10. [Google Scholar] [CrossRef]
Ge, J.; Orosz, G. Connected cruise control design in mixed traffic flow consisting of human-driven and automated vehicles. Transp. Res. Part C Emerg. Technol. 2018, 95, 445–459. [Google Scholar] [CrossRef]
Wang, Y.; Jiang, Y.; Wu, Y.; Yao, Z. Cooperative driving in mixed-flow traffic of connected vehicles and human-driven vehicles: A state estimation approach. Expert Syst. Appl. 2023, 235, 121275. [Google Scholar] [CrossRef]
Yao, Z.; Luo, R.; Gu, Q.; Xu, T. Analysis of linear internal stability for mixed traffic flow of connected and automated vehicles considering multiple influencing factors. Phys. A Stat. Mech. Its Appl. 2022, 597, 127200. [Google Scholar]
Li, K.; Zhang, Y.; Wang, X. A survey of lateral stability criterion and control application for autonomous vehicles. IEEE Trans. Intell. Transp. Syst. 2023, 24, 4567–4580. [Google Scholar]
Pan, X.; Li, H.; Zhang, M. The impacts of connected autonomous vehicles on mixed traffic flow: A comprehensive review. Phys. A Stat. Mech. Its Appl. 2023, 635, 129454. [Google Scholar] [CrossRef]
Ding, H.; Pan, H.; Bai, H.; Zheng, X.; Chen, J. Driving strategy of connected and autonomous vehicles based on multiple preceding vehicles state estimation in mixed vehicular traffic. Phys. A Stat. Mech. Its Appl. 2022, 596, 127154. [Google Scholar] [CrossRef]
Sun, M.; Li, Y.; Wang, J. Trajectory planning and control of autonomous vehicles for static vehicle avoidance in dynamic traffic environments. IEEE Trans. Intell. Transp. Syst. 2023, 24, 1234–1245. [Google Scholar]
Zhou, Y.; Wang, M.; Zhang, H. A survey on urban traffic control under mixed traffic environment with connected automated vehicles. Transp. Res. Part C Emerg. Technol. 2023, 145, 103902. [Google Scholar]
Zhu, J.; Easa, S.; Gao, K. Merging control strategies of connected and autonomous vehicles at freeway on-ramps: A comprehensive review. J. Intell. Connect. Veh. 2022, 5, 15–30. [Google Scholar] [CrossRef]
Kim, S.; Lee, J.; Park, H. Active lane management and control using connected and automated vehicles in a mixed traffic environment. Transp. Res. Part C Emerg. Technol. 2022, 139, 103648. [Google Scholar]
Zhao, C.; Yu, H.; Molnar, T.G. Safety-critical traffic control by connected automated vehicles. Transp. Res. Part C Emerg. Technol. 2023, 154, 104230. [Google Scholar] [CrossRef]
Ozioko, E.F.; Kunkel, J.; Stahl, F. Road Intersection Coordination Scheme for Mixed Traffic (Human-Driven and Driverless Vehicles): A Systematic Review. J. Adv. Transp. 2022, 2022, 2951999. [Google Scholar] [CrossRef]
Guo, L.; Jia, Y. Bilateral Adaptation of Longitudinal Control of Automated Vehicles and Human Drivers. IEEE Trans. Intell. Transp. Syst. 2023, 24, 5663–5671. [Google Scholar] [CrossRef]
Shi, T.; Ai, Y.; ElSamadisy, O.; Abdulhai, B. Bilateral deep reinforcement learning approach for better-than-human car-following. In Proceedings of the IEEE 25th International Conference on Intelligent Transportation Systems (ITSC), Macau, China, 8–12 October 2022; pp. 3986–3992. [Google Scholar]
Xie, J.; Liu, Y.; Chen, N. Two-Sided Deep Reinforcement Learning for Dynamic Mobility-on-Demand Management with Mixed Autonomy. Transp. Sci. 2022, 56, 1123–1144. [Google Scholar] [CrossRef]
Poudel, B.; Li, W.; Li, S. Carl: Congestion-aware reinforcement learning for imitation-based perturbations in mixed traffic control. In Proceedings of the IEEE 14th International Conference on CYBER Technology in Automation, Control, and Intelligent Systems (CYBER), Copenhagen, Denmark, 16–19 July 2024; pp. 7–14. [Google Scholar]
Liu, K.; Jiao, P.; Hong, W.; Chen, Y. Bilateral Control Model for Autonomous Vehicles Based on Deep Reinforcement Learning. IEEE Trans. Intell. Transp. Syst. 2025, 26, 6216–6230. [Google Scholar] [CrossRef]
Treiber, M.; Hennecke, A.; Helbing, D. Congested traffic states in empirical observations and microscopic simulations. Phys. Rev. E 2000, 62, 1805. [Google Scholar] [CrossRef]
Kerner, B.S.; Lieu, H. The physics of traffic: Empirical freeway pattern features, engineering applications; and theory. Phys. Today 2005, 58, 54–56. [Google Scholar]
Sugiyama, Y.; Fukui, M.; Kikuchi, M.; Hasebe, K.; Nakayama, A.; Nishinari, K.; Tadaki, S.; Yukawa, S. Traffic jams without bottlenecks—Experimental evidence for the physical mechanism of the formation of a jam. New J. Phys. 2008, 10, 033001. [Google Scholar] [CrossRef]
Zheng, Z.; Ahn, S.; Monsere, C.M. Impact of traffic oscillations on freeway crash occurrences. Accid. Anal. Prev. 2010, 42, 626–636. [Google Scholar] [CrossRef]
Li, X.; Cui, J.; An, S.; Parsafard, M. Stop-and-go traffic analysis: Theoretical properties, environmental impacts and oscillation mitigation. Transp. Res. Part B 2014, 70, 319–339. [Google Scholar] [CrossRef]
Qin, Y.; Liu, M.; Hao, W. Energy-optimal car-following model for connected automated vehicles considering traffic flow stability. Energy 2024, 298, 131333. [Google Scholar] [CrossRef]
Jeon, C.M.; Amekudzi, A.; Guensler, R.L. Evaluating Transportation System Sustainability: Atlanta Metropolitan Region. Transp. Res. Rec. 2006, 1983, 10–17. [Google Scholar]
Heckelmann, P.; Rinderknecht, S. Influence of an automated vehicle with predictive longitudinal control on mixed urban traffic using SUMO. World Electr. Veh. J. 2024, 15, 448. [Google Scholar] [CrossRef]
Hou, K.; Giannopoulos, G. Modeling the Deployment and Management of Large-Scale Autonomous Vehicle Circulation in Mixed Road Traffic Conditions Considering Virtual Track Theory. Future Transp. 2024, 4, 215–235. [Google Scholar] [CrossRef]
Li, P.; Liu, M.; Zhu, M.; Yao, M. Preemptive-Level-Based Cooperative Autonomous Vehicle Trajectory Optimization for Unsignalized Intersection with Mixed Traffic. Electronics 2025, 14, 71. [Google Scholar] [CrossRef]
Brackstone, M.; McDonald, M. Car-following: A historical review. Transp. Res. Part F Traffic Psychol. Behav. 1999, 2, 181–196. [Google Scholar] [CrossRef]
Kesting, A.; Treiber, M.; Helbing, D. Enhanced intelligent driver model to access the impact of driving strategies on traffic capacity. Philos. Trans. R. Soc. A 2010, 368, 4585–4605. [Google Scholar] [CrossRef]
Milanés, V.; Shladover, S.E. Cooperative adaptive cruise control: A state-of-the-art review. IEEE Trans. Intell. Veh. 2014, 1, 98–113. [Google Scholar]
Horn, B.K.P. Suppressing traffic flow instabilities. In Proceedings of the 16th International IEEE Conference on Intelligent Transportation Systems (ITSC 2013), Hague, The Netherlands, 6–9 October 2013; IEEE: Hague, The Netherlands, 2013; pp. 13–20. [Google Scholar]
Horn, B.K.P.; Wang, J. Wave equation based control of vehicular platoons. Transp. Res. Part Methodol. 2018, 106, 340–360. [Google Scholar]
Wang, J.; Wang, R.; Horn, B.K.P. Chain stability of a platoon with bidirectional control. Transp. Res. Part C Emerg. Technol. 2019, 100, 1–17. [Google Scholar]
Wang, J.; Horn, B.K.P.; Wang, R. Multi-node bidirectional control for vehicle platooning. IEEE Trans. Intell. Transp. Syst. 2019, 20, 2262–2276. [Google Scholar]
Wang, J.; Horn, B.K.P. Eigenvalue-based analysis of bidirectional platoon stability. IEEE Trans. Intell. Transp. Syst. 2016, 17, 2211–2221. [Google Scholar]
Wang, J.; Horn, B.K.P.; Wang, R. Mixed platoon stability analysis with bidirectional control. Transp. Res. Part C Emerg. Technol. 2019, 102, 1–14. [Google Scholar]
Watkins, C.J.C.H. Learning from Delayed Rewards. Ph.D. Thesis, University of Cambridge, Cambridge, UK, 1989. [Google Scholar]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]
Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.M.O.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D.P. Continuous Control with Deep Reinforcement Learning. U.S. Patent 10,776,692, 15 September 2020. [Google Scholar]
Lin, Y.; McPhee, J.; Azad, N.L. Comparison of deep reinforcement learning and model predictive control for adaptive cruise control. IEEE Trans. Intell. Veh. 2020, 6, 221–231. [Google Scholar] [CrossRef]
Ernst, D.; Glavic, M.; Capitanescu, F.; Wehenkel, L. Reinforcement learning versus model predictive control: A comparison on a power system problem. IEEE Trans. Syst. Man Cybern. Part B (Cybern.) 2008, 39, 517–529. [Google Scholar] [CrossRef] [PubMed]
Li, H.; Roncoli, C.; Ju, Y. A Helly model-based MPC control system for jam-absorption driving strategy against traffic waves in mixed traffic. Appl. Sci. 2024, 14, 1424. [Google Scholar] [CrossRef]
Wang, L.; Horn, B.K.P. On the stability analysis of mixed traffic with vehicles under car-following and bilateral control. IEEE Trans. Autom. Control 2019, 65, 3076–3083. [Google Scholar] [CrossRef]
Fujimoto, S.; Hoof, H.; Meger, D. Addressing function approximation error in actor-critic methods. In Proceedings of the International Conference on Machine Learning (ICML), Stockholm, Sweden, 10–15 July 2018; pp. 1587–1596. [Google Scholar]
Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. Proc. Int. Conf. Mach. Learn. 2018, 80, 1861–1870. [Google Scholar]
Zhu, M.; Wang, Y.; Pu, Z.; Hu, J.; Wang, X.; Ke, R. Safe, efficient, and comfortable velocity control based on reinforcement learning for autonomous driving. Transp. Res. Part C Emerg. Technol. 2020, 117, 102662. [Google Scholar] [CrossRef]
Dong, J.; Wang, J.; Chen, L.; Gao, Z.; Luo, D. Effect of adaptive cruise control on mixed traffic flow: A comparison of constant time gap policy with variable time gap policy. J. Adv. Transp. 2021, 3745989. [Google Scholar] [CrossRef]
Kesting, A.; Treiber, M.; Helbing, D. General lane-changing model MOBIL for car-following models. Transp. Res. Rec. 2007, 86–94. [Google Scholar] [CrossRef]
Liu, Y.; Sun, W.; Xu, W.; Xiong, X.; Hao, L.; Qu, L. Multi-agent collaborative adaptive cruise control based on reinforcement learning. In Proceedings of the 2021 China Automation Congress (CAC), Beijing, China, 22–24 October 2021; pp. 3388–3393. [Google Scholar]
Montanino, M.; Punzo, V. Trajectory data reconstruction and simulation-based validation against macroscopic traffic patterns. Transp. Res. Part B Methodol. 2015, 80, 82–106. [Google Scholar] [CrossRef]
Jiang, L.; Xie, Y.; Evans, N.G.; Wen, X.; Li, T.; Chen, D. Reinforcement Learning based cooperative longitudinal control for reducing traffic oscillations and improving platoon stability. Transp. Res. Part C Emerg. Technol. 2022, 141, 103744. [Google Scholar] [CrossRef]
Li, Y.; Chen, S.; Ha, P.Y.J.; Dong, J.; Steinfeld, A.; Labi, S. Leveraging vehicle connectivity and autonomy to stabilize flow in mixed traffic conditions: Accounting for human-driven vehicle driver behavioral heterogeneity and perception-reaction time delay. Transp. Res. Part C Emerg. Technol. 2020, 121, 102890. [Google Scholar]
He, Y.; Zhou, Q.; Wang, C.; Li, J.; Shuai, B.; Lei, L.; Xu, H. Microscopic modeling of car-following behavior: Developments and future directions. Int. J. Automot. Manuf. Mater. 2023, 2, 6. [Google Scholar]
Németh, B.; Gáspár, P. LPV-based control design of vehicle platoon considering road inclinations. IFAC Proc. Vol. 2011, 44, 3837–3842. [Google Scholar] [CrossRef]
Ploeg, J.; Van De Wouw, N.; Nijmeijer, H. LP string stability of cascaded systems: Application to vehicle platooning. IEEE Trans. Control Syst. Technol. 2013, 22, 786–793. [Google Scholar] [CrossRef]
Minderhoud, M.M.; Bovy, P.H. Extended time-to-collision measures for road traffic safety assessment. Accid. Anal. Prev. 2001, 33, 89–97. [Google Scholar] [CrossRef]
Minocha, V.K.; Saini, G. Discussion of “Estimating Vehicle Fuel Consumption and Emissions Based on Instantaneous Speed and Acceleration Levels” by Kyoung Ahn, Hesham Rakha, Antonio Trani, and Michel Van Aerde. J. Transp. Eng. 2003, 129, 578–579. [Google Scholar] [CrossRef]
West, B.H.; McGill, R.N.; Hodgson, J.W.; Sluder, C.S.; Smith, D.E. Development of data-based light-duty modal emissions and fuel consumption models. SAE Trans. 1997, 106, 1274–1280. [Google Scholar]

Figure 1. Bilateral Control Framework.

Figure 2. Tail Vehicle (Actor 1 and Critic 1).

Figure 3. Middle Vehicles (Actor 2 and Critic 2).

Figure 4. A Sample Reconstructed Speed Curve.

Figure 5. Training Step Framework.

Figure 6. Sliding Reward Value Results.

Figure 8. Visualized Comparison of Stability Metrics.

Figure 9. Cumulative Damping Ratio Analysis.

Figure 10. Traffic Dynamics Across Mixed Scenarios.

Figure 11. Time Gap Results Across Scenarios.

Figure 12. Fuel Consumption Analysis Across Scenarios.

Table 1. Summary of Reward Components and Their Functions.

Component	Parameters	Functions
Stability Reward ( $r_{stability, n}$ )	$α_{1}$ , $α_{2}$	Reduces acceleration fluctuations.
Time Gap Penalty ( $r_{time_gap, n}$ )	$β_{time_gap}$	Penalizes insufficient time gaps.
Distance Penalty ( $r_{distance, n}$ )	$β_{distance}$	Penalizes unsafe following distances.
TTC Penalty ( $r_{TTC, n}$ )	$β_{TTC}$	Enhances collision avoidance.
Efficiency Reward ( $r_{efficiency, n}$ )	$β_{efficiency}$	Encourages target speed tracking.
Smoothness Reward ( $r_{smoothy, n}$ )	—	Improves control smoothness.

Table 2. Key Parameters and Their Configurations for the TD3-BCM Framework in Mixed Traffic Scenarios.

Parameter Name	Description	Value
$v_{n} (t)$	Speed of the vehicle at time t	Dynamic
$a_{n} (t)$	Acceleration of the vehicle	$[- 3, 3] m / s^{2}$
$d_{ahead, n} (t)$	Distance to the preceding vehicle	Dynamic
$d_{behind, n} (t)$	Distance to the following vehicle	Dynamic
$d_{safe}$	Minimum safe distance	$2 m$
$v_{desired}$	Target speed	$33.3 m / s$
$T_{safe}$	Minimum time gap ensuring safety	$0.6 s$
$k_{d}$	Distance feedback gain	$0.1 \sim 1.5 (0.8)$
$k_{v}$	Velocity feedback gain	$0.1 \sim 1.0 (0.5)$
$k_{c}$	Target velocity feedback gain	$0.1 \sim 0.8 (0.5)$
$k_{a}$	Acceleration feedback gain	$0.1 \sim 0.5 (0.2)$
$α_{1}, α_{2}$	Stability penalty weight	$0.4, 0.3$
$T_{react}$	Reaction time for vehicles	$0.4 s$
$a_{\max}$	Maximum deceleration	$3 m / s^{2}$
$β_{time gap}$	Penalty weight for unsafe time gaps	$0.5 \sim 1.5 (1.0)$
$β_{distance}$	Penalty weight for unsafe distances	$0.5 \sim 2.0 (1.0)$
$β_{TTC}$	Penalty weight for exceeding safe speed	$0.5 \sim 1.5 (1.0)$
$β_{efficiency}$	Penalty weight for efficiency deviations	$0.1 \sim 0.5 (0.3)$
$α_{actor}$	Actor network learning rate	$1 \times 10^{- 4}$
$α_{critic}$	Critic network learning rate	$2 \times 10^{- 4}$
$γ$	Discount factor	$0.95$
$τ$	Soft update rate	$0.001$
Replay Buffer Size	Experience replay buffer size	$20, 000$ (tail), $80, 000$ (mid)
Batch Size	Mini-batch size	64 (tail), 256 (mid)
$σ$	Standard deviation of target noise	$0.2$
Clipping Range	Clipping range for target noise	$[- 0.5, 0.5]$

This table provides a comprehensive summary of the key parameters and configurations used in the TD3-BCM framework, including dynamic parameters, feedback gains, and reward function weights. The values are determined based on simulation performance, prior discussions, and references to relevant studies.

Table 3. Stability Metrics for Different Scenarios.

Scenario	Scenario 1	Scenario 2	Scenario 3	Scenario 4	Scenario 5
$S_{AV}$	0.814	0.920	1.056	1.066	1.130
$S_{HDV}$	0.870	0.967	1.076	1.145	1.121
$S_{boundary}$	0.860	1.015	1.098	1.153	1.143
$S_{modular}$	0.556	1.015	1.132	1.163	1.182
$S_{AV}$ (Std)	0.061	0.057	0.039	0.033	0.023
$S_{HDV}$ (Std)	0.085	0.068	0.067	0.056	0.043
$S_{boundary}$ (Std)	0.046	0.044	0.037	0.021	0.016
$S_{modular}$ ((Std)	0.087	0.081	0.077	0.065	0.054

Table 4. Comparison Table of TTC Indicators.

TTC Threshold	2.5 s	3 s
Scenario 1—Mean	0.2453	1.4385
Scenario 1—Std	0.7341	2.8119
Scenario 2—Mean	0.3186	1.6214
Scenario 2—Std	0.6453	2.7348
Scenario 3—Mean	0.3924	1.8749
Scenario 3—Std	0.5632	2.5231
Scenario 4—Mean	0.4657	2.1874
Scenario 4—Std	0.4712	2.3142
Scenario 5—Mean	0.5432	2.6951
Scenario 5—Std	0.3951	2.0713

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, K.; Jiao, P.; Hong, W.; Chen, Y. Enhancing Mixed Traffic Stability with TD3-Driven Bilateral Control in Autonomous Vehicle Chains. Sustainability 2025, 17, 4790. https://doi.org/10.3390/su17114790

AMA Style

Liu K, Jiao P, Hong W, Chen Y. Enhancing Mixed Traffic Stability with TD3-Driven Bilateral Control in Autonomous Vehicle Chains. Sustainability. 2025; 17(11):4790. https://doi.org/10.3390/su17114790

Chicago/Turabian Style

Liu, Kan, Pengpeng Jiao, Weiqi Hong, and Yue Chen. 2025. "Enhancing Mixed Traffic Stability with TD3-Driven Bilateral Control in Autonomous Vehicle Chains" Sustainability 17, no. 11: 4790. https://doi.org/10.3390/su17114790

APA Style

Liu, K., Jiao, P., Hong, W., & Chen, Y. (2025). Enhancing Mixed Traffic Stability with TD3-Driven Bilateral Control in Autonomous Vehicle Chains. Sustainability, 17(11), 4790. https://doi.org/10.3390/su17114790

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Enhancing Mixed Traffic Stability with TD3-Driven Bilateral Control in Autonomous Vehicle Chains

Abstract

1. Introduction

2. Related Work

2.1. Traffic Oscillations and Sustainability Impacts

2.2. Evolution of Control Models: From Car-Following Model to Bilateral Control Model

2.3. Reinforcement Learning in Mixed Traffic Control

3. Methodology

3.1. Introduction to Traditional Bilateral Control Models

3.2. TD3-Driven Bilateral Control Model

3.2.1. Core Mechanisms of TD3

3.2.2. TD3-BCM Formulation

3.3. Training and Evaluation of TD3-BCM

3.3.1. Training and Test Data

3.3.2. Training and Test Steps

3.3.3. Parameters and Training Results

4. Simulation Setup and Performance Evaluation

4.1. Simulation Setup

4.2. Stability Results

4.2.1. Theoretical Stability Analysis

4.2.2. Cumulative Damping Ratio

4.2.3. Traffic Dynamics

4.3. Performance Results

4.3.1. Efficiency

4.3.2. Safety

4.3.3. Energy Savings

5. Conclusions and Discussion

5.1. Conclusions

5.2. Discussion

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI