Research on Speed Planning and Energy Management Strategy for Distributed-Drive Electric Vehicles Based on Deep Deterministic Policy Gradient Algorithm

Li, Ning; Lin, Yong; Huang, Zhongyuan; Hong, Yihao; Ning, Xiaobin

doi:10.3390/act15050248

Open AccessArticle

Research on Speed Planning and Energy Management Strategy for Distributed-Drive Electric Vehicles Based on Deep Deterministic Policy Gradient Algorithm

by

Ning Li

^1,*

,

Yong Lin

^2,*

,

Zhongyuan Huang

³,

Yihao Hong

⁴ and

Xiaobin Ning

⁵

¹

School of Intelligent Manufacture, Taizhou University, Jiaojiang 318000, China

²

Office of Scientific Research, Zhijiang College of Zhejiang University of Technology, Shaoxing 312030, China

³

School of Mechanical Engineering, Zhijiang College of Zhejiang University of Technology, Shaoxing 312030, China

⁴

Jiangxi Jingwei Hengrun Technology Co., Ltd., Nanchang 330220, China

⁵

School of Mechanical Engineering, Zhejiang University of Technology, Hangzhou 310014, China

^*

Authors to whom correspondence should be addressed.

Actuators 2026, 15(5), 248; https://doi.org/10.3390/act15050248

Submission received: 25 March 2026 / Revised: 17 April 2026 / Accepted: 20 April 2026 / Published: 30 April 2026

(This article belongs to the Section Control Systems)

Download

Browse Figures

Versions Notes

Abstract

Fully leveraging the four-wheel independent drive characteristics of distributed-drive electric vehicles has become essential for enhancing their driving range. However, conventional regenerative braking strategies applied to such vehicles often fail to consider individual wheel slip ratios, which can easily lead to wheel lock and low energy recovery efficiency. To address these issues, this paper proposes a novel energy management method that integrates hybrid braking control with intelligent connected speed planning. A hierarchical control strategy for the hybrid braking system is first developed, explicitly accounting for the slip ratio of each wheel. The upper-level controller calculates the slip ratio for each wheel based on vehicle speed and wheel speed information and subsequently determines the braking torque distribution between the front and rear axles. The lower-level controller then allocates the motor braking torque and hydraulic braking torque to each wheel, subject to system constraints such as battery status and motor torque limits. Building on this framework, vehicle state and road information are incorporated as inputs to formulate a Markov decision process, which optimizes traffic efficiency, energy economy, and ride comfort as multiple objectives. The deep deterministic policy gradient (DDPG) algorithm is employed to achieve collaborative optimization of speed planning and energy management. Simulation results demonstrate that the proposed DDPG-based control strategy outperforms both rule-based control methods and classical dynamic programming algorithms in terms of comprehensive performance across traffic efficiency, energy consumption, and ride comfort. These findings validate its superiority in complex traffic conditions.

Keywords:

distributed drive; regenerative braking; speed planning; deep deterministic policy gradient

1. Introduction

During the operation of vehicles in complex urban traffic environments, frequent braking and deceleration occur, causing a substantial amount of mechanical energy to be converted into heat and dissipated, resulting in unnecessary energy waste [1]. Regenerative braking technology enables the recovery and reuse of kinetic energy during braking, thereby improving energy utilization efficiency and extending driving range [2,3,4]. In the past, regenerative braking technology was primarily applied to electric vehicles (EVs) with either front-axle or rear-axle centralized drive systems. In such configurations, it is not possible to account for the slip ratio of individual wheels during braking, which can lead to wheel lock-up and relatively low regenerative braking efficiency. In contrast, distributed four-wheel-drive EVs based on in-wheel motors allow independent control of each wheel’s driving force and offer advantages such as fast response and low latency [5,6]. Therefore, applying regenerative braking technology to distributed-drive electric vehicles (DDEVs) not only enables precise control of the braking force of each wheel but also allows the electric motors to participate more extensively in the braking process, thereby further enhancing regenerative braking efficiency.

Currently, research on regenerative braking technology for distributed four-wheel-drive EVs is relatively scarce and remains largely at the theoretical stage. Yang Lu et al. [7] designed a hierarchical hybrid braking control strategy based on braking-mode switching conditions to address the issue of energy recovery constraints during mode transitions in DDEVs. In this strategy, the upper-layer controller calculates the required braking torque based on the braking intensity and road adhesion coefficient and selects the corresponding braking mode. The lower-layer controller then distributes the hydraulic braking torque and motor braking torque to each wheel according to the current braking mode and corresponding control strategy. Zhu S.P. et al. [8], based on the practical requirements of electric vehicle hybrid braking systems, proposed a distributed parallel braking control strategy. This strategy adopts a hierarchical control structure, where the upper layer focuses on braking safety as its objective, and the lower layer targets energy recovery. Braking force distribution is performed by comprehensively considering constraints such as braking intensity, battery state-of-charge, and motor status, thereby verifying its braking performance and energy recovery effectiveness. To address the coordination control challenge between path tracking and regenerative braking systems in distributed-drive vehicles, Li Bo et al. [9] developed an intelligent distributed-drive coordinated control strategy adaptable to different road adhesion coefficients. The upper-layer controller employs a model predictive control method to perform rolling optimization and predictive solution of a multi-objective cost function, obtaining the desired front-wheel steering angle and total braking torque demand for the four wheels. The lower-layer controller then allocates hydraulic braking torque and motor braking torque to each wheel based on the demanded braking torque, thereby completing the entire hybrid braking process. To mitigate the braking instability that may arise when conventional braking force distribution strategies are applied to distributed-drive vehicles, He R. et al. [10] addressed the issue of speed asynchrony in coaxial in-wheel motors caused by variations in road surface and load during regenerative braking, proposing a circular coupling synchronization control strategy with a current compensation module. A speed controller and a compensation controller were designed based on nonsingular fast terminal sliding mode. Simulations verified that this strategy can effectively reduce the speed synchronization error between motors and enhance the stability of the braking process. Wang L. et al. [11] proposed an extension coordination torque allocation control method that evaluates vehicle stability based on extension theory, constructs an objective function balancing energy efficiency and tire stability margin, and adaptively determines the weighting coefficients. Co-simulation results show that this method significantly outperforms the energy-optimal strategy in terms of stability, while reducing energy consumption by 2.17% and 11.2% compared to the stability-priority strategy, thereby effectively balancing the two objectives. Cai G. S. et al. [12] proposed a safety region-based event-driven lateral stability control framework for DDEVs. By constructing a performance-driven driving event library, this study hierarchically coordinates active front steering, direct yaw moment control, and torque allocation under different driving events, and defines a novel stability region division criterion based on phase plane projection. Hardware-in-the-loop experimental results demonstrate that the proposed strategy improves yaw rate tracking accuracy by 20.81% and 27.06% under extreme conditions, respectively, while achieving energy savings of up to 26.41% compared with active front steering and direct yaw moment control strategies, thereby effectively balancing steerability, stability, and energy efficiency. To improve the regenerative braking energy efficiency of DDEVs, Techalimsakul P. et al. [13], addressing the issue of regenerative braking energy storage efficiency in pure electric vehicles, proposed a hybrid energy storage paradigm integrating supercapacitors and lithium-ion batteries. An artificial neural network was introduced for intelligent management of energy flow, and a three-phase inverter switching algorithm was employed to achieve fine regulation of braking force. The results indicated that this approach improved regenerative efficiency by 38.01%, significantly extending the vehicle’s driving range. Chen Z.Y. et al. [14], focusing on the impact of road gradient variations on regenerative braking energy recovery, proposed a regenerative braking control strategy based on the joint estimation of road gradient and vehicle mass. Neural networks and a least-squares algorithm were used for online estimation of gradient and mass, subsequently optimizing the braking force distribution between the front and rear motors and the hydraulic braking system; the simulation results demonstrated that under specific operating conditions, the energy recovery rate could be increased by up to 9.62%. Zhang X.D. et al. [15] proposed a torque distribution scheme for energy efficiency optimization; under traction conditions, traction force was distributed among the four motors with the objective of minimizing total power loss. Under braking conditions, variable braking torque distribution was performed based on the ideal braking force distribution curve and in compliance with ECE regulations. By combining offline optimization with online interpolation, the computational complexity was reduced. Simulation results showed that this scheme significantly improved overall vehicle efficiency and regenerative braking energy recovery. Jin L.Q. et al. [16], targeting a four-wheel-drive electric vehicle equipped with an electromechanical braking system, proposed an optimal torque distribution method for maximizing energy recovery, taking battery limitations into account. A coordinated control strategy for electromechanical composite braking was also designed; simulation results under the new European driving cycle and worldwide light-duty test cycle driving cycles indicated that the energy recovery rate improved by 3% and 4%, respectively, while effectively suppressing braking jerk during mode transitions.

In summary, current research on hybrid regenerative braking technology for DDEVs has achieved considerable depth, yet the research objectives remain relatively singular, primarily focusing on either improving regenerative braking efficiency or ensuring vehicle braking safety. Studies addressing the integrated optimization of multiple objectives—such as traffic efficiency, energy consumption, and driving comfort—remain scarce. Furthermore, the optimization of regenerative braking efficiency is often based on rule-based deceleration strategies. Due to the lack of real-time adaptive adjustment capabilities in these optimization strategies, frequent start–stop events occur during vehicle operation, which not only compromises ride comfort but also introduces additional energy consumption, hindering the overall optimization of vehicle energy management. Reinforcement learning, however, excels in addressing multi-objective optimization problems and decision-making in uncertain and dynamic conditions, as well as in online planning. Therefore, this paper proposes a multi-objective speed planning and energy management method for signalized intersections based on reinforcement learning; leveraging the independent controllability of each wheel in distributed-drive vehicles, the method designs a composite braking system control strategy that accounts for individual wheel slip rates. A multi-objective Markov decision process model is established to balance travel efficiency, energy consumption, and ride comfort, and a DDPG algorithm is employed to achieve coordinated optimization of speed planning and energy management.

2. Establishment of a Simulation Model for a DDEV

2.1. Overall Architecture of the Hybrid Braking System for a DDEV

The hybrid braking system of a DDEV employs four in-wheel motors to independently control the torque output of each wheel. All four in-wheel motors are capable of regenerative braking and are controlled individually. Compared with conventional centralized drive hybrid braking systems, this configuration can significantly enhance energy recovery efficiency [17,18].

The overall architecture of the DDEV hybrid braking system designed in this paper is shown in Figure 1. The system primarily consists of three components: sensors, controllers, and actuators. The sensors mainly include wheel speed sensors and brake pedal sensors, which are used to acquire information such as braking signals and wheel speeds. The controllers primarily comprise the vehicle controller, hydraulic braking controller, motor controllers, and power-battery management system, which are responsible for the overall control of the hybrid braking system. The actuators mainly include four in-wheel motors, hydraulic brakes, and the power battery pack. During hybrid vehicle braking, the brake pedal opening is acquired through sensors to calculate the required braking torque. The motor controllers control the four in-wheel motors to reverse, generating braking torque for each wheel, at which point the drive motors function as generators. The hydraulic braking controller regulates the magnitude of hydraulic braking torque applied to the wheels based on input signals. The battery management system is responsible for real-time monitoring of the state-of-charge value and providing feedback to the vehicle controller, which is used for overall control during the hybrid braking process.

2.2. Construction of the DDEV Model

2.2.1. Modeling of the In-Wheel Motor

The function of the motor in an EV is to provide torque based on the vehicle’s operating state, enabling acceleration or braking. For the modeling in this study, the motor model is simplified to a certain extent by neglecting the effects of inductance and damping, focusing solely on variations in output torque and rotational speed. The relationship between output torque and rotational speed is characterized by its external characteristic curve [19], as shown in Figure 2.

The expression for the motor output torque at different rotational speeds is given in Equation (1). When

n < n_{e}

, the motor operates in the constant-torque region, and the maximum torque it can provide is

T_{n}

; when

n \geq n_{e}

, it operates in the constant-power region, where the torque it can provide decreases as the rotational speed increases, exhibiting an inverse relationship between the two.

T_{\max} = \{\begin{cases} T_{n}, n < n_{e} \\ \frac{9550 P_{n}}{n}, n \geq n_{,} \end{cases}

(1)

where

T_{\max}

is the maximum torque the motor can provide, N·m;

T_{n}

represents the rated torque, N·m; n is the current rotational speed of the motor, r/min;

n_{e}

represents the rated rotational speed, r/min; and

P_{n}

represents the rated power, Kw.

To account for the dynamic delay between the demanded torque and the actual torque during motor operation, a first-order inertial delay model is employed to simulate the dynamic delay characteristics of the motor. The expression is as follows [20]:

T_{m} = \min (T_{r e q}, T_{\max}) \frac{1}{t_{c} s + 1}

(2)

where

T_{m}

is the actual output torque of the motor, N·m;

T_{r e q}

is the demanded torque, N·m;

t_{c}

is the time constant, which determines the response time of the motor; and s is the Laplace variable.

2.2.2. Modeling of the Power Battery

The Rint model is one of the commonly used equivalent models for power-battery modeling. This model is characterized by simple parameter identification and ease of implementation. Therefore, this paper models the power battery based on the Rint model. The expression is as follows [21]:

U_{b} = U_{o} - I_{b} R_{0}

(3)

E_{b} = \int_{0}^{t} |P_{b} (t)| d t = \int_{0}^{t} |U_{b} I_{b}| d t

(4)

where

U_{b}

is the terminal voltage of the battery, V;

U_{o}

is the open-circuit voltage of the battery, V;

I_{b}

is the battery current, A;

R_{0}

is the equivalent resistance, Ω;

P_{b}

is the battery power, W; and

E_{b}

is the energy change in the power battery, J.

2.2.3. Construction of the DDEV Model

In this study, a simulation model is constructed using Carsim v8.02 and Simulink R2023b software. In CarSim, the vehicle body system, powertrain system, braking system, steering system, suspension system, and tire model are developed, while in Simulink, the in-wheel motor and control system models are built, as shown in Figure 3. During simulation, the vehicle model outputs the wheel speed signal ω_wheel, which is then converted into the reference speed ω_motor for the motor. The difference between the actual vehicle speed v_x output by the vehicle and the target speed is calculated through a PID control algorithm to obtain the load torque T_t for the drive motor. Based on the input reference speed ω_motor and load torque T_t, the drive motor computes and outputs the driving/braking torque T_e to the vehicle.

3. Design of the Hybrid Braking System Control Strategy for DDEVs

Leveraging the independent controllability of the four wheels in a distributed-drive electric vehicle, this paper designs a compound braking control strategy with four-wheel independent control that considers the slip ratio of each individual wheel. The strategy adopts a hierarchical structure. The upper-layer control strategy calculates the slip ratios of the four wheels

s_{f l}

,

s_{f r}

,

s_{r l}

,

s_{r r}

based on the acquired vehicle speed

v

, braking intensity z, and each wheel’s rotational speed

ω_{f l}, ω_{f r}, ω_{r l}, ω_{r r}

, and compares them with a preset optimal slip ratio

s

. When the slip ratio of every wheel is below the optimal slip ratio, the control strategy switches to normal braking mode to distribute the braking torque to the front and rear axles

T_{f}

,

T_{r}

. When the slip ratio of any wheel exceeds the optimal slip ratio, the strategy switches to anti-lock braking mode to distribute the braking torque to the front and rear axles

T_{f}

,

T_{r}

. After receiving the front- and rear-axle braking torque information from the upper-layer control strategy, the lower-layer control strategy performs secondary distribution of the motor braking torque

T_{e_f r o n t}

and

T_{e_r e a r}

and the hydraulic braking torque

T_{m_f r o n t}

and

T_{m_r e a r}

for the front and rear axles, respectively. It comprehensively considers factors such as vehicle speed, braking intensity, motor braking torque, state of charge (SOC), and battery charging power as constraints for the lower-layer torque distribution to determine the braking torque to be provided by the motor. The objective of the lower-layer control strategy is to utilize motor braking as much as possible while ensuring braking safety, thereby recovering as much regenerative energy as possible, as shown in Figure 4.

The control objective of the upper-layer control strategy is to maintain consistent slip ratios between the front and rear wheels during braking, thereby ensuring good braking stability. Although the control strategies differ between normal braking and anti-lock braking, the distribution of braking force between the front and rear axles in both cases follows the ideal braking force distribution principle [22,23]. The calculation formula is as follows:

\begin{array}{l} F_{x f} = G \cdot ϕ \cdot (l_{b} + z \cdot h_{g}) / L \\ F_{x r} = G \cdot ϕ \cdot (l_{a} - z \cdot h_{g}) / L \end{array}

(5)

where

F_{x f}

is the braking force distributed to the front axle, N;

F_{x r}

is the braking force distributed to the rear axle, N;

G

is the vehicle gravity, N;

z

is the braking intensity;

ϕ

is the road adhesion coefficient at the front and rear wheels;

h_{g}

is the height of the vehicle’s center of mass, m; and

l_{a}

and

l_{b}

are the distances from the center of mass to the front and rear axles, respectively, m.

The lower-layer control strategy is primarily responsible for distributing the motor regenerative braking torque and hydraulic braking torque, comprehensively considering constraints such as the battery SOC, maximum motor torque, vehicle speed, braking intensity, and battery charging power, as follows:

(1): Battery SOC: The battery SOC should not be too high. If it exceeds a certain threshold, continued charging may lead to overcharging, which can reduce battery service life and, in severe cases, cause battery damage. Therefore, when the SOC value is greater than 95%, only hydraulic braking is employed, and the motor does not participate in braking. The constraint coefficient is defined as k₁:

k_{1} = \{\begin{cases} 1, S O C \leq 95 % \\ 0, S O C > 95 % \end{cases}

(6)

(2): Maximum motor torque: The braking torque of the in-wheel motor installed on each wheel cannot exceed the maximum motor torque. Therefore, the constraint coefficient is defined as k₂:

k_{2} = \{\begin{cases} 1, T \leq T_{\max} \\ T_{\max} / T, T > T_{\max} \end{cases}

(7)

(3): Vehicle speed: When the vehicle speed is below 5 km/h, the current generated by the motor during braking is low, resulting in low energy recovery efficiency. Moreover, motor braking is unstable at low speeds and can easily cause impact to the vehicle. Therefore, under such conditions, only hydraulic braking is performed, and the motor no longer participates. As the speed decreases to between 5 and 10 km/h, the motor braking torque gradually reduces until it is completely withdrawn. The constraint coefficient k₃ is defined as:

k_{3} = \{\begin{cases} 1, v \leq 5 \\ 0.2 v - 1, 5 < v < 10 \\ 1, v \geq 10 \end{cases}

(8)

(4): Braking intensity: When the braking intensity exceeds 0.7, continued use of motor braking poses a safety risk. Therefore, motor braking is withdrawn, and only hydraulic braking is performed. The constraint coefficient is defined as k₄:

k_{4} = \{\begin{cases} 1, z \leq 0.7 \\ 0, z > 0.7 \end{cases}

(9)

(5): Battery charging power: The sum of the regenerative braking power generated by the four wheels of a DDEV should not exceed the maximum charging power of the power battery; otherwise, it may affect the service life of the power battery or even cause damage. The constraint coefficient is defined as k₅:

k_{5} = \{\begin{cases} 1, P_{1} + P_{2} + P_{3} + P_{4} \leq P_{\max} \\ \frac{P_{\max}}{P_{1} + P_{2} + P_{3} + P_{4}}, P_{1} + P_{2} + P_{3} + P_{4} > P_{\max} \end{cases}

(10)

By multiplying the aforementioned constraint coefficients with the torque distributed by the upper-layer control, the motor braking torque required for braking is obtained. The motor braking torque is given by:

\{\begin{cases} T_{e_f r o n t} = k_{1} k_{2} k_{3} k_{4} k_{5} T_{f} \\ T_{e_r e a r} = k_{1} k_{2} k_{3} k_{4} k_{5} T_{r} \end{cases}

(11)

The remaining portion is supplemented by the hydraulic braking torque, ultimately achieving the distribution between motor braking torque and hydraulic braking torque. The hydraulic braking torque is given by:

\{\begin{cases} T_{m_f r o n t} = T_{f r o n t} - T_{e_f r o n t} \\ T_{m_r e a r} = T_{r e a r} - T_{e_r e a r} \end{cases}

(12)

4. Construction of a Signalized Intersection Road Model and Study of Traffic Operation Status

4.1. Construction of a Signalized Intersection Road Model

This study selects Liuhe Road in the Liuxia Subdistrict, Xihu District, Hangzhou City, Zhejiang Province, as the basis for constructing the simulation environment. The road has a total length of approximately 5.5 km, includes 12 traffic signal intersections, is two-way roads, and has a maximum speed limit of 50 km/h.

The road model is built on the PreScan v8.5 software platform. The road is configured with three motor vehicle lanes, each with a width of 3.5 m. Twelve standard intersection structures are constructed, and the entire road is divided into 13 continuous segments, designated as Section1 to Section13. The distances between each segment are set according to the actual road conditions, as shown in Table 1.

Additionally, traffic signal models are deployed at each of the 12 signalized intersections, and speed limit signs are installed on certain road segments to enhance the precision of traffic flow control in the simulation environment. The completed simulation road model of Liuhe Road is shown in Figure 5.

The phase information of the traffic signals (red and green lights) along the actual Liuhe Road was collected. Based on the collected data, the phase information for signal lights TL1~TL12 was set in the simulation model as follows in Table 2.

4.2. Study of Traffic Operation Status at Signalized Intersections

To investigate the problem of single-vehicle speed planning at intersections, the condition is defined as follows: The test vehicle travels at a speed of 12 m/s, located 200 m from the intersection, with no other vehicles ahead. The average signal cycle of the intersection is 110 s, comprising a red-phase duration of 57 s, a green-phase duration of 50 s, and a yellow-phase duration of 3 s. For safety considerations, the yellow phase is combined with the red phase.

Based on the spatiotemporal characteristics of a single vehicle traveling through an intersection, as well as the signal phase and the distance from the vehicle to the stop line, the vehicle’s driving modes are classified into four types: constant-speed passage, acceleration passage, deceleration passage, and deceleration stop. This paper does not conduct in-depth research on constant-speed driving and stopping conditions; instead, it focuses on the deceleration passage and acceleration passing condition at intersections, as detailed below:

(1): Deceleration passing condition at a red light: When the test vehicle approaches the intersection the traffic signal is red, with 20 s remaining in the red phase. If the vehicle maintains its current constant speed, it will arrive at the stop line before the red light ends and would need to stop and wait for the light to turn green before proceeding. To avoid stopping before the stop line, the test vehicle adopts a deceleration coasting strategy. This allows the vehicle to smoothly pass through the intersection after the red light ends and the green phase begins, thereby improving traffic efficiency, as shown in Figure 6.

(2): Acceleration passage condition at a green light: When the test vehicle approaches the intersection the traffic signal is green, with 15 s remaining in the green phase. If the vehicle continues driving at its current speed, it will not be able to pass through the intersection before the green phase ends. Therefore, in order to pass through the intersection within the green light duration, the test vehicle needs to accelerate, enabling it to clear the intersection before the light changes, as shown in Figure 7.

5. Speed Planning Algorithm for DDEV

5.1. Speed Planning Algorithm Based on DDPG

5.1.1. Principles of the DDPG Algorithm

DDPG is a reinforcement learning algorithm based on policy gradients, combining deep learning with deterministic policy optimization methods to address reinforcement learning problems with continuous action spaces. Its core idea is to integrate the deterministic policy gradient method with deep neural networks, enabling effective exploration and optimization in high-dimensional state spaces and continuous action spaces [24,25,26]. The framework of its algorithm network is shown in Figure 8.

The basic procedure of the DDPG algorithm is as follows:

(1): Initialization: Initialize the Actor network and the Critic network, as well as their corresponding target networks. The parameters of the Actor and Critic networks are randomly initialized, while the parameters of the target networks are directly copied from their corresponding online networks. Initialize the experience replay buffer, which is used to store state–action–reward–next state (SARS) tuples.
(2): Exploration and learning: At each time step $t$ , the agent selects an action $a_{t}$ based on the current state $s_{t}$ through the Actor network. To encourage exploration, DDPG introduces noise (typically Ornstein–Uhlenbeck noise or Gaussian noise) during action selection, achieving a balance between exploration and exploitation. Subsequently, the agent executes the action $a_{t}$ , interacts with the environment, and obtains the next state $s_{t + 1}$ and the immediate reward $r_{t}$ .
(3): Experience storage: The current interaction experiences $(s_{t}, a_{t}, r_{t}, s_{t + 1})$ are stored into the experience replay buffer.
(4): Sampling and training: A batch of data is randomly sampled from the experience replay buffer to update the Critic and Actor networks. The Critic network optimizes the Q-value function by minimizing the TD error, where the target Q-value is computed jointly by the target Actor network and the target Critic network. The Actor network optimizes the policy function by maximizing the Q-value output by the Critic network.
(5): Target network update: A soft-update method is adopted to smoothly update the target network parameters with a small step size $τ$ : $θ_{t a r g e t} \leftarrow τ \cdot θ_{e v a l} + (1 - τ) \cdot θ_{t a r g e t}$ . This update method enhances the stability of the training process and prevents drastic changes in target values.

Termination condition: When the preset number of iterations is reached or the performance metrics are satisfied, the training terminates and the learned Actor network (i.e., the final policy function) is returned. The pseudocode of the DDPG algorithm is shown in Algorithm 1.

Algorithm 1: Pseudocode of the DDPG algorithm (DDPG)

1. Input: Initialize policy parameters

θ

, action-value function parameters

φ

, and experience replay buffer

D

2. Initialize target network parameters:

θ_{t a r g e t} \leftarrow θ

,

φ_{t a r g e t} \leftarrow φ

3. For each episode:

4. Observe state

s

, select action:

a = c l i p (μ_{θ} (s) + ε, a_{l o w}, a_{h i g h})

, where

ε ~ N

5. Execute action in the environment

a

6. Observe next state

s^{'}

, reward

r

, and termination flag

d

7. Store transition

(s, a, r, s^{'}, d)

in the experience replay buffer

D

8. If the termination condition for

s

is met, reset the environment state

9. If the experience replay buffer

D

is full:

10. For

t = 0, 1, \dots, M

:

11. Randomly sample a mini-batch of transitions

B = \{(s, a, s^{'}, d)\}

from

D

12. Compute target network parameters:

13. Update the action-value function using gradient descent:

\nabla_{φ} (\frac{1}{|B|} \sum_{(s, a, s^{'}, d) \in B} {(Q_{φ} (s, a) - y (r, s^{'}, d))}^{2})

14. Update the policy using gradient ascent

\nabla_{θ} (\frac{1}{|B|} \sum_{s \in B} Q_{φ} (s, μ_{θ} (s)))

15. Update the target network parameters:

φ_{t a r g e t} \leftarrow ρ \cdot φ_{t a r g e t} + (1 - ρ) \cdot φ

θ_{t a r g e t} \leftarrow ρ \cdot θ_{t a r g e t} + (1 - ρ) \cdot θ

16. End For

17. End the loop until convergence

5.1.2. Speed Planning Algorithm Based on DDPG

Based on the research conditions described in the preceding section, the traffic-light model, vehicle model, and the definitions of state (S), action (A), and reward (R) constructed using the DDPG algorithm in this paper are as follows:

(1): Traffic-light model

At the beginning of each training episode, the current traffic-light phase is randomly selected from the 12 traffic-light phases given in Table 2, and the initial time

t_{c u r_0}

is randomly reset within the range of 0~111 s.

The recorded value of the traffic-light time

t_{r g_i}

is defined as shown in Equation (13). If the current time exceeds a complete signal cycle, the current signal phase time is obtained by taking the modulo of the current time with respect to the signal cycle. If the current time has not yet reached a complete cycle, it is directly used as the recorded value of the signal-light time.

t_{r g_i} = \{\begin{cases} t_{c u r_0} \mod t_{c y c l e_i}, (t_{c u r_0} > t_{c y c l e_i}) \\ t_{c u r_0}, (t_{c u r_0} \leq t_{c y c l e_i}) \end{cases}

(13)

where

t_{r g_i}

is the recorded value of the current signal-light time, s;

t_{c u r_0}

is the randomly initialized time, s; and

t_{c y c l e_i}

is the current signal-light cycle time, s.

The definitions of the remaining green-light time

t_{g_r e m i a n_i}

and remaining red-light time

t_{r_r e m i a n_i}

are given in Equations (14) and (15). If the recorded value falls within the green-light duration, the remaining green-light time is calculated as the green-light duration minus the recorded value, while the remaining red-light time is set to zero. Conversely, if the recorded value exceeds the green-light phase, it indicates that the current state is the red-light phase; in this case, the remaining green-light time is zero, and the remaining red-light time is calculated as the total signal cycle duration minus the current recorded value.

\{\begin{cases} t_{g_r e m a i n_i} = t_{g_p h a s e_i} - t_{r g_i} \\ t_{r_r e m a i n_i} = 0 \end{cases}, (t_{r g_i} < t_{g_p h a s e_i})

(14)

\{\begin{cases} t_{g_r e m a i n_i} = 0 \\ t_{r_r e m a i n_i} = t_{c y c l e_i} - t_{r g_i} \end{cases}, (t_{r g_i} \geq t_{g_p h a s e_i})

(15)

where

t_{g_r e m i a n_i}

is the remaining green-light time at the intersection, s;

t_{r_r e m i a n_i}

is the remaining red-light time at the intersection, s; and

t_{g_p h a s e_i}

is the green-light phase duration at the intersection, s.

(2): Initial state of the vehicle

At the beginning of training, the vehicle enters the 200 m V2X communication range of the traffic light at a speed of 12 m/s. The initial state of the vehicle at this moment is defined as follows:

\{\begin{cases} x_{0} = \begin{matrix} 200 \end{matrix} \\ v_{0} = \begin{matrix} 12 \end{matrix} \\ a_{0} = \begin{matrix} 0 \end{matrix} \end{cases}

(16)

where

x_{0}

is the initial position of the vehicle, m;

v_{0}

is the initial speed of the vehicle, m/s; and

a_{0}

is the initial acceleration of the vehicle, m/s².

(3): Vehicle state space

The vehicle state space is defined as follows:

S_{t} = [x_{t}, v_{t}]

(17)

where

S_{t}

is the current state space of the vehicle;

x_{t}

is the current traveled distance of the vehicle, m; and

v_{t}

is the current speed of the vehicle, m/s.

(4): Vehicle action space

The vehicle action space is defined as follows:

A_{t} = a_{t}

(18)

where

A_{t}

is the current action space of the vehicle, and

a_{t}

is the current acceleration of the vehicle, m/s².

(5): Training logic

The model adopts a training cycle with a step length of 1 s. In each iteration, the vehicle’s travel displacement is calculated based on the current vehicle speed and the step time. Simultaneously, the vehicle’s cumulative distance traveled, the current state of the traffic light, and its remaining duration are dynamically updated according to the logical rules of the traffic environment. The vehicle speed

v_{t}

for each iteration is solved as shown in Equation (19):

v_{t} = v_{t - 1} + a_{t}

(19)

x_{t} = 1 \cdot \frac{v_{t} + v_{t - 1}}{2}

(20)

d_{t} = \sum_{t = 0}^{n} x_{t}

(21)

where

v_{t - 1}

is the vehicle speed at the previous time step, m/s;

x_{t}

is the travel distance per iteration, m; and

d_{t}

is the total travel distance of the vehicle at the current time step, m.

If the cumulative travel distance does not exceed 200 m, it indicates that the vehicle’s movement within the current step is feasible, and a corresponding reward is assigned for each successfully completed 1 s iteration. If, during the final iteration, the remaining distance is insufficient to support a full 1 s advance, the vehicle only travels to the traffic-light position, and this serves as the termination condition for the current training episode. The remaining travel distance within the last time step and the time taken for the final iteration are calculated using the following formulas:

d i s_{r e m a i n} = 200 - (d i s_{200} - d i s)

(22)

t_{g o_g r e e n_r e d} = \frac{- v_{0} + \sqrt{v_{0}^{2} + 2 a_{t} \cdot d i s_{r e m a i n}}}{a_{t}}

(23)

where

d i s_{r e m a i n}

is the remaining travel distance within the last time step, m;

d i s_{200}

is the cumulative travel distance after completing the final step, m;

d i s

is the travel distance of the final step, m; and

t_{g o_g r e e n_r e d}

is the time taken for the final iteration, s.

The time taken for the vehicle to arrive at the signalized intersection and its speed upon arrival are calculated as follows:

t_{c u r} = t_{g o_g r e e n_r e d}

(24)

v_{c u r} = a_{t} \cdot t_{g o_g r e e n_r e d}

(25)

where

t_{c u r}

is the time taken for the vehicle to arrive at the signalized intersection, s; and

v_{c u r}

is the speed of the vehicle upon arrival at the signalized intersection, m/s.

When the vehicle reaches the traffic-light intersection, the system first determines the current signal status. If the remaining time of the green light has expired or the red light is still ongoing at this moment, it is considered a violation of traffic signal rules, i.e., red-light running, and the current episode is immediately terminated while a new training episode is initiated. Conversely, if the green light still has remaining time or the red-light phase has already ended, it is deemed legal passage, and the current episode is similarly concluded to proceed to the next learning episode.

(6): Reward function and constraint conditions

The reward value

R_{1}

for regenerative braking is defined as the difference between the current vehicle

S O C_{t}

and the vehicle battery

S O C_{t - 1}

at the previous time step. This reward is designed to enhance regenerative braking efficiency and is calculated as follows:

R_{1} = S O C_{t} - S O C_{t - 1}

(26)

The reward value

R_{2}

for the vehicle’s distance to the intersection stop line is defined as the ratio between the total travel distance of the vehicle

d_{t}

and a constant. This reward is designed to improve traffic efficiency and is calculated as follows:

R_{2} = \frac{d_{t}}{200}

(27)

The reward value

R_{3}

for driving comfort is defined as the absolute value of the reciprocal of the difference between the current vehicle acceleration

a_{t}

and the vehicle acceleration at the previous time step

a_{t - 1}

. This reward is designed to improve driving comfort. When

a_{t} - a_{t - 1} = 0

, the value

R_{3}

= 1. Its calculation formula is as follows:

R_{3} = \{\begin{cases} |\frac{1}{a_{t} - a_{t - 1}}|, a_{t} - a_{t - 1} \neq 0 \\ 1, a_{t} - a_{t - 1} = 0 \end{cases}

(28)

where

R_{3}

is the reward value for driving comfort and

a_{t - 1}

is the vehicle acceleration at the previous time step, m/s².

Weight coefficients are assigned to the three different reward values, with higher priority given to energy consumption and traffic efficiency than to comfort. The weight for driving comfort is set as a fixed value

β_{3} = 0.2

, and the weights for energy consumption and traffic efficiency satisfy

β_{1} + β_{2} = 0.8

. By varying

β_{1}

from 0.1 to 0.7 and applying a multi-objective weight tuning method, it is found that when

β_{1} = 0.4

, a well-balanced optimization effect can be achieved. Consequently, the weight coefficients corresponding to regenerative braking energy recovery, traffic efficiency, and driving comfort are defined as 0.4, 0.4, and 0.2, respectively. The total reward value is calculated as follows:

R = β_{1} R_{1} + β_{2} R_{2} + β_{3} R_{3}

(29)

These three reward functions correspond to distinct physical meanings and control objectives, and their numerical magnitudes differ significantly. To unify the evaluation scale, enhance training stability, and accelerate the convergence process, it is necessary to normalize each reward according to the following formula, as shown in Equation (30):

x^{'} = \frac{x - \min (x)}{\max (x) - \min (x)}

(30)

In addition, the weight coefficients

d_{t}

,

v_{t}

and

a_{t}

must also satisfy the constraints specified in Equation (31):

\{\begin{cases} 0 \leq d_{t} \leq \begin{matrix} 200 \end{matrix} \\ 0 \leq v_{t} \leq \begin{matrix} 12 \end{matrix} \\ - 3 \leq a_{t} \leq 2 \end{cases}

(31)

After constructing the algorithm model, the parameters of the DDPG algorithm are configured in detail. The Actor network consists of an input layer, two fully connected layers, and an output layer. The input layer takes the state

s

as its input. The first fully connected layer

L_{1}

has 128 neurons, and the second fully connected layer

L_{2}

has 200 neurons; both fully connected layers use ReLU as the activation function. The output layer outputs the action and uses Tanh as the activation function. Since the output range of Tanh is [−1, 1], its output value needs to be proportionally scaled to the final action output according to the range of the action space, as shown in Equation (32).

a_{f} = a_{\min} + \frac{a_{h} + 1}{2} (a_{\max} - a_{\min})

(32)

where

a_{h}

is the action value output by Tanh, and

a_{f}

is the final output acceleration.

The cCritic network consists of an input layer, two fully connected layers, and an output layer. The input layer takes the state

s

and action

a

as inputs. The number of neurons and the activation functions of the fully connected layers are the same as those in the Actor network. The output layer outputs the Q value.

During the training process, DDPG adds noise to the actions for random adjustment, which enables thorough exploration of the action space, thereby expanding the range of actions and avoiding local optima, thus contributing to improved training effectiveness. Ornstein–Uhlenbeck noise is adopted for action exploration, with the noise mean set to 0, standard deviation set to 0.3, and decay rate set to 1 × 10⁻⁵. The relevant parameter settings of the DDPG algorithm are shown in Table 3.

During the training process, an episode is considered terminated under any of the following three conditions: the vehicle encounters a red light when passing through any signalized intersection, it violates the preset rules, or the maximum number of simulation steps per episode is reached. The condition for each training episode is randomly selected from the previously defined set of conditions.

5.1.3. Training Results of the DDPG Algorithm

The reward curve during the training process of the DDPG algorithm is shown in Figure 9, where the episode reward refers to the cumulative reward value of that episode, and the average reward refers to the average reward value of the 50 episodes preceding that episode. As can be seen from the figure, the reward value shows an upward trend with the increase in the number of training episodes. After 600 episodes, the average reward value has essentially converged, stabilizing at around 4500. At this point, it can be considered that the model has learned to use an appropriate strategy to control vehicle operation. The policy network obtained after the 786th episode, which corresponds to the highest average reward value during training, is selected as the final result of the DDPG algorithm.

5.2. Speed Planning Algorithm Based on Dynamic Programming

To evaluate the effectiveness of the speed planning algorithm based on the DDPG algorithm, the dynamic programming (DP) algorithm is adopted as a benchmark for comparison. The DP algorithm was proposed by the American mathematician R. E. Bellman in the 1950s to solve multi-stage decision problems. It decomposes a multi-stage problem into multiple single-stage subproblems and solves them step by step, thereby obtaining the optimal solution to the overall problem [27,28]. Although the DP algorithm can achieve a globally optimal solution, it requires prior knowledge of the entire driving cycle and involves substantial computational effort, making it difficult to implement in real time. Therefore, the results obtained by DP are often used as benchmarks and references for comparing and evaluating the performance of other optimization algorithms and strategies. The solution principle of the DP algorithm is illustrated in Figure 10, with the specific steps as follows:

(1): Determine the state variable $x$ and the control variable $u$ of the problem to be solved, and discretize them within their feasible regions (boundary ranges).
(2): Determine the state transition equation $f$ and design the single-step cost function based on the control objective.
(3): Perform backward solving; starting from the final stage, traverse all state variables at each stage and the outcomes under all possible corresponding control variables. Determine the state of the previous stage based on the state transition equation and update the cost function $J$ until the initial state of the initial stage is reached.
(4): Perform forward solving; starting from the initial state of the initial stage, determine the optimal control variable for each stage by minimizing the cost function, thereby obtaining the optimal control sequence. Based on the state transition equation, obtain the state variables at each stage under the action of the optimal control sequence.

In this paper, the speed planning for passing through a signalized intersection is solved through discretization in the distance domain. Consistent with the DDPG algorithm, the vehicle travel distance

s

and vehicle speed

v

are selected as the state variables, denoted as

S = [x v]

, and the vehicle acceleration

a

is selected as the action variable, denoted as

A = a

. The system state transition equation is given as follows:

\{\begin{cases} x (k + 1) = x (k) + Δ x \\ v (k + 1) = \sqrt{v {(k)}^{2} + 2 a (k) Δ x} \end{cases}

(33)

where

k

is the stage index,

N

is the total number of stages, and

Δ x

is the discrete step size in the distance domain.

Taking the vehicle’s total power

P_{\cos t}

demand as the economic objective function, travel time

T_{\cos t}

as the indicator of traffic efficiency, and the absolute value of acceleration

C_{\cos t}

as the measure of comfort, the optimal control problem of speed planning can be formulated as follows:

\min J = \sum_{k = 0}^{N - 1} (ω_{1} P_{\cos t} (k) + ω_{2} T_{\cos t} (k) + ω_{3} C_{\cos t} (k))

(34)

In which

\{\begin{cases} P_{\cos t} (k) = \frac{P (k) - P_{\min}}{P_{\max} - P_{\min}} \\ T_{\cos t} (k) = \frac{Δ t (k) - Δ t_{\min}}{Δ t_{\max} - Δ t_{\min}} \\ C_{\cos t} (k) = |\frac{a (k)}{a_{\max}}| \end{cases}

(35)

where

P_{\cos t} (k), T_{\cos t} (k), C_{\cos t} (k)

is the cost function of the

k

stage for economy, traffic efficiency, and comfort;

ω_{1}, ω_{2}, ω_{3},

are the weights of the respective objective functions, consistent with the values used in the DDPG algorithm;

P_{\max}

and

P_{\min}

are, respectively, the approximate maximum and minimum total power demand per unit distance domain statistically calculated under the current driving conditions (

kWh / km

);

Δ t_{\max}

and

Δ t_{\min}

are the maximum and minimum travel time per unit distance domain statistically obtained under the current driving conditions (s); and

a_{\max}

is the maximum acceleration (m/s²), the value of which is consistent with that used in the DDPG algorithm.

5.3. Rule-Based Speed Planning Algorithm

Similarly, to evaluate the effectiveness of the DDPG-based speed planning algorithm, a rule-based speed planning model was constructed. The rule-based speed planning algorithm centers on rule logic, enabling the selection of traffic strategies under different traffic-light states by establishing a series of judgment criteria that reflect driving behavior. Its characteristics are simple computation and fast response. However, since it is based on traffic regulations, road environment, and driving experience rather than relying on complex model training or extensive data learning, it is difficult for it to adapt to complex conditions and it has poor scalability [29,30]. The workflow of rule-based speed planning is shown in Figure 11.

First, the controller, based on information provided by the V2X communication module, acquires in real time relevant parameters such as the current traffic-light state

S_{s t a t e}

(red or green), the remaining time of the current phase

T_{r e m a i n}

and the signal cycle

T_{c y c l e}

. Simultaneously, the system synchronously collects the vehicle’s current speed

v

, acceleration

a

and the distance

d

between the vehicle and the intersection stop line. Using this information, the controller estimates the time

t_{a r r i v e}

required for the vehicle to reach the intersection without accelerating or decelerating under the current state, providing a basis for subsequent traffic decisions.

Building on this, the speed planning controller makes decisions based on different traffic conditions. If the estimated time of arrival for the vehicle falls within the current signal cycle, it is considered that the vehicle meets the conditions for passage during the current cycle. At this point, the controller calculates the target time required to complete the passage

t_{a r r i v e} < T_{r e m a i n}

, and further determines whether the current vehicle speed meets the requirements. If the target speed is higher than the current speed, the controller outputs an acceleration passage strategy; otherwise, it maintains a constant-speed state to ensure the vehicle passes through the intersection smoothly.

If the estimated time of arrival for the vehicle does not fall within the current signal cycle, it is determined that the vehicle cannot maintain passage under the current state. The controller first assesses whether the vehicle can potentially pass through by appropriately decelerating. If the conditions are met, a comfortable deceleration strategy is output; if not, the system enters a buffer control state. The controller then evaluates whether a safe stop can be achieved based on the remaining distance and current kinetic energy. If a safe stop can be completed before the stop line, it further determines whether the current red light is about to end. If so, it prepares in advance for a “second acceleration” to smoothly enter the next green-light cycle; if not, it decelerates to a low speed or comes to a complete stop and enters a waiting state. If a safe stop cannot be completed before the stop line, an emergency braking strategy is executed to ensure driving safety.

Ultimately, the speed planning controller outputs key control variables such as the desired speed

v_{r e f}

and the recommended acceleration

a_{r e f}

, and transmits them to the lower-layer energy management module or execution system, thereby achieving planning and control for passage through urban intersections.

6. Analysis of Speed Planning Results for DDEVs at Signalized Intersections

6.1. Traffic Efficiency and Comfort Analysis

6.1.1. Analysis of Red-Light Deceleration Passage Condition at Signalized Intersections

Traffic efficiency is primarily reflected in the time required for a vehicle to pass through an intersection, while ride comfort is generally measured by the root mean square value of longitudinal acceleration during travel

a_{w}

, calculated as seen below. The larger this value, the more drastic the change in longitudinal speed, and thus the poorer the comfort; conversely, the smaller the value, the better the comfort.

a_{w} = {[\frac{1}{T} \int_{0}^{T} a_{w}^{2} (t) d t]}^{\frac{1}{2}}

(36)

The simulation results for the deceleration passage condition at an intersection with a red light are shown in Figure 12. Within the time range of 0 to 35 s, the DDPG algorithm adopts a smooth deceleration strategy, first reducing the speed to 8 m/s and then gradually accelerating back to the initial speed. This ensures that the vehicle arrives at and passes through the intersection exactly at the end of the 20 s red-light phase, while ensuring it does not cross the intersection during the red-light period. The rule-based planning algorithm first uniformly decelerates to 8 m/s within 20 s and then accelerates back to the initial speed, ensuring arrival and passage at the end of the 20 s red-light phase. The DP algorithm, through phased optimization, rapidly decelerates to 10 m/s within 3 s and maintains a constant speed, ensuring arrival and passage at the end of the 20 s red-light phase, before gradually returning to the initial speed. As can be seen from Figure 12a, all three control strategies successfully navigate the deceleration passage condition at the red-light intersection. Among them, the DDPG algorithm arrives at the intersection earliest at the end of the 20 s red-light phase, resulting in the shortest passage time and the highest traffic efficiency, followed by the rule-based planning algorithm and the DP algorithm.

Figure 13 shows the variation curves of vehicle acceleration over time under the deceleration passage condition at a red-light intersection for three different speed planning algorithms. Calculations indicate that the root mean square value of longitudinal acceleration

a_{w}

for the DDPG control method is 0.243 m/s², for the rule-based control method it is 0.22 m/s², and for the DP control method it reaches 0.746 m/s². This demonstrates that the rule-based method provides the best ride comfort, followed by the DDPG algorithm, with the DP control algorithm exhibiting the poorest comfort.

6.1.2. Analysis of Green-Light Acceleration Passage Condition at Signalized Intersections

The simulation results for the acceleration passage condition at an intersection with a green light are shown in Figure 14. The DDPG algorithm, through dynamic adjustment strategies, first increases the vehicle speed to a peak of 16.2 m/s at around 10 s, then gradually decelerates back to the initial speed. The vehicle arrives at the intersection at 13 s and passes through within the green-light phase. The rule-based planning algorithm, through phased optimization, rapidly accelerates to a peak of 14.8 m/s within 2 s, then maintains a constant speed, followed by gradual uniform deceleration back to the initial speed at 15 s. The vehicle arrives at the intersection at 14 s and passes through within the green-light phase. The DP algorithm first uniformly accelerates the vehicle to a peak of 17.3 m/s within 15 s, then decelerates back to the initial speed, with the vehicle arriving at the intersection at 14 s and passing through within the green-light phase. It can be observed that all three speed planning methods successfully navigate the acceleration passage condition. Among them, the DDPG algorithm achieves the shortest passage time and the highest traffic efficiency, followed by the rule-based planning algorithm and the DP algorithm.

Figure 15 shows the variation curves of vehicle acceleration over time under the acceleration passage condition at a green-light intersection for the three different speed planning algorithms. Calculations indicate that the root mean square value of longitudinal acceleration

a_{w}

for the DDPG method is 0.425 m/s², for the rule-based control method it is 0.361 m/s², and for the DP control method it reaches 0.765 m/s². This demonstrates that the rule-based algorithm provides the best driving comfort, followed by the DDPG algorithm, with the DP control algorithm exhibiting the poorest comfort. Notably, the value for the DP control method is significantly higher than the other methods. The reason for this, upon analysis, is the presence of large acceleration fluctuations during speed planning, which adversely affects ride comfort.

In summary, under both the deceleration passage condition at a red light and the acceleration passage condition at a green light, the DDPG algorithm achieves the highest traffic efficiency while reasonably balancing comfort. The rule-based algorithm offers the best comfort but results in lower traffic efficiency. The DP algorithm, on the other hand, exhibits both low traffic efficiency and poor comfort.

6.2. Energy Consumption Economy Analysis of Speed Planning at Signalized Intersections

To investigate the influence of different speed planning algorithms on the dynamic characteristics of motor torque and energy consumption optimization at intersections, a study was conducted on the variation of motor torque over time under the three planning algorithms. Figure 16 shows the variation curves of motor torque over time under different operating conditions, and Table 4 presents the energy consumption results for the different speed planning algorithms when passing through the intersection. It can be observed that, under the deceleration passage condition at a red light, the rule-based algorithm achieves the lowest energy consumption, with the DDPG algorithm performing similarly, while the DP algorithm exhibits the highest energy consumption. Analysis suggests that this is because the rule-based control algorithm consistently operates with uniform deceleration or acceleration, resulting in smaller torque fluctuations and thus the lowest energy consumption. The DDPG algorithm shows slightly larger torque fluctuations compared to the rule-based method, leading to marginally higher energy consumption. In contrast, the DP algorithm involves rapid acceleration and deceleration, causing significant torque fluctuations and consequently the highest energy consumption. Similarly, under the acceleration passage condition at a green light, the DDPG algorithm achieves the lowest energy consumption due to its minimal torque fluctuations. The rule-based algorithm follows, while the DP algorithm, with the largest torque fluctuations, exhibits the highest energy consumption.

By analyzing the traffic efficiency, energy consumption, and driving comfort of the three planning methods, it can be concluded that the DDPG algorithm achieves the highest traffic efficiency, maintains relatively low overall energy consumption, and effectively balances ride comfort. The rule-based method demonstrates average performance in terms of traffic efficiency and energy consumption, offers good comfort, but suffers from poor dynamic adaptability due to the rigidity of its preset logic. The DP algorithm can respond quickly in deterministic conditions; however, the presence of rapid acceleration and deceleration results in poor comfort and energy consumption, with only average traffic efficiency.

In summary, the continuous action space of the DDPG algorithm enables it to output a smooth acceleration sequence, avoiding the stepwise control of rule-based strategies and the discretized abrupt changes of DP. The deterministic policy combined with a decaying OU noise allows the thorough exploration of various acceleration and deceleration behaviors in the early training stage, while converging stably in the later stage. Moreover, the Critic network optimizes the long-term cumulative reward, allowing the agent to adaptively balance traffic efficiency, energy consumption, and comfort. In the red-light deceleration scenario, DDPG achieves the shortest passage time at the cost of slightly higher energy consumption and slightly lower comfort. In the green-light acceleration scenario, DDPG simultaneously achieves the lowest energy consumption and the shortest passage time. Therefore, the DDPG-based speed planning algorithm exhibits the best overall performance when considering traffic efficiency, energy consumption, and driving comfort.

7. Conclusions

This paper, addressing the technical characteristics of four-wheel independent control in DDEVs, proposes a collaborative optimization strategy for speed planning and energy management that integrates hybrid braking control with deep reinforcement learning. The main research conclusions are as follows:

(1): A hierarchical control framework for the hybrid braking system of DDEVs was constructed, considering individual wheel slip ratio. The upper-layer control strategy decides on different braking control modes and distributes braking torque between the front and rear axles based on wheel speed, vehicle speed, and brake pedal information. The lower-layer control strategy distributes motor braking torque and hydraulic braking torque for each wheel according to the front and rear axle braking torque information input from the upper layer, along with constraints such as battery and motor torque, achieving safe and efficient distribution of braking torque.
(2): A multi-objective Markov decision process model was established, integrating vehicle state, road geometry information, and traffic signal phase. With traffic efficiency, energy consumption economy, and ride comfort as the comprehensive reward function, an integrated intelligent speed planning and energy management strategy based on DDPG was designed. It overcomes the technical bottleneck of the difficulty in collaboratively balancing multiple objectives and achieves adaptive speed regulation in dynamic intersection conditions.
(3): Verified through simulations of typical conditions—deceleration at a red light and acceleration at a green light at signalized intersections—the proposed DDPG strategy, compared to traditional rule-based control strategies and the classic DP algorithm, achieves optimal traffic efficiency while maintaining energy consumption within an excellent range. Simultaneously, it effectively suppresses acceleration fluctuations caused by rapid acceleration and deceleration, balancing ride comfort. It achieves an optimal multi-objective balance among traffic efficiency, energy consumption, and comfort, fully validating the effectiveness and superiority of the proposed strategy.

Although this study has achieved certain results in the field of multi-objective speed planning and energy management for distributed-drive electric vehicles, the use of a simplified first-order inertial delay motor model and the Rint equivalent battery model may lead to an overestimation of regenerative energy recovery efficiency and SOC maintenance capability in practical applications. Future work will enhance the robustness of the strategy through refined modeling and real-vehicle road tests.

Furthermore, recently proposed algorithms such as Twin-Delayed DDPG (TD3) [31] and Soft Actor–Critic (SAC) [27] have shown promising theoretical properties in mitigating overestimation bias and improving training stability. In the next step, the performance of DDPG, TD3, and SAC within the proposed framework will be systematically compared. Additionally, mechanisms including adaptive entropy adjustment, cyberattacks [32], and control barrier functions for safety will be considered to further improve the generalization capability of the proposed approach.

Author Contributions

Conceptualization, Y.L. and Z.H.; software, Y.H.; writing—original draft preparation, N.L.; writing—review and editing, X.N. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by general scientific research projects of the Zhejiang Provincial Department of Education, grant number Y202352247.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

Author Yi Hao Hong was employed by the company Jiangxi Jingwei Hengrun Technology Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

DDEV	Distributed-Drive Electric Vehicle
DDPG	Deep Deterministic Policy Gradient
SOC	State of Charge
DP	Dynamic Programming
EV	Electric Vehicle

References

Park, Y.; Park, S.; Ahn, C. Performance Potential of Regenerative Braking Energy Recovery of Autonomous Electric Vehicles. Int. J. Control. Autom. Syst. 2023, 21, 1442–1454. [Google Scholar] [CrossRef]
Deepa, M.K.; Sridharan, S.; Subramanian, S.C. Energy Efficiency Improvement Framework for Regenerative Braking System in Electric Vehicles. IEEE Trans. Transp. Electrif. 2026, 12, 1994–2008. [Google Scholar] [CrossRef]
Mazouzi, A.; Hadroug, N.; Hafaifa, A.; Iratni, A.; Colak, I. Particle swarm optimization of fuzzy logic—Based energy management system for enhanced efficiency in fuel cell hybrid electric vehicles. Sustain. Comput. Inform. Syst. 2025, 48, 101239. [Google Scholar] [CrossRef]
Li, N.; Huang, Z.Y.; Wang, C.P.; Ning, X. Particle Swarm Optimization and Fuzzy Logic Co—Optimization for Energy Efficiency Cooperative Energy Management Strategy of Hybrid Energy Storage Electric Vehicles. World Electr. Veh. J. 2026, 17, 73. [Google Scholar] [CrossRef]
Xu, S.W.; Li, J.Q.; Zhang, X.P.; Song, J.; Zeng, X. Research on Composite Braking Control Strategy of Four—Wheel—Drive Electric Vehicles with Multiple Motors Based on Braking Energy Recovery Optimization. IEEE Access 2023, 11, 110151–110163. [Google Scholar] [CrossRef]
Ge, S.S.; Li, Q.H.; Xie, Z.Q.; Zhang, Z. Research on torque distribution control strategy of distributed—Drive electric vehicles for large—Slopes with low—Adhesion. Proc. Inst. Mech. Eng. Part D J. Automob. Eng. 2025, 1–20. [Google Scholar] [CrossRef]
Yang, L.; Tan, D. Study on hybrid brake control of distributed drive electric vehicle. Mech. Sci. Technol. Aerosp. Eng. 2021, 40, 619–626. (In Chinese) [Google Scholar] [CrossRef]
Zhu, S.P.; Jiang, X.D.; Wang, Y.R.; Ye, X.Y.; Yu, G.; Xu, L.F. Research on Parallel Braking Control of Distributed Four—Wheel—Drive Electric Vehicle. Qiche Gongcheng Automot. Eng. 2020, 42, 1506–1512, 1544. [Google Scholar] [CrossRef]
Li, B.; Pan, P.; Shen, H.Y.; L, J.; Li, L.; Liu, Z. Investigation of Cooperative Control Strategy for Path Tracking and Braking Energy Recovery for Intelligent Distributed—Drive Vehicles. China J. Highw. Transp. 2022, 35, 292–304. (In Chinese) [Google Scholar]
He, R.; Xie, Y.K. Research on the Synchronization Control Strategy of Regenerative Braking of Distributed Drive Electric Vehicles. World Electr. Veh. J. 2024, 15, 512. [Google Scholar] [CrossRef]
Wang, L.; Shu, Q.X.; Zhou, D.S.; Ti, Y. Extenics Coordinated Torque Distribution Control for Distributed Drive Electric Vehicles Considering Stability and Energy Efficiency. Actuators 2025, 15, 3. [Google Scholar] [CrossRef]
Cai, G.S.; Yin, G.D.; Pi, D.W.; Zhuang, W.; Feng, J.; Ren, Y.; Ding, H. Safety Region—Based Event—Driven Lateral Stability Control for DDEVs with Energy Conservation. IEEE Trans. Transp. Electrif. 2025, 11, 13976–13989. [Google Scholar] [CrossRef]
Techalimsakul, P.; Keyoonwong, W. Integrated Vehicle—Following Control for Four—Wheel Independent Drive Based on Regenerative Braking System Control Mechanism for Battery Electric Vehicle Conversion Driven by PMSM 30 kW. Energies 2024, 17, 2576. [Google Scholar] [CrossRef]
Chen, Z.Y.; Xiong, R.; Cai, X.; Wang, Z.; Yang, R. Regenerative Braking Control Strategy for Distributed Drive Electric Vehicles Based on Slope and Mass Co—Estimation. IEEE Trans. Intell. Transp. Syst. 2023, 24, 14610–14619. [Google Scholar] [CrossRef]
Zhang, X.D.; Göhlich, D.; Li, J.Y. Energy—Efficient Torque Allocation Design of Traction and Regenerative Braking for Distributed Drive Electric Vehicles. IEEE Trans. Veh. Technol. 2018, 67, 285–295. [Google Scholar] [CrossRef]
Jin, L.Q.; Fan, J.P.; Fei, T. Coordinated control strategy of electro—Mechanical composite braking for four—Wheel drive electric vehicles. Proc. Inst. Mech. Eng. Part D J. Automob. Eng. 2025, 239, 2838–2853. [Google Scholar] [CrossRef]
Zhao, K.K.; Fan, X.B.; Huang, Z.P.; Wang, L.H.; Peng, J.X. A review of drive torque distribution control for distributed drive electric vehicles. Proc. Inst. Mech. Eng. Part D J. Automob. Eng. 2025, 239, 5291–5315. [Google Scholar] [CrossRef]
Hua, M.; Chen, G.Y.; Zhang, B.Y.; Huang, Y. A hierarchical energy efficiency optimization control strategy for distributed drive electric vehicles. Proc. Inst. Mech. Eng. Part D J. Automob. Eng. 2019, 233, 605–621. [Google Scholar] [CrossRef]
Iwaki, K.; Nakamura, K. Experimental Improvement of Speed—Torque Characteristics in Magnetic—Geared Switched Reluctance Motor. IEEE Trans. Magn. 2025, 61, 8202605. [Google Scholar] [CrossRef]
Ming, X.; Wang, X.Y.; Liu, F.C.; Qu, Y.; Zhou, B.; Zhang, S.; Yu, P. Mechanical Parameter Identification of Permanent Magnet Synchronous Motor Based on Symmetry. Symmetry 2025, 17, 1929. [Google Scholar] [CrossRef]
Tekin, M.; Karamangil, M.I. Comparative analysis of equivalent circuit battery models for electric vehicle battery management systems. J. Energy Storage 2024, 86, 111327. [Google Scholar] [CrossRef]
Louback, E.; Kollmeyer, P.J.; Emadi, A. Braking Strategy Characterization for a Dual—Motor Battery Electric Vehicle and Regenerative Torque Limit Derivation. IEEE Access 2025, 13, 192920–192934. [Google Scholar] [CrossRef]
Li, S.Q.; Yu, B.; Feng, X.Y. Research on braking energy recovery strategy of electric vehicle based on ECE regulation and I curve. Sci. Prog. 2020, 103. [Google Scholar] [CrossRef]
Wahid, M.R.; Joelianto, E.; Budiman, B.A.; Dewanata, M.P.; Aziz, M. Optimizing regenerative braking in light electric vehicles using deep deterministic policy gradient reinforcement learning. Egypt. Inform. J. 2026, 33, 100893. [Google Scholar] [CrossRef]
Ouyang, T.C.; Jin, S.; Xie, X.J.; Gong, Y.; Zhang, Z. Adaptive Energy Management in Dual—Motor Electric Vehicles Using Deep Deterministic Policy Gradient. IEEE Trans. Transp. Electrif. 2025, 11, 12647–12656. [Google Scholar] [CrossRef]
Fan, D.Y.; Shen, H.K.; Dong, L.J. Multi—Agent Distributed Deep Deterministic Policy Gradient for Partially Observable Tracking. Actuators 2021, 10, 268. [Google Scholar] [CrossRef]
Yin, Y.L.; Xiao, H.Y.; Zhan, S.; Chen, H.; Deng, C.; Li, Z.; Pan, X. Hierarchical control of hybrid electric vehicle platoon with slope—Adaptive variable spacing and soft actor—Critic based energy management. J. Energy Storage 2026, 152, 120623. [Google Scholar] [CrossRef]
Manivannan, R. Research on IoT—Based hybrid electrical vehicles energy management systems using machine learning -based algorithm. Sustain. Comput.-Inform. Syst. 2024, 41, 100943. [Google Scholar] [CrossRef]
Ma, Y.; Ma, Q.; Liu, Y.Q.; Gao, J.; Chen, H. Two—Level optimization strategy for vehicle speed and battery thermal management in connected and automated EVs. Appl. Energy 2024, 361, 122928. [Google Scholar] [CrossRef]
Pan, C.; Li, Y.; Huang, A.; Wang, J.; Liang, J. Energy—Optimized adaptive cruise control strategy design at intersection for electric vehicles based on speed planning. Sci. China Technol. Sci. 2023, 66, 3504–3521. [Google Scholar] [CrossRef]
Chen, S.H.; Zhang, Z.; Zhang, J.; Lu, Y.; Yu, X.; Xuan, D. Hierarchical energy management for FCHEV in car-following scenarios with speed prediction. Proc. Inst. Mech. Eng. Part D J. Automob. Eng. 2026, 1–18. [Google Scholar] [CrossRef]
Chen, G.D.; Zhu, P.M.; Peng, X.J.; Huang, C.; Li, H. Watermarking—Based Attack Detection for Sensor Networks with Intermittent Observation Under Stealthy Attacks. J. Syst. Sci. Complex. 2026, 1, 1–23. [Google Scholar] [CrossRef]

Figure 1. Overall architecture of the DDEV hybrid braking system.

Figure 2. External characteristic curve of the motor.

Figure 3. CarSim vehicle dynamics model in the Simulink environment.

Figure 4. Braking force distribution strategy for the hybrid braking system of DDEVs.

Figure 5. Road model of Liuhe Road with signalized intersections.

Figure 6. Red light deceleration passing condition.

Figure 7. Green-light acceleration passing condition.

Figure 8. Network framework of the DDPG algorithm.

Figure 9. Reward curve during the training process of the DDPG algorithm.

Figure 10. Solution principle of DP.

Figure 11. Flowchart of the speed planning control strategy.

Figure 12. Simulation results of red-light deceleration passing condition at signalized intersections.

Figure 13. Curve of acceleration variation with time under the red-light deceleration passage condition.

Figure 14. Simulation results of green-light acceleration passing condition at signalized intersections.

Figure 15. Curve of acceleration variation with time under the green-light acceleration passage condition.

Figure 16. Vehicle driving time vs. motor torque variation curves under different driving conditions.

Table 1. Length of each section of Liuhe Road.

Road Number	Length
Section1	538 m
Section2	417 m
Section3	697 m
Section4	363 m
Section5	323 m
Section6	400 m
Section7	377 m
Section8	660 m
Section9	220 m
Section10	497 m
Section11	465 m
Section12	245 m
Section13	300 m

Table 2. Phase information of traffic signals for each section of Liuhe Road.

Traffic-Light Number	Green Phase	Red Phase	Total Phase
TL1	50 s	60 s	110 s
TL2	54 s	57 s	111 s
TL3	70 s	40 s	110 s
TL4	43 s	67 s	110 s
TL5	65 s	45 s	110 s
TL6	34 s	22 s	56 s
TL7	56 s	53 s	109 s
TL8	57 s	51 s	108 s
TL9	66 s	42 s	108 s
TL10	42 s	68 s	110 s
TL11	47 s	63 s	110 s
TL12	70 s	38 s	108 s

Table 3. Parameter settings of the DDPG algorithm.

Parameter Name	Numerical Value	Parameter Name	Numerical Value
Simulation step(s)	1	Exploration noise type	Ornstein–Uhlenbeck
Actor learning rate	0.001	OU noise mean	0
Soft-update coefficient	0.005	OU noise initial standard deviation	0.3
Experience replay buffer size	1,000,000	OU noise decay rate	1 × 10⁻⁵
Batch size for training	256	Actor network hidden-layer architecture	[128, 200]
Reward discount factor	0.95	Critic network hidden-layer architecture	[128, 200]
Gradient threshold	1	Hidden-layer activation function	ReLU
Maximum simulation steps per episode	2000	Actor output layer activation function	Tanh
Maximum training episodes	1000	Critic output layer activation function	Linear
Weight coefficients of reward components $β_{1}$ , $β_{2}$ , $β_{3}$	0.4, 0.4, 0.2	Target network update method	Soft update

Table 4. Comparison of energy consumption under different operating conditions.

Operating Condition	Unit	DDPG	Rule-Based Control Strategy	DP
Deceleration passage at a red light	kWh/km	0.0315	0.0309	0.0359
Acceleration passage at a green light	kWh/km	0.0446	0.0475	0.0534

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, N.; Lin, Y.; Huang, Z.; Hong, Y.; Ning, X. Research on Speed Planning and Energy Management Strategy for Distributed-Drive Electric Vehicles Based on Deep Deterministic Policy Gradient Algorithm. Actuators 2026, 15, 248. https://doi.org/10.3390/act15050248

AMA Style

Li N, Lin Y, Huang Z, Hong Y, Ning X. Research on Speed Planning and Energy Management Strategy for Distributed-Drive Electric Vehicles Based on Deep Deterministic Policy Gradient Algorithm. Actuators. 2026; 15(5):248. https://doi.org/10.3390/act15050248

Chicago/Turabian Style

Li, Ning, Yong Lin, Zhongyuan Huang, Yihao Hong, and Xiaobin Ning. 2026. "Research on Speed Planning and Energy Management Strategy for Distributed-Drive Electric Vehicles Based on Deep Deterministic Policy Gradient Algorithm" Actuators 15, no. 5: 248. https://doi.org/10.3390/act15050248

APA Style

Li, N., Lin, Y., Huang, Z., Hong, Y., & Ning, X. (2026). Research on Speed Planning and Energy Management Strategy for Distributed-Drive Electric Vehicles Based on Deep Deterministic Policy Gradient Algorithm. Actuators, 15(5), 248. https://doi.org/10.3390/act15050248

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Research on Speed Planning and Energy Management Strategy for Distributed-Drive Electric Vehicles Based on Deep Deterministic Policy Gradient Algorithm

Abstract

1. Introduction

2. Establishment of a Simulation Model for a DDEV

2.1. Overall Architecture of the Hybrid Braking System for a DDEV

2.2. Construction of the DDEV Model

2.2.1. Modeling of the In-Wheel Motor

2.2.2. Modeling of the Power Battery

2.2.3. Construction of the DDEV Model

3. Design of the Hybrid Braking System Control Strategy for DDEVs

4. Construction of a Signalized Intersection Road Model and Study of Traffic Operation Status

4.1. Construction of a Signalized Intersection Road Model

4.2. Study of Traffic Operation Status at Signalized Intersections

5. Speed Planning Algorithm for DDEV

5.1. Speed Planning Algorithm Based on DDPG

5.1.1. Principles of the DDPG Algorithm

5.1.2. Speed Planning Algorithm Based on DDPG

5.1.3. Training Results of the DDPG Algorithm

5.2. Speed Planning Algorithm Based on Dynamic Programming

5.3. Rule-Based Speed Planning Algorithm

6. Analysis of Speed Planning Results for DDEVs at Signalized Intersections

6.1. Traffic Efficiency and Comfort Analysis

6.1.1. Analysis of Red-Light Deceleration Passage Condition at Signalized Intersections

6.1.2. Analysis of Green-Light Acceleration Passage Condition at Signalized Intersections

6.2. Energy Consumption Economy Analysis of Speed Planning at Signalized Intersections

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI