Electric Vehicle Cluster Charging Scheduling Optimization: A Forecast-Driven Multi-Objective Reinforcement Learning Method

Zhao, Yi; Jia, Xian; Tan, Shuanbin; Liang, Yan; Wang, Pengtao; Wang, Yi

doi:10.3390/en19030647

Open AccessArticle

Electric Vehicle Cluster Charging Scheduling Optimization: A Forecast-Driven Multi-Objective Reinforcement Learning Method

by

Yi Zhao

^1,2,*,

Xian Jia

¹,

Shuanbin Tan

¹,

Yan Liang

¹,

Pengtao Wang

¹ and

Yi Wang

¹

School of Engineering, Xi’an Siyuan University, Xi’an 710038, China

²

Engineering Research Center on Additive Manufacturing Technology and Application in Universities of Shaanxi Province, Xi’an 710038, China

^*

Author to whom correspondence should be addressed.

Energies 2026, 19(3), 647; https://doi.org/10.3390/en19030647

Submission received: 10 October 2025 / Revised: 22 November 2025 / Accepted: 24 November 2025 / Published: 27 January 2026

(This article belongs to the Topic Advancing the Energy Internet: Innovations and Solutions for a Sustainable Energy Future)

Download

Browse Figures

Versions Notes

Abstract

The widespread adoption of electric vehicles (EVs) has posed significant challenges to the security of distribution grid loads. To address issues such as increased grid load fluctuations, rising user charging costs, and rapid load surges around midnight caused by uncoordinated nighttime charging of household electric vehicles in communities, this paper first models electric vehicle charging behavior as a Markov Decision Process (MDP). By improving the state-space sampling mechanism, a continuous space mapping and a priority mechanism are designed to transform the charging scheduling problem into a continuous decision-making framework while optimizing the dynamic adjustment between state and action spaces. On this basis, to achieve synergistic load forecasting and charging scheduling decisions, a forecast-augmented deep reinforcement learning method integrating Gated Recurrent Unit and Twin Delayed Deep Deterministic Policy Gradient (GRU-TD3) is proposed. This method constructs a multi-objective reward function that comprehensively considers time-of-use electricity pricing, load stability, and user demands. The method also applies a single-objective pre-training phase and a model-specific importance-sampling strategy to improve learning efficiency and policy stability. Its effectiveness is verified through extensive comparative and ablation validation. The results show that our method outperforms several benchmarks. Specifically, compared to the Deep Deterministic Policy Gradient (DDPG) and Particle Swarm Optimization (PSO) algorithms, it reduces user costs by 11.7% and the load standard deviation by 12.9%. In contrast to uncoordinated charging strategies, it achieves a 42.5% reduction in user costs and a 20.3% decrease in load standard deviation. Moreover, relative to single-objective cost optimization approaches, the proposed algorithm effectively suppresses short-term load growth rates and mitigates the “midnight peak” phenomenon.

Keywords:

deep reinforcement learning; charging control; Markov decision process; load fluctuation suppression

1. Introduction

With the intensification of global climate change and the energy crisis, EVs have garnered increasing attention as a crucial solution for reducing dependence on fossil fuels and lowering greenhouse gas emissions. In 2020, the State Council issued the “New Energy Vehicle Industry Development Plan (2021–2035)”, emphasizing the automotive industry’s commitment to electrification, connectivity, and intelligence to promote high-quality sustainable development and accelerate the construction of an automotive powerhouse. Compared to traditional internal combustion engine vehicles, EVs are greener and cleaner. In addition, the “Handbook on New Paradigms in Smart Charging for E-Mobility” provides valuable insights in its chapter 14 on “Policies for the future: Promoting electric vehicle deployment”, highlighting the essential role of forward-looking policies [1]. Similarly, the handbook’s chapter 11 on “Demand-side management and managing electric vehicles and their optimal charging locations and scheduling in smart grids” underscores the importance of advanced management strategies [2]. These studies underscore the synergy between policy incentives and intelligent charging coordination, which collectively enhance user participation, reduce grid impact, and support the transition to sustainable e-mobility. Encouraged and supported by national industrial policies, China’s new energy vehicle industry is booming, and the number of urban EVs is rapidly increasing.

However, urban residential power load planning still follows building area unit load standards, lacking systematic management of the additional load introduced by high-power home EV charging piles. As a result, issues such as grid load fluctuations and increased user costs caused by disorderly charging behavior are becoming increasingly prominent. The phenomenon of EV owners charging upon returning home in the evening, which coincides with peak grid fluctuation periods, exacerbates load oscillations in distribution networks, further widening peak-to-valley differences and leading to “peak-on-peak” problems [3], posing threats of overload to power infrastructure. Even smart charging piles that allow scheduled charging often prioritize minimizing charging costs solely guided by time-of-use pricing, frequently resulting in a more severe “midnight peak” clustering phenomenon [4]. This causes a rapid short-term increase in community grid load, challenging the stability and security of the power system. Traditional charging strategies easily lead to congestion and waste, often exacerbating grid load fluctuations, increasing user charging costs, and causing uneven allocation of charging resources. Therefore, researching an intelligent and efficient EV charging method is particularly important.

In recent years, research on EV charging scheduling optimization has evolved across multiple dimensions. Studies are typically categorized by vehicle type, such as private EVs and electric taxis, considering their distinct management characteristics. Private EV charging primarily occurs in residential communities or during travel. Community charging, characterized by extended parking times and concentrated load, focuses on minimizing individual costs through price–time coordination [5], while grid-level optimization aims to mitigate load fluctuations via peak shaving and valley filling strategies [6]. Travel charging, given its spatial uncertainty, involves either individual route adjustments based on battery status [7] or platform-driven charging location planning [8]. For electric taxis, centralized dispatch platforms facilitate multi-factor charging strategy optimization. Traditional optimization approaches, including dynamic programming [9], day-ahead scheduling [10], and real-time scheduling [11], have been widely applied. While dynamic programming offers theoretical optimality [12], it relies heavily on accurate state-transition models; day-ahead scheduling depends on forecasted conditions but lacks real-time feedback for clustered vehicle behaviors; and real-time scheduling, though responsive, struggles with high-dimensional parametric modeling. Conventional methods often formulate charging scheduling as multi-objective optimization problems, such as maximizing multi-agent benefits or flattening load curves, solved via operational research or evolutionary algorithms. For instance, Liu et al. [13] developed a user-response-based charging/discharging model optimized by a Particle Swarm Optimization solver, while Ihekwaba et al. [14] explored control strategies for EV stations in voltage regulation. However, these methods face significant challenges: strong dependence on accurate real-time prediction data, high computational resource consumption in high-dimensional spaces, and limited efficiency in handling dynamic charging behaviors.

Reinforcement Learning (RL), as an advanced machine learning technique, learns optimal policies through interaction with the environment. It has been widely applied in fields like gaming [15], robotics, and control [16], providing new ideas for solving complex decision-making problems. The application potential of RL in the EV charging domain is significant. Firstly, the EV charging problem is highly dynamic and uncertain. Secondly, RL can comprehensively consider multiple objectives, such as charging cost, charging time, and grid load balance. By designing a reasonable reward function, RL can achieve a balance between multiple objectives to realize global optimization.

Therefore, leveraging its advantages in sequential decision-making [17], RL has brought new opportunities to research in real-time optimization for power systems. Najafi et al. [18] considered the uncertainties caused by EV integration into distribution grids, renewable energy sources, and loads, using the Q-learning algorithm for day-ahead training to enable real-time scheduling during the day without iterative computation. Ding et al. [19] used MDP to analyze the charging behavior of EVs within each time step and guided users’ EV charging choices at each time step by constructing a reward function. To address the issue of large absolute differences between action and state dimensions in traditional Deep Q-Learning algorithms for EV charging control optimization, Ji et al. [20] proposed a deep reinforcement learning optimization method based on the Dueling Deep Q Network (DDQN), balancing user travel demand and distribution grid voltage regulation. Treating each EV as a separate optimization variable simplifies implementation and control, but this method is only suitable for scenarios with a small number of EVs, as a large number can easily lead to the “curse of dimensionality” problem. Huang et al. [21] introduced the concept of EV clusters, established upper and lower energy boundary models for EV clusters, and used the Soft Actor–Critic (SAC) algorithm for model training and real-time scheduling. Lotfy et al. [22] modeled the charging process of a single EV as an MDP and used the TD3 algorithm to optimize the charging behavior of EV clusters from the perspective of a load aggregator. Literature [23] grouped EV clusters according to distribution network nodes and charging end times, proposed charging scheduling strategies considering charging costs and insufficient charging penalties, and solved them using a deep DRL algorithm. The above research requires building complex EV cluster charging station environments, meaning the charging optimization process needs to be implemented by distributively deployed agents, resulting in high hardware costs. Kumar et al. [24] proposed using an LSTM network to extract dynamic price features to compensate for the reinforcement learning state space.

However, directly applying deep reinforcement learning to EV charging control tasks faces the following challenges: (1) The uncertainty of the base load and user commuting behavior in the environment leads to missing components when designing the state space, making it impossible to obtain precise state transitions in the model, which affects the effectiveness of reinforcement learning training. (2) Multiple objectives need to be achieved, including minimizing charging costs while suppressing load fluctuations. Obtaining an optimal policy that balances these two objectives is very difficult. (3) There is a hidden prerequisite during the training of the reinforcement learning action model: the agent must first satisfy the vehicle owner’s charging demand while simultaneously optimizing the three objective values. In the early stages of training and when filling the experience replay buffer, the agent needs to first master the basic logic of charging through various means. (4) The state-space dimension is the product of the charging time period and the number of vehicles. In high-dimensional continuous state and action spaces, the presence of discrete time-of-use electricity prices exacerbates the sparsity of the reward function.

To solve the above problems, EV charging control is mathematically constrained and modeled based on environmental factors, constructing the sequential decision-making problem as an MDP. Additionally, we developed a model-free, real-time scheduling reinforcement learning method to learn the optimal sequential charging strategy for EVs. The main contributions of this paper are summarized as follows:

(1): EV charging control is modeled as an MDP. Random variable distributions are used to approximate the actual occurrences of vehicle state of charge (SoC) and owner travel behavior information. Historical load data and sequence neural network models are used to predict the charging demand of a residential community. Furthermore, the set of EV charging piles is treated as an agent interacting with the environment, allocating charging power to maximize cumulative reward.
(2): Importance sampling logic is introduced into the experience replay buffer of the TD3 reinforcement learning method. This aims to find a globally optimal charging strategy in continuous action scenarios, minimizing electricity purchasing costs while meeting the distribution grid’s requirements for long-term load fluctuation magnitude and short-term growth rate.
(3): A GRU model is utilized to forecast 48 h base load information to determine the current charging action. Additionally, Gaussian noise is added to the actor network to ensure effective exploration by the agent.
(4): Simulation results demonstrate that compared to the baseline method, the proposed method balances low charging costs with distribution grid load requirements.

The remainder of this article is organized as follows. In Section 2, we present the EV charging scheduling model, including system formulation and optimization objectives. In Section 3, we introduce the GRU-TD3-based scheduling method in detail. In Section 4, we provide simulation results and performance evaluation. Finally, in Section 5, we conclude this article and suggest possible future research directions.

2. EV Charging Scheduling Model

In this section, the charging scenario and the charging scheduling model underlying the employed algorithm adopted in this paper will be described in detail.

2.1. Scenario Description

For the EV charging problem, this paper constructs a scenario as shown in Figure 1 for modeling. This scenario approximates the home EV charging task within a residential community: residents return home at dusk after completing daytime work and need to charge their EV before leaving to work the next morning, which aligns with typical urban commuting rhythms and represents the most common usage scenario in residential settings [25,26]. The community grid, after integrating private charging piles and fixed-capacity public charging piles, needs to control its impact on the overall load. The residents’ round-trip plans, vehicle status, and the grid’s predicted load are provided to the decision-making agent. The pricing signals serve as natural economic incentives that encourage charging during optimal periods, including potential solar-abundant hours when prices may be lower due to renewable generation. Under time-of-use electricity pricing, the agent selects the charging periods for each vehicle to achieve the three objectives: reducing user charging costs, flattening daily load fluctuations, and reducing short-term load growth rates during the zero-hour period.

2.2. Charging Scheduling Model

Considering the periodic characteristics of user travel plans and distribution grid load fluctuations, a 24 h period spanning day and night is selected as the total decision window. User round-trip information and EV remaining charge are sampled according to random probability distribution functions. EV attribute information includes SoC, maximum charging power, and charging efficiency. Each vehicle has different arrival and estimated departure times, leading to different behavioral time windows. The agent controls the charging power of each charging pile within its effective time window to achieve optimal control.

2.2.1. Optimization Objectives

The optimization objectives under the mathematical model can be summarized as the minimization problem shown in the following equations:

\min f = \{f_{1} (x), f_{2} (x), f_{3} (x)\}

(1)

where f is the objective to be optimized, and f_j are the j-th sub-objectives.

f_{1} (x) = m i n \sum_{i = 1}^{T} \sum_{j = 1}^{N} Q_{i} p_{i, j} ε

(2)

where Q_i represents the electricity price at time i, p_i_,j is the charging power of the j-th EV at time i, and

ε

is the length of the time period s.

f_{2} (x) = m i n \sum_{i = 1}^{T} |P_{t} - P_{t - 1}|

(3)

where P_t represents the total load, including residential and EV charging, of the community distribution grid at time t, P_t₋₁ is the load at the previous time step:

f_{3} (x) = m i n \sqrt{\frac{1}{T} \sum_{i = 1}^{T} {(P_{t} - P_{a v g})}^{2}}

(4)

where P_avg is the average daily community load.

2.2.2. Constraints

In reality, the EV charging problem is subject to multiple constraints, including time, power, and battery limitations. The strategy must satisfy user travel demands, adhere to EV battery specifications, and respect charging pile power ratings to avoid undercharging, overcharging, or unrealistic charging schedules. Beyond these operational necessities, this work also incorporates a fundamental consideration for battery health into its constraints.

Charging Time Constraint:

t_{i, a r r} < t < t_{i, d e p}

(5)

where t_i,arr is the arrival time of the i-th vehicle, and t_i,dep is its departure time.

Charging Power Constraint:

0 < p_{i, j} < p_{m a x}

(6)

Battery Capacity Constraint:

E_{a r r, i}^{s o c} < E_{t, i}^{s o c} < E_{m a x}

(7)

E_{t + 1, i}^{s o c} = E_{t, i}^{s o c} + η p_{i, j} ε

(8)

where

E_{a r r, i}^{s o c}

is the remaining SoC of the i-th vehicle upon arrival, constrained to the range [0.1, 0.9]. This constraint not only ensures sufficient energy for user trips but also serves to maintain long-term battery health and mitigate degradation.

E_{m a x}

is the vehicle’s maximum capacity.

E_{t, i}^{s o c}

is the vehicle’s SoC at time t.

3. EV Scheduling Based on GRU-TD3 Method

In this study, the MDP model is constructed for EV charging scheduling, including the integrated system state space, the charging action mapping based on the state space and agent decisions, the state transition based on action mapping, and the reward function designed using the optimization objectives from the previous chapter. Subsequently, based on the MDP model and utilizing the GRU prediction model to complete the state space, we will implement the TD3 algorithm in detail and introduce the general flow from interactive training to evaluation. Its structure is shown in Figure 2.

3.1. Charging Scheduling MDP Model

MDP is adopted to cast the problem as a discrete sequential decision process. The set of charging piles is modeled as a single agent that performs coordinated scheduling. For each episode, the state is initialized with date-specific load forecasts and randomly sampled vehicle attributes. Within one charging cycle, each newly arriving vehicle defines a decision step; the agent’s action is to assign a charging time window to that vehicle. The state then transitions according to the incremental load induced by the action. A reward function, capturing both charging costs and load characteristics (e.g., fluctuation and peaking), guides the agent toward improved policies. After sufficient training iterations, the learned policy yields an optimal scheduling model in terms of cost reduction and load smoothing.

3.1.1. System State

The state space of the charging model consists of three parts: predicted load, time-of-use electricity price, and vehicle information, as shown below

s_{i} = [[z_{1}^{p r}, z_{2}^{p r}, \dots, z_{48}^{p r}],, {[λ_{1}, λ_{2}, \dots, λ_{24}]}^{1}, {[λ_{1}, λ_{2}, \dots, λ_{24}]}^{2}, [t_{i, a r r}, t_{i, d e p}, E_{a r r, i}^{s o c}, i]]

(9)

where

{[z}_{1}^{p r}, z_{2}^{p r}, \dots, z_{48}^{p r}]

is the predicted hourly base load for the next 48 h, generated by a GRU gated recurrent neural network (RNN) model using the base load data from the three days prior;

[λ_{1}, λ_{2}, \dots, λ_{24}]

represents the 24 h time-of-use electricity price for the urban residential area, repeated for two days to complete the state space since the main EV charging time is concentrated from evening to the next morning;

[t_{i, a r r}, t_{i, d e p}, E_{a r r, i}^{s o c}, i]

represents the vehicle attribute information for the i-th arriving vehicle in this episode, specifically its arrival time

t_{i, a r r}

, next day estimated departure time

t_{i, d e p}

, remaining SoC

E_{a r r, i}^{s o c}

, and its arrival sequence number i within the episode.

3.1.2. Action Mapping

A is the action space,

a_{t}

represents the action taken in state

s_{t}

, which practically means the charging power allocation per unit time. Considering that different EV samples have charging time windows constrained by arrival and departure times, and battery capacity and charging power have constraints, we adjust the charging task using normalization and priority assessment criteria while ensuring the actor network generates a non-linear continuous action space, ultimately fulfilling the full-charge requirement.

The policy network generates an initial action vector of dimension 48. A Sigmoid activation function is applied at the output to normalize values to the [0, 1] interval.

A sliding window bounded by the vehicle’s arrival and departure times is applied to the initial action vector. Values outside this window are set to zero, ensuring charging only occurs within the permitted time window. The logic is as follows.

(1) The non-zero values within the window are normalized so their sum equals 1. They are then multiplied by

(E_{m a x} - E_{a r r}^{s o c}) / η

, yielding the initial full-charge strategy. (2) The charging power values are clipped to the range [0,

P_{m a x}

] to satisfy the charging power constraint. (3) If the sum of the clipped values is less than the required total energy due to clipping, the time slots where the actual charging power is less than

p_{m a x}

are identified. These slots are then filled in descending order of their charging power potential up to

p_{m a x}

until the residual energy is fully allocated.

3.1.3. State Transition

Considering that the initial vehicle information is sampled from feature distributions, the time-of-use price is fixed, and the base load is a historical prediction, the primary impact of each action, a_t, on the state space is on the load component. Therefore, the state transition is set as follows.

A_{i}^{j} = \sum_{k = 1}^{i} a_{k}^{j} (j = 1, 2, \dots, 48)

(10)

s_{i + 1} = [[z_{1}^{p r} + A_{i}^{1}, z_{2}^{p r} + A_{i}^{2}, \dots, z_{48}^{p r} + A_{i}^{3}], \dots]

(11)

The next state, S_t₊₁, for the decision regarding the (i + 1)-th vehicle is formed by updating the load prediction with the i-th vehicle’s charging load and incorporating the attributes

A_{i}^{j}

of the next vehicle. However, a causal bias exists in this model: when making a decision for the current vehicle, the charging loads of subsequent vehicles have not yet occurred and are therefore not included in the total load forecast of the state. This could lead to myopic policies by the agent. To mitigate this issue, the vehicle’s arrival sequence number i is included in the state space. This design provides the agent with contextual information about the progress of the decision-making process, enabling it to learn policies that differentiate between early and late stages of scheduling: in the early stage when i is small, the policy becomes more cautious, reserving optimization space for subsequent vehicles and avoiding over-utilization of low-price periods; in the final stage when i approaches the total vehicle count, the policy can more fully utilize the remaining capacity. Although the sequence number i alone is not sufficient to fully compensate for all model bias, it serves as an efficient and necessary engineering approximation that effectively guides the policy away from the most severe myopic behaviors, as validated by the results in Section 4.3. For the last vehicle in an episode, the next state is still generated conventionally when storing training data to align state training sets. In reality, the load at each time also depends on charging from vehicles arriving later. However, considering the causality of the decision model, the sequence feature in the vehicle information helps correct the training effect.

3.1.4. Reward Function

The optimization objectives for EV charging scheduling are to minimize user charging costs and suppress load fluctuations and short-term growth rates during zero-hour periods. Therefore, the reward function should include the following aspects.

r_{i} = - w_{1} ϕ_{i} - w_{2} σ_{i} - w_{3} τ_{i}

(12)

where

ϕ_{i}

is the total charging cost for the i-th vehicle;

σ_{i}

is the absolute increase in the standard deviation of the load over the cycle caused by connecting the i-th vehicle;

τ_{i}

is the absolute increase in load at midnight caused by connecting the i-th vehicle; w₁, w₂, w₃ are weight values. Considering the incremental nature of

σ_{i}

, it can also help suppress short-term load growth rates to some extent during non-midnight hours.

However, in practical testing, a reward function containing only the optimization objectives failed to guide the agent to learn the basic charging logic. For example, when an EV sample has a low initial SoC, its charging cost is inevitably high, resulting in a low reward, which indirectly increases the learning cost for correct strategies. Therefore, the following aspect was added to the base reward.

{r_{i}}^{*} = r_{i} + w_{4} e_{i}

(13)

where

e_{i}

is the required charging energy for the i-th vehicle. This rewards the charging behavior itself, helping the agent understand the basic charging logic in the early stages of training.

The selection of weight coefficients

w_{1}, w_{2}, w_{3}, w_{4}

in the reward function was based on multi-objective trade-offs and empirical tuning. The core principles were as follows: first, a significantly positive weight

w_{4}

was set to ensure the fulfillment of the basic user charging demand, which is a prerequisite for a feasible policy; second, based on an analysis of historical data, the magnitudes of the cost

ϕ_{i}

, load standard deviation increment

σ_{i}

, and midnight load increment

τ_{i}

were estimated, allowing the initial values of

w_{1}, w_{2}, w_{3}

to normalize the contributions of the respective optimization objectives to the same order of magnitude, thereby preventing any single objective from dominating the training gradient. Fine-tuning was subsequently performed through a series of pre-training simulations: the absolute value of

w_{1}

was increased if the cost was too high, and

w_{2}

or

w_{3}

were increased if load suppression was insufficient, thus effectively guiding the agent’s behavior. The final adopted weights successfully integrated the multiple objectives into a coherent reward signal, driving the agent to learn an effective policy that achieves a balance among cost, load stability, and user satisfaction.

3.2. GRU-TD3 Algorithm

3.2.1. GRU-Based Load Forecasting Model

As seen from previous sections, the state-space dimensionality is high and requires predicting 48 h of hourly load data. To satisfy the causality required for model implementation, a load forecasting model needs to be preset at the input of the training model. The GRU neural network, as a type of RNN, is a simplified variant of the Long Short-Term Memory (LSTM) network.

The GRU cell stores and transmits load information across different time steps. It omits the cell state of LSTM and reduces the three gates (input, forget, output) to two gates (reset gate and update gate). Compared to LSTM, it has fewer parameters and trains faster. Through its gating mechanisms, it effectively extracts and predicts features over long sequences (hundreds of steps), avoiding the problems of dimensionality explosion and gradients vanishing in vanilla RNN. Specifically, the first GRU cell takes

x_{t}

as input.

h_{t - 1}

represents the output of the previous GRU cell. Then,

h_{t}

, containing prior load information, is passed to the next cell. The second GRU cell uses

h_{t}

along with the current time step’s load

x_{t + 1}

to compute the current cell’s output h_t₊₁. After repeating this process to the last GRU cell, the output

z^{p r}

of the GRU layer is used as the estimated value for the sequential base load at the current time step. By concatenating the load estimate

z^{p r}

with the time-of-use price and sampled vehicle information, the state observation is formed. GRU performs exceptionally well in processing strongly periodic sequential data. Additionally, leveraging its ability to extract features from high-dimensional information helps reduce the learning space dimensionality and improves the learning efficiency of reinforcement learning, making it suitable for predicting distribution grid base load data.

In this paper, we utilize two years of historical base load data from a community to train the GRU prediction model. Five consecutive days of historical data form one training sample: the input is the historical load data at 15 min intervals for the first three days (input dimension 288), and the output is the predicted hourly load for the next two days (output dimension 48). The model is trained by comparing its predictions with the actual historical data. Finally, the trained model is integrated into the reinforcement learning training model for initializing the state space.

3.2.2. TD3-Based EV Scheduling Method

The TD3 algorithm is an improvement over the DDPG algorithm within the actor–critic framework. It overcomes the application limitations of DQN while satisfying the requirements for continuous state and action spaces.

(1): Clipped Double Q-Learning:

TD3 incorporates the idea of double Q-learning. On top of the AC architecture, it employs two different critic networks. The minimum of their output Q values is used to calculate the residual loss (TD error) for updating the critics. This avoids issues with overestimation bias and instability caused by a single critic.

L (θ) = {(r_{t} + γ \min_{n = 1, 2} {Q_{π} (s^{'}, a^{'} | θ^{Q})} - \min_{n = 1, 2} \{Q_{t a r g e t} (s, a | θ^{Q})\})}^{2}

(14)

Furthermore, to enhance overall model stability, a gradient clipping strategy is introduced during back-propagation, clipping the gradient update value

\nabla η J

for the critic networks to avoid the exploding gradient problem in deep networks.

(2): Experience Replay with Importance Sampling:

As an off-policy reinforcement learning algorithm, TD3 can learn from historical experiences stored by using previous exploration policies. These experiences are stored in a replay buffer. Each sample contains the current state

s_{t}

, current action

a_{t}

, reward

r_{t}

, and the next state

s_{t + 1}

.

Regarding the selection of training samples from the buffer, to balance exploration and final result stability, we adjust the importance sampling principle for the TD3 model. When sampling a batch from the buffer, we consider the importance weight of each sample, using the normalized TD error to represent the sampling probability of each sample.

Simultaneously, when calculating the policy gradient or critic loss for a batch, we weight each sample’s contribution to guide the critic network to pay more attention to samples with higher errors.

(3): Policy Update:

TD3 improves upon traditional DDPG’s target update mechanism: instead of periodically hard-updating the target networks by copying the policy networks, it uses a soft update, gradually changing the target network weights towards the policy network weights using an update coefficient

τ

.

Additionally, TD3 delays the actor policy network updates relative to the critic value network updates. This allows the critic to be more accurate before guiding the actor update, improving the objectivity of action selection.

(4): Network Architecture:

The TD3 model, incorporating the above improvements, ultimately consists of six neural networks grouped into three pairs: one actor network pair (consisting of an actor network and a target actor network) and two critic network pairs (each pair consisting of a critic network and a target critic network). Within the actor network, batch normalization layers are interspersed within the fully connected stack structure to mitigate training convergence issues caused by high sample variance.

To balance the exploration capability, we employ an

ϵ

-greedy method by adding noise to the output of the actor network to enhance its exploration of the action space.

[\hat{π_{η}} (o_{t}) = π_{η} (o_{t}) + (1 - ϵ) \cdot N (0, 1)]

(15)

where

ϵ

represents the exploration coefficient, initially set to 0 and increases over time per step, allowing exploration to diminish as training progresses for stable final performance.

3.3. Algorithm Flow

In this section, the algorithm flow is detailed, which consists of three parts: Initialization, Training, and Testing. The specific logic is described below.

3.3.1. Initialization

At the start of each episode, initialize the basic information for the episode’s state space: Input the historical hourly load data from the three days prior to the episode into the pre-trained GRU base prediction model. Generate the predicted 48 h base load

[z_{1}^{p r}, z_{2}^{p r}, \dots, z_{48}^{p r}]

for this episode. Repeat the region’s 24 h time-of-use electricity price twice to obtain

{[λ_{1}, λ_{2}, \dots, λ_{24}]}^{1}, {[λ_{1}, λ_{2}, \dots, λ_{24}]}^{2}

. Using travel flow data, generate vehicle information for the episode by sampling from random distributions. This includes vehicle arrival time

t_{i, a r r}

, estimated departure time

t_{i, d e p}

, remaining SoC

E_{a r r, i}^{s o c}

, and the vehicle’s arrival sequence number

i

within the episode. Concatenate the four parts to form the initial state information

o_{t}

for the decision regarding the i-th vehicle. (

o_{t}

is distinguished from s_t because the load part in s₁ changes as vehicle power is connected. For the first vehicle, s₁ =

o_{t}

). Store them in the buffer in order to obtain the initial state information of each step in this round, where k is the total number of steps in this round.

To satisfy the TD3 algorithm’s experience replay mechanism, the replay buffer needs to be initialized. This involves running a number of episodes without training the main TD3 networks. During these episodes, actions are chosen using a sufficiently exploratory random policy. The interaction process per step in these episodes is: After initializing the episode state, the agent outputs an action

a_{t}

according to the random policy. The EV charges according to

a_{t}

until full. The state transitions from

s_{t}

to

s_{t + 1}

according to the transition rules. The reward

r_{t}

is calculated. Finally, the transition

[s_{t}, a_{t}, r_{t}, s_{t + 1}]

is stored in the experience replay buffer. As the number of episodes increases, the number of transitions stored in the experience replay buffer grows.

During actual training episodes, the agent inputs the state

s_{t}

into the actor network

π_{η} (o_{t})

, adds Gaussian noise to its output

a_{t}

.

3.3.2. Training

As shown in Figure 2, the six neural networks used for training the EV charging scheduling policy are: Policy Actor Network:

π_{η} (\cdot)

, Target Actor Network:

\hat{π_{η^{'}}} (\cdot)

, Critic Network 1:

Q_{θ} (\cdot, \cdot)

, Target Critic Network 1:

\hat{Q_{θ^{'}}} (\cdot, \cdot)

, Critic Network 2:

{Q_{θ}}^{'} (\cdot, \cdot)

, and Target Critic Network 2:

{\hat{Q_{θ^{'}}}}^{'} (\cdot, \cdot)

.

At the start of training, the actor and critic networks are automatically initialized through functions. The target networks are initially set to the same weights as their corresponding policy networks (

η = η^{'}

and

θ = θ^{'}

). During training, we first sample a batch of transitions from the prioritized replay buffer. The target actor network

\hat{π_{η^{'}}} (\cdot)

outputs the target action

a_{t + 1}^{'}

according to the next state

s_{t + 1}

.

a_{t + 1}^{'} = \hat{π_{η^{'}}} (s_{t + 1})

(16)

where both critic networks

Q_{θ} (\cdot, \cdot)

and

{Q_{θ}}^{'} (\cdot, \cdot)

are input

s_{t}

and

a_{t}

to obtain the smaller Q-value estimates.

Q_{e v a l} = m i n \{Q_{θ} (s_{t}, a_{t}), {Q_{θ}}^{'} (s_{t}, a_{t})\}

(17)

where the target action

a_{t + 1}^{'}

and state

s_{t + 1}

are input to both target critic networks

\hat{Q_{θ^{'}}} (\cdot, \cdot)

and

\hat{Q_{θ^{'}}}' (\cdot, \cdot)

. The target Q-value

Q_{t a r}

is computed by taking the minimum of their outputs, multiplying by the discount factor η, and adding the immediate reward

r_{t}

.

Q_{tar} = m i n \{r_{t} + γ \hat{Q_{θ^{'}}} (s_{t + 1}, a_{t + 1}^{'}), r_{t} + γ \hat{Q_{θ^{'}}}' (s_{t + 1}, a_{t + 1}^{'})\}

(18)

The loss function loss is defined as the mean squared error (MSE) between its predicted Q-value and the target Q-value.

l o s s (θ) = \sum_{i = 1}^{M} {(Q_{t a r} - Q_{e v a l})}^{2}

(19)

Similar to other AC structures, the gradient for updating the actor network parameters is approximated using the policy gradient theorem.

\nabla_{η} J_{η} = {\sum_{i = 1}^{M} \nabla_{η} π_{η} (s_{i}) \nabla_{a} Q_{θ} (s_{i}, a)|}_{a = π_{η} (s_{i})}

(20)

Update the network parameters using their respective loss gradients.

θ = θ - ρ_{1} \nabla_{θ} L_{θ}

(21)

η = η - ρ_{2} \nabla_{η} J_{η}

(22)

where

ρ_{1}

and

ρ_{2}

are the learning rates for the critic and actor networks, respectively.

Update the target networks via soft updates.

θ^{r} = τ θ + (1 - τ) θ^{r}

(23)

θ_{t a r}^{r} = τ θ_{t a r} + (1 - τ) θ_{t a r}^{r}

(24)

where τ is the soft-update factor.

This process is repeated for a fixed number of episodes or steps. The pseudo-code presented in Algorithm 1 describes the process of updating the neural network parameters using transitions obtained during interaction.

In each round, small batches of data are first sampled from the experience replay buffer. Secondly, the Q values and target Q values are calculated using Formulas (17) and (18). Then, the gradients of the loss function and the policy are obtained using Formulas (19) and (20), respectively. After that, the parameters of the neural networks are updated through back-propagation of the loss function and policy gradient, as indicated by Formulas (21) and (22). Finally, the parameters of the target critic network and target actor network are updated using Formulas (23) and (24). By continuously updating the parameters of the six neural networks, an optimal charging control strategy for electric vehicles is obtained.

Algorithm 1: GRU-TD3 Initial Training
Inputs: horizon T; episodes N; replay buffer 𝔇; batch size M; discount γ; target rate τ; exploration coefficient ε; prioritized replay exponents α, β; step sizes ρ₁, ρ₂
Outputs: trained actor parameters η
1	Initialize greedy coefficient ε and importance-sampling coefficients α, β; action and critic network parameters $η, θ$
2	Set the target parameters $η' \leftarrow η, θ' \leftarrow θ$
3	For episode = 1 to N do
4	Use GRU to predict the 48-h load $z^{p r}$ from the previous 3-day history
5	Initialize episode state with 200 EV info, $z^{p r}$ , and time-of-use price
6	For t = 1 to T do
7		Obtain policy output $π_{η} (s_{t})$ ; add noise $\hat{π_{η}} (\cdot) = π_{η} (s_{t}) + (1 - ε) \cdot N (0,1)$
8		Execute action $a_{t}$ to obtain $r_{t}$ , $s_{t + 1}$
9		Store $< s_{t}, a_{t}, r_{t}, s_{t + 1} >$ into the replay buffer
10	End
11	Importance-sample a batch M
12	Compute evaluated Q-value $Q_{e v a l} = Q_{θ} (s_{t}, a_{t})$
13	Generate next action $a_{t + 1}^{'} = \hat{π_{η^{'}}} (s_{t + 1})$ by target actor
14	Compute target Q-value via target critic to obtain $Q_{tar}$
15	Critic residual loss: $l o s s (θ) = \sum_{i = 1}^{M} {(Q_{t a r} - Q_{e v a l})}^{2}$
16	Policy gradient: $\nabla_{η} J_{η} = {\sum_{i = 1}^{M} \nabla_{η} π_{η} (s_{i}) \nabla_{a} Q_{θ} (s_{i}, a)\|}_{a = π_{η} (s_{i})}$
17	Policy gradient: $θ \leftarrow θ - ρ_{1} \nabla_{θ} L_{θ}$ , $η \leftarrow η - ρ_{2} \nabla_{η} J_{η}$
18	Soft-update targets: $θ^{r} \leftarrow τ θ + (1 - τ) θ^{r}$ , $θ_{t a r}^{r} \leftarrow τ θ_{t a r} + (1 - τ) θ_{t a r}^{r}$
19	Schedule coefficients: $ε \leftarrow ε + ζ_{ε}$ , $β \leftarrow β + ζ_{β}$
20	End

3.3.3. Testing

The process of EV charging control strategy evaluation is detailed in Algorithm 2. After training is completed, the parameters of the trained policy actor network

π_{η} (\cdot)

are saved and loaded into an identical network structure within a testing program. The trained actor network still inputs the observed state from the environment and generates actions to guide EV charging time allocation. Unlike during training, the critic networks are not used. Within an episode, only c are needed: interact sequentially according to

< s_{t}, a_{t}, r_{t}, s_{t + 1} {, a}_{t + 1} >

until the episode ends. Finally, at the end of the episode, the overall scheduling effectiveness is evaluated based on metrics including GRU prediction model accuracy, total/average charging cost, daily load fluctuation variance, and short-term load fluctuation speed, demonstrating the reasonableness and effectiveness of the charging control strategy. Results are averaged over multiple test episodes.

Algorithm 2: EV Charging Control Strategy Evaluation
Inputs: trained actor parameters; horizon T; number of days N; GRU prediction network
Outputs: mean values of the three evaluation metrics; average 24-h load curve
1	Load the trained actor parameters η
2	For episode = 1 to N do
3	Use the GRU prediction network with the previous three-day history to obtain $z^{p r}$
4	Initialize episode state with 200 EV info, $z^{p r}$ , and time-of-use price
5	For t = 1 to T do
6		Obtain the policy output $π_{η} (s_{t})$
7		Map to the feasible action $a_{t}$ according to the mapping relationship and priority
8		Execute $a_{t}$ , observe $r_{t}$ , compute $s_{t + 1}$ and record the load trajectory
9	End
10	Calculate the standard deviation of the load, average charging cost, and short-term load growth rate
11	End
12	Report the mean values of the three evaluation metrics over N episodes and plot the average 24-h load variation curve

4. Simulation Results and Evaluation

This chapter provides a systematic performance evaluation and result analysis of the proposed GRU-TD3 method. It begins by detailing the simulation setup, including datasets, model parameters, and comparison benchmarks. This is followed by an analysis of the GRU load forecasting model’s performance. Subsequently, the comprehensive effectiveness of the charging scheduling strategy is evaluated, encompassing the training process and multi-strategy comparisons. Finally, the impact of key parameters on algorithm performance is explored through parameter ablation simulations.

4.1. Simulation Setup

To comprehensively evaluate the performance of the GRU-TD3 method, this simulation utilizes real-world datasets and establishes reasonable model parameters and comparison benchmarks. The implementation is developed in Python 3.9 using the PyTorch 1.13 framework, ensuring reproducibility and efficient GPU-based training.

4.1.1. Dataset and Parameters

The simulation setup of the GRU-TD3 method is based on the open-source load dataset from the 2016 China Society for Electrical Engineering Cup competition. This dataset adequately reflects the peak-valley characteristics of residential electricity consumption, making it highly suitable for electric vehicle charging behavior research. It primarily contains historical residential electricity load data and does not include other potential influencing factors. In terms of model configuration, the GRU prediction module takes historical load data from the previous 72 h as input and outputs the load forecast for the next 48 h. It has 256 hidden units, with a training set to test set ratio of 7:3.

The TD3 scheduling module employs a four-layer fully connected network structure. The actor network has 512 neurons in its hidden layers with batch normalization, while the critic network has 256 neurons in its hidden layers. Key training parameters include actor network learning rate of 0.001, critic network learning rate of 0.002, discount factor γ = 0.95, soft-update rate τ = 0.001, and importance-sampling parameters α = 0.5 and β = 0.4 with a β growth rate of 0.0003.

The electric vehicle parameters are set considering practical application scenarios. The daily number of charging vehicles is 200, with a rated charging power of 7 kW, battery capacity of 82 kWh, and charging efficiency of 90%. Vehicle behavior simulation is based on real data statistics: arrival time follows a normal distribution N (17.47, 3.412) constrained to the interval [12, 24], departure time follows N (31.92, 3.242) constrained to [36, 48], and initial SoC follows N (0.4, 0.12) constrained to [0.1, 0.9]. The electricity price parameters adopt Guangzhou’s residential time-of-use pricing system, with specific values shown in Figure 3.

4.1.2. Benchmark Setup

To ensure a comprehensive and fair comparison, the baselines are divided into two groups:

(1): Algorithmic comparisons

This group focuses on the optimization paradigm under identical objectives and constraints.

GRU-TD3 (proposed): Forecast-augmented DRL model combining GRU prediction and TD3 decision-making for adaptive and stable scheduling.
DDPG: Classical DRL algorithm for continuous control, used to verify the improvement brought by the proposed GRU-TD3.
PSO: Representative model-based optimization method; each particle encodes EV charging power, minimizing a weighted sum of cost and load variance.

(2): Multi-strategy scheduling comparison

This group reflects practical rule-based scheduling strategies.

Base Load: No electric vehicle charging load.
Uncoordinated Charging: Vehicles charge immediately at maximum power, significantly increasing load pressure during peak hours.
Cost-Only Optimization: Solely considers charging cost, leading to a pronounced “midnight peak” phenomenon.

4.2. Load Forecasting Results and Validation

The accuracy of the GRU load forecasting model is crucial for the overall scheduling performance. Figure 4 shows the prediction performance of the GRU model on the test set, including the comparison between predicted and actual values and the standard deviation range. It can be observed that the GRU model effectively captures the trends and periodic characteristics of load changes.

Figure 5 further shows the convergence curve of the mean squared error loss during training, indicating that the GRU model reaches a stable state after sufficient training. The Mean Absolute Percentage Error of 4.56% on the test set indicates that this forecasting model has high accuracy and can provide reliable state input for the subsequent reinforcement learning scheduling.

4.3. Scheduling Result Analysis

4.3.1. Scheduling Algorithms Results Comparison

Since the PSO method is a population-based heuristic without an iterative learning process comparable to DRL, its convergence curve is not directly comparable. Therefore, this subsection focuses on analyzing the training performance and convergence behavior of the two reinforcement learning algorithms. Figure 6 and Figure 7 show the change in the rewards during the training process of the DDPG and GRU-TD3 algorithms. It can be seen that in the first 1200 training episodes, the DDPG agent lacked positive feedback guidance for charging behavior, leading to slow learning progress. Although improvement occurred between episodes 1300–1500, due to the limitations of the DDPG algorithm, it ultimately converged to a discrete local optimum and failed to effectively explore the state-space boundaries. In contrast, Figure 7 shows that, after adding positive feedback for charging quantity, adopting the double Q-learning mechanism, and importance sampling, the agent continuously learned the charging process from the start of training, gradually converging after about 500 episodes, oscillating within the average cost range of 24.8 ¥ to 25.5 ¥. Both the training speed and optimization results are superior to the DDPG algorithm.

Figure 8 presents the optimized load curves obtained by DDPG, GRU-TD3, and PSO under identical conditions. The GRU-TD3 algorithm achieves the most stable and balanced load profile, effectively mitigating both midday and midnight peaks. The DDPG method improves load uniformity in several intervals but still shows local fluctuations due to limited exploration capability. In contrast, PSO produces the highest peak-to-valley difference and overall load oscillation, reflecting its difficulty in achieving global coordination under dynamic constraints. Quantitatively, the average charging cost per vehicle obtained by GRU-TD3 is 24.8 ¥, compared with 27.6 ¥ for DDPG and 29.1 ¥ for PSO. The average load variance (STD) is 730.5 kW² for GRU-TD3, 838.4 kW² for DDPG, and 914.0 kW² for PSO. These results indicate that GRU-TD3 outperforms both DDPG and PSO in terms of economic efficiency and load stability, demonstrating stronger adaptability and convergence performance in multi-period scheduling optimization.

4.3.2. Multi-Strategy Scheduling Results Comparison

Figure 9 presents the results of the second group of simulations, focusing on the comparison among different scheduling strategies introduced. Table 1 quantifies the key performance indicators for the four scenarios. Analysis shows that compared to uncoordinated charging, the GRU-TD3 method reduces user charging costs by 42.5% (from 42.79 ¥ to 24.41 ¥) and reduces the daily load standard deviation by 20.3% (from 915.6 kW to 729.8 kW). Compared to the cost-only optimization strategy, GRU-TD3 significantly improves load fluctuation while maintaining a similar cost level.

Figure 10 further compares the short-term fluctuation characteristics of the four scenarios through histograms of the load change rate over 24 h. It can be seen that the GRU-TD3 method effectively suppresses the load growth rate during the midnight period (hours 24–29), while the cost-only strategy exhibits a very high short-term fluctuation rate during this period, verifying the proposed method’s effectiveness in solving the “midnight peak” problem.

4.4. Parameter Ablation Simulation

To optimize algorithm performance, ablation simulations were conducted on key training parameters. Figure 10 shows the convergence characteristics of the TD3 algorithm under the initial parameter settings (

α = 0.6

,

β = 0.0001

,

ρ_{1} = 0.001

,

ρ_{2} = 0.002

). Although convergence was achieved, obvious oscillations were present, and there was room for improvement in convergence speed.

We implemented the following parameter adjustments: (1) Exploration Noise Decay: Changed the action noise decay function from linear to exponential to enhance stability in the later stages of training. (2) Importance-Sampling Parameters: Slightly decreased

α

(from 0.6 to 0.5) and increased the

β

growth rate (from 0.0001 to 0.0003). (3) Network Learning Rates: Increased the AC network learning rates (actor network from 0.001 to 0.003, critic network from 0.002 to 0.003).

Figure 11 shows the convergence situation after parameter adjustment (

α = 0.5

,

β = 0.0003

,

ρ_{1} = 0.003

,

ρ_{2} = 0.003

): the convergence speed significantly improved (requiring only about 200 episodes), the reward function value stabilized around 650, and the oscillation amplitude was effectively suppressed. The average charging cost in a single simulation was 24.41 ¥, close to the optimal 24.40 ¥. However, simulations found that higher learning rates made the convergence process more sensitive to the initial random exploration conditions. Although curve 2 (Figure 11) shows the best performance, this represents a successful result after multiple attempts, with a high frequency of convergence failures in actual training. The reason for this phenomenon is that the aggressive learning rates amplify the impact of the low-quality sample data generated in the early exploration stage on the network parameters. Although the exponentially decayed exploration noise helps stability in the later stages, it may prematurely restrict the policy’s exploration scope in the early stages of training. When combined with high learning rates, it can easily cause the policy to fall into a local optimum or lead to divergent Q-value estimates. The adjusted importance-sampling parameters, lower

α

and higher

β

growth rate, alleviated this issue to some extent by reducing the intensity of prioritized sampling based on TD error, leading to a more uniform sample distribution, but they could not completely offset the inherent risk brought by the high learning rates. Therefore, from the perspective of training robustness, the conservative parameter settings from Section 4.1.1 are still recommended. Although convergence requires about 500 episodes, the failure rate in the early exploration stage of training is significantly reduced, making the overall training process more reliable.

5. Conclusions

To address the challenges of grid load fluctuations and rising user costs caused by uncoordinated EV charging in residential communities, this study formulates the EV charging scheduling problem as an MDP. By analyzing the constraints and random distribution characteristics of environmental parameters, a complete mathematical model with defined boundaries is established. The GRU neural network is innovatively employed to predict the base load, effectively enriching the state space. On the algorithmic front, the TD3 method is adopted and enhanced with key improvements: dual critic networks are introduced to enhance stability, an optimized importance-sampling mechanism accelerates convergence, and a phased reward function combined with sample filling logic guides the agent to progressively master charging strategies, enabling effective multi-objective optimization.

Extensive simulations validate the effectiveness of the proposed GRU-TD3 method. Compared to the DDPG and PSO algorithms, the proposed approach reduces user costs by 11.7% and decreases the load standard deviation by 12.9%. When compared to uncoordinated charging strategies, it achieves a 42.5% reduction in user costs and a 20.3% decrease in load standard deviation. The simulation results demonstrate that the method not only maintains near-optimal charging costs but also effectively achieves load “peak shaving and valley filling,” significantly mitigating the “midnight peak” phenomenon. Through ablation studies, the impacts of key parameters, such as reward weights, network learning rates, and importance-sampling factors, on algorithmic convergence and optimization performance are further analyzed.

Future work will extend the proposed framework toward hierarchical architectures, multi-factor load forecasting, and more practical constraint modeling to enhance scalability and real-world applicability. These improvements are expected to further strengthen the method’s generalization capability and stability under diverse operating conditions.

Author Contributions

Conceptualization, Y.Z. and X.J.; methodology, Y.Z. and X.J.; software, Y.Z., S.T. and P.W.; validation, Y.Z. and X.J.; formal analysis, S.T. and Y.L.; investigation, X.J. and P.W.; resources, X.J., S.T. and Y.W.; data curation, S.T. and Y.L.; writing—original draft preparation, Y.Z.; writing—review and editing, Y.Z. and X.J.; visualization, S.T. and Y.L.; supervision, P.W. and Y.W.; project administration, Y.Z. and Y.W.; funding acquisition, Y.Z. and Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Key Research Project of Xi’an Siyuan University (No. XASYZD-B2309) and the Key Project of the University Engineering Research Center, Shaanxi Provincial Department of Education (No. 24JR138).

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following nomenclatures are used in this manuscript.

$E_{a r r, i}^{s o c}$	Remaining State of Charge of the i-th vehicle upon arrival
Q	Q-value function
R_t	reward function
a_t	action at time t
P_i,j	charging power of the j-th vehicle at time i
r_t	reward at time t
s_t	state at time t
t_i,arr	arrival time of the i-th vehicle
t_i,dep	estimated departure time of the i-th vehicle
$γ$	discount factor
$ϵ$	exploration coefficient
$η$	charging efficiency
$ρ_{1}$	learning rate of actor network
$ρ_{2}$	learning rate of critic network
$τ$	soft-update factor

References

Salman, M.; Arslan, M.; Khan, S.A.; Fahad, S.; Imran, M.; Ullah, S. Chapter 14—Policies for the future: Promoting electric vehicle deployment. In Handbook on New Paradigms in Smart Charging for E-Mobility; Kumar, A., Bansal, R.C., Kumar, P., HE, X., Eds.; Elsevier: Amsterdam, The Netherlands, 2025; pp. 481–507. [Google Scholar]
Salman, M.; Arslan, M.; Khan, S.A.; Fahad, S.; Imran, M.; Ullah, S. Chapter 11—Demand-side management and managing electric vehicles and their optimal charging locations and scheduling in smart grids. In Handbook on New Paradigms in Smart Charging for E-Mobility; Kumar, A., Bansal, R.C., Kumar, P., HE, X., Eds.; Elsevier: Amsterdam, The Netherlands, 2025; pp. 375–403. [Google Scholar]
Nottrott, A.; Kleissl, J.; Washom, B. Storage Dispatch Optimization for Grid-Connected Combined Photovoltaic-Battery Storage Systems. In Proceedings of the 2012 IEEE Power and Energy Society General Meeting, San Diego, CA, USA, 22–26 July 2012; pp. 1–7. [Google Scholar]
Koyanagi, F.; Uriu, Y. A Strategy of Load Leveling by Charging and Discharging Time Control of Electric Vehicles. IEEE Trans. Power Syst. 2002, 13, 1179–1184. [Google Scholar] [CrossRef]
Flath, C.M.; Ilg, J.P.; Gottwalt, S.; Schmeck, H.; Weinhardt, C. Improving Electric Vehicle Charging Coordination Through Area Pricing. Transp. Sci. 2014, 48, 619–634. [Google Scholar] [CrossRef]
Leemput, N.; Geth, F.; Claessens, B.; Van Roy, J.; Ponnette, R.; Driesen, J. A Case Study of Coordinated Electric Vehicle Charging for Peak Shaving on A Low Voltage Grid. In Proceedings of the 2012 3rd IEEE PES Innovative Smart Grid Technologies Europe (ISGT Europe), Berlin, Germany, 14–17 October 2012; pp. 1–7. [Google Scholar]
Li, C.; Zhu, Y.; Lee, K.Y. Route Optimization of Electric Vehicles Based on Reinsertion Genetic Algorithm. IEEE Trans. Transp. Electrif. 2023, 9, 3753–3768. [Google Scholar] [CrossRef]
Korkas, C.D.; Baldi, S.; Yuan, S.; Kosmatopoulos, E.B. An Adaptive Learning-Based Approach for Nearly Optimal Dynamic Charging of Electric Vehicle Fleets. IEEE Trans. Intell. Transp. Syst. 2017, 19, 2066–2075. [Google Scholar] [CrossRef]
Zhang, L.; Li, Y. Optimal Management for Parking-Lot Electric Vehicle Charging by Two-Stage Approximate Dynamic Programming. IEEE Trans. Smart Grid 2015, 8, 1722–1730. [Google Scholar] [CrossRef]
Yang, L.; Zhang, J.; Poor, H.V. Risk-Aware Day-Ahead Scheduling and Real-time Dispatch for Electric Vehicle Charging. IEEE Trans. Smart Grid 2017, 5, 693–702. [Google Scholar] [CrossRef]
Frendo, O.; Gaertner, N.; Stuckenschmidt, H. Real-Time Smart Charging Based on Precomputed Schedules. IEEE Trans. Smart Grid 2019, 10, 6921–6932. [Google Scholar] [CrossRef]
Sarabi, S.; Kefsi, L. Electric Vehicle Charging Strategy Based on A Dynamic Programming Algorithm. In Proceedings of the IEEE International Conference on Intelligent Energy and Power Systems (IEPS), Kyiv, Ukraine, 2–6 June 2014; pp. 1–5. [Google Scholar]
Liu, Z.F.; Zhang, W.; Ji, X.; Li, K. Optimal Planning of Charging Station for Electric Vehicle Based on Particle Swarm Optimization. In Proceedings of the IEEE PES Innovative Smart Grid Technologies, Tianjin, China, 21–24 May 2012; pp. 1–5. [Google Scholar]
Ihekwaba, A.; Kim, C. Analysis of Electric Vehicle Charging Impact on Grid Voltage Regulation. In Proceedings of the 2017 North American Power Symposium (NAPS), Morgantown, WV, USA, 17–19 September 2017; pp. 1–6. [Google Scholar]
Lample, G.; Chaplot, D.S. Playing FPS Games with Deep Reinforcement Learning. In AAAI’17: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, San Francisco CA USA, 4–9 February 2017; AAAI Press: San Francisco, CA, USA, 2017; Volume 31. [Google Scholar]
Waltz, M.; Fu, K.S. A Heuristic Approach to Reinforcement Learning Control Systems. IEEE Trans. Autom. Control 1965, 10, 390–398. [Google Scholar] [CrossRef]
Keneshloo, Y.; Shi, T.; Ramakrishnan, N.; Reddy, C.K. Deep Reinforcement Learning for Sequence-to-Sequence Models. IEEE Trans. Neural Netw. Learn. Syst. 2019, 31, 2469–2489. [Google Scholar] [CrossRef] [PubMed]
Najafi, S.; Livani, H. Robust Day-Ahead Voltage Support and Building Demand Response Scheduling Under Gaussian Mixture Model Uncertainty. IEEE Trans. Ind. Appl. p. 2025, in press. [CrossRef]
Ding, T.; Zeng, Z.; Bai, J.; Qin, B.; Yang, Y.; Shahidehpour, M. Optimal Electric Vehicle Charging Strategy With Markov Decision Process and Reinforcement Learning Technique. IEEE Trans. Ind. Appl. 2020, 56, 5811–5823. [Google Scholar] [CrossRef]
Ji, Y.; Wang, Y.; Zhao, H.; Gui, G.; Gacanin, H.; Sari, H.; Adachi, F. Multi-Agent Reinforcement Learning Resources Allocation Method Using Dueling Double Deep Q-Network in Vehicular Networks. IEEE Trans. Veh. Technol. 2023, 72, 13447–13460. [Google Scholar] [CrossRef]
Huang, J.; Zhou, X. Optimizing EV Charging Station Placement in New South Wales: A Soft Actor-Critic Reinforcement Learning Approach. In Proceedings of the 5th International Conference on Computer Engineering and Application (ICCEA), Hangzhou, China, 12–14 April 2024; pp. 1790–1794. [Google Scholar]
Lotfy, A.; Chaoui, H.; Kandidayeni, M.; Boulon, L. Enhancing Energy Management Strategy for Battery Electric Vehicles: Incorporating Cell Balancing and Multi-Agent Twin Delayed Deep Deterministic Policy Gradient Architecture. IEEE Trans. Veh. Technol. 2024, 73, 16593–16607. [Google Scholar] [CrossRef]
Bi, X.; Gao, D.; Yang, M. A Reinforcement Learning-Based Routing Protocol for Clustered EV-VANET. In Proceedings of the 2020 IEEE 5th Information Technology and Mechatronics Engineering Conference (ITOEC), Chongqing, China, 12–14 June 2020; pp. 1769–1773. [Google Scholar]
Suresh Kumar, S.; Margala, M.; Siva Shankar, S.; Chakrabarti, P. A Novel Weight-Optimized LSTM for Dynamic Pricing Solutions in E-commerce Platforms Based on Customer Buying Behaviour. Soft Comput. 2023, 6, 1–13. [Google Scholar] [CrossRef]
Cao, J.; Crozier, C.; McCulloch, M.; Fan, Z. Optimal Design and Operation of a Low Carbon Community Based Multi-Energy Systems Considering EV Integration. IEEE Trans. Sustain. Energy 2018, 10, 1217–1226. [Google Scholar] [CrossRef]
Marasciuolo, F.; Orozco, C.; Dicorato, M.; Borghetti, A.; Forte, G. Chance-Constrained Calculation of the Reserve Service Provided by EV Charging Station Clusters in Energy Communities. IEEE Trans. Ind. Appl. 2023, 59, 4700–4709. [Google Scholar] [CrossRef]

Figure 1. Structure of charging scenario.

Figure 2. GRU-TD3 Algorithm Structure.

Figure 3. Time-of-Use price.

Figure 4. GRU-Based 48-Hour Load Forecasting Performance.

Figure 5. Training Convergence of GRU: MSE Loss Curve.

Figure 6. DDPG training reward curve.

Figure 7. GRU-TD3 Algorithm Reward (

α = 0.6

,

β = 0.0001

,

ρ_{1} = 0.001

,

ρ_{2} = 0.002

).

Figure 7. GRU-TD3 Algorithm Reward (

α = 0.6

,

β = 0.0001

,

ρ_{1} = 0.001

,

ρ_{2} = 0.002

).

Figure 8. Comparison of load optimization results among DDPG, GRU-TD3, and PSO algorithms.

Figure 9. Load curve comparison results under four scheduling strategies.

Figure 10. Hourly Variation in Load.

Figure 11. TD3 training curves (

α = 0.5

,

β = 0.0003

,

ρ_{1} = 0.003

,

ρ_{2} = 0.003

).

Figure 11. TD3 training curves (

α = 0.5

,

β = 0.0003

,

ρ_{1} = 0.003

,

ρ_{2} = 0.003

).

Table 1. Charging Results Comparison.

Scene	EV Charging Cost (¥/Vehicle)	Daily Load Standard Deviation
Basic Load	-	784.8
Disorderly Charging	42.79	915.6
Cost-priority	24.40	788.4
Proposed Method	24.41	729.8

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhao, Y.; Jia, X.; Tan, S.; Liang, Y.; Wang, P.; Wang, Y. Electric Vehicle Cluster Charging Scheduling Optimization: A Forecast-Driven Multi-Objective Reinforcement Learning Method. Energies 2026, 19, 647. https://doi.org/10.3390/en19030647

AMA Style

Zhao Y, Jia X, Tan S, Liang Y, Wang P, Wang Y. Electric Vehicle Cluster Charging Scheduling Optimization: A Forecast-Driven Multi-Objective Reinforcement Learning Method. Energies. 2026; 19(3):647. https://doi.org/10.3390/en19030647

Chicago/Turabian Style

Zhao, Yi, Xian Jia, Shuanbin Tan, Yan Liang, Pengtao Wang, and Yi Wang. 2026. "Electric Vehicle Cluster Charging Scheduling Optimization: A Forecast-Driven Multi-Objective Reinforcement Learning Method" Energies 19, no. 3: 647. https://doi.org/10.3390/en19030647

APA Style

Zhao, Y., Jia, X., Tan, S., Liang, Y., Wang, P., & Wang, Y. (2026). Electric Vehicle Cluster Charging Scheduling Optimization: A Forecast-Driven Multi-Objective Reinforcement Learning Method. Energies, 19(3), 647. https://doi.org/10.3390/en19030647

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Electric Vehicle Cluster Charging Scheduling Optimization: A Forecast-Driven Multi-Objective Reinforcement Learning Method

Abstract

1. Introduction

2. EV Charging Scheduling Model

2.1. Scenario Description

2.2. Charging Scheduling Model

2.2.1. Optimization Objectives

2.2.2. Constraints

3. EV Scheduling Based on GRU-TD3 Method

3.1. Charging Scheduling MDP Model

3.1.1. System State

3.1.2. Action Mapping

3.1.3. State Transition

3.1.4. Reward Function

3.2. GRU-TD3 Algorithm

3.2.1. GRU-Based Load Forecasting Model

3.2.2. TD3-Based EV Scheduling Method

3.3. Algorithm Flow

3.3.1. Initialization

3.3.2. Training

3.3.3. Testing

4. Simulation Results and Evaluation

4.1. Simulation Setup

4.1.1. Dataset and Parameters

4.1.2. Benchmark Setup

4.2. Load Forecasting Results and Validation

4.3. Scheduling Result Analysis

4.3.1. Scheduling Algorithms Results Comparison

4.3.2. Multi-Strategy Scheduling Results Comparison

4.4. Parameter Ablation Simulation

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI