Deep Reinforcement Learning for Charging Scheduling of Electric Vehicles Considering Distribution Network Voltage Stability

The rapid development of electric vehicle (EV) technology and the consequent charging demand have brought challenges to the stable operation of distribution networks (DNs). The problem of the collaborative optimization of the charging scheduling of EVs and voltage control of the DN is intractable because the uncertainties of both EVs and the DN need to be considered. In this paper, we propose a deep reinforcement learning (DRL) approach to coordinate EV charging scheduling and distribution network voltage control. The DRL-based strategy contains two layers, the upper layer aims to reduce the operating costs of power generation of distributed generators and power consumption of EVs, and the lower layer controls the Volt/Var devices to maintain the voltage stability of the distribution network. We model the coordinate EV charging scheduling and voltage control problem in the distribution network as a Markov decision process (MDP). The model considers uncertainties of charging process caused by the charging behavior of EV users, as well as the uncertainty of uncontrollable load, system dynamic electricity price and renewable energy generation. Since the model has a dynamic state space and mixed action outputs, a framework of deep deterministic policy gradient (DDPG) is adopted to train the two-layer agent and the policy network is designed to output discrete and continuous control actions. Simulation and numerical results on the IEEE-33 bus test system demonstrate the effectiveness of the proposed method in collaborative EV charging scheduling and distribution network voltage stabilization.


Introduction
In recent years, electric vehicle (EV) technology has developed rapidly, driven by breakthroughs in battery technology [1,2]. As a substitute for fossil fuel vehicles, EVs have received extensive attention due to their environmentally friendly characteristics [2][3][4][5]. EVs can reduce traffic pollution emissions and have lower charging costs than refueling, which has been widely accepted and deployed [6][7][8][9]. However, large-scale numbers of EVs connected to the power grid will bring challenges, such as frequency excursion and voltage fluctuation [10]. The voltage stability of the distribution network (DN) is an issue that needs to be focused on. The uncontrolled charging process of EVs will affect the voltage stability of the distribution network [11]. When EVs are connected to the power grid in vehicle-to-grid (V2G) mode, this situation will further deteriorate. In addition, the access of distributed generators (DGs) to the power system changes the direction of power flow, and the injection of active power upstream by distributed generators (DGs) causes voltage rise and interferes with Volt/Var control (VVC) equipment [12]. The intermittency, randomness, and fluctuation of renewable energy sources (RESs) can cause voltage fluctuations in the distribution network [13].
VVC is used to improve the voltage stability of the distribution network. In traditional VVC practice, voltage-regulating devices, such as on-load tap changers (OLTCs), voltage regulators (VRs), and switchable capacitor banks (SCBs), are leveraged to mitigate voltage violations [14]. In [15], a support vector regression based model predictive control (MPC) method was proposed to optimize the voltage of a distribution network. For the renewable energy access problem controlled by inverters, a multistage method has been widely used for the VVC of distribution networks [16,17]. To coordinate VVC equipment and distributed power supply, neural networks and learning-based methods are widely used. An artificial neural network (ANN) based approach was introduced for the VVC of distributed energy resources at the grid edge [18]. The authors of [19] proposed a safe off-policy deep reinforcement learning algorithm for VVC in a power distribution system. A model-free approach based on constrained safe deep reinforcement learning was proposed in [12] to solve the problem of optimal operation of distribution networks. Although the aforementioned research has made some achievements in VVC, they did not consider the effect of electric vehicle charging on voltage stability nor did they consider the possibility of electric vehicles participating in voltage control.
EVs participate in the voltage regulation of distribution networks. On the one hand, the charging behavior of EVs is stimulated by the electricity price; on the other hand, EVs can operate in the V2G mode [20]. By adjusting the charging power or discharge power of EVs, it is helpful to stabilize the voltage of the distribution network [21][22][23][24]. Researchers have performed much work on the problem of charging scheduling of electric vehicles. The authors of [25] proposed an improved binary particle swarm optimization (PSO) approach to solve the problem of the controlled charging of EV with the objective of reducing the charging cost for EV users and reducing the pressure of peak power on the distribution network. To avoid the limitations of deterministic methods in terms of models and parameters and their inability to handle real-time uncertainty, deep reinforcement learning is widely used in the charging scheduling problem for EVs. References [3,10] proposed model-free approaches based on deep reinforcement learning and safe deep reinforcement learning, respectively, for the charging scheduling of household electric vehicles. Both consider the uncertainty of the system and do not need an accurate model but only study the charging scheduling problem of home electric vehicles. When faced with the problem of charging EVs on a larger scale, the charging process for EVs is managed by an aggregator or central controller. However, the charging process of EVs is highly uncertain, which requires the estimation and prediction of the charging demand of EVs. Artificial intelligence approaches are currently of interest due to their advantages in dealing with high-dimensional data and non-linear problems. A Q-learning-based prediction method was proposed in [26] for forecasting the charging load of electric vehicles under different charging scenarios. The authors of [27] proposed a demand modeling approach based on 3D convolutional generative adversarial networks. Reference [28] designed a deep learning-based forecasting and classification network to study the long-term and short-term characteristics of the charging behaviors of plug-in EVs. To solve the problem of EV cluster charging, [29] proposed a hybrid approach to reduce the power loss and improve the voltage profile in the distribution system, and both the vehicle-to-grid and grid-to-vehicle operational modes of EVs were considered in this work. However, the above research only studies the charging problem of electric vehicles from the perspective of demand response (DR). The capacity and access location of the EV charging load will affect the power flow distribution of the distribution network, and disordered electric vehicle charging will reduce the voltage stability of the distribution network. The authors of [30] proposed an evolutionary curriculum learning (ECL)-based multiagent deep reinforcement learning (MADRL) approach for optimizing transformer loss of life while considering various charging demands of different EV owners. This work only focuses on the life of the transformer and does not directly control the voltage. Reference [20] proposed a three-layer hierarchical voltage control strategy for distribution networks considering the customized charging navigation of EVs. Although the hourly scheduling results of the OLTC are given the day before, the voltage is controlled in minutes, and frequent voltage regulation will reduce the life of the OLTC.
The above analysis shows that the current research is more concerned with the VVC of DN or DR of EVs, and there are fewer studies that consider both and perform coordinated optimization. However, the studies that do examine the coordinated optimization of both do not consider the actual system comprehensively. The collaborative optimization of EVs, schedulable DGs and VVC devices in an DN system faces some challenges. First, the charging goals of EV users and the goal of maintaining voltage stability in the distribution networks are mutually exclusive. Second, the distribution network has strong uncertainty and nonlinearity, and the charging process of EVs has strong uncertainty due to arrival time, departure time, and electricity price. Third, there are many homogeneous devices controlled by discrete and continuous actions in the system.
To solve these challenges, we formulate a collaborative EV charging scheduling and voltage control strategy based on DRL to comprehensively schedule the charging of EVs and control the voltage of distribution networks. We establish an MDP model for the charging scheduling of EVs and the voltage control problems of distribution networks. The state variables of the system take into account the uncertainty of the EV charging process, nodal loads, RES generation, and electricity price interacting with the main grid. The purpose is to realize automatic voltage regulation and reduced EV charging cost though the collaborative control of VVC devices and EVs, as well as controllable DGs. The design of the reward function comprehensively considers the charging target of EVs and the voltage control objective of DN. In contrast to the control strategies mentioned in the literature above, which were graded according to time, the proposed control strategy synergistically considers the problem of optimizing the scheduling of EVs and the voltage control of the DN. The collaborative scheduling control strategy consists of two layers; the upper layer manages the charging of electric vehicles and the lower layer regulates the voltage control equipment. The control strategy is output by a designed deep neural network (DNN) and trained using a model-free deep deterministic policy gradient (DDPG) method. A signal rounding block is set up after the output layer of the DNN to obtain the discrete control signals of VVC devices. The main contributions of this work are:

•
A time-independent two-layer coordinated EV charging and voltage control framework is proposed to minimize EV charging costs and stabilize the voltage of distribution networks. • An MDP with unknown transition probability is established to solve the EV charging problem considering the voltage stabilization of DN. The reward function is reasonably designed to balance the EV charging target and voltage stability target.

•
The model-free DDPG algorithm is introduced to solve the coordinated optimization problem. A DNN-based policy network is designed to output hybrid continuous scheduling signals and discrete control signals.
The rest of the paper is organized as follows: Section 2 presents the MDP model and introduces the collaborative scheduling control strategy. Section 3 encompasses the simulation experiments and analysis. Section 4 gives the conclusions.

Modelling of DN System
In this paper, we propose a collaborative EV charging and voltage control framework on a DN system. As shown in Figure 1, EVs, controllable DGs, and RES are distributed in the DN system. EVs are controlled by smart terminals that control the charging process and pay for charging. The central controller (CC) collects information on the operating status of the system though two-way real-time communication and based on this information, outputs signal to control the controllable units through our proposed collaborative EV charging and voltage control strategy. In the formula we established, the total operating time range of DN is divided into T time slots and the subscript t represents the specific time slot. In the DN system, the subscript i ∈ Ω n is used to represent the nodes and Ω n is the set of all nodes, the subscript ij ∈ Ω b is used to represent the branches, and Ω b is the set of all branches. We then perform the detailed modelling of EVs, controllable DGs, and voltage control devices in the DN system. The operational constraints of DN are subsequently given, and the MDP model is finally established. It is worth noting that our model does not require knowledge of the topology of the DN, the specific line parameters, and the distribution and fluctuations of the load. The scheduling strategy is learned only according to the observed system state. information, outputs signal to control the controllable units through our proposed collaborative EV charging and voltage control strategy. In the formula we established, the total operating time range of DN is divided into time slots and the subscript represents the specific time slot. In the DN system, the subscript ∈ Ω is used to represent the nodes and Ω is the set of all nodes, the subscript ∈ Ω is used to represent the branches, and Ω is the set of all branches. We then perform the detailed modelling of EVs, controllable DGs, and voltage control devices in the DN system. The operational constraints of DN are subsequently given, and the MDP model is finally established. It is worth noting that our model does not require knowledge of the topology of the DN, the specific line parameters, and the distribution and fluctuations of the load. The scheduling strategy is learned only according to the observed system state. The control variable of an EV is its charging power and, in V2G mode, discharge power. For EVs connected to node , the control variable is expressed as , EV , and in V2G mode, the value of , EV is positive for charging and negative for discharging. Under the intelligent charging strategy, the charging power constraints are: where ,ch,max EV and ,dis,max EV are the maximum charging power and maximum discharging power of EV.
The SOC represents the state of charge of the battery, which should meet the following constraints during any scheduling period [31]:

EVs
The control variable of an EV is its charging power and, in V2G mode, discharge power. For EVs connected to node i, the control variable is expressed as P EV i,t , and in V2G mode, the value of P EV i,t is positive for charging and negative for discharging. Under the intelligent charging strategy, the charging power constraints are: where P EV i,ch,max and P EV i,dis,max are the maximum charging power and maximum discharging power of EV.
The SOC represents the state of charge of the battery, which should meet the following constraints during any scheduling period t [31]: where E EV i is the capacity of the EV battery, η EV i and η EV i,dis are the charging rate and discharging rate, respectively, and SoC EV i,max and SoC EV i,min represent the maximum and minimum SOC, respectively.

Controllable DGs
The control variables for controllable DGs are the active and reactive power output. The active and reactive power outputs of DG at node i are denoted by P DG i,t and Q DG i,t , respectively. The constraints of active and reactive power are: where P DG i,t and Q DG i,t represent the active and reactive power of DG.

Third OLTCs and VRs
The OLTC/VR is controlled by changing the tap position. The control variable of the OLTC/VR on the access branch ij is expressed as l VR ij,t , which varies in an integer range: where l OLTC ij,max represents the maximum adjustable position of the OLTC/VR.

SCBs
The SCB adjusts the amount of reactive power it provides by regulating the number of operating units. The number of operating units of the SCB at node i is expressed as n SCB i,t , which is taken in an integer range: where n SCB i,max is the maximum number of units that can be connected to operation.

Operational Constraints of the DN
The operational constraints of the distribution network are as follows: Equation (8) constrains the complex power that the substation can withstand, with P S t , Q S t , and S S in the equation being the active power, reactive power, and maximum apparent power of the substation, respectively. Equations (9) and (10) constrain the node voltage V i,t and branch current I i,t , respectively.

MDP Model
The challenge of modeling the collaborative problem of EV charging scheduling and voltage control in the distribution network is how to deal with various uncertainties in the system. It is also necessary to balance the charging target of EV users and the voltage control target of distribution network. Therefore, we establish an MDP model for the collaborative optimization problem of the EV charging scheduling and distribution network voltage control. The state variable, action variable, and reward function of the system are reasonably designed in the model.

State
The state variable of the system at any time t is defined as: where P D i,t and Q D i,t represents the active and reactive power demand of node i, respectively, and the subscripts from (t − T + 1) to t − T + 1 indicate the information for the past T time periods; R S t−T+1 , · · · , R S t represents the historical electricity price of the past T slots; P WT i,t and P PV i,t represent the power output of wind turbine (WT) and PV power output of node i, respectively; and for the EV connected to node i, SoC EV i,t represents the its state of charge at timeslot t, t EV i,arr , and t EV i,dep represent the start charging time and departure time, respectively, and SoC EV i,dep represents the expected state of charge of EV at departure time t EV i,dep . The dimension of state space will explode when the system is large. In the distribution network studied, we assume that the WT and PV are uncontrollable power supplies. To reduce the complexity of the state space, the active power demand of node i can be replaced by the net load demand of the node: The simplified system state variable is represented as follows:

Action
The control variables of the system include the active and reactive power of the DG, the charging capacity of the electric vehicle, the tap position of the OLTC, and the number of SCB units.
where the tap position of the OLTC and the number of SCB units are discrete actions and the rest are continuous actions.

Reward function
The design of the reward function takes into consideration the operation cost of the system and the voltage violations of the distribution network. The operating costs of the system include the cost of the DN interacting with the main grid through the substation, the generation costs of DG and the charging cost of EVs.
where α represents the weight coefficient, which indicates that the reward function is the tradeoff value between the operating cost and the voltage stability target.
The first term C G t calculates the cost of purchasing electricity from a power exchange station.
The second term C DG i,t calculates the generation cost of the DG at node i.
where a i , b i , and c i are the generation cost coefficients.
The third term C EV i,t calculates the charging cost of EVs.
We assume that all EVs in the DN system are connected through V2G to take full advantage of the flexibility of EVs to participate in regulating the voltage of the DN, and that the tariff is settled according to the trading tariff of the substation.
The fourth term C V i,t calculates the penalty of voltage violation, corresponding to the node voltage constraint given in Equation (9).
where σ is the penalty coefficient of voltage deviation.

Objective
Define J(π) as the expected cumulative discount return over the scheduling cycle: where γ ∈ [0, 1] is the discount factor and τ represents the system trajectory under policy π.

Deep Reinforcement Learning Solution
In this section, a DRL-based approach is introduced for collaborative EV charging scheduling and distribution network voltage control. The problem in Section 2.1 is first transformed into the RL framework. Then we design a DNN to handle the output of mixed discrete and continuous actions and train the DNN by DDPG [32]. The agent of DDPG consists of two layers. All dispatching signals and control signals are output by the upper layer of the agent, and the lower layer obtains a complete reward value by controlling the controllable units in the distribution network.

DRL Based Approach
Under the reinforcement learning framework, the learning agent interacts with the constructed MDP environment model. The optimization problem is transformed into a standard reinforcement learning framework with the following objective: where Q π * is the optimal action-value function. The action-value function Q π (s t , a t ) describes the expected rewards for taking action a t and then following policy π in state s t , which is used in many reinforcement learning algorithms. It is denoted by Equation (22).
The optimal action-value function Q π * can be derived by solving the Bellman equation recursively. Then, we can obtain the optimal policy π * , which means the optimal action a t ∼ π * can be obtained. This optimization problem can be described by Equations (23) and (24).
The Bellman equation will be difficult to solve when faced with complex problems. To address this problem, value-based methods use look-up table or deep neural network to estimate the optimal action-value function Q π * and update it iteratively. The approximation function is usually described in the form of a function Q s, a θ Q with respect to the parameters θ Q and the parameters are optimized with the objective of minimizing the loss function Loss based on the temporal difference theory.
where B is the batch size of the samples sampled from the replay buffer and y t is the target value: Reinforcement learning that uses an approximation function to estimate the value function is known as value-based RL methods. However, they have some disadvantages in practical applications, especially when dealing with problems with continuous action spaces where a good scheduling strategy cannot be obtained. Therefore, we use policy-based reinforcement learning methods, which directly approximate the policy and optimize the policy function through the gradient ascent method until a convergent policy is obtained.
The deep deterministic policy gradient (DDPG) [32] algorithm is introduced to solve the complex coordinate EV charging and voltage control problem with high-dimensional and continuous action spaces by only using low-dimensional observations. The DDPG algorithm is a policy-based DRL algorithm with actor-critic architecture. Both actor and critic contain two neural networks, with actor consisting of two DNN with parameters θ µ and θ µ , and critic consisting of two multilayer perceptron (MLP) with parameters θ Q and θ Q , respectively. The construction of the DDPG algorithm is shown in Figure 2. Similar to standard reinforcement learning, DDPG has a learning agent that interacts with a distribution network environment in discrete timesteps. The input of the DDPG agent is the system state s t at time step t and the output is action a t . We assume the studied DN environment is fully observed. To ensure independence between samples when using neural networks, DDPG uses experience replay technology to ensure independence between the samples used for target value updating. After each interaction of the agent with the environment, we can obtain a sample containing s t , a t , r t , and s t+1 , and store this sample in the replay buffer. The agent continues to interact with the environment until the set condition is met, then B samples are randomly sampled from the replay buffer to minimize the loss (Equation (25)) of the critic network and to calculate the gradient (Equation (27)) of the actor network to softly update the parameters of the critic and actor networks, respectively. the parameters and the parameters are optimized with the objective of minimizing the loss function based on the temporal difference theory.
where is the batch size of the samples sampled from the replay buffer and is the target value: Reinforcement learning that uses an approximation function to estimate the value function is known as value-based RL methods. However, they have some disadvantages in practical applications, especially when dealing with problems with continuous action spaces where a good scheduling strategy cannot be obtained. Therefore, we use policybased reinforcement learning methods, which directly approximate the policy and optimize the policy function through the gradient ascent method until a convergent policy is obtained.
The deep deterministic policy gradient (DDPG) [32] algorithm is introduced to solve the complex coordinate EV charging and voltage control problem with high-dimensional and continuous action spaces by only using low-dimensional observations. The DDPG algorithm is a policy-based DRL algorithm with actor-critic architecture. Both actor and critic contain two neural networks, with actor consisting of two DNN with parameters and ′ , and critic consisting of two multilayer perceptron (MLP) with parameters and ′ , respectively. The construction of the DDPG algorithm is shown in Figure 2. Similar to standard reinforcement learning, DDPG has a learning agent that interacts with a distribution network environment in discrete timesteps. The input of the DDPG agent is the system state at time step and the output is action . We assume the studied DN environment is fully observed. To ensure independence between samples when using neural networks, DDPG uses experience replay technology to ensure independence between the samples used for target value updating. After each interaction of the agent with the environment, we can obtain a sample containing , , , and +1 , and store this sample in the replay buffer. The agent continues to interact with the environment until the set condition is met, then samples are randomly sampled from the replay buffer to minimize the loss (Equation (25)) of the critic network and to calculate the gradient (Equation (27)) of the actor network to softly update the parameters of the critic and actor networks, respectively. The DDPG algorithm combines the success of the actor-critic approach and DQN [33] using dual networks on top of the deterministic policy gradient (DPG) algorithm. The DPG algorithm is based on the actor-critic structure, which consists of an actor and a critic. The DDPG algorithm combines the success of the actor-critic approach and DQN [33] using dual networks on top of the deterministic policy gradient (DPG) algorithm. The DPG algorithm is based on the actor-critic structure, which consists of an actor and a critic. The critic Q(s, a) is learned using the Bellman equation as in Q-learning. According to Equation (26), the update rule for the parameters of the critic is given by Equation (27).
The actor is a parameterized actor function µ(s|θ µ ) that specifies the current policy by deterministically mapping states to actions. The parameters of the actor's value network are updated based on the policy gradient method. The policy gradient algorithms apply a gradient ascent method to update policy parameters and rely on the sampled sequence of decisions when interacting with the environment. The actor is updated by following the applying the chain rule to the expected return from the start distribution J. The update rule for the parameters of the actor is given by Equation (28): where J is the expected return from the start distribution, µ is the deterministic target policy, θ is the parameter of the function approximator, ρ is the discounted state visitation distribution for policy, β is a different stochastic behavior policy, and s i is the state of the ith sample in the small batch of samples sampled from the replay buffer.

Design of the Parameterized Policy Network
The proposed DDPG uses a multilayer perceptron (MLP) to approximate the policy and output the continuous action value. We design a DNN to approximate the coordinated policy. Figure 3 illustrates the architecture of the designed policy network. The status information on the system's renewable energy output P WT t , P PV t , load demand P L t , realtime LMP price R S t , and SOC of EV SoC t is fed into the network and output as a defined continuous action vector. To ensure the stability and convergence of the learning process, all input state data are normalized according to their respective min-max values. RNN can be used as a policy network when the state variables contain information from the past T time periods. In our model, the state variables only contain information from the current moment to reduce the dimensionality of the state space, so we choose a DNN as the policy network to extract the feature information of the system state variables. The final layer of the network uses tanh as the activation function and outputs continuous values in the range [−1, 1]. To output discrete control actions, we add an integral block behind the output layer to output discrete control signals to OLTC and CB. In this way, the mixed discrete continuous action output a t is obtained. The min-max block in the figure behind the output layer represents the limit on the range of output continuous actions, corresponding to the constraints on DGs and EV in Section 2. To alleviate the problem of a vanishing gradient or exploding gradient, a rectified linear unit (ReLU) is used as the activation function of each neuron in the hidden layer. The details of the architecture of the policy network of the proposed DDPG structure is provided in the

Practices Implementation
The scheduling process for DN can be summarized as the offline training and online scheduling process in Figure 4. The coordinate EV charging and voltage control strategy in the agent contains two layers. The upper layer is the dispatching layer, which outputs the control signals of all dispatchable units according to the system status. The lower layer is the voltage control layer, which is the response for receiving these control signals and controlling the dispatchable units in the DN system. The parameters (weights and biases) of the initial policy of the agent are random and the policy network cannot output the optimal action. Therefore, the policy network of the agent needs to be trained offline using historical environmental data before it can operate practically. The parameters of the DNN are updated through iterative interaction with the environment and the accumulation of experience. With this approach, the agent can gradually optimize the network parameters to more accurately approach the optimal collaborative strategy.
The agent is trained in a centralized mode using historical system data and then run in online mode. During the training process, the voltage layer calculates the penalty of voltage fluctuation in the reward function by running a simulated distribution network system. The pseudocode for the training procedure of the DRL-based method is presented in Algorithm 1.

Practices Implementation
The scheduling process for DN can be summarized as the offline training and online scheduling process in Figure 4. The coordinate EV charging and voltage control strategy in the agent contains two layers. The upper layer is the dispatching layer, which outputs the control signals of all dispatchable units according to the system status. The lower layer is the voltage control layer, which is the response for receiving these control signals and controlling the dispatchable units in the DN system. The parameters (weights and biases) of the initial policy of the agent are random and the policy network cannot output the optimal action. Therefore, the policy network of the agent needs to be trained offline using historical environmental data before it can operate practically. The parameters of the DNN are updated through iterative interaction with the environment and the accumulation of experience. With this approach, the agent can gradually optimize the network parameters to more accurately approach the optimal collaborative strategy.
The agent is trained in a centralized mode using historical system data and then run in online mode. During the training process, the voltage layer calculates the penalty of voltage fluctuation in the reward function by running a simulated distribution network system. The pseudocode for the training procedure of the DRL-based method is presented in Algorithm 1. Initialize weights and of critic network ( , | ) and actor network ( | ) 2 Initialize weights ′ ← , ′ ← of target network ′ and ′ 3 Initialize experience replay buffer 4 for = 1, 2, …, do 5 Receive initial observation state 1 6 for = 1, 2, …, do 7 Choose = ( | ) and do simulation using pandapower 8 Observe reward and the next state +1 9 Store transition ( , , 1 , +1 ) in 10 Sample a random minibatch of transitions ( , , , +1 ) from 11 Set = + ⋅ ′( +1 , ′( +1 | ′ )| ′ ) according to Equation (26) 12 Update critic network parameters by minimizing the loss, see Equation (27): Update the actor policy using the sampled policy gradient, see Equation (28): Softly update the target networks using the updated critic and actor network parameters: end for 16 end for

Algorithm 1 DDPG-based Learning Algorithm
1 Initialize weights θ Q and θ µ of critic network Q s, a θ Q and actor network µ(s|θ µ ) 2 Initialize weights θ Q ← θ Q , θ µ ← θ µ of target network Q and µ 3 Initialize experience replay buffer R 4 for episode = 1, 2, . . . , M do 5 Receive initial observation state s 1 6 for t = 1, 2, . . . , T do 7 Choose a t = µ(s t |θ µ ) and do simulation using pandapower 8 Observe reward r t and the next state s t+1 9 Store transition (s t , a t , r 1 , s t+1 ) in R 10 Sample a random minibatch of B transitions (s i , a i , r i , s i+1 ) from R 11 Set y i = r i + γ · Q s i+1 , µ (s i+1 |θ µ ) θ Q according to Equation (26) 12 Update critic network parameters by minimizing the loss, see Equation (27): Update the actor policy using the sampled policy gradient, see Equation (28): Softly update the target networks using the updated critic and actor network parameters: end for 16 end for In Algorithm 1, all network parameters (weights and bias) of the DDPG are initialized before starting training. At the beginning of each episode, the environment is reset in order to obtain the initial state of the system. Then, the policy network under the current parameters is used to interact with the environment for T time steps. During the interaction, the immediate reward, the observed state at the next moment, current state and the action are composed to be one sample, and this sample is stored in the replay buffer. Next, a random batch of samples from the replay buffer is used to update the parameters of the actor and critic networks of DDPG according to the conditions. After the offline training, the trained network parameters are preserved for online operation. In practical operation, the preserved network parameters are loaded, and the system state is input to output the control signals for the collaborative strategy. The agent only outputs the control signal to the system through the dispatching layer, and the voltage control layer no longer calculates the penalty of voltage fluctuation. The pseudocode for the practical running process of the algorithm is presented in Algorithm 2.  Input system state s t 2 Output EV charging/discharging schedule and voltage control signals 3 for t = 1, 2, . . . , T do 4 Obtain historical information and EV charging demand 5 Build observation state s t according to Equation (13)  6 Choose action a t according to Equation (24) using the trained Algorithm 1 7 Output EV charging/discharging schedule and voltage control signals 8 end for

Results and Discussion
In this section, we present the details of simulation experiments to test the proposed method and prove the effectiveness of the method through the analysis of the simulation results. The simulations are trained and tested using a personal computer with an NVIDIA RTX-3070 GPU and one Intel (R) Cores (TM) i7-10700K CPU. The code is written in Python 3.7.8, the reinforcement learning algorithm is implemented using the deep learning package TensorFlow 1.14.0 [34], and the distribution network environment is realized using pandapower 2.10.1 [35].

IEEE-33 Node System and Parameter Settings
The performance of the proposed learning method is evaluated on a modified IEEE-33 node system [36]. Figure 5 shows the topology of the test feeder system. An OLTC is set at bus 0 to connect to the external grid, which has 11 tap positions with an adjustment range of −10% to 10% (2% per tap). Two SCBs with a capacity of 400 kVar are connected at node 17 and node 30, each containing four units. A controllable DG is connected at node 17 and node 32, respectively, and a WT and a PV are provided at nodes 21 and 24, respectively. The detailed parameter settings of DGs and RES are presented in Table 2.
To reflect the complexity of the system, we evenly distributed the EV charging stations throughout the test system. As shown in Figure 5, EV charging stations are set up on nodes 8, 13, 19, 22, and 29, each of which can be connected to a different number of EVs. Nissan Leaf is considered a typical EV prototype, and the maximum charge/discharge power for each vehicle is set at 6 kW and the battery capacity is set at 24 kWh. The charging and discharging efficiency of EVs is set at 0.98 and 0.95, respectively. As suggested by [3,37], the EV arrive time, departure time, and battery SOC at the arrival time obey truncated normal distribution. These distributions and the specific parameter settings are presented in Table 3. The safe range for the SOC of EV battery is [0.2, 1.0]. The objective is to minimize the total operating cost of the system and the fluctuation of node voltage. The safe range of nodal voltages is set between 0.95 p.u. and 1.05 p.u.   The time-series data in the California Independent System Operator (CAISO) [38] are used to simulate the electricity prices, load demand, and RES generation in the distribution network system. We downloaded data for 2019 and 2020 and used these two years as the training set and test set, respectively. To ensure the load data meet the requirements of the considered system, it is necessary to process the downloaded load data. First, normalize the downloaded load data and then multiply the node base load power of the standard IEEE-33 node system for setting. The output data of the downloaded wind turbine and photovoltaic are processed in the same way.
To verify the effectiveness and scalability of the coordinated strategy, two simulation cases are designed based on the load capacity that the system can handle: Case 1 contains 5 EVs and Case 2 contains 50 EVs. Electric vehicles have a characteristic of having more parking time than driving time, and EV users prefer to charge at night when electricity prices are lower. Therefore, we set up fewer EVs in the simulation scenario of Case 1 and more EVs in the simulation scenario of Case 2. In Case 1, the EVs considered in the system are charged during the daytime, with a charging scheduling time range of 8:00 a.m. to 19:00 p.m. for a total of 12 h. In Case 2, the EVs are charged during the nighttime, with a charging time range of 20:00 p.m. to 7:00 a.m. for a total of 12 h. The expected charge level would be no less than 80% to alleviate low range anxiety.   Table 3. Parameter setting of EVs.

Variable Distribution Boundary
Arrival time t EV arr ∼ N 9, 1 2 The time-series data in the California Independent System Operator (CAISO) [38] are used to simulate the electricity prices, load demand, and RES generation in the distribution network system. We downloaded data for 2019 and 2020 and used these two years as the training set and test set, respectively. To ensure the load data meet the requirements of the considered system, it is necessary to process the downloaded load data. First, normalize the downloaded load data and then multiply the node base load power of the standard IEEE-33 node system for setting. The output data of the downloaded wind turbine and photovoltaic are processed in the same way.
To verify the effectiveness and scalability of the coordinated strategy, two simulation cases are designed based on the load capacity that the system can handle: Case 1 contains 5 EVs and Case 2 contains 50 EVs. Electric vehicles have a characteristic of having more parking time than driving time, and EV users prefer to charge at night when electricity prices are lower. Therefore, we set up fewer EVs in the simulation scenario of Case 1 and more EVs in the simulation scenario of Case 2. In Case 1, the EVs considered in the system are charged during the daytime, with a charging scheduling time range of 8:00 a.m. to 19:00 p.m. for a total of 12 h. In Case 2, the EVs are charged during the nighttime, with a charging time range of 20:00 p.m. to 7:00 a.m. for a total of 12 h. The expected charge level would be no less than 80% to alleviate low range anxiety.
The proposed method is compared with several benchmark approaches, including DRL-based DQN [33], soft actor-critic (SAC) [39] and proximal policy optimization (PPO) method [40]. The policy network of DQN contains three hidden ReLU layers and 64 neurons each, and SAC has the same policy network structure as DDPG, both containing three ReLU layers with 256, 128, and 64 neurons, respectively, and an output layer with tanh as the activation function and using the same approach to obtain hybrid actions. Eleven levels of optional actions are set in the action space of DQN, and SAC and PPO output actions are present in the same way as DDPG. Additional parameters considering the algorithm are given in Table 4. These algorithms choose the same parameter to realize the voltage fluctuation to ensure the competitiveness of the comparison results.  Figure 6 shows the results of the system nodal voltages using the DDPG algorithm. The proposed method is compared with several benchmark approaches, including DRL-based DQN [33], soft actor-critic (SAC) [39] and proximal policy optimization (PPO) method [40]. The policy network of DQN contains three hidden ReLU layers and 64 neurons each, and SAC has the same policy network structure as DDPG, both containing three ReLU layers with 256, 128, and 64 neurons, respectively, and an output layer with tanh as the activation function and using the same approach to obtain hybrid actions. Eleven levels of optional actions are set in the action space of DQN, and SAC and PPO output actions are present in the same way as DDPG. Additional parameters considering the algorithm are given in Table 4. These algorithms choose the same parameter to realize the voltage fluctuation to ensure the competitiveness of the comparison results.  Figure 6 shows the results of the system nodal voltages using the DDPG algorithm. Figure 6a,c shows the results for the node voltage with voltage control in Case 1 and Case 2, respectively, and Figure 6b,d shows the results for node voltage without voltage control in Case 1 and Case 2, respectively. Comparing Figure 6a,c and Figure 6b,d vertically, it can be observed that the voltage of the system decreases as the load of the electric vehicles in the system increases. The comparison results show that controlling the voltage while scheduling the electric vehicle allows the voltage at each node in the system to be in the safe range, indicating that our proposed coordinated control strategy can implement the safe control of the node voltage of the system.  Figure 7 compares the average cumulative voltage violation (CVV) of the proposed DDPG method and the comparison method for five independent runs of the training process with different random seeds. As shown in Figure 7a, in Case 1, DDPG learns a safe and stable voltage control strategy after 500 training episodes. However, the SAC in the comparison algorithm converge after 1000 episodes of training, and the DQN and PPO converge after more than 2500 episodes. In Case 2, DDPG can converge to a lower average CVV after 1000 training episodes. Both PPO and SAC require 1500 training episodes to converge, and DQN, although converging after 2000 training episodes, still has large fluctuations, which are related to the discretization of its action output. The comparison algorithms have poorer performance in voltage control, both exhibiting higher values of voltage violations. The control results of the coordinated control strategy for the OLTC and SCB in Case 1 and Case 2 are given in Figure 8. Combined with the node voltage results in Figure 6a,c, both the OLTC and SCB can be controlled to ensure that the voltage at each node of the system is in the range of [0.95, 1.05], thus avoiding voltage dropout and voltage overrun.   Figure 7 compares the average cumulative voltage violation (CVV) of the proposed DDPG method and the comparison method for five independent runs of the training process with different random seeds. As shown in Figure 7a, in Case 1, DDPG learns a safe and stable voltage control strategy after 500 training episodes. However, the SAC in the comparison algorithm converge after 1000 episodes of training, and the DQN and PPO converge after more than 2500 episodes. In Case 2, DDPG can converge to a lower average CVV after 1000 training episodes. Both PPO and SAC require 1500 training episodes to converge, and DQN, although converging after 2000 training episodes, still has large fluctuations, which are related to the discretization of its action output. The comparison algorithms have poorer performance in voltage control, both exhibiting higher values of voltage violations. The control results of the coordinated control strategy for the OLTC and SCB in Case 1 and Case 2 are given in Figure 8. Combined with the node voltage results in Figure 6a,c, both the OLTC and SCB can be controlled to ensure that the voltage at each node of the system is in the range of [0.95, 1.05], thus avoiding voltage dropout and voltage overrun. The control results of the coordinated control strategy for the OLTC and SCB in Case 1 and Case 2 are given in Figure 8. Combined with the node voltage results in Figure 6a,c, both the OLTC and SCB can be controlled to ensure that the voltage at each node of the system is in the range of [0.95, 1.05], thus avoiding voltage dropout and voltage overrun.  Figure 9 compares the cumulative operating cost curves of the proposed method with several other EV scheduling strategies, including the constant power charging strategy (CPC), TOU excitation strategy and PSO-based scheduling strategy without voltage control (NVC). In the constant power charging strategy, the charging power of the EV is set to the maximum charging power of the battery. The TOU price used in the simulation is given in Table 5 [20]. As shown in Figure 9a, the cumulative operating cost for DDPG, CPC, TOU, and NVC in Case 1 are USD 0.974M, USD 1.219M, USD 1.096M, and USD 0.722M, respectively. Compared with CPC and TOU, DDPG reduces the operating cost by 20.1% and 11.1%. As shown in Figure 9b, the cumulative operating cost for DDPG, CPC, TOU, and NVC in Case 2 are USD 9.35M, USD 13.468M, USD 10.272M, and USD 7.309M, respectively. Compared with CPC and TOU, DDPG reduces the operating cost by 30.58% and 8.98%. Although the scheduling strategy without voltage control has the lowest cumulative operating cost, it cannot guarantee the stability of the system voltage. Combined with the above analysis, the following conclusion can be drawn. Our proposed DDPG approach ensures system voltage stability at the expense of some economy.    Figure 9 compares the cumulative operating cost curves of the proposed method with several other EV scheduling strategies, including the constant power charging strategy (CPC), TOU excitation strategy and PSO-based scheduling strategy without voltage control (NVC). In the constant power charging strategy, the charging power of the EV is set to the maximum charging power of the battery. The TOU price used in the simulation is given in Table 5 [20]. As shown in Figure 9a, the cumulative operating cost for DDPG, CPC, TOU, and NVC in Case 1 are USD 0.974M, USD 1.219M, USD 1.096M, and USD 0.722M, respectively. Compared with CPC and TOU, DDPG reduces the operating cost by 20.1% and 11.1%. As shown in Figure 9b, the cumulative operating cost for DDPG, CPC, TOU, and NVC in Case 2 are USD 9.35M, USD 13.468M, USD 10.272M, and USD 7.309M, respectively. Compared with CPC and TOU, DDPG reduces the operating cost by 30.58% and 8.98%. Although the scheduling strategy without voltage control has the lowest cumulative operating cost, it cannot guarantee the stability of the system voltage. Combined with the above analysis, the following conclusion can be drawn. Our proposed DDPG approach ensures system voltage stability at the expense of some economy.  Figure 9 compares the cumulative operating cost curves of the proposed method with several other EV scheduling strategies, including the constant power charging strategy (CPC), TOU excitation strategy and PSO-based scheduling strategy without voltage control (NVC). In the constant power charging strategy, the charging power of the EV is set to the maximum charging power of the battery. The TOU price used in the simulation is given in Table 5 [20]. As shown in Figure 9a, the cumulative operating cost for DDPG, CPC, TOU, and NVC in Case 1 are USD 0.974M, USD 1.219M, USD 1.096M, and USD 0.722M, respectively. Compared with CPC and TOU, DDPG reduces the operating cost by 20.1% and 11.1%. As shown in Figure 9b, the cumulative operating cost for DDPG, CPC, TOU, and NVC in Case 2 are USD 9.35M, USD 13.468M, USD 10.272M, and USD 7.309M, respectively. Compared with CPC and TOU, DDPG reduces the operating cost by 30.58% and 8.98%. Although the scheduling strategy without voltage control has the lowest cumulative operating cost, it cannot guarantee the stability of the system voltage. Combined with the above analysis, the following conclusion can be drawn. Our proposed DDPG approach ensures system voltage stability at the expense of some economy.     Figure 10 compares the cumulative rewards of the proposed DDPG method and the comparison method for five independent runs of the training process with different random seeds. Figure 10a shows the convergence curves of the different DRL methods in Case 1. From Figure 10a it can be seen that DDPG is able to learn an economical EV charging strategy after 500 episodes of training. The SAC in the comparison method is measured to converge after 1500 episodes, while the DQN requires more training to converge. The cumulative return of DDPG, DQN, SAC, and PPO during the training process converge to USD 1.2135M, USD 2.5023M, USD 1.8798M, and USD 1.4392M, respectively. Compared to DQN, SAC, and PPO, DDPG can reduce the cost by 51.5%, 35.4%, and 15.7%, respectively. Figure 10b shows the convergence curves of the different DRL methods in Case 2. DDPG in Figure 10b converges after 1000 training episodes, while PPO, SAC, and DQN all requires 2000 training episodes to converge. The cumulative return of DDPG, DQN, SAC, and PPO are USD 3.2181M, USD 4.6912M, USD 4.4152M, and USD 3.7642M, respectively. Compared to DQN, SAC, and PPO, DDPG can reduce the cost by 31.4%, 27.11%, and 14.51%, respectively. Additionally, as can be seen in Figure 10, DDPG and PPO are able to achieve higher returns than SAC and DQN. Compared to PPO, DDPG has a faster convergence rate and more stable convergence results. Combined with the training results in Figure 7, we can conclude that DDPG has a faster and more stable performance than popular DRL methods in learning a safe and economical coordinated control strategy. The average running time of training and testing (one-step) of all algorithms is listed in Table 6. DQN has the longest training time compared to the DRL method that outputs continuous actions because it has a discrete action space. As the number of controllable units in the system increases, the training time of DQN increases as the action space increases.  Figure 10 compares the cumulative rewards of the proposed DDPG method and the comparison method for five independent runs of the training process with different random seeds. Figure 10a shows the convergence curves of the different DRL methods in Case 1. From Figure 10a it can be seen that DDPG is able to learn an economical EV charging strategy after 500 episodes of training. The SAC in the comparison method is measured to converge after 1500 episodes, while the DQN requires more training to converge. The cumulative return of DDPG, DQN, SAC, and PPO during the training process converge to USD 1.2135M, USD 2.5023M, USD 1.8798M, and USD 1.4392M, respectively. Compared to DQN, SAC, and PPO, DDPG can reduce the cost by 51.5%, 35.4%, and 15.7%, respectively. Figure 10b shows the convergence curves of the different DRL methods in Case 2. DDPG in Figure 10b converges after 1000 training episodes, while PPO, SAC, and DQN all requires 2000 training episodes to converge. The cumulative return of DDPG, DQN, SAC, and PPO are USD 3.2181M, USD 4.6912M, USD 4.4152M, and USD 3.7642M, respectively. Compared to DQN, SAC, and PPO, DDPG can reduce the cost by 31.4%, 27.11%, and 14.51%, respectively. Additionally, as can be seen in Figure 10, DDPG and PPO are able to achieve higher returns than SAC and DQN. Compared to PPO, DDPG has a faster convergence rate and more stable convergence results. Combined with the training results in Figure 7, we can conclude that DDPG has a faster and more stable performance than popular DRL methods in learning a safe and economical coordinated control strategy. The average running time of training and testing (one-step) of all algorithms is listed in Table 6. DQN has the longest training time compared to the DRL method that outputs continuous actions because it has a discrete action space. As the number of controllable units in the system increases, the training time of DQN increases as the action space increases.   The scheduling results for DGs and EVs in Case 1 are given in Figure 11. The purple star symbol on the SOC curve in Figure 11b indicates the SOC value at the start of EV charging. From the SOC curve the following conclusions can be drawn, our proposed charging strategy can charge the EV with goal of reducing charging costs and the reward function is designed to balance the EV charging target with the DN voltage control target.

Simulation Comparison of Cost Reduction Performance
Based on the SOC values at the end of charging process, the battery of the EV is not always fully charged; this is because we consider the voltage stability of the DN when scheduling the EV for charging. The voltage constraint of the DN prevents the EVs from being perfectly filled but the desired level can still be achieved. The scheduling results for DGs and EVs in Case 1 are given in Figure 11. The purple star symbol on the SOC curve in Figure 11b indicates the SOC value at the start of EV charging. From the SOC curve the following conclusions can be drawn, our proposed charging strategy can charge the EV with goal of reducing charging costs and the reward function is designed to balance the EV charging target with the DN voltage control target. Based on the SOC values at the end of charging process, the battery of the EV is not always fully charged; this is because we consider the voltage stability of the DN when scheduling the EV for charging. The voltage constraint of the DN prevents the EVs from being perfectly filled but the desired level can still be achieved.

Conclusions
In this paper, we proposed a DRL approach based on DDPG for coordinate EV charging and voltage control problems in distribution networks. The proposed two-layer scheduling control strategy enables the agent to learn an economical scheduling strategy and maintain the voltage stability of the distribution network. The proposed method is datadriven and does not rely on uncertain models in the system. The designed policy network can directly generate hybrid continuous scheduling and discrete control signals. The simulation experiment is tested on a modified IEEE-33 node system and the real-world power system data are used for training and testing. Two simulation cases of different scenarios are designed to verify the effectiveness and scalability of the proposed approach. Simulation results demonstrate that the proposed approach can successfully learn an effective policy to charge EVs in a cost-efficient way, considering voltage stability. The numerical results demonstrate the effectiveness of the DDPG approach, which can significantly reduce the operating cost of the system in both Case 1 and in Case 2 scenarios and has a faster convergence rate compared to the other DRL methods used for comparison. The

Conclusions
In this paper, we proposed a DRL approach based on DDPG for coordinate EV charging and voltage control problems in distribution networks. The proposed two-layer scheduling control strategy enables the agent to learn an economical scheduling strategy and maintain the voltage stability of the distribution network. The proposed method is data-driven and does not rely on uncertain models in the system. The designed policy network can directly generate hybrid continuous scheduling and discrete control signals. The simulation experiment is tested on a modified IEEE-33 node system and the real-world power system data are used for training and testing. Two simulation cases of different scenarios are designed to verify the effectiveness and scalability of the proposed approach. Simulation results demonstrate that the proposed approach can successfully learn an effective policy to charge EVs in a cost-efficient way, considering voltage stability. The numerical results demonstrate the effectiveness of the DDPG approach, which can significantly reduce the operating cost of the system in both Case 1 and in Case 2 scenarios and has a faster convergence rate compared to the other DRL methods used for comparison. The comparison results show that the proposed approach is well balanced to take into account the charging demand of EVs and the voltage stability of the distribution network.
The charging scheduling of EVs is a complex process, and more physical characteristics should be considered. For future work, the impacts of battery degradation and V2G operation on the EV charging process should be carefully considered in order to establish a more realistic environment.