Transmission Network Expansion Planning Considering Wind Power and Load Uncertainties Based on Multi-Agent DDQN

: This paper presents a multi-agent Double Deep Q Network (DDQN) based on deep reinforcement learning for solving the transmission network expansion planning (TNEP) of a high-penetration renewable energy source (RES) system considering uncertainty. First, a K-means algorithm that enhances the extraction quality of variable wind and load power uncertain characteristics is proposed. Its clustering objective function considers the cumulation and change rate of operation data. Then, based on the typical scenarios, we build a bi-level TNEP model that includes comprehensive cost, electrical betweenness, wind curtailment and load shedding to evaluate the stability and economy of the network. Finally, we propose a multi-agent DDQN that predicts the construction value of each line through interaction with the TNEP model, and then optimizes the line construction sequence. This training mechanism is more traceable and interpretable than the heuristic-based methods. Simultaneously, the experience reuse characteristic of multi-agent DDQN can be implemented in multi-scenario TNEP tasks without repeated training. Simulation results obtained in the modiﬁed IEEE 24-bus system and New England 39-bus system verify the effectiveness of the proposed method.


Introduction
Although countries have actively implemented Nationally Determined Contributions (NDCs) to alleviate climate deterioration in recent years, global greenhouse gas emissions are still in the process of continuous growth, and there has not yet been a peak phenomenon. In order to control the future temperature rise within 1.5 • C, the United Nations Environment Programme advocates that countries around the world should reduce the emissions to fill the gap between the current greenhouse gas emissions level and the Paris Agreement provisions [1]. The transformation of energy structure is regarded as the primary way for emissions reduction by all countries. Many countries have formulated plans to build a high-penetration renewable energy source (RES) system, which fully releases the high environmental and economic value of renewable energy by replacing fossil energy [2,3]. There are two main challenges in the RES system construction. One is to solve the time and space uncertainties caused by the intermittency of renewable energy [4], and the other is to optimize the network structure for large-scale renewable energy integration [5]. The transmission network expansion planning (TNEP) is the crucial task of power system construction, which determines the basic structure and system characteristic. Therefore, the characteristics of system with high-penetration of RES should be fully considered in the TNEP task on the basis of ensuring system stability and economy.

•
A K-means algorithm that enhances the extraction quality of variable wind and load power uncertain characteristics is proposed. The proposed method considers the accumulation and change rate of operational data. • A calculation method of wind curtailment and load shedding that reduces the computational complexity while retaining the uncertainty of the system is proposed. The calculation method is based on the typical uncertain scenarios extracted from operation data. • A TNEP bi-level model considering the system stability and economy is constructed, and this model includes the comprehensive cost, wind curtailment, load shedding, and electrical betweenness. • Multi-agent DDQN is proposed based on the bi-level model, which is a high-performance and interpretable machine learning method for the TNEP task.
This paper is organized as follows: Section 2 constructs the bi-level TNEP model to consider the wind power and load uncertainties. Section 3 builds the multi-agent DDQN for TNEP task based on deep reinforcement learning. Section 4 takes multi-agent DDQN to complete TNEP tasks on modified IEEE 24-bus system and New England 39-bus system.

TNEP Bi-Level Model Based on Typical Scenarios of Wind Power and Load Uncertainties
This section constructs a TNEP bi-level model considering the system stability and economy based on uncertain scenarios. First, the typical uncertain scenarios of wind power and load data are extracted based on the improved K-Means algorithm. Secondly, based on the extracted results, a TNEP bi-level model is constructed. This model can comprehensively evaluate the economic and stability of transmission network under the scenario of a high-penetration of wind power and variable load.

Improved K-Means Algorithm Based on Characteristics of Wind power and Load Operation Data
The output power of a wind farm is closely related to the regional weather, and the load-side behavior also makes the input power of the load variable. The system with high-penetration wind farms and variable load injection face high uncertainties at the source-side and load-side, which greatly affects the stability of the RES system. In the system to be expanded, there are many combinations of wind farm output power and load input power recorded in historical operating data, and it is unrealistic to consider each scenario in the TNEP task. Therefore, this paper uses the improved K-Means algorithm to extract typical scenarios, which saves a lot of calculation time for TNEP task while preserving the system uncertainty.
K-Means algorithm is an intuitive and efficient clustering method based on the distance of data samples. Additionally, when applied to the classification task of large data sets, its performance is still excellent.
When the K-Means algorithm is used to extract typical scenarios from operating data, the K value should be given first to determine K cluster centers. Then, through iterative optimization of K cluster centers, the sum of the distances between the classified samples and each cluster center is minimized. The traditional sum of the squared errors (SSE) is where x is the operation data; δ is the operation data set; C n is data of cluster center. However, when traditional SSE is used for clustering task, its morphological-based clustering objective cannot fully reflect the fluctuation characteristics and accumulation of operation data, which are quite critical for the TNEP task. Therefore, this paper proposes to adopt accumulation and change rate as indicators to measure the data uncertainty, and then use these two indicators as clustering objective to improve the quality of clustering data.
The cumulation of operation data D cumulative is where d h is value of operation data at h-th hour, and the change rate of operation data D change is where d rated is the rated value of operation data; d h−1 is value of operation data at h − 1-th hour. Based on (2) and (3), the clustering objective function SSE new of improved K-Means algorithm is (4) where D cumulative,x is the cumulation of operation data x; C cumulative,n is the cumulation of clustering center n; D change,x is the change rate of operation data x; C change,n is the change rate of clustering center n.

Bi-Level Multi-Objective TNEP Model
The TNEP task of the RES system is a multi-objective problem. It needs to coordinate economy and stability. In addition, the uncertainties of the system under large-scale wind power and variable load should also be considered. Therefore, this section constructs a bi-level model based on the nature of the transmission network evaluation index, and each layer model is composed of objective function and constraints.

Upper-Level Objective Function
The upper-level model uses the comprehensive cost to evaluate the economy of the system. The comprehensive cost is composed of construction cost, network loss cost, and operation and maintenance cost. The construction cost f 1 of the TNEP task is formed by the uniform annual investment of transmission lines.
where r d is the capital discount rate of line; y is the life expectancy of line; N L is the total number of lines; λ l is the line construction state; X l is the construction investment of line l.
The transmission network loss refers to the power loss in the form of heat energy during the power transmission, the transmission network loss cost f 2 is where P l is the active power of line l in AC rectangular; Q l is the active power of line l in AC rectangular; U l is the voltage of line l in AC rectangular; p loss is the unit electricity price of network loss. The operation and maintenance cost of the transmission network should consider the equipment of line and transformer. However, the transformer operation and maintenance cost is related to the load rate, and the parameters and transmission power of each transformer in the IEEE RTS-24 bus system are almost equal. Therefore, the transformer operation and maintenance cost have little effect on the scheme choice. The system operation and maintenance cost f 3 is where η l is the line operation and maintenance cost coefficient. The calculation of upper-level objective function is based on the AC power flow method, which can describe power flow characteristics more accurately than the DC power flow method used in traditional TNEP methods. The f upper (·) is

Upper-Level Constraints
The upper-level constraints are mainly composed of power transmission and equipment operation constraints. The AC power flow balance constraints are where P g j is the generator rated active power output of node j; P curt wind,j is the wind curtailment active power of node j; V j is the voltage of node j; B jk is the susceptance between node j and node k; G is the conductivity between node j and node k; θ jk is the phase angle between node j and node k; P load j is the load active power input of node j; P shed load,j is the load active power shedding of node j; Q g j is the generator reactive power output of node j; Q load j is the reactive power input of node j; Q curt wind,j is the wind curtailment reactive power of node j; Q shed load,j is the load reactive power of node j; Q c j is the reactive power of reactive power compensation device of node j.
The voltage amplitude and phase angle constraints are are the maximum and minimum of phase angle of node j. Because wind farm has strong reactive power regulation capability, this paper does not make a special constraint. The wind power and general generator output constraints are P g,min j where the F l is the power flow of line l; F max l is the power flow transmission maximum line j.
The wind curtailment constraint is where µ wind,j is the minimum output ratio; P curt,lower wind,j is the wind active power curtailment of lower-level model.
The load shedding constraint is where µ load,j is the minimum load ratio; P shed,lower load,j is the load activate power shedding of lower-level model; P curt,lower wind,j is the wind active power curtailment of lower-level model.

Lower-Level Objective Function
The lower-level model is based on typical uncertain scenarios. It evaluates renewable energy consumption of system through load shedding and wind curtailment calculations, and the system stability is evaluated by the improved electrical betweenness. Based on the bi-level model structure, the upper-level model obtains a Pareto set composed of better economical lines, and the lower-level model only needs to calculate the scheme in this set. Then, the upper-level model further optimizes the TNEP scheme after receiving the calculation results of lower-level model. This mechanism satisfies the constraints between the bi-level models and improves the computational efficiency.
The transmission network is a real-time balance system, but high-penetration wind power and variable load will affect this balance. Hence, when the wind farm output power is greater than the maximum absorbable power of the system or the adjacent lines of the wind farm do not have enough capacity to transmit the power flow, the excess wind power needs to be curtailed to ensure the balance. On the contrary, when the system load power exceeds the sum of the power of wind farm and the general generator set or the adjacent lines of load node are blocked, the excess load will be shed. Figure 1 is the schematic diagram of wind curtailment and load shedding. adjacent lines of load node are blocked, the excess load will be shed. Figure 1 is the schematic diagram of wind curtailment and load shedding. We set priority wind power output conditions to ensure the maximum wind power consumption. Therefore, the wind curtailment is determined by the load and the minimum output of general generator set. The wind curtailment of each hour p curt wind,h is calculated by min , where Nwind is the total number of wind farms; P where P max j,h is the maximum output of general generator set j at h-th hour. The total load shedding of each scenario P shed load is The wind curtailment and load shedding can evaluate the operation economy of system structure under uncertain scenarios. However, the high-penetration wind power and variable load may cause the line with excessive power flow to be cut off, which will lead large-scale power flow transfer and even cause a cascading failure. We propose to apply the improved electrical betweenness to measure system power flow balance in uncertain scenarios, and use it to evaluate the system stability.
The stability evaluation of the transmission network based on the electrical betweenness integrates the power flow characteristics into the topology analysis. This method uses electrical betweenness to indicate the transmission power of each line in multiple scenarios, and the large electrical betweenness means that the line is more important in the power flow transmission. When it is forced to be cut off, the system will be severely affected. Therefore, it is necessary to balance the power flow transmission by constructing We set priority wind power output conditions to ensure the maximum wind power consumption. Therefore, the wind curtailment is determined by the load and the minimum output of general generator set. The wind curtailment of each hour p curt wind,h is calculated by where N wind is the total number of wind farms; P wind h is the sum of the output of wind farm at h-th hour; P min h is the sum of the minimum output of general generator set at h-th hour; P load h is the sum of the input of load at h-th hour. The total wind curtailment of each scenario P curt wind is P curt The load shedding of each hour P shed load,h is where P max j,h is the maximum output of general generator set j at h-th hour. The total load shedding of each scenario P shed load is The wind curtailment and load shedding can evaluate the operation economy of system structure under uncertain scenarios. However, the high-penetration wind power and variable load may cause the line with excessive power flow to be cut off, which will lead large-scale power flow transfer and even cause a cascading failure. We propose to apply the improved electrical betweenness to measure system power flow balance in uncertain scenarios, and use it to evaluate the system stability.
The stability evaluation of the transmission network based on the electrical betweenness integrates the power flow characteristics into the topology analysis. This method uses electrical betweenness to indicate the transmission power of each line in multiple scenarios, and the large electrical betweenness means that the line is more important in the power flow transmission. When it is forced to be cut off, the system will be severely affected. Therefore, it is necessary to balance the power flow transmission by constructing new lines to improve the system's ability to withstand uncertainties of wind power and variable load. The calculation of electrical betweenness first requires the system to be divided into a combination of a single generator set and a single load. Then, the combination is required to transmit unit power with the complete line structure. The active power flow P l,unit and reactive power flow Q l,unit of line l under transmitting unit power are where V j,unit and V k,unit are the voltage of node j and node k under transmitting unit power; G j0 is the conductivity between node j and ground points; B jk is the susceptance between node j and node k under transmitting unit power. Second, the coefficient ω of unit power flow is determined by the smaller value of the generator set and load in the selected combination. The unit power coefficient is calculated by min P wind , P vari load i f the combination contains wind generator and variableload min P wind , P const load i f the combination contains wind generator and constantload min P g , P vari load i f the combination contains general generator and variable load min P g , P const load i f the combination contains general generator and constantload , where P vair load and P const load are the power of variable load and constant load. Third, all combinations in the system should be traversed, and the electrical betweenness of lines can be obtained from the sum of the unit power flow distribution. The electrical betweenness (EB) is EB = ∑ s∈Φ ω(P l,unit,s +Q l,unit,s ), (26) where Φ is the combination set; s is a combination of one source and one load.
The (26) can compare the power flow of each line in the system, but it is difficult to intuitively calculate the power flow balance of the system. Therefore, this paper proposes to use the Wasserstein distance to measure the uniformity of electrical betweenness distribution.
The Wasserstein distance measures the similarity of two distributions by calculating the distance between two distributions. In this paper, the Wasserstein distance between the electrical betweenness distribution and the absolute equilibrium power flow EB balance distribution is used to evaluate the power flow balance of the transmission network. Additionally, the power flow Wasserstein distance Wass(EB) is where inf means infimum; Π(EB, EB balance ) represents the set of all possible joint probability distributions of EB l,h and EB balance ; A is any norm of A. The lower-level objective function needs to improve the system's renewable energy consumption capacity while ensuring that the system has a small improved electrical betweenness. Therefore, the lower-level objective function f lower (·) is Energies 2021, 14, 6073 9 of 28

Lower Constrains
The lower-level constraints are similar to the upper-level constraints, and they are The solution of this model is to find a transmission network structure that meets various constraints and maximizes the performance of the objective function. The method based on deep reinforcement learning determines the construction value of each line based on Markov decision to realize the TNEP task. The method based on heuristic learning achieves the optimization goal by iterating the overall transmission network structure, while business optimizer such as CPLEX is based on mathematical planning to solve the task model.

Multi-Agent DDQN for Transmission Network Expand Planning
This section proposes multi-agent DDQN based on deep reinforcement learning for the bi-level TNEP model solving. First, the task environment model of TNEP is constructed based on the Markov Chain model. Second, an improved multi-agent DDQN is proposed according to the characteristics of the task model, which realizes the coordinated solution of the upper-level and lower-level models. Finally, we provide the improved multi-agent DDQN training process for the TNEP task.

Task Environment of Transmission Network Expansion Planning Based Markov Chain Model
The TNEP scheme is determined by the current system requirements and the established network structure. When each line is constructed, the system structure will be transformed into a new state, and the operation state will also be improved. This work can be abstracted into a Markov serialized decision process, and the schematic diagram of Markov Chain model for TNEP task is shown in Figure 2. ergies 2021, 14, x FOR PEER REVIEW 10 of 28 The Markov Chain model provides a way to solve the task through sequential decision. The reinforcement learning uses this mechanism to build task environment and agent for task solving. The task environment can provide the agent with the current task state and executable actions. The agent chooses actions according to a certain strategy. The task environment changes the task state according to the selected action, then calculates the reward of the action and transmits it to the agent. Therefore, the task environment ζTNEP can be expressed in state space as: where S is set of task state; A is set of executable action; R is set of action reward; γ is discount factor. In Figure 2, the current system structure is considered as the initial state St, and each line construction is considered as an action At. The probability of transition from state St to state St+1 is p(St, St+1). When a line is constructed to the state change to state St, the system operational improvement is considered as a reward Rt. The state value and state-action pair value are defined as v(St) and q(St, At). When state St transforms into the state SN, the cumulative reward is G(St).
The Markov decision assumes that the generation of a new state is only related to the current state, and the state transition probability When the state St−1 changes to the state St, the reward Rt is means the expectation of all action rewards in the state St.
The transition from the state St to the state St+1 is triggered by action At. The probability of action At is selected under the state St is defined as p(At|St). If the action selection is based on the policy π, the probability of action selection can be written as The Markov Chain model provides a way to solve the task through sequential decision. The reinforcement learning uses this mechanism to build task environment and agent for task solving. The task environment can provide the agent with the current task state and executable actions. The agent chooses actions according to a certain strategy. The task environment changes the task state according to the selected action, then calculates the reward of the action and transmits it to the agent. Therefore, the task environment ζ TNEP can be expressed in state space as: where S is set of task state; A is set of executable action; R is set of action reward; γ is discount factor. In Figure 2, the current system structure is considered as the initial state S t , and each line construction is considered as an action A t . The probability of transition from state S t to state S t+1 is p(S t , S t+1 ). When a line is constructed to the state change to state S t , the system operational improvement is considered as a reward R t . The state value and state-action pair value are defined as v(S t ) and q(S t , A t ). When state S t transforms into the state S N , the cumulative reward is G(S t ).
The Markov decision assumes that the generation of a new state is only related to the current state, and the state transition probability p(S t , S t+1 ) is When the state S t−1 changes to the state S t , the reward R t is where E[R t+1 |S t ] means the expectation of all action rewards in the state S t . The transition from the state S t to the state S t+1 is triggered by action A t . The probability of action A t is selected under the state S t is defined as p(A t |S t ). If the action selection is based on the policy π, the probability of action selection can be written as The state S t has an influence on all subsequent states, but the farther from S t , the smaller the influence. The reward obtained by S t in the subsequent state also has this characteristic. Therefore, the cumulative reward G(S t ) of each path can be defined as where γ is discount factor. The state value v(S t ) is expressed by the expectation of cumulative reward obtained by each path from S t based on policy π, and the relationship between The action value of action A t under the state S t based on policy π is defined as q π (S t , A t ). Therefore, v(S t ) can be written as the weighted sum of action value. That is where ψ is the action set of state S t . The relation between q(A t |S t ) and v(S t+1 ) is where Λ is the state set of state S t+1 .
Both (46) and (47) are Bellman equations, and the value function can be calculated iteratively through the dynamic programming. If the policy π is determined, the transition probability p(S t+1 |S t ) and the action selection probability π(A t |S t ) are known. Therefore, we only need to optimize the value function to promote the cumulative reward, and the TNEP task can be solved according to the optimal state sequence.

Multi-Agent DDQN Structure
DDQN is a value-based deep reinforcement learning. Its crucial objective is to construct and train an accurate value function for the value prediction of state-action pairs. The DDQN agent takes the action based on the ε-greedy strategy and value function to change the task state. Meantime, it modifies the value function through the reward of task state changes. The principle diagram of DDQN is shown in Figure 3. Energies 2021, 14, x FOR PEER REVIEW 12 of 28 DDQN contains two value functions, Qeval and Qtarget, with the same initial parameters based on the deep neural networks. This paper uses the Tensorflow platform to build the deep neural network, and the parameters of DDQN are shown in Table 1. Qeval is used to select the optimal value action. Through the value perdition of the action At-state St pair, the best action A max (St;ωeval) is where D(a;b) denotes the variable D with respective to the variable a and the parameter b; ωeval is the parameters of deep neural network Qeval. Qtarget is used to predict the best action value. The best action value Q max (St;ωtarget) is where ωtarget is the parameters of deep neural network Qtraget. According to (46), the reward q(St,At) of action At-state St pair can be obtained as The value function loss Qloss of Qeval is after each action is executed, the Qeval update regulation is DDQN contains two value functions, Q eval and Q target , with the same initial parameters based on the deep neural networks. This paper uses the Tensorflow platform to build the deep neural network, and the parameters of DDQN are shown in Table 1. Q eval is used to select the optimal value action. Through the value perdition of the action A t -state S t pair, the best action A max (S t ;ω eval ) is where D(a;b) denotes the variable D with respective to the variable a and the parameter b; ω eval is the parameters of deep neural network Q eval . Q target is used to predict the best action value. The best action value Q max (S t ;ω target ) is where ω target is the parameters of deep neural network Q traget . According to (46), the reward q(S t ,A t ) of action A t -state S t pair can be obtained as The value function loss Q loss of Q eval is after each action is executed, the Q eval update regulation is where α is learning rate of DDQN; Q t+1 eval and Q t eval are the Q eval state at t-th time and t + 1-th time, and the parameters ω eval of Q eval are copied into Q target periodically. This delayed update mechanism can ensure stable iteration of parameters. Based on the bi-level model of TNEP, we built a dual of DDQN agent. One set is used to search the economical transmission network. The upper-level model reward R upper is where V base,upper is the reward baseline of upper-level agent. The other set is used to optimize the wind curtailment, load shedding, and improve electrical betweenness of transmission network. The lower-level model reward R lower is where V base,lower is the reward baseline of lower-level agent.
In the optimization of the bi-level model, we stipulate that the upper-level agent needs to store the top three economical lines to form a Pareto set. The constitution rule of the upper-level solution set Pareto{A t,upper } is where Prob is the random probability of ε-greedy strategy. The optimization scope of the lower-level DDQN agent should be in (55). The constitution rule of lower-level action A t,lower is When each selected action A t,lower is executed, the Q eval,upper and Q eval,lower update based on (56). The multi-agent DDQN flow chart of TNEP task is shown in Figure 4, where Q eval,lower and Q eval,upper is used to select the optimal value action of lower-level agent and upper-level agent, respectively; Q target,lower and Q target,upper is used to predict the best action value of lower-level agent and upper-level agent, respectively; R t+1,upper and R t+1,lower is reward of state S t+1 of lower-level agent and upper-level agent, respectively.

Simulation and Verification
In this section, we apply the multi-agent DDQN to solve the multi-scenarios TNEP tasks of system with high-penetration of RES. The planning scheme and solution process of multi-agent DDQN are compared with those of DQN, particle swarm optimization (PSO) and branch-and-bound (B&B) in the modified IEEE RTS-24 bus system and the modified New England 39-bus system.

Modified IEEE RTS-24 Bus System with High-Penetration RES
The IEEE RTS-24 bus system is widely used to evaluate the performance of planning algorithms. This model contains 24 generator or load buses. The initial network consists of 38 lines with two rated voltages, the north area is 220 kV and the south area is 110 kV. The load model contains 17 buses with a maximum of 2850 MW. The generation model contains 32 generator sets, and the range of the output is 12-400 MW.

Simulation and Verification
In this section, we apply the multi-agent DDQN to solve the multi-scenarios TNEP tasks of system with high-penetration of RES. The planning scheme and solution process of multi-agent DDQN are compared with those of DQN, particle swarm optimization (PSO) and branch-and-bound (B&B) in the modified IEEE RTS-24 bus system and the modified New England 39-bus system.

Modified IEEE RTS-24 Bus System with High-Penetration RES
The IEEE RTS-24 bus system is widely used to evaluate the performance of planning algorithms. This model contains 24 generator or load buses. The initial network consists of 38 lines with two rated voltages, the north area is 220 kV and the south area is 110 kV. The load model contains 17 buses with a maximum of 2850 MW. The generation model contains 32 generator sets, and the range of the output is 12-400 MW.
Based on the IEEE RTS-24 bus system, a modified IEEE RTS-24 bus system is constructed, in which the types of some generator sets and loads are changed to make the system have the characteristics of renewable energy and variable load. The changes are listed in Table 2, and the distribution of generator set and loads is shown in Figure 5. Based on the IEEE RTS-24 bus system, a modified IEEE RTS-24 bus system is constructed, in which the types of some generator sets and loads are changed to make the system have the characteristics of renewable energy and variable load. The changes are listed in Table 2, and the distribution of generator set and loads is shown in Figure 5.  It can be seen from Figure 5 that the modified system contains 54.1% renewable energy and 79.6% variable load, which simulates the high-penetration renewable energy and variable load characteristic of RES system. Then, we use the improved K-Means algorithm to extract the typical scenarios. Its performance is not only related to the setting of the clustering objective function, but also closely related to the K value. We use the improved K-Means algorithm to cluster the variable wind power and load operation data of HRP-38-test-system in reference [31], and the best K value is determined from the curve of SSE based on the elbow method. The results are shown in Figure 6.   It can be seen from Figure 5 that the modified system contains 54.1% renewable energy and 79.6% variable load, which simulates the high-penetration renewable energy and variable load characteristic of RES system. Then, we use the improved K-Means algorithm to extract the typical scenarios. Its performance is not only related to the setting of the clustering objective function, but also closely related to the K value. We use the improved K-Means algorithm to cluster the variable wind power and load operation data of HRP-38-test-system in reference [31], and the best K value is determined from the curve of SSE based on the elbow method. The results are shown in Figure 6. Based on the IEEE RTS-24 bus system, a modified IEEE RTS-24 bus system is constructed, in which the types of some generator sets and loads are changed to make the system have the characteristics of renewable energy and variable load. The changes are listed in Table 2, and the distribution of generator set and loads is shown in Figure 5.  It can be seen from Figure 5 that the modified system contains 54.1% renewable energy and 79.6% variable load, which simulates the high-penetration renewable energy and variable load characteristic of RES system. Then, we use the improved K-Means algorithm to extract the typical scenarios. Its performance is not only related to the setting of the clustering objective function, but also closely related to the K value. We use the improved K-Means algorithm to cluster the variable wind power and load operation data of HRP-38-test-system in reference [31], and the best K value is determined from the curve of SSE based on the elbow method. The results are shown in Figure 6.  The results show that the SSEs of wind power and variable load decreases faster when K value becomes K = 4. If K continues to increase, the change rate of SSE will decrease, which can be considered as an elbow point. Therefore, this paper adopts K = 4. The clustering results are shown in Figure 7. The results show that the SSEs of wind power and variable load decreases faster when K value becomes K = 4. If K continues to increase, the change rate of SSE will decrease, which can be considered as an elbow point. Therefore, this paper adopts K = 4. The clustering results are shown in Figure 7.   Figure 7b, the cumulative energy of load 1, 3, and 4 modes are different, and the fluctuation characteristics of mode 2 are more unique. Therefore, the improved K-Means algorithm realizes the operation data compression of wind power and variable load, which greatly improves the efficiency of TNEP task solving.
The scenario generation rule is to randomly select the wind farm and load status in each hour from the extracted mode as the system status and then generate 384 typical scenarios. Some typical scenarios are listed in Table 3. The typical scenarios cover the normal and extreme conditions during the system operation. This data-driven scenario generation method reduces computational complexity and ensures the uncertain characteristic.

TNEP for Multi-Level Renewable Energy Penetration Scenarios in Modified IEEE RTS-24 Bus System
All the programs are developed using TensorFlow 1.14 and python 3.7. The system configuration is i9-9900K with 3.6 GHz, a memory of 32 GB, and graphics card of 2080Ti. DQN and PSO are used for contrast, and the parameters of multi-agent DDQN and DQN are listed in Table 4. The TNEP schemes of four methods are shown in Tables 5-8, and the transmission network structure is in Figure 8.   Figure  7b, the cumulative energy of load 1, 3, and 4 modes are different, and the fluctuation characteristics of mode 2 are more unique. Therefore, the improved K-Means algorithm realizes the operation data compression of wind power and variable load, which greatly improves the efficiency of TNEP task solving.
The scenario generation rule is to randomly select the wind farm and load status in each hour from the extracted mode as the system status and then generate 384 typical scenarios. Some typical scenarios are listed in Table 3. The typical scenarios cover the normal and extreme conditions during the system operation. This data-driven scenario generation method reduces computational complexity and ensures the uncertain characteristic.

TNEP for Multi-Level Renewable Energy Penetration Scenarios in Modified IEEE RTS-24 Bus System
All the programs are developed using TensorFlow 1.14 and python 3.7. The system configuration is i9-9900K with 3.6 GHz, a memory of 32 GB, and graphics card of 2080Ti. DQN and PSO are used for contrast, and the parameters of multi-agent DDQN and DQN are listed in Table 4. The TNEP schemes of four methods are shown in Tables 5-8, and the transmission network structure is in Figure 8.    Tables [5][6][7][8] show that all four methods optimize the stability and economy of the transmission network by constructing lines. The deep reinforcement learning based methods contain line sequence information, but PSO method only optimizes the structure of the complete transmission network structure. Among the four schemes, the scheme obtained by multi-agent DDQN has the lowest construction cost at USD 6.95 M. Due to the scheme obtained by DQN includes the line 9-12, which contains a set of transformers, the construction cost is the highest at USD 10.58 M. In addition, the four schemes all affect the system structure and the distribution of power flow through new lines construction, which decreases the network loss cost. Finally, the objective function of the upper-level model is to minimize the comprehensive cost. The comprehensive costs of four methods The improved electrical betweenness measures the system stability by calculating the power flow balance of each line. The scheme obtained by multi-agent DDQN reduces the improved electrical betweenness from 0.007154 to 0.004457, which reduces the probability of cascading failures than other three methods. In addition, system with high-penetration of RES needs to ensure as little wind curtailment and load shedding as possible under uncertain scenarios. Both multi-agent DDQN and DQN improve the power support capability through building a tighter transmission network structure. However, the scheme obtained by PSO is over-searching for a structure with better comprehensive cost performance, and it ignores the optimization of the improved electrical betweenness, wind curtailment and load shedding. This is because the heuristic learning-based method is easy to fall into the local optima problem in TNEP task solving processing. TNEP is a complex non-convex non-deterministic polynomial (NP) problem. The scheme obtained by B&B still has a certain gap with the other schemes obtained by three methods. The methods based on deep reinforcement learning well coordinates the optimization of the lower-level model indicators, which avoids the impact of the local optima problem to a certain extent, and confirms the superiority of this type of methods in this task.
Although the sum of the wind curtailment and the load shedding are nearly equal for scheme obtained by DQN and multi-agent DDQN, the improved electrical betweenness of multi-agent DDQN is significantly better. The dual DDQN agent structure improves the optimization capability of each layer, and thus forming a better solution method for TNEP task. Figure 9 shows the comprehensive cost changes of the schemes obtained by multiagent DDQN and DQN. The construction cost of lines 13-15, 9-12, and 7-9 in the DQN scheme is relatively high, which makes the construction cost rise rapidly. The scheme obtained by multi-agent DDQN chooses lines with lower construction cost, so that the investment in the scheme implementation can be invested more smoothly. Moreover, the sum of network loss cost and operation and maintenance cost of scheme obtained by multi-agent DDQN are decreasing faster, which makes the transition process more economical. The improved electrical betweenness measures the system stability by calculating the power flow balance of each line. The scheme obtained by multi-agent DDQN reduces the improved electrical betweenness from 0.007154 to 0.004457, which reduces the probability of cascading failures than other three methods. In addition, system with high-penetration of RES needs to ensure as little wind curtailment and load shedding as possible under uncertain scenarios. Both multi-agent DDQN and DQN improve the power support capability through building a tighter transmission network structure. However, the scheme obtained by PSO is over-searching for a structure with better comprehensive cost performance, and it ignores the optimization of the improved electrical betweenness, wind curtailment and load shedding. This is because the heuristic learning-based method is easy to fall into the local optima problem in TNEP task solving processing. TNEP is a complex non-convex non-deterministic polynomial (NP) problem. The scheme obtained by B&B still has a certain gap with the other schemes obtained by three methods. The methods based on deep reinforcement learning well coordinates the optimization of the lower-level model indicators, which avoids the impact of the local optima problem to a certain extent, and confirms the superiority of this type of methods in this task.
Although the sum of the wind curtailment and the load shedding are nearly equal for scheme obtained by DQN and multi-agent DDQN, the improved electrical betweenness of multi-agent DDQN is significantly better. The dual DDQN agent structure improves the optimization capability of each layer, and thus forming a better solution method for TNEP task. Figure 9 shows the comprehensive cost changes of the schemes obtained by multiagent DDQN and DQN. The construction cost of lines 13-15, 9-12, and 7-9 in the DQN scheme is relatively high, which makes the construction cost rise rapidly. The scheme obtained by multi-agent DDQN chooses lines with lower construction cost, so that the investment in the scheme implementation can be invested more smoothly. Moreover, the sum of network loss cost and operation and maintenance cost of scheme obtained by multi-agent DDQN are decreasing faster, which makes the transition process more economical.  Figure 10 is the distribution of line power flow for the two methods. It shows that the initial power flow is quite uneven, and the power flow of the initial network contains three  Figure 10 is the distribution of line power flow for the two methods. It shows that the initial power flow is quite uneven, and the power flow of the initial network contains three lines with more than 300 MW and two lines more than 400 MW. The scheme obtained by multi-agent DDQN transfers part of the power flow to underloading lines, which improves the utilization rate of the underloading lines and reduces the probability of  Figure 10b, the multi-agent DDQN controls the line power flow close to 250 MW. The scheme obtained by DQN reduces most of the line power flow to below 250 MW, but it contains two lines that are much larger than 250 MW. For the lines with power lower than 200 MW, multi-agent DDQN optimizes the power flow distribution in the range of 150~200 MW better than DQN method. DQN better increases the power flow of underloading lines, and multi-agent DDQN tends to limit the power flow of overloading lines. The lines with high power flow often determine the stability of the system. Therefore, the scheme obtained by multi-agent DDQN has higher system stability.
Energies 2021, 14, x FOR PEER REVIEW 20 of 28 lines with more than 300 MW and two lines more than 400 MW. The scheme obtained by multi-agent DDQN transfers part of the power flow to underloading lines, which improves the utilization rate of the underloading lines and reduces the probability of the cascading failures caused by the overloading lines. For the lines with power greater than 200 WM in Figure 10b, the multi-agent DDQN controls the line power flow close to 250 MW. The scheme obtained by DQN reduces most of the line power flow to below 250 MW, but it contains two lines that are much larger than 250 MW. For the lines with power lower than 200 MW, multi-agent DDQN optimizes the power flow distribution in the range of 150~200 MW better than DQN method. DQN better increases the power flow of underloading lines, and multi-agent DDQN tends to limit the power flow of overloading lines. The lines with high power flow often determine the stability of the system. Therefore, the scheme obtained by multi-agent DDQN has higher system stability.  Figure 11 shows the changes in wind curtailment and load shedding during the construction of the two schemes. The initial network structure sheds a mass of load under uncertain scenarios. When the output of wind farms is reduced, the regional power balance ability is weakened, and  lines with more than 300 MW and two lines more than 400 MW. The scheme obtained by multi-agent DDQN transfers part of the power flow to underloading lines, which improves the utilization rate of the underloading lines and reduces the probability of the cascading failures caused by the overloading lines. For the lines with power greater than 200 WM in Figure 10b, the multi-agent DDQN controls the line power flow close to 250 MW. The scheme obtained by DQN reduces most of the line power flow to below 250 MW, but it contains two lines that are much larger than 250 MW. For the lines with power lower than 200 MW, multi-agent DDQN optimizes the power flow distribution in the range of 150~200 MW better than DQN method. DQN better increases the power flow of underloading lines, and multi-agent DDQN tends to limit the power flow of overloading lines. The lines with high power flow often determine the stability of the system. Therefore, the scheme obtained by multi-agent DDQN has higher system stability.  Figure 11 shows the changes in wind curtailment and load shedding during the construction of the two schemes. The initial network structure sheds a mass of load under uncertain scenarios. When the output of wind farms is reduced, the regional power balance ability is weakened, and The initial network structure sheds a mass of load under uncertain scenarios. When the output of wind farms is reduced, the regional power balance ability is weakened, and the power support from power generators and wind farms in the other regions of the system is needed to supplement the regional power shortage. However, when the congestion occurs in transmission lines connected to the power shortage area, the power support in the system is difficult to achieve, which forces load shedding. Therefore, it is Energies 2021, 14, 6073 21 of 28 necessary to build new lines to eliminate the occurrence of congestion. The two schemes obtained by multi-agent DDQN and DQN both reduce the load shedding to the same level through new lines construction, and multi-agent DDQN is decreasing slightly faster than the scheme obtained by DQN. The wind curtailment of the multi-agent is lower than the scheme obtained by DQN, more wind power will be curtailed during the construction process.
Multi-agent DDQN not only controls the sum of wind curtailment and load shedding as DQN, but also obtains a scheme with high power flow balance. This demonstrates that one set DDQN agent built by lower-level model has better performance than DQN agent for lower-level optimization. Therefore, the dual DDQN agent structure realizes the hierarchical prediction value of line. One set is used to search the lines with high economy, and the other set is to search the lines that can improve the renewable energy consumption capacity and stability of the system. This structure improves the accuracy of the line value prediction and contributes to the formation of a better TNEP scheme. Figures 12 and 13 are the indicators (such as wind abandonment and load shedding) of 7000 network structures constructed by the multi-agent DDQN agent in 200 episodes of training. Before 1000 steps, the distribution of various indicators in poor areas is more concentrated, or only some indicators of the network perform well. This is because the multi-agent DDQN does not have enough data for neural network training now, and there is a large error in its value prediction. Between 1000 and 2500 steps, the indexes of the transmission network gradually improve. The neural network achieves sufficient training, and the prediction error of the line construction value has gradually decreased. After 2500 steps, the indexes are uniformly distributed between the best and the worst. This is because the multi-agent DDQN adopts a ε-greedy strategy, which allows the agent to chooses the line randomly without using the prediction result of the neural network under a certain probability. This training mechanism can prevent local optimal problem and obtain more accurate value function.
the power support from power generators and wind farms in the other regions of the system is needed to supplement the regional power shortage. However, when the congestion occurs in transmission lines connected to the power shortage area, the power support in the system is difficult to achieve, which forces load shedding. Therefore, it is necessary to build new lines to eliminate the occurrence of congestion. The two schemes obtained by multi-agent DDQN and DQN both reduce the load shedding to the same level through new lines construction, and multi-agent DDQN is decreasing slightly faster than the scheme obtained by DQN. The wind curtailment of the multi-agent is lower than the scheme obtained by DQN, more wind power will be curtailed during the construction process.
Multi-agent DDQN not only controls the sum of wind curtailment and load shedding as DQN, but also obtains a scheme with high power flow balance. This demonstrates that one set DDQN agent built by lower-level model has better performance than DQN agent for lower-level optimization. Therefore, the dual DDQN agent structure realizes the hierarchical prediction value of line. One set is used to search the lines with high economy, and the other set is to search the lines that can improve the renewable energy consumption capacity and stability of the system. This structure improves the accuracy of the line value prediction and contributes to the formation of a better TNEP scheme. Figures 12 and 13 are the indicators (such as wind abandonment and load shedding) of 7000 network structures constructed by the multi-agent DDQN agent in 200 episodes of training. Before 1000 steps, the distribution of various indicators in poor areas is more concentrated, or only some indicators of the network perform well. This is because the multi-agent DDQN does not have enough data for neural network training now, and there is a large error in its value prediction. Between 1000 and 2500 steps, the indexes of the transmission network gradually improve. The neural network achieves sufficient training, and the prediction error of the line construction value has gradually decreased. After 2500 steps, the indexes are uniformly distributed between the best and the worst. This is because the multi-agent DDQN adopts a ε-greedy strategy, which allows the agent to chooses the line randomly without using the prediction result of the neural network under a certain probability. This training mechanism can prevent local optimal problem and obtain more accurate value function.   Figure 14 shows the sum of value prediction of multi-agent DDQN and DQN. The value prediction represents the estimation of the construction value estimation of different lines by the agent, and the sum of value prediction of the multi-agent DDQN is lower than that of the DQN. This is because the agent of multi-agent DDQN adopts a dual neural network structure, which makes the optimal line A max independent of the value prediction Q max of the optimal line. This structure avoids the influence of accidental overestimation to a certain extent and improves the accuracy of line value prediction, which enhances the ability of multi-agent DDQN to solve TNEP task.

TNEP under Unavoidable Interference in Modified IEEE RTS-24 Bus System
During the implementation of the TNEP scheme, unavoidable interference or unconsidered factors may cause a certain line to be unable to be constructed. When this happens, the heuristic learning-based method requires retraining due to changes in planning conditions. However, the experience obtained from training based on the reinforcement learning method is to judge the construction value of each line, which is not affected by changes in conditions. Thereby, multi-agent DDQN can solve new TNEP tasks without redundant training. We change the fourth line of the scheme obtained by multi-agent DDQN from 20-22 to line 5-6 to simulate the TNEP task scenario under unavoidable interference. Similarly, the fourth line of the scheme obtained by DQN is changed from 9-  Figure 14 shows the sum of value prediction of multi-agent DDQN and DQN. The value prediction represents the estimation of the construction value estimation of different lines by the agent, and the sum of value prediction of the multi-agent DDQN is lower than that of the DQN. This is because the agent of multi-agent DDQN adopts a dual neural network structure, which makes the optimal line A max independent of the value prediction Q max of the optimal line. This structure avoids the influence of accidental overestimation to a certain extent and improves the accuracy of line value prediction, which enhances the ability of multi-agent DDQN to solve TNEP task.  Figure 14 shows the sum of value prediction of multi-agent DDQN and DQN. The value prediction represents the estimation of the construction value estimation of different lines by the agent, and the sum of value prediction of the multi-agent DDQN is lower than that of the DQN. This is because the agent of multi-agent DDQN adopts a dual neural network structure, which makes the optimal line A max independent of the value prediction Q max of the optimal line. This structure avoids the influence of accidental overestimation to a certain extent and improves the accuracy of line value prediction, which enhances the ability of multi-agent DDQN to solve TNEP task.

TNEP under Unavoidable Interference in Modified IEEE RTS-24 Bus System
During the implementation of the TNEP scheme, unavoidable interference or unconsidered factors may cause a certain line to be unable to be constructed. When this happens, the heuristic learning-based method requires retraining due to changes in planning conditions. However, the experience obtained from training based on the reinforcement learning method is to judge the construction value of each line, which is not affected by changes in conditions. Thereby, multi-agent DDQN can solve new TNEP tasks without redundant training. We change the fourth line of the scheme obtained by multi-agent DDQN from 20-22 to line 5-6 to simulate the TNEP task scenario under unavoidable interference. Similarly, the fourth line of the scheme obtained by DQN is changed from 9-

TNEP under Unavoidable Interference in Modified IEEE RTS-24 Bus System
During the implementation of the TNEP scheme, unavoidable interference or unconsidered factors may cause a certain line to be unable to be constructed. When this happens, the heuristic learning-based method requires retraining due to changes in planning conditions. However, the experience obtained from training based on the reinforcement learning method is to judge the construction value of each line, which is not affected by changes in conditions. Thereby, multi-agent DDQN can solve new TNEP tasks without redundant training. We change the fourth line of the scheme obtained by multi-agent DDQN from 20-22 to line 5-6 to simulate the TNEP task scenario under unavoidable interference. Similarly, the fourth line of the scheme obtained by DQN is changed from 9-12 to line 13-14. The schemes obtained by multi-agent DDQN and DQN are shown in Tables 9 and 10. After the interference, the improved electrical betweenness of the scheme obtained by multi-agent DDQN immediately deteriorates, and the wind curtailment and load shedding are hardly improved. This shows that the line 5-6 causes the system power flow balance to be destroyed, and the scheme is greatly affected by the interference. On the contrary, although the improved electrical betweenness of the scheme obtained by DQN slightly deteriorates after the interference, the line 13-14 reduces the wind curtailment and load shedding. The continued construction of multi-agent DDQN after unfavorable interference is more positive. The multi-agent DDQN agent reuses training experience to judge the performance of the current transmission network structure and the construction value of each line. Then, three high-value lines are selected to form the new scheme. Although the new scheme obtained by multi-agent DDQN increases the comprehensive cost under unfavorable interference, it still shows great performance in the improved electrical betweenness, wind curtailment and load shedding. The comprehensive cost of scheme obtained by new DQN is USD 0.45 M, which is higher than that of the new scheme obtained by multi-agent DDQN. In addition, the improved electrical betweenness of the new scheme obtained by DQN is also larger than that of the new scheme obtained by multi-agent DDQN, which means the new scheme obtained by DQN has lower reliability of the system. Even the excellent control of wind curtailment and load shedding in the scheme obtained by original DQN is weakened. Finally, both methods can complete planning tasks by reusing the training experience under the unavoidable interference, but the value prediction of multi-agent DDQN is more accurate, which makes the multi-agent DDQN better to solve such TNEP tasks.

TNEP in Modified New England 39-Bus System
This article extends the application scenario to a more complex modified New England 39-bus system to further evaluate the performance of the proposed methods. Consistent with the changes in the modified IEEE 24-bus system, we increase the load to 1.1 times, the capacity of conventional generator sets to 1.2 times, and the capacity of wind farms to 1.4 times. The node settings of wind farm and variable load are shown in Table 11, the schemes of the four methods are shown in Tables 12-15, and the network structure is shown in Figure 15.     The generator sets of the modified New England 39-bus system are settled at the edge of the system, but the load nodes are evenly distributed. Under the uncertain scenarios, Figure 15. TNEP scheme of four methods in modified New England 39-bus system.