1. Introduction
With the key progress of green energy low-carbon transformation, a large amount of distributed generation (DG) needs wide access to the distribution network. The future Active Distribution Network (ADN) is anticipated to transition from a traditional unidirectional radial configuration to a bidirectional active network. The control mode will be changed from centralized dispatch to hierarchical and clustered control, constituting a hybrid AC–DC distribution network with medium and low voltage interaction and multi-station interconnection. However, due to the large random volatility and strong uncertainty of DG output, which is typically represented by photovoltaic (PV) energy, leading to prominent problems such as voltage overruns and power imbalance in the distribution network, the challenge of new energy consumption is highlighted [
1]. Therefore, fully utilizing adjustable resources to implement autonomous operation at the distribution level and collaborative optimization control across multiple distribution areas has become one of the effective means of addressing the aforementioned issues.
Currently, optimal regulation methods for ADN can be classified into two main categories: model-driven [
2,
3] and data-driven approaches [
4]. The model-driven category typically employs mathematical planning methods and heuristic algorithms to formulate the optimal regulation problem of ADN as a nonlinear optimization problem that can be explicitly solved. However, the limited coverage of communication facilities in medium- and low-voltage distribution networks [
5], along with frequent changes in line-switching states, results in issues such as incomplete real-time measurement data [
6] and fluctuations in network topology. Consequently, the traditional model-driven approach, which relies on comprehensive measurement data, is becoming less applicable.
In recent years, as power grids have undergone intelligent upgrades, data-driven methods, particularly Deep Reinforcement Learning (DRL), have emerged as innovative solutions for the real-time optimization and regulation of distribution networks [
6,
7,
8]. These approaches are leveraged to enhance decision-making processes in dynamic environments by large datasets and adaptive learning techniques.
DRL is essentially a framework of interaction-based learning that models control resources as intelligent agents within the distribution network environment. This integration allows for continuous interaction, enabling the system to learn from the vast amounts of operational data generated during these interactions. Neural networks are utilized to efficiently develop regulation strategies based on the acquired data. As the size of distribution network nodes and the penetration of photovoltaic (PV) systems increase, the expansion of control equipment results in a growing number of intelligent entities. However, this increase can lead to suboptimal synergistic effects among these entities [
9]. Additionally, node voltage violations tend to occur in only a limited number of regions, indicating that it is unnecessary to mobilize all resources in order to address localized voltage fluctuations. To tackle these challenges, existing research has explored collaborative optimization of distribution network zones through clustering techniques in [
10,
11]. This approach aims to enhance the consumption of DG resources in close proximity. For instance, [
12] proposes a dynamic clustering method based on a multi-scenario hierarchical clustering approach, which segments the distribution network into clusters. In this framework, intelligent agents are deployed within each region to undergo training, facilitating zonal in situ optimization and improving overall system performance.
However, the aforementioned DRL methods primarily focus on fixed network topologies, which do not account for the dynamic changes that occur in actual distribution networks due to load transfers and the opening and closing of switches. These changes inevitably affect cluster divisions and the dynamic assignment of regulation units to clusters, which involves determining which regulation units are in the same cluster and share regional information. In traditional DRL, the input network dimensions for each agent are fixed, i.e., as the power flow data of all agents within the same cluster at the distribution network nodes. However, when the cluster changes, it leads to changes in the input dimensions of the agent network, rendering traditional methods unsuitable. To address this issue, researchers have concentrated on two main areas: strategy transfer learning and topological feature learning.
In the realm of policy transfer learning, [
13] utilized the migration capabilities of neural networks to transfer parameters of the policy network based on a pre-trained intelligent agent model under a specific topological structure. However, this policy transfer method is limited in that it acts merely as a refinement measure after a topology change event has occurred. It also assumes that the state space of the intelligent agent remains unchanged, which introduces certain constraints.
On the other hand, topological feature learning approaches, as discussed in [
14,
15,
16,
17], aim to characterize the regional topology of the distribution network using adjacency matrices or one-dimensional vectors. In these methods, neural networks learn the numerical variations of these matrices or vectors to identify anomalies in network topology. Nevertheless, constructing these matrices or vectors still requires a global understanding of the distribution network’s topology or of the entire cluster, which places high demands on the quality of the real-time communication infrastructure.
In summary, this paper presents a data-driven cooperative optimal regulation method designed for multiple area clusters, aimed at enhancing the adaptability of algorithms to changes in distribution network topology. The approach begins by segmenting the distribution network into clusters based on electrical distance modularity and power balance degree. This segmentation allows for the decomposition of the overall distribution network regulation problem into a series of intra-cluster autonomy issues.
Next, the optimization problem for the distribution network is modeled using sequential decision-making principles grounded in the Markov Decision Process (MDP). To address the dynamic nature of cluster boundaries resulting from topology changes, the paper introduces an attention-based Critic network. This network is designed to accommodate dynamic observations, enabling the inputs to the Critic network to be scalable in response to changing conditions.
Finally, the effectiveness and superiority of the proposed algorithms are rigorously evaluated and compared using the IEEE 33-bus test system.
2. Dynamic Clustering of Distribution Networks
The cluster segmentation metric serves as a foundational element for the aggregation of distribution areas, ensuring that clusters are formed in a way that promotes high electrical coupling within the clusters while maintaining approximate decoupling between them. In this paper, two primary indices are utilized for cluster division: electrical distance modularity and power balance.
Electrical Distance Module Degree Metric: The Module Degree is used to measure the strength of the community network structure; the larger the value of the Module Degree, the better the results of the cluster division [
18]. The Module Degree function is defined as
ρ:
where
is the adjacency matrix in the modularity metric;
is the sum of the weights of all edges connected to node
i; s is the sum of the weights of all networks;
denotes the cluster where station
i is located,
is a 0–1 function, and
= 1 means that node
i and node
j are classified into a cluster. The formula for
is given in [
19].
Power Balance Degree Metric: In order to achieve full DG consumption and reduce the power outgoing from the cluster, a power balance degree metric
e describing the balance between power supply and demand within the cluster is defined:
where
is the total reactive power balance index of the system, and
is the total active power balance index of the system. The active power balance index is used to describe the balance relationship between active power demand and supply within the cluster and can be expressed as
where
is the net power characteristic of cluster
s at time
t;
T is the duration of the typical time-varying scenario; and
c is the number of clusters.
where
,
are the maximum reactive power output and reactive power demand of cluster s, respectively, and
is the reactive power balance degree of cluster
s.Combining the modularity metric and power balance metric, the cluster composite division metric is expressed as
where
a1 and
a2 are the weighting coefficients of the two cluster indicators.
The distribution network is clustered into multiple area clusters by the Smart Local Moving (SLM) algorithm, which iteratively optimizes the cluster configuration to maximize modularity. Initially, each station area within the distribution network is treated as a distinct cluster. Subsequently, a sequential assignment process is employed, wherein each station area is evaluated for potential reassignment to a cluster containing other substation nodes. The modularity metric is calculated before and after each potential reassignment, and the node exhibiting the largest modularity increment is recorded. This process is repeated until the modularity converges, indicating a local optimum. The resulting clusters are then aggregated to form sub-networks, which are subsequently treated as nodes in a higher-level clustering iteration. This hierarchical clustering process continues until the modularity ceases to increase, thereby yielding an optimal division of the distribution network into multiple area clusters.
4. Model Optimal Solving Based on Deep Reinforcement Learning
4.1. Dynamic Clustering Markov Decision Process for Multiple Distribution Areas
The co-optimization of multiple areas within a distribution network involves decision-making that is informed by both the current and prior operating states, thereby characterizing it as a sequential decision-making problem. Given the weak electrical coupling between clusters following the clustering process, autonomous regulation is facilitated within each cluster. Consequently, this paper employs a Markov decision model of multiple intelligences to model the clusters of various station areas, treating each station area as an individual intelligence and the distribution network as the training environment. The principal components of the model are delineated as follows:
The observation space is defined as the set of observational variables that inform the decision-making process of the distribution area agent
i. This includes critical parameters, such as the power and voltage measurements at the grid-connected nodes within the distribution area:
where
and
are the active and reactive power of the distribution network node connected to station
i at time
t.
Intelligent agent observations of stations belonging to the same cluster constitute the environmental state of the distribution network, and the state space is represented as
where
is the set of all station nodes in cluster
m.
Considering that, after the topology changes as shown in
Figure 2, the state space of each cluster intelligence also changes accordingly with the cluster boundary, taking the three station intelligences
i1,
i2 and
i3 in topology A as examples, their state spaces are, respectively,
When the topology is changed from A to B, then the state space of the three station intelligences at this time becomes, respectively,
The action space represents the amount of decision making by the intelligent agent in the station based on the observation of
, specifically:
where
and
are the active and reactive powers of station area
i interacting with the feeder layer, which are realized by the station-side PV and storage inverters, respectively.
denotes the active power emitted by the station,
denotes the active power absorbed, and
similarly.
After the execution of the action
, the power of the distribution network node connecting station area
i is
The reward function is used to measure the value of a set of observation–actions of an intelligent agent
i. Rewards include objective function rewards and constraint rewards. The larger the reward function, the closer to the optimal goal under the current. The relationship between station areas is one of collaborative cooperation, and the reward function is expressed as the negative of the objective function combined with a penalty term:
where
represents the constraint on the energy storage capacity that is infringed upon during the decision-making process associated with the action variables of intelligent agent
j.
is the weight coefficient for balancing the goal reward and the constraint reward, and
,
,
and
are the ESS penalty function coefficients.
After executing an action based on the current observation, the intelligent agent j obtains the timely reward function and enters the next state to achieve the observation of the next moment. This process is a state transfer, and the whole process starts from the initial moment and transfers states over time sections until there is no feasible solution for the trend or until the end of a scheduling cycle.
4.2. Model Solving Based on the Actor–Critic Framework
The decision variables of the distribution area agent are characterized by the active and reactive power outputs associated with DG and ESS, both of which are continuously adjustable, so the MADDPG algorithm is chosen to solve the continuous action behavior space in multi-intelligence reinforcement learning. For each intelligent agent
i, the MADDPG algorithm needs to maintain four neural networks: the Actor network
that executes an action
based on observation
and the target Actor network; the Critic network
utilized to evaluate the value of actions and the target Critic network, where
and
denote the set of observations-actions of smart agent
i and the rest of the intelligences in the same cluster at the time
t,
and
are the corresponding parameters of the networks, respectively. The target network does not participate in the network training, and the parameters of the Actor and Critic networks are periodically copied to the corresponding target network to stabilize the network training process. The output of the Actor network is the action
, denoted as
where
is Gaussian noise,
and
is the upper and lower action limits, i.e., the upper and lower limits of the interaction power between the station and the distribution network;
represents a random number between the upper and lower action limits.
The Critic network output is the value assessment for
and
, and the network parameters
are updated based on gradient descent, the update formulae detailed in
Section 4.3. The Actor network is trained using gradient ascent to maximize the value
of current
.
4.3. Attention Critic Network for Counting Dynamic State Quantity Inputs Under Cluster Changes
The Critic network of the traditional MADDPG algorithm consists of a fully connected neural network, where the number of neurons in the input layer corresponds to a fixed dimension of
. Once the Critic network is constructed, its input dimensions cannot be modified, which limits the application to scenarios with a fixed number of agents under a fixed topology. When the topology changes, the cluster boundaries also expand or contract, bringing the problem of dynamic changes in the state space of the intelligences as described in
Section 4.2. To address this problem, this paper proposes an attention encoder critic network (AECN), which maps the dynamic observation inputs to a fixed dimensional space through encoding, ensuring the dynamic scalability of the network, and the network structure is shown in
Figure 3.
The observation–action vector of the distribution substation area agent is encoded into the initial input of the value network, with an initial set of attention weights assigned. The attention network continuously adjusts the attention weight coefficients assigned to each agent during training, enabling the value network to dynamically assess each substation’s contribution to the optimization objective. Based on the adjusted attention weights, the policy network is then guided in making strategy corrections. As shown in
Figure 3, for any intelligent agent i, the input to its AECN can be divided into two parts: one part is the environmental information
, which represents the agent’s observation-action combination
encoding of the distribution network environment, and the other part is the attention information
, which encodes the observation–action combinations of the other agents within the cluster:
In the equation, represents the encoding of agent j, where and are the parameters for mapping to the key value and for mapping to the query, respectively. denotes the relevance coefficient that computes the correlation between the observation–action combination of agent i and agent j. is the attention weight coefficient obtained by normalizing , representing the importance of the actions of other agents as observed by agent i in relation to the task at hand.
The Critic network of agent
i concatenates the two input parts, resulting in an output
, which represents
In the equation, represents a fully connected neural network, is the linear transformation matrix applied to , and h denotes the activation function.
The Critic network update is implemented based on minimizing the joint regression function, with the loss function given by
In the equation, E represents the expectation, and denotes the output of the target Critic network.
The algorithm training process is shown in Algorithm 1. First, the neural network parameters and the experience replay buffer are initialized. Then, the distribution network environment, based on the actions output by the Actor network, performs power flow simulations to compute the node voltage amplitude and active network loss values. The immediate reward is calculated using the reward function in Equation (22). Following this, the system transitions to the next state based on load and substation output. The resulting data from the interaction is stored in the experience replay buffer, completing one interaction cycle. When the data in the experience replay buffer reaches a certain amount, batch data are sampled to train the Critic and Actor networks. The two networks assist each other in training based on the gradient update formulas until convergence. After training is completed, only the Actor network is deployed to the substation. During the online decision phase, control commands are output in real-time through a feedforward pass of the neural network, enabling millisecond-level decision-making for control instructions.
Algorithm 1 Multi-Agent Reinforcement Learning-Based Coordinated Control Algorithm for Multiple Area Clusters in Distribution Networks |
→Initialize: the parameters and of the Actor network and Critic of all agents in the area |
→Initialize: the experience replay buffer, the agents’ learning rates, and the exploration coefficient |
for epsilon in 1,2,3… do |
Initialize the distribution network environment |
for t in 1,2,3…T do
The area agents obtain local observations The agents’ Actor networks select action based on the observation at time t, resulting in the state St at the current time t The action commands interact with the distribution network environment, performing power flow calculations to obtain the immediate reward Rt and transition to the next state St+1 Store the interaction experience tuple into the experience replay buffer The Critic network starts training once the data in the replay buffer reaches the specified threshold Randomly sample from the experience pool and update the action value according to Equation (29) Update the Actor network to output the decision command Softly update the Actor and Critic networks using target networks until the networks converge
|
end for |
end for |