1. Introduction
With the rapid growth of the global economy and population, energy demand is rising, and the impact of carbon emissions on the environment is becoming increasingly significant. To address environmental challenges, some regions, including the United States [
1], Europe [
2], and China [
3], have introduced relevant policies and plans to reduce carbon emissions. The global power industry is accelerating its transition towards low-carbon development, with sustainable energy gradually replacing traditional energy sources and becoming a crucial component of energy structure adjustment. An important direction in the current development of decarbonization in the power industry is the utilization of active distribution networks (ADNs) as the primary carrier for distributed energy resources, such as wind and photovoltaic energy. These networks facilitate the development and utilization of renewable energy sources, reduce the consumption of fossil fuels, and contribute to achieving cleaner and lower carbon emissions [
4,
5].
The problem of low-carbon dispatch in ADNs begins with an accurate representation of carbon emissions from their operational processes. The use of equivalent transformations is currently more frequently studied in power system control problems. Xu et al. [
6] used new energy in a distribution network as its carbon emission reduction indicator. Chen et al. [
7] fit the carbon emissions of a unit and used conversion coefficients to convert the carbon emissions into economic costs. In order to describe the economic cost of carbon emissions in more detail, Qing et al. [
8] took into account carbon emission taxes, carbon trading, and the penalty cost of wind and solar power abandonment. Chen et al. [
9] took the economic benefits brought by the purchase and sale of carbon emission allowances as the objective function in the context of carbon trading. However, the results of such equivalent conversion methods are largely affected by the reasonableness of the conversion coefficients and the reasonableness of the economic model, ignoring the impact of time and space changes on the carbon emissions of distributed equipment. Since electricity production is driven by electricity consumption, some studies have shown that the demand side can be considered as the main cause of carbon emissions [
10]. A number of countries have introduced carbon market or carbon tax policies that specify that in addition to power producers being responsible for carbon emissions, electricity consumers are also responsible for carbon emissions. Under these policies, electricity consumers are supposed to pay for the carbon emissions generated from the production of the electricity they consume, which can be calculated by multiplying the electricity consumption by the grid carbon emission factor. Therefore, the policy requires more timely and spatially differentiated carbon emission intensities to calculate and charge consumers according to their true level of carbon emission responsibility. Carbon emission flow evaluation results can provide consumers with timely and regionally differentiated carbon emission signals to inform the implementation and settlement of consumer-side charges. In addition, with appropriate incentive programs, this carbon emission distribution information can guide active carbon emission reduction on the demand side, further promoting low-carbon operation and planning of the power system. By employing the concept of carbon emission flow (CEF) as a virtual network flow, Kang et al. [
11,
12] networked CEF, attached carbon emission flow to the active current of the power system, analyzed and calculated the carbon emission generated in each link of energy production, transmission, and consumption, and tracked the carbon distribution in a power network. However, this model does not consider the power losses in lines in order to simplify calculations, while the line loss rate in ADNs is usually higher than that of the main network, so the line losses are not negligible, and a CEF model for ADNs is needed.
The large number of sustainable energy sources, such as wind power and photovoltaic sources, connected to the grid has transformed the traditional distribution network into an active network that actively controls the bidirectional flow of power, i.e., an ADN. At the same time, the distributed sustainable energy output is characterized by randomness, volatility, and intermittency [
13], and its high penetration rate affects the power balance, power quality, and power supply reliability of the distribution network [
14]. Distributed energy storage systems are crucial for addressing the challenges of intermittent new energy output and the poor temporal and spatial matching of distribution network loads. Implementing reasonable and effective energy storage control strategies can lead to the efficient operation of distribution networks [
15]. In order to characterize sustainable energy uncertainty, the study [
16] has established predictive probabilistic models for wind power and photovoltaic power and established an optimal dispatch method based on uncertainty boundaries; the study [
17] also predicts new energy outputs and loads on dispatch days by using historical and weather data, but both probabilistic models and model-free prediction methods inevitably lead to the accumulation of errors and are only applicable to some specific scenarios. Compared with the stochastic sequential decision-making problem with continuous decision variables for optimal dispatch of distribution networks taking into account the uncertainty of new sustainable energy [
18], deep reinforcement learning, as a model-free method, does not need to predict uncertainty in advance and, at the same time, is able to deal with higher-dimensional and complex state spaces [
19]. Cao et al. [
20] used a proximal policy optimization algorithm to solve a Markov decision problem for the dispatch of renewable energy and energy storage devices in distribution networks to assess the economics of distribution network operations. Hosseini et al. [
21] optimized the reinforcement learning exploration process and imposed penalties on infeasible solutions to ensure the feasibility and optimality of training results. Xing et al. [
22] used a graph reinforcement learning approach to solve the real-time optimal dispatch problem for an active distribution network, combining the topological characteristics of the distribution network and using a graph attention mechanism to extract and aggregate the node and line information in each training round to improve the data fusion capability of intelligent systems. Lu et al. [
23] used the dual-DQN algorithm to construct a multiple-intelligent-body system, with the goal of minimum network loss and voltage deviation, and the two intelligent systems provide each other with incentives and strategies to achieve optimal problem-solving strategies. However, the aforementioned studies have focused on the economic and security aspects of distribution network operations. There is a noticeable gap in research concerning methods to address the low-carbon operation challenges of distribution networks.
In summary, this paper proposes a low-carbon dispatch method for ADNs, optimizes the CEF theory according to the nature of ADNs, proposes a carbon emission flow calculation method considering the network loss, and establishes a control framework for gas units and energy storage devices with low-carbon dispatch for ADNs on this basis, and under a series of constraints to ensure the safe and stable operation of ADNs, trains and applies deep reinforcement learning intelligence to achieve the lowest carbon emission control. The main innovations and contributions of this paper are as follows:
- (1)
Based on the CEF theory, the dynamic carbon emission intensity calculation model of gas units, energy storage equipment, and a lossy ADN is established to realize the carbon emission measurement of each link. On this basis, a low-carbon dispatch model is proposed for distributed sustainable energy access scenarios, taking into account operational safety and low-carbon benefits.
- (2)
The ADN low-carbon dispatch problem is modeled as a Markov decision-making process, which takes into account the uncertainties caused by carbon emission intensity changes in the main network, load changes, and changes in distributed sustainable energy generation, and improves the SAC algorithm of deep reinforcement learning by adopting a Gaussian distribution reward function sampling strategy, which effectively improves the stability of the training process of the intelligence and the algorithm performance.
5. Solving Low-Carbon Dispatch Model for ADN Based on Improved SAC
5.1. Markov Decision Process Framework
The low-carbon scheduling problem is a stochastic sequential decision-making problem, which requires continuous decision-making on the operation of gas units and energy storage devices with respect to the uncertainty conditions of new energy outputs, nodal loads, and carbon potentials of the main network in the distribution network.
We use Markov Decision Process (MDP) as a mathematical framework for modeling stochastic sequential decision-making problems, and model the dispatch model as an MDP framework with a finite time step according to the characteristics of the carbon potential calculation of the distribution network, which contains four elements ():
- (1)
is the state space that supports all gas units and energy storage devices in the distribution network to make action decisions. The variables in the state space are all continuous values.
The observation state
st is denoted as
The state contains variables such as the carbon potential eGrid(t) of the main grid, the load vector (t) of each node, the turbine power vector (t), the PV power vector (t), the carbon potential (t) of the node to which the combustion turbine is connected, and the carbon potential (t) of the node to which the energy storage is connected, as well as the combustion turbine power (t − 1) and the energy storage charge state SOCM(t − 1). Excluding the main network carbon potential, the rest of the vectors contain multiple state quantities; the superscript indicates the number of state quantities.
- (2)
is the action space. During the time period
t, the intelligent body outputs the optimal action according to the environmental changes. The action
at contains the unit output and the operating status of the energy storage device:
where
(
t) denotes the active power of
KM gas units at moment
t and
(
t) denotes the charging power or discharging power of
M energy storage devices.
- (3)
p is the state transfer probability, which denotes the probability density of the current state
st to move to the next state
st+1 under action
at. The transition process from
st to
st+1 can be expressed as
where
(
t) and
(
t) are the action values in the current state and
wt denotes the environmental randomness.
- (4)
r denotes the reward returned from the environment for taking action
at during each round of state transfer:
where
ET(
t) is the single-step carbon emission cost during the dispatch cycle.
5.2. Improvement of Soft Actor–Critic Algorithm
In the actual operation situation, the environmental stochasticity makes it difficult to establish an accurate model; we propose the use of the improved Soft Actor–Critic (SAC) algorithm to solve the above MDP problem [
29]. The overall structure is shown in
Figure 4.
SAC uses one Actor Network and two Critic Networks to model strategies and value functions:
- (1)
Actor Network
The Actor Network is used to decide the action that should be taken in a given state. It takes the current state as input, outputs the action and the corresponding policy entropy, and learns the optimal policy
by maximizing the entropy objective, as shown in (29).
where
represents the expectation function,
and
represent the optimal and current strategies, respectively, and
px is the probability distribution corresponding to the state action under strategy
; the entropy reward term
is used to increase the strategy diversity; and the weight of the entropy term in the overall reward function is adjusted by the temperature factor
α.
To address the uncertainty of sustainable energy sources and loads in the operating environment of an ADN, we improve the reward function
of the SAC algorithm to improve the robustness of the model. In the reward function, the strategies that do not satisfy the constraints are penalized by the penalty term
δ:
where
δ0 is the distribution network tidal current, security, and power balance constraint penalty term;
δDN is the distribution network allowable voltage deviation value penalty term;
δMT is the maximum output power penalty term for gas-fired units;
δESS is the energy storage equipment charge state limitation penalty term; and
η is the penalty coefficient.
At the same time, we consider that the objective used to evaluate the difference between the cost and the penalty is large, which affects the solution. In this paper, we use the reward function based on Gaussian distribution for each item, firstly, we calculate the target value
ET and the penalty term
δ of the current strategy according to Equations (16) and (30), and we select the corresponding standard deviation
σ1 and
σ2 according to the value domain, and since the target values are all positive, we take the mean value of Gaussian distribution to be 0 and obtain the improved reward function:
where
λ1 and
λ2 are the respective adjustable weight coefficients used to modulate the effects of the two Gaussian distributions.
Then, the update of the optimization objective function and the temperature parameter
α in the Actor Network is expressed as
where
is the experience pool;
is the Actor Network;
ϕ is the network parameters;
H0 is the dimension of the action matrix.
- (2)
Critic Networks
The core task of the Critic Networks is to learn and approximate a state-action value function
to evaluate the expected reward of the current policy for taking a given action in a given state. SAC uses two independent Critic Networks to implement this function, reducing the overestimation bias in the estimation of the
Q-value and improving the stability of the algorithm. The
Q-value
y is calculated by the current reward
and the
Q-value of the next state to be computed:
where
γ is the discount factor, which measures the importance of future rewards;
and
are the target
Q networks, which are used to compute the Q-value for the next state;
α is the temperature parameter; and
is the logarithmic probability that the strategy network will choose action
at+1 in the next state.
The loss function
of each Critic Network
i is achieved by minimizing the mean square error between the target
Q-value and the predicted
Q-value:
where
is the experience pool, storing previous state transfer samples;
θi is a parameter of the Critic Network
. The agent training process is shown in Algorithm 1.
In each step of the decision-making process, the intelligent agent firstly carries out ADN power flow calculation and carbon flow calculation according to the node load changes and the power changes of sustainable energy sources to obtain the distribution of the node carbon potential of the node that has not been connected to the power generation equipment and updates the state space variables. Subsequently, decision-making actions are made based on the state information of load and carbon potential in the state space, in which, according to the analysis of the dynamic impact of distributed power on node carbon potential in
Section 2.2, the carbon potential state of the node is an important basis for influencing the action decision. Subsequently, it judges whether the constraints are satisfied, calculates the reward value, and completes the decision-making of a dispatch cycle.
Algorithm 1 Solving dispatch model based on improved SAC |
1: Initialize: |
Policy network πϕ |
Two Q networks and . |
Target Q networks and . |
Replay buffer . |
2: for each environment step do: |
3: Sample action at∼πϕ(⋅∣st) from the policy at state st. |
4: Execute action at, receive reward and next state st+1. |
5: Store transition (st, at, , st+1) in . |
end for |
6: for each update step do: |
7: Randomly sample a batch of transitions (st, at, , st+1) from . |
8: Compute target value according to Equation (37). |
9: Update critic networks according to Equation (38): |
|
10: Update policy network according to Equation (35): |
|
11: Adjust temperature parameter α according to Equation (36): |
|
12: Update target Q networks: |
|
end for |
13: Output: Learned policy network parameters |
6. Case Simulation
6.1. Experimental Configuration
In order to validate the effectiveness and advantages of the low-carbon dispatch methodology proposed in this paper for active distribution networks, it is validated on a modified IEEE 33-bus system. The node system is shown in
Figure 5, with WTs at nodes 5, 15, and 30, photovoltaic arrays at nodes 19, 24, and 32, dispatchable gas-fired units at nodes 4, 11, and 29, and electrical energy storage devices at nodes 18 and 33. The operating parameters of the devices are shown in
Table 1.
The wind power generation, PV generation data, load data, and main grid carbon potential change data in distribution network operation are obtained from Tianjin, China, with a sampling interval of 15 min. The dataset is subjected to data cleansing, and the data in the dataset are scaled to the corresponding level of the IEEE 33-node data.
The hyperparameter configuration of the SAC network parameter training process is shown in
Table 2.
The training of all models in this paper was deployed in a Python 3.6.9 environment under a Windows 11 operating system with computer hardware configurations of CPU: AMD Ryzen 7 4800H and GPU: RTX 3060.
6.2. Evaluation of the Training Process
The convergence trend of the average reward curve obtained during training is an important indicator for evaluating the performance of deep reinforcement learning algorithms.
Figure 6 shows the training process curves of the intelligence using the original reward function in 10,000 rounds of training. We use the linear sum of the objective function and the normalized penalty term as the original reward function. Meanwhile, the change curve of the reward value of the improved algorithm proposed in this paper is shown in
Figure 7.
The intelligent body achieves high stochasticity exploration through the maximum entropy strategy in the initial stage, and most of the decisions will deviate from the optimization objective, so the reward curve shows a large change. As the training process continues, the interaction between the intelligent body and the operating environment of the distribution network becomes more obvious, the reward value gradually rises, and finally, the reward curve converges to a certain high value at the end of the training, indicating that the intelligent body learns the optimal dispatch strategy to realize the multi-objective. Compared with the simple linear combination of reward functions, the reward curve of the method improved with Gaussian distribution fluctuates less, indicating that the algorithm is more stable, and at the same time the rewards converge at about 2200 rounds, which is faster compared to the control group method that converges at about 4600 rounds. Thus, the improved reward function algorithm using Gaussian sampling outperforms the normal algorithm for most of the training time.
6.3. Scheduling Result Performance
The intelligent body trained using the training set dataset is capable of realizing real-time optimal scheduling of the active distribution network, in order to verify the performance of the improved SAC algorithm, the data of one scheduling cycle (24 h) are selected in the test set to evaluate the scheduling of each device in the operation of the active distribution network, as well as the changes in the distribution network indices.
From
Figure 8, it can be seen that the gas unit and energy storage equipment in the ADN act according to the carbon potential of the main grid, the power output of wind and solar sustainable energy, and load changes. In the low power consumption time, such as 1:15–6:00, the carbon potential of the main grid is at a low level, the gas unit reduces the power output, the energy storage equipment is in a charging state, and the distribution network increases the low-carbon power input from the main grid. During peak hours, such as 16:30–18:30, when the carbon potential of the main grid is at a higher level, the distribution network is controlled to increase the output power of the gas-fired units, and the energy storage equipment is discharged to output the clean power stored during the low carbon potential hours, so as to reduce the use of high carbon power in the main grid.
The change in the operating status of 1# gas unit under the change in carbon potential of the main grid is shown in
Figure 9. When the carbon potential of the main grid is lower than the carbon emission factor of the gas unit, such as during the 0:00–7:00 time period, the output power of the gas unit is lower, mainly playing the role of regulating the stable operation of the distribution network. When the carbon potential of the main grid is higher than the carbon emission factor of the gas-fired units, such as during the 10:00–17:00 time period, the gas-fired units are used to replace part of the main grid power input, and at this time, under the fixed load and storage demand, the gas-fired units have a better low-carbon benefit compared with the main grid transmission of the same amount of electricity, and thus the unit output power increases.
Figure 10 shows the scheduling results of 1# energy storage equipment under the change in carbon potential of its connected 3# node, setting the initial 2 of the energy storage equipment as 80%, and the initial carbon flow rate as 0.6 kgCO
2/kWh. In the initial stage, the energy storage equipment discharges continuously in order to respond to the high load demand of the distribution network, and when the load demand decreases, and the node carbon potential is lower than the carbon emission factor of the energy storage equipment, such as in the time period from 1:45 to 7:30, the energy storage equipment changes to charging state and absorbs the low-carbon power from the distribution network, and its discharge carbon emission factor is calculated to be 0.57 kgCO
2/kWh at the end of the charging state based on the changes in the nodal carbon potential and charging power in the model of Equation (9). When the load demand increases, such as 7:45–17:30, the energy storage device continuously discharges and outputs the low-carbon power stored in the previous charging state, and the carbon emission factor of the discharge is 0.57 kgCO
2/kWh, which is lower than the node carbon potential, thus realizing the effect of reducing carbon emissions.
The 33-node distribution network is plotted as a 6 × 6 grid diagram, which represents the distribution of nodal carbon potential at a certain moment in time, and the color represents the numerical level of nodal carbon potential.
Figure 11 shows the distribution of nodal carbon potential at different times of the distribution network without considering the dispatch of gas units and energy storage devices.
Figure 12 shows the distribution of carbon potential in different time periods using the proposed model and the optimal operation strategy. Comparing the two figures, it can be intuitively seen that this method reduces the overall carbon potential during active distribution network operation, expands the radiation range of zero-carbon emission energy sources such as wind and photovoltaic, and improves the utilization rate by controlling gas-fired units and energy storage devices in response to the new energy output and the high carbon and low carbon changes in the main network.
In summary, under the multivariate changes in carbon potential, wind power, and load in the active distribution network, the intelligent agent trained by using the improved SAC algorithm can control the operating state of gas units and energy storage devices in the distribution network, reduce the output of gas units and increase the charging power of energy storage devices to absorb low-carbon power during the low-carbon time of the main network, and increase the output of gas units and control the release of stored low-carbon power from energy storage devices during the high-carbon time of the main network. During the high-carbon hours of the main grid, it increases the output of gas units and controls the energy storage equipment to release the stored low-carbon power, so as to realize the low-carbon operation of the distribution network and meet the load supply and stable operation.
6.4. Comparison of Model-Solving Algorithms
Under the hyperparameter configuration of
Table 2, the improved SAC algorithm proposed in this paper is compared with the classical SAC, DDPG [
30], and PPO [
31] algorithms, and the results of carbon emission cost comparison are shown in
Table 3.
The DDPG method defines two neural networks: one for the policy (Actor) and one for the value function (Critic), similar to the SAC method discussed in this paper. It uses an experience replay buffer to store the agent’s experiences, allowing for random sampling during training to break data correlation. PPO includes a shared main network for generating policies and estimating value functions. It optimizes the objective function through gradient ascent to update the network parameters, facilitating the learning of the agent.
The improved SAC algorithm used in this paper utilizes a Gaussian distribution reward function, which is able to converge to the optimal solution faster, and reduces carbon emissions by 6.4% compared to the classical SAC algorithm. Compared to the PPO, which requires massive data training, and the DDPG algorithm, which adopts a deterministic policy, the carbon emissions of the methods used in this paper are reduced by 7% and 12.7%, respectively, to realize the optimal operation of the active distribution network low carbon scheduling model proposed in this paper.
7. Conclusions
In this paper, a model and method for low-carbon scheduling of active distribution networks combining carbon flow calculation theory and deep reinforcement learning are constructed. By establishing the operation model and carbon flow calculation model of lossy distribution networks, we realize the carbon measurement of each operational link, and based on this, we establish the active low-carbon dispatch model of the distribution network taking into account the wind and solar intermittent energy, so as to make the carbon emission cost of the operation of the distribution network have a better explanation, and the calculation of the node carbon potential provides support for the optimal decision-making of the combustion turbine and the energy storage equipment. The powerful data fitting and decision-making ability of deep reinforcement learning is utilized to solve the optimal operation strategy, and the method of using the combination of Gaussian distributions is improved on the basis of the classical SAC algorithm, which significantly improves the training effect of the intelligent agent. Finally, in the arithmetic test, the intelligent body is able to make decisions based on the real-time measured demand load, wind and solar output, and carbon potential data of the main network without any prediction information, which improves the adaptability of the system to uncertainty and verifies the effectiveness of the model and algorithm improvement in this paper.