A Reinforcement Learning-Based Approach for Distributed Photovoltaic Carrying Capacity Analysis in Distribution Grids

Sun, Shumin; Yang, Song; Yu, Peng; Cheng, Yan; Xing, Jiawei; Wang, Yuejiao; Yi, Yu; Hu, Zhanyang; Yao, Liangzhong; Pang, Xuanpei

doi:10.3390/en18185029

Open AccessArticle

A Reinforcement Learning-Based Approach for Distributed Photovoltaic Carrying Capacity Analysis in Distribution Grids

by

Shumin Sun

¹,

Song Yang

¹,

Peng Yu

¹,

Yan Cheng

¹,

Jiawei Xing

¹,

Yuejiao Wang

¹,

Yu Yi

¹,

Zhanyang Hu

^2,*,

Liangzhong Yao

² and

Xuanpei Pang

²

¹

State Grid Shandong Electric Power Research Institute, No. 2000 Wangyue Road, Shizhong District, Jinan 250002, China

²

School of Electrical Engineering and Automation, Wuhan University, Wuhan 430072, China

^*

Author to whom correspondence should be addressed.

Energies 2025, 18(18), 5029; https://doi.org/10.3390/en18185029

Submission received: 18 June 2025 / Revised: 13 July 2025 / Accepted: 24 July 2025 / Published: 22 September 2025

Download

Browse Figures

Versions Notes

Abstract

Driven by the “double carbon” goals, the penetration rate of distributed photovoltaics (PV) in distribution networks has increased rapidly. However, the continuous growth of distributed PV installed capacity poses significant challenges to the carrying capacity of distribution networks. Reinforcement learning (RL), with its capability to handle high-dimensional nonlinear problems, plays a critical role in analyzing the carrying capacity of distribution networks. This study constructs an evaluation model for distributed PV carrying capacity and proposes a corresponding quantitative evaluation index system by analyzing the core factors influencing it. An optimization scheme based on deep reinforcement learning is adopted, introducing the Deep Deterministic Policy Gradient (DDPG) algorithm to solve the evaluation model. Finally, simulations on the IEEE-33 bus system validate the good feasibility of the reinforcement learning approach for this problem.

Keywords:

distribution grid; distributed photovoltaic carrying capacity; reinforcement learning; deep deterministic policy gradient (DDPG) algorithm

1. Introduction

With the advancement of the “dual-carbon” goals, the penetration rate of distributed photovoltaics (PV) in distribution networks has rapidly increased, making it a critical force in energy transition. It offers advantages such as low energy consumption, minimal environmental pollution, and flexible installation locations. However, the continuous growth in distributed PV installation scale poses severe challenges to the carrying capacity of distribution networks. The power output of distributed PV is intermittent and random, and the large-scale integration of distributed PV can lead to intensified voltage fluctuations at distribution network nodes, thereby affecting the safe operation of the distribution network [1]. During periods of abundant noon sunlight, a surge in PV power generation may trigger local voltage violations. At night or on rainy days, when PV power generation is insufficient, reliance on power purchased from the main grid exacerbates the risk of voltage sags. Additionally, the complexity of distribution networks further increases the difficulty of carrying capacity assessment and optimization.

The concept of the distributed PV carrying capacity of a distribution network is defined as the maximum capacity of distributed PV that can be integrated into the network while satisfying the constraints of safe operation [2]. Ref. [3] proposed models for the comprehensive carrying capacity and comprehensive carrying capacity coefficient of distribution networks but did not mention methods for analyzing PV carrying capacity. Ref. [4] analyzed the impacts of distributed PV integration on the power quality and fault current of distribution networks but only considered the static effects of distributed PV integration, ignoring its dynamic characteristics. Ref. [5] constructed a carrying capacity model with the objective of maximizing the carrying capacity of distributed new energy sources but did not consider the impacts of extreme scenarios of distributed generation output on the carrying capacity of distribution networks. Refs. [6,7] improved the operational efficiency of the system and the accommodation capacity of new energy through optimized scheduling of resources, providing new ideas for enhancing the carrying capacity of distributed PV in distribution networks.

Reinforcement learning (RL), an emerging machine learning approach, exhibits robust adaptive and optimization capabilities. Its core principle involves an agent optimizing decision-making strategies by continuously exploring and leveraging reward signals from the environment to maximize long-term cumulative returns [8]. Distinctively, RL does not require extensive pre-labeled data but achieves autonomous learning through dynamic interactions, making it particularly suitable for addressing complex and uncertain decision-making problems. The research and application of RL in power systems provide a foundation for analyzing the distributed PV carrying capacity of distribution networks. Reference [9] combines meta-learning with deep reinforcement learning (DRL) to address voltage violations caused by PV output fluctuations and randomness, yet lacks in-depth analysis of carrying capacity assessment methods. Reference [10] proposes a multi-agent DRL-based local voltage control framework for distributed energy resources, effectively mitigating voltage fluctuations and indirectly enhancing PV hosting capacity. Reference [11] presents a multi-time scale DRL reactive power optimization strategy, offering an efficient approach for reactive power management under high PV penetration. References [12,13,14] summarizes DRL applications in various distribution network scenarios, providing comprehensive theoretical and practical references for RL-based PV hosting capacity improvement, though its case studies are highly specific and lack validation in large-scale grids or multi-user environments. References [15,16,17] optimize PV inverter active/reactive power outputs via RL to regulate voltage, indirectly contributing to hosting capacity enhancement. References [18,19,20] employ DRL for energy coordination and demand response management to reduce electricity costs, but their models exhibit high computational complexity. Reference [21] introduces a chance-constrained programming-based PV hosting capacity assessment method combined with RL for capacity optimization, yet overlooks safety verification under high-penetration scenarios. Collectively, these studies leverage RL for power system optimization but do not directly address the explicit analysis of distribution network PV hosting capacity.

According to the definition in reference [2], determining the distributed PV hosting capacity is inherently an optimization problem. Reinforcement learning, with its robust adaptive and optimization capabilities, offers a promising solution. Therefore, introducing RL technology to optimize distribution network operation strategies can serve as an effective approach to enhance the accommodation capacity of distributed PV, improve power system stability, and reduce economic costs.

2. Analysis of Carrying Capacity of Distributed Photovoltaic Power Generation in Distribution Network

2.1. Influencing Factors of Photovoltaic Carrying Capacity

In the power output structure of distributed photovoltaics (PV) in distribution networks, PV power components generally include PV panels, PV inverters, protection and control systems, and energy storage systems [22].

The typical structure diagram of a distribution network after connecting distributed photovoltaics is shown in Figure 1.

When distributed photovoltaics (PV) are integrated into a distribution network, they alter the voltage distribution and line power flow of the network. Suppose there are n nodes on a distribution network feeder, and the voltage at the k-th node is

U_{k} = U_{0} - \sum_{i = 1}^{k} \frac{r l_{i} \sum_{j = i}^{n} (P_{L, j} - P_{p v, j}) + x l_{i} \sum_{j = i}^{n} (Q_{L, j} - Q_{p v, j})}{U_{i - 1}}

(1)

where U₀ represents the voltage at the connection node between the primary grid and the distribution network; U_i₋₁ is the voltage at node i − 1; P_L,j and Q_L,j denote the active and reactive power of the load at node j, respectively; P_pv_,j and Q_pv_,j represent the active and reactive power output by the distributed PV system PV_j connected at node j, where j = 1, 2, 3, …, n; l_i is the line length between nodes i − 1 and i; r and x are the resistance and reactance per unit length of the transmission line, respectively.

When the power output of distributed photovoltaics (PV) exceeds the load at a node, reverse power flow occurs. According to Equation (1), the node voltage increases with the increase in PV power output, and the closer to the end of the distribution network feeder, the more significant the voltage rise, which may cause voltage violations. Additionally, the reverse current superimposed on the original load current may exceed the conductor’s current-carrying capacity, accelerating conductor temperature rise and aging. If the reverse power flow passes through a transformer, it may cause reverse overload of the transformer, leading to winding overheating due to long-term reverse power and shortening its service life. Therefore, to ensure the safe and stable operation of the distribution network and prevent transformer overheating and current violations caused by reverse power flow, the following constraints need to be met:

(1): Line ampacity constraint

I_{l} < I_{l, m a x}

(2)

where I_l represents the current flow in line l, while I_l_,max denotes the maximum allowable current-carrying capacity of line l.

(2): Transformer reverse loading ratio constraint

\frac{\sum_{i \in S_{N}} P_{p v, i} - \sum_{i \in S_{N}} P_{L, i}}{S_{T, \max}} ⩽ λ_{\max}

(3)

where S_N is the set of all nodes in the distribution network, S_T_,max is the maximum capacity that transformer T can transmit, and

λ_{m a x}

is the maximum reverse loading ratio of the transformer, typically set to 80%.

2.2. PV Carrying Capacity Assessment Model

The distributed photovoltaic (PV) carrying capacity of a distribution network generally refers to the maximum installable capacity of distributed PV systems that can be accommodated in a specific area under the premise of safe and stable operation of the distribution network, typically expressed in terms of DC-side peak power or AC-side rated power. Determining the PV carrying capacity requires not only considering factors such as the ampacity of grid equipment, voltage regulation capability, and reverse power flow limitations but also the capacity ratio between PV modules and PV inverters. Therefore, when quantifying the distributed PV carrying capacity of a distribution network, it is essential to first clarify which type of capacity (DC or AC) is being used.

2.2.1. Quantitative Indicators for Photovoltaic Carrying Capacity

In this paper, the photovoltaic (PV) carrying capacity is expressed by the maximum installable capacity of distributed PV systems

f_{1} = \sum_{i \in S_{N}} C_{p v, i}

(4)

where f₁ denotes the total capacity of distributed PV integrated into the distribution network; and C_pv_,i represents the capacity of distributed PV that can be connected to node i.

Therefore, this paper constructs a distributed PV carrying capacity model for distribution networks with the objectives of maximizing the integration capacity of distributed PV and minimizing operational costs. The objective function is as follows

\max f_{1}

(5)

\min f_{2} = f_{\cos t} + f_{loss} = ρ_{g} \cdot P_{g} + ρ_{P} \sum_{l \in S_{L}} Z_{l} {|I_{l}|}^{2} + ρ_{p v} \cdot P_{p v c}

(6)

where f_cost is the cost of purchasing electricity from the main grid; f_loss is the sum of the active power loss cost and PV curtailment cost during system operation;

ρ

_g is the unit cost of purchasing electricity from the main grid;

ρ

_P is the unit cost of active power loss; Z_l is the impedance of branch l; I_l is current in branch l; S_L is the set of all branches in the distribution network;

ρ

_pv is the unit cost of PV curtailment; P_pvc is the amount of curtailed PV power.

2.2.2. Restrictive Conditions

(1): Distributed PV Power Output Constraint

\{\begin{array}{l} 0 ⩽ P_{p v, i} ⩽ P_{p v, i, \max} \\ - Q_{p v, i, \max} ⩽ Q_{p v, i} ⩽ Q_{p v, i, \max} \end{array}

(7)

where P_pv_,i and Q_pv,i are the active and reactive power outputs of the distributed PV connected to node i, respectively; P_pv_,i.max and Q_pv_,i.max are the upper limits of active and reactive power outputs for the PV at node i.

(2): Node Voltage Security Constraint

U_{i, \min} ⩽ U_{i} ⩽ U_{i, \max}

(8)

where U_i is the voltage magnitude at node i; U_i_,max and U_i_,min are the lower and upper limits of the voltage magnitude at node i.

(3): Line Ampacity Constraint: as shown in Equation (2).
(4): Power Flow Balance Constraint

\{\begin{array}{l} P_{p v, i} + P_{g, i} - P_{L, i} - U_{i} \sum_{j = 1}^{N} U_{j} (G_{i j} \cos θ_{i j} + B_{i j} \sin θ_{i j}) = 0 \\ Q_{p v, i} + Q_{g, i} - Q_{L, i} + Q_{C, i} + U_{i} \sum_{j = 1}^{N} U_{j} (G_{i j} \sin θ_{i j} - B_{i j} \cos θ_{i j}) = 0 \end{array}

(9)

where G_ij and B_ij are the conductance and susceptance between nodes i and j; Q_C,i is the additional reactive power compensation at node i; θ_ij is the voltage phase angle difference between nodes i and j; P_g_,i and Q_g_,i are the active and reactive power purchased from the primary grid at node i; N is the total number of nodes in the network.

(5): Primary Grid Power Constraint

\{\begin{array}{l} P_{g, \min} ⩽ P_{g} ⩽ P_{g, \max} \\ Q_{g, \min} ⩽ Q_{g} ⩽ Q_{g, \max} \end{array}

(10)

where P_g_,max and P_g_,min are the upper and lower limits of the active power output from the main grid; Q_g_,max and Q_g_,min are the upper and lower limits of the reactive power output from the primary grid.

(6): Transformer Reverse Loading Ratio Constraint: as shown in Equation (3).

Based on the above constraints, a distributed PV carrying capacity assessment model for distribution networks can be constructed to quantify the carrying capacity, providing an analytical basis for subsequent capacity improvement strategies.

3. Optimization Solution for Distributed Photovoltaic Carrying Capacity Based on Reinforcement Learning

The PV carrying capacity evaluation model constructed above balances the maximization of distributed PV integration capacity with the minimization of operational costs. However, due to the complex structure of distribution networks and the large number of variables in the optimization model, traditional methods such as Second-Order Cone Programming (SOCP) require tens of minutes for computation, failing to meet the requirements of minute-level dispatching. In contrast, reinforcement learning (RL) models offer rapid response, strong adaptability, and optimization capabilities. They can achieve autonomous learning through dynamic interaction without requiring large amounts of pre-labeled data, making them particularly suitable for addressing complex and uncertain decision-making problems. Therefore, this paper employs an RL model and uses the Deep Deterministic Policy Gradient (DDPG) algorithm to solve the optimization problem.

3.1. Basic Principles of Reinforcement Learning

Reinforcement learning (RL), a crucial branch in the field of machine learning, centers around the concept of maximizing expected rewards by iteratively constructing and refining an action strategy through environmental interaction. This process is inherently dynamic, where an agent continuously receives feedback from the environment, updates its policy based on this feedback, and eventually converges to an optimal strategy after numerous iterations [23]. RL’s distinctive “trial-and-error” learning mechanism grants it unparalleled advantages over traditional optimization methods when dealing with uncertain environments. The basic framework of reinforcement learning is shown in Figure 2.

The framework primarily consists of two key components: the environment and the agent. The agent is further composed of three elements: input I, reward R, and policy P. When performing a task, the agent first uses I to gather the current state from the environment. Based on this state, the agent selects and executes an action according to its current policy P. This action triggers a state transition in the environment, and the agent receives a reward signal as feedback. This interaction cycle repeats, allowing the agent to continuously collect new data and optimize its decision-making. Through iterative learning, the agent gradually refines its policy until an optimal strategy for the task is obtained.

This interaction between the agent and the environment can be formalized using a Markov Decision Process (MDP), defined by the five-tuple (S, A, R, P, γ): state space S encompasses all possible environmental states the agent may encounter; action space A comprises all feasible actions the agent can take in the environment; reward function R provides feedback to the agent, quantifying the desirability of actions; transition probability P describes the probability of transitioning between states given an action; discount factor γ balances immediate and future rewards, reflecting the value of long-term outcomes [23].

3.2. Deep Deterministic Policy Gradient (DDPG) Algorithm

The Deep Deterministic Policy Gradient (DDPG) algorithm combines deep reinforcement learning (DQN) with Deterministic Policy Gradient (DPG) [24,25], fundamentally adopting an Actor–Critic architecture to approximate policies and value functions through two deep neural networks.

3.2.1. Actor–Critic Framework

In high-dimensional spaces, states and actions are represented by high-dimensional vectors s and a. In the DDPG algorithm:, the Actor network (policy network) uses a neural network μ(s|θ^μ) to approximate the policy μ_θ(s). The Critic network (value function network) uses a neural network Q(s,a|θ^Q) to approximate the action-value function Q(s,a). To mitigate instability issues in deep neural network training, such as parameter oscillations and overfitting, DDPG introduces target networks—copied versions of the Actor and Critic networks, denoted as μ’(s|θ^μ^’) and Q’(s,a|θ^Q’), where θ^μ^’ and θ^Q′ are the parameters of the target networks. The network structure is illustrated in Figure 3.

Based on the above, the four network functions of the DDPG algorithm are as follows:

(1): Original Actor Network

After each training iteration, the Actor network selects optimal actions based on the current state and interacts with the environment to generate new states and immediate rewards. It outputs the optimal policy with the goal of maximizing the state-action value function, is

\max Q {(s_{j}, a_{j} ∣ θ^{Q})}_{a_{j} = μ (s_{j} ∣ θ^{μ})}

(11)

Its loss function is formulated as

\min L_{μ} = - \frac{1}{N_{batch}} Q (s_{j}, a_{j} ∣ θ^{Q}) |_{a_{j} = μ (s_{j} ∣ θ^{μ})}

(12)

where the subscript j denotes the index of N_batch samples (state transition processes) selected from the experience replay buffer.

(2): Target Actor Network

During the experience replay phase, given the next state s_j₊₁ from the selected sample (s_j, a_j, r_j₊₁, s_j₊₁), the target Actor network outputs the corresponding next optimal action:

a^{'} = μ^{'} (s_{j + 1} ∣ θ^{μ^{'}})

(13)

(3): Original Critic Network

The original Critic network evaluates the action-value function and gradually approximates the true action-value. Its loss function is expressed as

L_{Q} = \frac{1}{N_{batch}} \sum_{j} {(δ_{j})}^{2} = \frac{1}{N_{batch}} \sum_{j} {[y_{j} - Q (s_{j}, a_{j} ∣ θ^{Q})]}^{2}

(14)

where δ_j is the temporal difference error:

δ_{j} = r_{j + 1} + γ Q^{'} (s_{j + 1}, μ^{'} (s_{j + 1} ∣ θ^{μ^{'}}) ∣ θ^{Q^{'}}) - Q (s_{j}, a_{j} ∣ θ^{Q})

(15)

(4): Target Critic Network

The target Critic network estimates the target Q-value y_j of the action-value function based on the reward feedback from the environment. The expression for y_j is

y_{j} = r_{j + 1} + γ Q^{'} (s_{j + 1}, μ^{'} (s_{j + 1} ∣ θ^{μ^{'}}) ∣ θ^{Q^{'}})

(16)

where r_j₊₁ represents the immediate reward obtained after executing the action at step j.

3.2.2. Parameter Updates

The core theoretical foundation of DDPG lies in the deterministic policy gradient theorem. For a deterministic policy a= μ_θ(s), its gradient is

\nabla_{θ} J (θ) = E [\nabla_{θ} μ_{θ} (s) \cdot \nabla_{a} Q (s, a) |_{a = μ_{θ} (s)}]

(17)

This gradient guides the Actor network updates using the Q-value gradient estimated by the Critic network, forming a closed-loop optimization where “the Actor makes decisions, and the Critic evaluates them.” The updates follow gradient descent, with the Actor network’s TD update rule

θ_{j + 1}^{μ} = θ_{j}^{μ} + α_{μ} \cdot \nabla_{a} Q (s_{j}, a_{j} ∣ θ^{Q}) |_{a_{j} = μ (s_{j} ∣ θ^{μ})} \cdot \nabla_{θ_{μ}} μ (s_{j} ∣ θ^{μ})

(18)

where α_μ is the learning rate for the Actor network.

The Critic network’s TD update rule is

θ_{j + 1}^{Q} = θ_{j}^{Q} + α_{Q} δ_{j} \cdot \nabla_{θ^{Q}} Q (s_{j}, a_{j} ∣ θ^{Q})

(19)

where α_Q is the learning rate for the Critic network.

Target network parameters can be updated in two ways: directly copying weights from the original network after a fixed number of steps and softly updating parameters toward the original network’s values using a small learning rate τ. DDPG employs the second approach

\{\begin{array}{l} θ_{j + 1}^{μ^{'}} = τ θ_{j}^{μ} + (1 - τ) θ_{j}^{μ^{'}} \\ θ_{j + 1}^{Q^{'}} = τ θ_{j}^{Q} + (1 - τ) θ_{j}^{Q^{'}} \end{array}

(20)

where τ stabilizes training by preventing drastic target value changes. The overall DDPG framework is shown in Figure 4 below.

During each training episode, the original Actor network selects an action a_t based on the current state s_t. Exploration is enhanced by adding random noise

N_{t}

to the action, resulting in the executed action a_t. The agent interacts with the environment, receiving reward r_t₊₁ and next state s_t₊₁.The transition (s_t, a_t, r_t₊₁, s_t₊₁) is stored in the experience replay buffer. Each learning step samples N_batch transitions from the buffer to update network parameters via Equations (11)–(20).

3.2.3. Advantages of the DDPG Algorithm in Optimizing Distributed PV Hosting Capacity in Distribution Networks

The Deep Deterministic Policy Gradient (DDPG) algorithm exhibits significant adaptability to the optimization problem of distributed PV hosting capacity in distribution networks, with its suitability primarily manifested in the following dimensions:

(1): Sequential Decision-Making Attribute

Evaluating the distributed PV hosting capacity requires integrating load profiles and generation patterns across multiple time intervals, where historical data influences subsequent decisions. As a prominent reinforcement learning method based on the Markov Decision Process (MDP), DDPG inherently possesses sequential decision-making capabilities, making it well-suited for the temporal dynamics of PV hosting capacity analysis in distribution networks.

(2): Full-Cycle Optimality

The core goal of optimizing distributed PV hosting capacity in distribution networks is to maximize the integrated PV capacity over the entire operating cycle. The training samples of the DDPG algorithm are state transition sequences (s_t, a_t, r_t₊₁, s_t₊₁), which exhibit static characteristics of single-time-interval learning in data form. However, its learning process is oriented toward full-cycle optimality, guaranteed by three mechanisms: large discount factor settings, temporal difference (TD) solution methods, and environmental exploration mechanisms.

The discount factor quantifies the weight of future rewards in current decision-making. The action-value function learned by the Critic network is correlated with all subsequent decision behaviors, with its original mathematical expression as follows:

Q (s, a) = E [r_{t + 1} + γ r_{t + 2} + γ^{2} r_{t + 3} + \dots ∣ s_{t} = s, a_{t} = a]

(21)

From this equation, any state-action value function Q(s,a) is essentially the mathematical expectation of the weighted sum of rewards from all possible subsequent decisions, discounted by the factor γ. Thus, to emphasize global decision effectiveness, a larger discount factor can strengthen the correlation between current and subsequent decisions. As long as the discount factor is non-zero, reinforcement learning inherently incorporates the impact of future actions during action selection, with its specific value only affecting algorithm convergence speed and stability.

The parameter update of Q(s,a) adopts the temporal difference method shown in Equation (16). This method does not require a complete transition path from the current state to the terminal state during iteration; instead, it can complete training using state transition fragments from each time interval, significantly improving training efficiency. Therefore, despite the time-interval-based static nature of DDPG training data, its optimization objective remains focused on full-cycle optimal decisions.

The environmental exploration mechanism ensures the global optimality of solutions. In any state s, the exploration mechanism randomly selects actions from the entire action space with a certain probability, ensuring sufficient traversal of all state and action spaces to avoid local optima and guarantee global optimality.

In summary, the solution results of the DDPG algorithm can fully meet the optimization requirements for distributed PV hosting capacity in distribution networks.

(3): Differences and Advantages of DDPG Compared to Other Deep Reinforcement Learning Methods

DDPG differs significantly from other deep reinforcement learning methods. Taking PPO (Proximal Policy Optimization) and A2C (Advantage Actor–Critic) as examples—all three belong to the Actor–Critic framework, but their core mechanisms vary, endowing DDPG with unique advantages in solving the distributed PV hosting capacity problem of distribution networks.

In the distribution network PV hosting capacity problem, actions are continuous variables. DDPG adopts a “deterministic policy,” where the Actor network directly outputs optimal action values (rather than action probability distributions). Combined with deep networks, it generates precise continuous actions without discretizing the action space—discretization would reduce precision and lead to suboptimal solutions. In contrast, PPO and A2C inherently rely on “stochastic policies,” outputting action probability distributions. In continuous action spaces, they generate actions through sampling, which may introduce additional errors and lack precision in fine adjustments (e.g., slightly increasing PV capacity at a node to approach constraint boundaries).

The distribution network state space includes voltage, power, and other variables across multiple nodes, forming a high-dimensional continuous state. The relationship between states and actions exhibits strong nonlinearity due to line losses, reactive power compensation, etc. Both the Actor and Critic networks in DDPG use deep neural networks, efficiently approximating nonlinear mappings from high-dimensional states to actions/values. While PPO and A2C can also utilize deep networks, their on-policy nature makes them more sensitive to network structure—for instance, overly deep networks tend to overfit due to sample correlation. DDPG’s experience replay mechanism mitigates this issue.

Distribution network safety constraints (e.g., voltage deviation ≤ ±6%, line load rate ≤ 100%) are “hard constraints.” The algorithm must stably converge to constraint-satisfying solutions, avoiding infeasible solutions (e.g., voltage violations) caused by policy oscillations during training. DDPG introduces “dual target networks,” reducing fluctuations in value estimation by slowly updating target network parameters; meanwhile, the experience replay buffer breaks temporal correlation between samples, preventing oscillations in gradient descent. PPO improves stability through “policy update clipping” (clip mechanism) but remains inherently on-policy, leaving sample correlation unresolved. A2C uses asynchronous multi-thread training, which is prone to parameter oscillations due to policy differences between threads, resulting in poorer stability.

The continuous action characteristics, high-dimensional state space, and strict constraint requirements of the distribution network PV hosting capacity problem make DDPG more advantageous than PPO and A2C: its deterministic policy ensures precise continuous action output; the off-policy mechanism enhances sample efficiency; and dual target networks and experience replay improve training stability. Ultimately, DDPG can more efficiently and accurately solve for the maximum PV integration capacity while satisfying distribution network safety constraints.

3.3. Optimization of Distributed Photovoltaic Carrying Capacity Based on Deep Reinforcement Learning

When solving the distributed PV carrying capacity model for distribution networks using the Deep Deterministic Policy Gradient (DDPG) algorithm, the first step is to transform the PV carrying capacity model into a reinforcement learning (RL) five-tuple (S, A, R, P, γ) framework based on the Markov Decision Process (MDP), fully accounting for the objective function and all constraints. The objective function can be designed as the immediate reward component in the RL process, while a common strategy for handling constraints is to incorporate them into the immediate reward via penalty functions [26]. However, an excessive number of penalty functions can negatively impact the algorithm’s convergence and stability, and accurately determining penalty values is challenging. Therefore, it is more optimal to integrate constraints into the initial definition of the state and action spaces and the design of state transition rules.

(1): State Space Selection

The state space S should include all factors influencing decision-making. For the problem of enhancing distributed PV carrying capacity in distribution networks studied here, the state space comprises node voltage magnitudes, line current magnitudes, and time-specific features (current hour, light intensity, active/reactive load, and PV power output indicators). The state space is defined in Table 1 as follows:

In this study, the analyzed distribution network system satisfies a connected radial structure, so the dimensions in the state space adhere to the following relationship:

N_{U} = N_{I} + 1

(22)

The state space S is expressed as

S = \{S_{DN}, S_{T}\}

(23)

where S_DN denotes the set of distribution network elements, including node voltage magnitudes and line current magnitudes; S_T represents time-specific features, which are introduced to ensure that actions taken by the agent can better satisfy all constraints over a 24 h period.

(2): Action Space Selection

To simplify the problem analysis, this study calculates the maximum installable capacity of distributed PV and minimizes curtailment under the premise of given PV installation locations and quantities. Therefore, the action space A—serving as the decision variable in the model—includes distributed PV capacity and curtailment. The action space is defined in Table 2 as follows:

The action space A is expressed as

A = \{A_{p v}, A_{c p v}\}

(24)

where A_pv denotes the set of distributed PV capacities; A_cpv represents the set of PV curtailment amounts for each installed PV system at each time period t.

During each time period t of the day, based on the current state s_t, the original Actor network in the DDPG algorithm outputs the corresponding action a_t from the predefined action space A according to its decision logic. The Critic network then evaluates the quality of this decision using Q(s,a|θ^Q) based on the current state s_t and the action a_t from the Actor network. The specific architectures of these two networks are shown in Figure 3.

(3): State Transition

The state transition probability P determines how the next state s_t₊₁ evolves in the environment after the Actor network outputs action a_t. Following the action a_t, node voltage magnitudes, line current magnitudes, and other state values in s_t₊₁ are accurately calculated using power flow equations. If the power flow calculation fails to converge, the current episode is terminated, and a large negative reward is assigned to penalize invalid actions. This process relies on physical models to ensure transition accuracy while handling outliers to prevent the agent from learning infeasible strategies, collectively forming the state transition probability P.

(4): Immediate Reward

For the immediate reward, variables in the objective function that need to be maximized are designed as reward elements in the reinforcement learning process. Meanwhile, voltage and current safety constraints are key constraints limiting the distributed photovoltaic (PV) carrying capacity of distribution networks [27,28]; thus, variables in the objective function that need to be minimized, along with voltage and current violation constraints, are incorporated into the immediate reward through penalty functions. The objective of this study is to maximize distributed PV capacity and minimize operational costs (including main grid power purchase costs, active power loss costs, and PV curtailment costs). Thus, the immediate reward r_t₊₁ is defined as

r_{t + 1} = R_{1} f_{1} - R_{2} f_{2} - R_{3} V_{o v e r} - R_{4} I_{o v e r}

(25)

where R₁ is the reward coefficient for PV capacity; R₂ is the penalty coefficient for operational costs; R₃ and R₄ are penalty coefficients for voltage and current violations, respectively; and V_over and I_over are the absolute values of voltage and current violations.

(5): Discount Factor

The sum of immediate rewards over an episode constitutes the total return, where the discount factor γ (0 ≤ γ ≤ 1) balances the importance of immediate and future rewards. A value closer to 1 emphasizes long-term cumulative rewards. In this study, maximizing PV capacity over an episode—with current decisions influencing future choices—requires prioritizing cumulative rewards. Therefore, γ is set to 0.99 to highlight long-term benefits.

All objective functions, constraints, and improvement measures for the distributed PV carrying capacity model in distribution networks are thus integrated into the proposed RL framework.

4. Simulation Examples and Analysis

4.1. Example Environment Description

This paper uses the IEEE-33 bus distribution system as an example to calculate the distributed PV carrying capacity of the distribution network based on the reinforcement learning model and verify the effectiveness of the PV carrying capacity improvement measures adopted in this paper. Since in the Python context of reinforcement learning, the index starts from 0, the node numbers in the IEEE-33 bus distribution network are uniformly reduced by one in this paper, and the changed numbers are used uniformly in the following text. The IEEE-33 bus distribution network system is shown in Figure 5 [29]. To simulate the situation of a distribution network with a high proportion of distributed PV, in this paper, a distributed PV is connected to every other node from node 2 to node 32 in the distribution network, and the capacity of the distributed PV is set as a variable to be optimized and is in a state to be solved.

The lower limit of the node voltage in the distribution network is set to 0.94 p.u., and the upper limit is set to 1.06 p.u.; the maximum allowable ampacity of the transmission line is set to 600A, which is related to the material used for the transmission line; the light intensity is fitted according to the PV processing curve of a certain place, and 24 points are evenly taken as the light intensity within 24 h of a day; the branch impedance data is the same as that in reference [30]; the node load data is multiplied by the 24 h load demand of a certain place based on the data in reference [30] to obtain the 24 h load data of the IEEE-33 bus distribution system; the power reference value is 1MW, and the voltage reference value is 12.66kV.

The DDPG network structure is shown in Figure 3. The Actor network has two hidden layers, with 256 neurons in the first layer and 128 neurons in the second layer; the Critic network also has two hidden layers, with 256 neurons in the first layer and 128 neurons in the second layer. The activation function of the hidden layer is the ReLU function. The learning rate of the Actor network α_μ = 0.0001, and the learning rate of the Critic network α_Q = 0.001. The learning rate of the target network τ = 0.005, and only a small number of new parameters are mixed during each update to maintain stability. The capacity of the experience replay buffer is 10⁵, and the sampling method is random sampling, with a length of 64 each time.

Based on the above parameters, this paper uses the Python language and Visual Studio Code 1.100.3 based on the Python 3.9.0 environment to calculate and solve on a computer platform with an 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz, 2.80 GHz processor, and the number of algorithm iterations is 1500 times.

4.2. Example Results and Analysis

Based on the previously established reinforcement learning theoretical framework, the Deep Deterministic Policy Gradient (DDPG) algorithm is employed to iteratively train the model and optimize its parameters. The total training time amounts to 1995.47 s, and the average decision-making time is 0.17 milliseconds.

Within the DDPG algorithm architecture, the Critic network’s loss function serves as a quantitative metric for evaluating the model’s learning performance and convergence status, providing core feedback during the iterative training process. The dynamic evolution of the Critic network’s loss function is shown in Figure 6.

The analysis of the results in the figure shows that the constructed model exhibits excellent learning performance and convergence characteristics. In the initial stage of the DDPG algorithm, the focus is mainly on environment exploration, and then the algorithm gradually converges as the number of steps increases. It can be seen that the model trains well, achieving relatively stable convergence at about half the total number of steps.

Additionally, the variation in the total reward value during the training process is illustrated in Figure 7. As shown in the figure, the total reward gradually increases and stabilizes over time, indicating that the model has been effectively trained. The total reward reaches a high level when it reaches half of the total number of steps, and then stabilizes at a high level, although there are occasional small fluctuations, which do not significantly affect the overall effect.

The trained model is used to solve the distributed PV carrying capacity model for the distribution network, with results presented in Figure 8 and Figure 9.

Figure 8 shows the 24 h curves of total power output and curtailment for PV systems connected to 16 nodes, as well as the variation in primary grid power purchase and active power loss. Figure 9 displays the optimal capacity configuration for each PV node after training. The maximum installable distributed PV capacity in the distribution network is 6.5680 p.u., with a total daily curtailment of 0.6489 p.u. (1.59% of total generation), main grid power purchase of 27.2667 p.u., and total active power loss of 22.3294 p.u.

Analyzing the results obtained above, it is clear from the bar chart that node 24 has the highest value for PV capacity, which is close to 1.2 p.u. Because node 24 is located at the end of the distribution network, the voltage of the end nodes in the distribution network is relatively less affected by the upstream power supply, and when accessing PV, the impact on the overall network voltage is relatively controllable. A large amount of power injection is less likely to cause problems such as voltage overruns, so the upper limit of the PV capacity that can be accessed is higher.

Node 10 has a PV capacity of about 0.7 p.u., node 26 has a PV capacity of about 1.0 p.u., node 28 has a PV capacity of about 0.8 p.u., and node 30 has a PV capacity of about 0.7 p.u., which is at a higher level among all nodes with access to PV. The location of these nodes has some advantages in the network topology. Among them, nodes 28 and 30 are located on the feeder branches, which are at a certain distance from the power point, making them have relatively flexible power regulation space. Under the conditions of satisfying the voltage constraints and power balance, it can accommodate a relatively large amount of PV capacity. Nodes 4, 6, 8, 12, 14, 16, and 32 have smaller and similar PV capacities, between 0.2 and 0.4 p.u., with values closer to each other. Nodes 2, 18, 20, and 22 have the smallest PV capacity, almost close to 0, and are at the low end of the overall range.

4.3. Rationality Analysis of Results

The rationality of the results is analyzed from the perspective of voltage constraints based on the constructed distributed PV carrying capacity model for the distribution network. The voltage variations in all nodes in the distribution system over 24 h under the above capacity configuration are shown in Figure 10.

The following is shown in Figure 10:

0:00–6:00: Node voltages remain stable at 0.96–1.00 p.u. with minimal fluctuations. During this nighttime period, PV systems generate little to no power, and the distribution network is primarily supplied by traditional power sources with stable loads, resulting in steady voltages;
6:00–10:00: Node voltages gradually rise, with some exceeding 1.00 p.u. and reaching relatively high levels around 10:00 (most at 1.02–1.06 p.u.). As daylight increases, PV systems start generating power, injecting electricity into the distribution network. According to Equation (1), increased power injection at nodes—with constant or slightly changing loads—raises voltages;
10:00–14:00: Voltages remain high, with minor fluctuations at some nodes. This period coincides with the peak PV output, where continuous power injection sustains elevated voltages;
14:00–18:00: Voltages gradually decline from high levels to near 1.00 p.u. As light intensity weakens, PV output decreases, reducing power injection and causing voltages to drop;
18:00–23:00: Voltages continue to decline and stabilize at 0.96–1.00 p.u., similar to nighttime levels. PV systems stop generating power, and the network relies again on traditional power sources. Stable loads lead to stable voltages.

The analysis of the voltage curves reveals that during peak PV output, node voltages approach the allowable upper limit. With the current PV capacity configuration, further increasing PV integration under constant curtailment rates could easily trigger voltage violations, threatening power system security. Conversely, expanding PV capacity while maintaining voltage within safe thresholds (under unchanged conditions) would increase curtailment and operational costs, contradicting the study’s optimization objectives. These results align with actual operational patterns, demonstrating high rationality and reliability.

5. Conclusions

This paper addresses the problem of enhancing the carrying capacity of distributed photovoltaic (PV) systems in distribution networks, focusing on the maximum installable PV capacity under safe operation constraints (i.e., distributed PV carrying capacity) in networks with high PV penetration. Deep reinforcement learning (DRL) is employed for carrying capacity analysis. The main conclusions are as follows:

(1) A methodology for analyzing distributed PV carrying capacity in distribution networks is proposed. By examining PV power output structures, key factors influencing carrying capacity are identified. A quantitative model for distributed PV carrying capacity in distribution networks is established, along with metrics to quantify this capacity.

(2) Reinforcement learning is adopted to optimize and solve the distributed photovoltaic (PV) carrying capacity problem. First, the basic principles of reinforcement learning and the DDPG algorithm are introduced. Then, a corresponding reinforcement learning model is constructed based on the objective function and constraints of the distributed PV carrying capacity model for distribution networks. Furthermore, by analyzing the differences between the DDPG algorithm and other deep reinforcement learning methods, it is concluded that the DDPG algorithm is highly suitable for solving the distributed PV carrying capacity of distribution networks. Finally, the IEEE-33 bus distribution system is taken as an example to verify the feasibility of the DDPG algorithm proposed in this paper for the research content.

In addition, although this paper proposes an innovative method based on deep reinforcement learning and achieves phased results in theoretical analysis, due to limitations in research time and objective conditions, the existing research still has several directions for optimization and expansion, as follows:

(1) Practical promotion of the proposed method in larger-scale power systems. The model established in this paper is only verified through simulation in the IEEE-33 bus distribution system. Whether the method is applicable to larger-scale power grids remains to be confirmed.

(2) Excessively long algorithm training time and occasional poor convergence. Although the decision-making time after training is short (at the millisecond level), the average training time for a single run exceeds 30 min, and it will be even longer for larger-scale power grids, indicating that the learning efficiency needs to be improved.

(3) Transfer application of deep reinforcement learning. The algorithm adopted in this paper is only studied and applied to the analysis of distributed PV carrying capacity in distribution networks. It is a solution for specific problems under specific constraints, lacking research on its universality.

Author Contributions

Formal analysis and Writing—original draft, S.S.; Data curation and Conceptualization, S.Y.; Investigation and Funding acquisition, P.Y.; Project administration and Methodology, Y.C.; Investigation and Project administration, J.X.; Software and Resources, Y.W.; Validation and Visualization, Y.Y.; Software and Writing—review and editing, Z.H.; Writing—review and editing and Supervision, L.Y.; Writing—review and editing, X.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by State Grid Shandong Electric Power Research Institute, grant number “52062623003K” and The APC was funded by State Grid Shandong Electric Power Research Institute. This paper is supported by the project: Research and Application of Active Synchronization Grid Formation and Grid Interaction Technology for Large-Scale Distributed Photovoltaic Storage Clusters—Topic 4: Research on Main and Distribution Grid Coordination and Interaction Support Technology under Conditions of Extremely High Penetration of Distributed Photovoltaic Storage by State Grid Shandong Electric Power Research Institute. (52062623003K).

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Acknowledgments

We would like to express our gratitude to the editors and reviewers for their suggestions regarding the publication of this paper. Your suggestions and assistance have been of great help in improving this paper. We would also like to thank State Grid Shandong Electric Power Research Institute for their assistance and cooperation in this project (52062623003K).

Conflicts of Interest

The authors declare no conflicts of interest.

Correction Statement

This article has been republished with a minor correction to the Data Availability Statement. This change does not affect the scientific content of the article.

Abbreviations

The following abbreviations are used in this manuscript:

PV	Photovoltaic
RL	Reinforcement learning
DRL	Deep reinforcement learning
MDP	Markov Decision Process
DDPG	Deep deterministic policy gradient
PPO	Proximal Policy Optimization
A2C	Advantage Actor–Critic

References

Chai, Y.Y.; Guo, L. Distributed Voltage Control of Distribution Network with High Penetration Photovoltaics. Power Syst. Technol. 2018, 42, 738–746. [Google Scholar] [CrossRef]
National Energy Administration. Guidelines for Assessment of Distributed Generation Access to Power Grid Capacity (DL/T 2041—2019); China Electric Power Press: Beijing, China, 2019.
Deng, Z.; Liu, M.; Chen, H.; Lu, W.; Dong, P. Optimal Scheduling and Operation of Active Distribution Network Considering Comprehensive Capacity. Electr. Power Constr. 2020, 41, 67–75. [Google Scholar] [CrossRef]
Qi, X.G.; Xi, P.; Liu, H.; Xu, Z. Analysis of Distribution Network’s Acceptance Capacity for Distributed Photovoltaics. Adv. Electr. Power Energy 2017, 5, 143–152. [Google Scholar] [CrossRef]
Hao, W.B.; Meng, Z.G.; Zhang, Y.; Xie, B.; Peng, P.; Wei, J. Research on Capacity Assessment Method for Multi-Distributed Generation Access to Distribution Network under New Power System. Power Syst. Prot. Control 2023, 51, 23–33. [Google Scholar] [CrossRef]
Wang, J.; Xie, H.; Sun, J. Energy Optimal Scheduling of Active Distribution Network Based on Chance-Constrained Programming. Power Syst. Prot. Control 2014, 42, 45–52. [Google Scholar] [CrossRef]
Luo, J.M.; Liu, L.Y.; Liu, P.; Ye, R.; Qin, P. Research on Optimal Scheduling Method of Active Distribution Network Considering Source-Network-Load-Storage Coordination. Power Syst. Prot. Control 2022, 50, 167–173. [Google Scholar] [CrossRef]
Yu, T.; Zhou, B. Application and Prospect of Reinforcement Learning in Power Systems. Power Syst. Prot. Control 2009, 37, 122–128. [Google Scholar] [CrossRef]
Huang, H.; Zhang, A.A. Load Forecasting of Integrated Energy System Based on Decomposition Algorithm Combined with Meta-Learning. Autom. Electr. Power Syst. 2024, 48, 151–160. [Google Scholar] [CrossRef]
Xi, W.; Li, P.; Li, P.; Cai, T.T.; Wei, M.J.; Yu, H. Distributed Generation Local Adaptive Voltage Control Method Based on Deep Reinforcement Learning. Autom. Electr. Power Syst. 2022, 46, 25–31. [Google Scholar] [CrossRef]
Hu, D.E.; Peng, Y.G.; Wei, W.; Xiao, T.; Cai, T.; Xi, W. Multi-Time-Scale Deep Reinforcement Learning Reactive Power Optimization Strategy for Distribution Network. Proc. CSEE 2022, 42, 5034–5044. [Google Scholar] [CrossRef]
Hu, W.H.; Cao, D.; Huang, Q.; Zhang, B.; Li, S.; Chen, Z. Application of Deep Reinforcement Learning in Optimal Operation of Distribution Network. Autom. Electr. Power Syst. 2023, 47, 174–191. [Google Scholar] [CrossRef]
Zhang, Z.D.; Zhang, D.X.; Qiu, R.C. Deep Reinforcement Learning for Power System Applications: An Overview. CSEE J. Power Energy Syst. 2020, 6, 213–225. [Google Scholar] [CrossRef]
Xi, L.; Zhou, L.; Xu, Y.; Chen, X. A Multi-Step Unified Reinforcement Learning Method for Automatic Generation Control in Multi-Area Interconnected Power Grid. IEEE Trans. Sustain. Energy 2021, 12, 1406–1415. [Google Scholar] [CrossRef]
El Helou, R.; Kalathil, D.; Xie, L. Fully Decentralized Reinforcement Learning-Based Control of Photovoltaics in Distribution Grids for Joint Provision of Real and Reactive Power. arXiv 2020, arXiv:2008.01231. [Google Scholar]
Wang, J.H.; Xu, W.K.; Gu, Y.J.; Song, W.; Green, T.C. Multi-Agent Reinforcement Learning for Active Voltage Control on Power Distribution Networks. Advances in Neural Information Processing Systems. arXiv 2021. [Google Scholar] [CrossRef]
Suchithra, J.; Robinson, D.; Rajabi, A. Hosting Capacity Assessment Strategies and Reinforcement Learning Methods for Coordinated Voltage Control in Electricity Distribution Networks: A Review. Energies 2023, 16, 2371. [Google Scholar] [CrossRef]
Zhai, S.W.; Li, W.Y.; Qiu, Z.Y.; Zhang, X.; Hou, X. An Improved Deep Reinforcement Learning Method for Dispatch Optimization Strategy of Modern Power Systems. Entropy 2023, 25, 546. [Google Scholar] [CrossRef] [PubMed]
Sang, J.S.; Sun, H.B.; Kou, L. Deep Reinforcement Learning Microgrid Optimization Strategy Considering Priority Flexible Demand Side. Sensors 2022, 22, 2256. [Google Scholar] [CrossRef] [PubMed]
Wang, L.; Hu, G.; Wu, H.; Tan, K.; Zhou, C.; Zhu, Y.J. Multi-Energy Coordinated Optimization Method for Distributed Energy System Based on Hierarchical Deep Reinforcement Learning. Autom. Electr. Power Syst. 2024, 48, 67–76. [Google Scholar] [CrossRef]
Ding, Q.X.; Qin, H.P.; Wan, C.; Peng, D.; Li, Y. Assessment of Distributed Photovoltaic Hosting Capacity in Distribution Network Based on Chance-Constrained Programming. J. Northeast. Electr. Power Univ. 2022, 42, 28–38. [Google Scholar] [CrossRef]
Zhu, Y.Q. New Energy and Distributed Generation Technology, 3rd ed.; Zhu, L.Z., Zhao, H.Y., Eds.; Peking University Press: Beijing, China, 2022; pp. 43–44. [Google Scholar]
Peng, L.Y. Adaptive Uncertainty Economic Dispatch of Power Systems Based on Deep Reinforcement Learning. Master’s Thesis, Wuhan University, Wuhan, China, 2020. [Google Scholar]
Ni, S.; Cui, C.G. Multi-Time-Scale Online Reactive Power Optimization of Distribution Network Based on Deep Reinforcement Learning. Autom. Electr. Power Syst. 2021, 45, 77–85. [Google Scholar] [CrossRef]
Hasselt, H.V.; Guez, A.; Silver, D. Deep Reinforcement Learning with Double Q-Learning. arXiv 2015, arXiv:1509.06461. [Google Scholar] [CrossRef]
Wang, Y. Research on Optimal Scheduling of Power Systems with Wind Power Based on Chance-Constrained Goal Programming. Master’s Thesis, North China Electric Power University, Beijing, China, 2017. [Google Scholar]
Abad, M.S.S.; Ma, J.; Zhang, D.W.; Ahmadyar, A.S.; Marzooghi, H. Probabilistic Assessment of Hosting Capacity in Radial Distribution Systems. IEEE Trans. Sustain. Energy 2018, 9, 1935–1947. [Google Scholar] [CrossRef]
Li, Z.; Bao, X.; Shao, Y.; Peng, P.; Wang, W. Studying Accommodation Ability of Distributed Photovoltaic Considering Various Voltage Regulation Measures. Power Syst. Prot. Control 2018, 46, 10–16. [Google Scholar] [CrossRef]
Zheng, Y.Z.; Zhou, K.; Yang, Y.; Diao, H.; Hua, L.; Wang, R.; Liu, K.; Guo, Q. Robust Assessment Method for Hosting Capacity of Distribution Network in Mountainous Areas for Distributed Photovoltaics. Energies 2025, 18, 2394. [Google Scholar] [CrossRef]
Ding, T.; Li, F.X.; Li, X.; Sun, H.B.; Bo, R. Interval Radial Power Flow Using Extended DistFlow Formulation and Krawczyk Iteration Method with Sparse Approximate Inverse Preconditioner. IET Gener. Transm. Distrib. 2015, 9, 1998–2006. [Google Scholar] [CrossRef]

Figure 1. Distribution grid structure with high percentage of distributed photovoltaics.

Figure 2. Basic framework of reinforcement learning.

Figure 3. Network structures of Actor (a), Critic (b), and their target networks.

Figure 4. Overall Framework of DDPG.

Figure 5. IEEE-33 node distribution system [29].

Figure 6. Variation in Critic network loss function.

Figure 7. Variation in total reward value.

Figure 8. Curves of total PV power output, total curtailment, main grid power purchase, and active power loss.

Figure 9. PV node capacity configuration.

Figure 10. Voltage variations in nodes in the distribution system.

Table 1. State space definition.

Ingredient	Dimension
Node voltage magnitudes	N_U
Line current magnitudes	N_I
Time-specific features	5

Table 2. Action space definition.

Ingredient	Dimension
Distributed PV capacity	N_pv
Distributed PV curtailment	N_pv × T

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sun, S.; Yang, S.; Yu, P.; Cheng, Y.; Xing, J.; Wang, Y.; Yi, Y.; Hu, Z.; Yao, L.; Pang, X. A Reinforcement Learning-Based Approach for Distributed Photovoltaic Carrying Capacity Analysis in Distribution Grids. Energies 2025, 18, 5029. https://doi.org/10.3390/en18185029

AMA Style

Sun S, Yang S, Yu P, Cheng Y, Xing J, Wang Y, Yi Y, Hu Z, Yao L, Pang X. A Reinforcement Learning-Based Approach for Distributed Photovoltaic Carrying Capacity Analysis in Distribution Grids. Energies. 2025; 18(18):5029. https://doi.org/10.3390/en18185029

Chicago/Turabian Style

Sun, Shumin, Song Yang, Peng Yu, Yan Cheng, Jiawei Xing, Yuejiao Wang, Yu Yi, Zhanyang Hu, Liangzhong Yao, and Xuanpei Pang. 2025. "A Reinforcement Learning-Based Approach for Distributed Photovoltaic Carrying Capacity Analysis in Distribution Grids" Energies 18, no. 18: 5029. https://doi.org/10.3390/en18185029

APA Style

Sun, S., Yang, S., Yu, P., Cheng, Y., Xing, J., Wang, Y., Yi, Y., Hu, Z., Yao, L., & Pang, X. (2025). A Reinforcement Learning-Based Approach for Distributed Photovoltaic Carrying Capacity Analysis in Distribution Grids. Energies, 18(18), 5029. https://doi.org/10.3390/en18185029

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Reinforcement Learning-Based Approach for Distributed Photovoltaic Carrying Capacity Analysis in Distribution Grids

Abstract

1. Introduction

2. Analysis of Carrying Capacity of Distributed Photovoltaic Power Generation in Distribution Network

2.1. Influencing Factors of Photovoltaic Carrying Capacity

2.2. PV Carrying Capacity Assessment Model

2.2.1. Quantitative Indicators for Photovoltaic Carrying Capacity

2.2.2. Restrictive Conditions

3. Optimization Solution for Distributed Photovoltaic Carrying Capacity Based on Reinforcement Learning

3.1. Basic Principles of Reinforcement Learning

3.2. Deep Deterministic Policy Gradient (DDPG) Algorithm

3.2.1. Actor–Critic Framework

3.2.2. Parameter Updates

3.2.3. Advantages of the DDPG Algorithm in Optimizing Distributed PV Hosting Capacity in Distribution Networks

3.3. Optimization of Distributed Photovoltaic Carrying Capacity Based on Deep Reinforcement Learning

4. Simulation Examples and Analysis

4.1. Example Environment Description

4.2. Example Results and Analysis

4.3. Rationality Analysis of Results

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Correction Statement

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI