Deep Reinforcement Learning-Based Voltage Regulation Using Electric Springs in Active Distribution Networks

Lara-Perez, Jesus Ignacio; Trejo-Caballero, Gerardo; Tapia-Tinoco, Guillermo; Raya-González, Luis Enrique; Garcia-Perez, Arturo

doi:10.3390/technologies14020087

Open AccessArticle

Deep Reinforcement Learning-Based Voltage Regulation Using Electric Springs in Active Distribution Networks

by

Jesus Ignacio Lara-Perez

¹

,

Gerardo Trejo-Caballero

²

,

Guillermo Tapia-Tinoco

³

,

Luis Enrique Raya-González

⁴

and

Arturo Garcia-Perez

^1,*

¹

Electronic Engineering Department, University of Guanajuato, Salamanca 36885, Mexico

²

Mechatronic Engineering Department, Tecnológico Nacional de México/Instituto Tecnológico Superior de Irapuato, Irapuato 36821, Mexico

³

Agricultural Engineering Department, University of Guanajuato, Irapuato 36500, Mexico

⁴

Postgraduate in Biosciences, University of Guanajuato, Irapuato 36500, Mexico

^*

Author to whom correspondence should be addressed.

Technologies 2026, 14(2), 87; https://doi.org/10.3390/technologies14020087

Submission received: 20 December 2025 / Revised: 22 January 2026 / Accepted: 28 January 2026 / Published: 1 February 2026

(This article belongs to the Special Issue Artificial Intelligence for Smart Fault Diagnosis and Fault Tolerant Control)

Download

Browse Figures

Versions Notes

Abstract

The increasing penetration of distributed generation in active distribution networks (ADNs) introduces significant voltage regulation challenges due to the intermittent nature of renewable energy sources. Electric springs (ESs) have emerged as a cost-effective alternative to conventional FACTS devices for voltage regulation, requiring minimal energy storage while providing fast, flexible reactive power compensation. This paper proposes a deep reinforcement learning (DRL)-based approach for voltage regulation in balanced active distribution networks with distributed generation. Electric springs are deployed at selected buses in series with noncritical loads to provide flexible voltage support. The main contributions of this work are: (1) a novel region-based penalized reward function that effectively guides the DRL agent to minimize voltage deviations; (2) a coordinated control strategy for multiple ESs using the Deep Deterministic Policy Gradient (DDPG) algorithm, representing the first application of DRL to ES-based voltage regulation; (3) a systematic hyperparameter tuning methodology that significantly improves controller performance; and (4) comprehensive validation demonstrating an approximately 40% reduction in mean voltage deviation relative to the no-control baseline. Three well-known continuous-control DRL algorithms, Twin Delayed Deep Deterministic Policy Gradient (TD3), Proximal Policy Optimization (PPO), and DDPG, are first evaluated using the default hyperparameter configurations provided by MATLAB R2022b.Based on this baseline comparison, a dedicated hyperparameter-tuning procedure is then applied to DDPG to improve the robustness and performance of the resulting controller. The proposed approach is evaluated through simulation studies on the IEEE 33-bus and IEEE 69-bus test systems with time-varying load profiles and fluctuating renewable generation scenarios.

Keywords:

deep reinforcement learning; active distribution network; distributed generation; electric spring

1. Introduction

Traditional power grids have undergone significant evolution in recent years due to the increasing integration of distributed generation (DG) resources, transforming them into ADNs characterized by bidirectional power flows and dynamic, time-varying operating conditions. Among the key advantages of this new paradigm are reduced transmission losses, enhanced resilience to power outages, and lower greenhouse gas emissions [1].

However, the high penetration of renewable energy sources introduces significant voltage regulation challenges due to their inherent intermittency and variability. Voltage regulation in ADNs has been extensively investigated using various control strategies, each characterized by distinct levels of coordination and communication architecture [2]. In addition, many of these schemes incorporate intelligent control techniques to cope with the increasing complexity and uncertainty of modern power systems. Among the most widely used approaches are metaheuristic algorithms and reinforcement learning methods, which enable real-time system optimization under dynamic, nonlinear operating conditions. The controlled elements include renewable energy sources, Flexible AC Transmission Systems/Distributed Flexible AC Transmission Systems (FACTS/D-FACTS), compensation devices, electric vehicles, and energy storage systems (ESSs), all aimed at maintaining voltage stability and enhancing overall system performance [3].

In particular, metaheuristic algorithms have been applied in both local and centralized voltage-control frameworks to tune reactive-power compensator setpoints and transformer tap positions [4,5]. Nevertheless, these methods exhibit significant limitations in real-time applications. They rely on static parameters, lack online feedback, incur high computational costs, and require manual tuning of algorithm hyperparameters. These factors reduce their robustness in the face of changing scenarios.

1.1. Deep Reinforcement Learning for Voltage Control

Deep reinforcement learning algorithms have gained relevance in power systems applications due to their capacity to learn optimal policies directly from interaction with the environment, without requiring an explicit model of the network [3]. Recent studies in energy applications have also reported model-free learning-based control strategies designed to handle uncertainties [6,7,8]. Unlike traditional optimization methods, DRL agents can adapt to changing conditions in real time and handle the high-dimensional state and action spaces typical of modern ADNs.

Several DRL algorithms have been successfully applied to voltage regulation problems. The DDPG algorithm [9] enables continuous control by combining deep neural networks with actor–critic architectures, making it suitable for reactive power dispatch applications. Cao et al. [2] demonstrated the effectiveness of multi-agent DRL for decentralized voltage control with high PV penetration. The TD3 algorithm [10] addresses overestimation issues in DDPG via twin critics and delayed policy updates, improving learning stability. More recently, the Soft Actor–Critic (SAC) algorithm has been applied to hierarchical Volt/VAR control, achieving robust performance under varying network conditions [11]. Additionally, PPO [12] has emerged as an alternative to policy optimization with improved sample efficiency and stability.

DRL has been successfully applied to tasks such as demand response management [13], ESS management [14], voltage control in ADNs [15], and hybrid schemes such as hierarchical Volt/VAR control using the SAC algorithm [11]. Perera and Kamalaruban [16] provide a comprehensive review of DRL applications in energy systems, highlighting the growing adoption of these techniques for real-time control problems.

1.2. Voltage Regulation Devices: From FACTS to Electric Springs

Along with these methodological advances, innovations in voltage regulation devices have emerged, with the electric spring (ES) standing out in recent years. This device enables flexible, real-time voltage regulation that adapts to load and generation variability [17]. From an operating-principle perspective, ESs can provide fast-acting, distributed voltage support suitable for real-time operation, similar to devices such as ESSs and Static Synchronous Compensators (STATCOMs) [14,17]. The ES operates by modulating the voltage across a series-connected noncritical load, effectively creating a “smart load” that can exchange reactive power as needed. This operating principle enables voltage regulation at the point of common coupling (PCC) and supports demand-side management. Note that this work does not aim to compare different grid-support devices (STATCOMs, D-FACTS, ESSs), but rather to systematically compare and tune different deep reinforcement learning techniques. In this context, the ES is adopted as a representative case study due to its nonlinear dynamics and demand-side implementation.

Table 1 summarizes the main characteristics of different voltage regulation approaches, highlighting the advantages of ES-based solutions.

Cooperative and centralized schemes have been proposed using multiple ESs, employing algorithms such as Particle Swarm Optimization (PSO) and Genetic Algorithm (GA) to optimize their operation and mitigate voltage deviations [18]. Chen et al. [19] developed distributed cooperative control strategies for multiple DC electric springs, demonstrating the feasibility of coordinated ES operation in distribution networks. More recently, Saha and Dutta [20] extended ES applications to combined voltage–frequency regulation in multi-area power systems, demonstrating the versatility of this technology.

Akhtar et al. [21] demonstrated that smart loads with ESs can contribute to primary frequency control through reactive compensation, while Luo et al. [17] provided a detailed comparison between ES-based distributed voltage control and STATCOM, concluding that ESs offer comparable performance at lower cost for distribution network applications.

1.3. Research Gap and Contributions

Despite the advances achieved, most existing approaches using metaheuristic algorithms and ES devices for voltage control still exhibit significant limitations for real-time operation and in high-uncertainty contexts.

In contrast, DRL algorithms offer clear advantages in dynamic environments, providing faster response times and more precise voltage control. However, despite advances in both DRL-based control and ES technology, no prior studies have addressed their combined application for voltage control in ADNs.

This gap is significant because: (1) ESs offer a cost-effective alternative to conventional FACTS devices for voltage regulation, (2) DRL provides the adaptive, model-free control capability needed to handle the uncertainty of renewable generation, and (3) the combination of both technologies has the potential to achieve superior voltage regulation performance while minimizing infrastructure costs.

This paper presents a centralized voltage control scheme based on the DDPG algorithm for the coordinated operation of multiple ESs in an ADN. The main contributions of this work are as follows:

A novel region-based penalized reward function that effectively guides the DRL agent to minimize voltage deviations while avoiding aggressive control actions.
A coordinated control strategy for multiple ESs using the DDPG algorithm, representing the first application of DRL techniques to ES-based voltage regulation in ADNs.
A systematic hyperparameter tuning methodology based on grid search that significantly improves the performance and robustness of the DDPG controller compared to default configurations.
Comprehensive validation through simulation studies on the IEEE 33-bus and IEEE 69-bus test systems, demonstrating approximately a 40% reduction in mean voltage deviation compared to the no-control baseline.

The control agent is trained and validated through offline simulations under multiple load variations and active-power-injection scenarios. The reward function is explicitly formulated to minimize voltage deviation at the buses. The results demonstrate the feasibility and effectiveness of the proposed method, highlighting its ability to provide fast, robust regulation in the face of fluctuations and dynamic changes in the system.

The remainder of this paper is organized as follows: Section 2 presents the theoretical background on ES modeling and DRL algorithms. Section 3 describes the proposed framework and methodology. Section 4 details the case study setup. Section 5 presents and analyzes the simulation results. Section 6 discusses the findings, limitations, and future directions. Finally, Section 7 concludes the paper.

2. Modeling and Theoretical Background of the Electric Spring and the DDPG Algorithm

This section establishes the theoretical basis of this work. First, it presents the steady-state mathematical model of the ES, followed by a concise overview of the DDPG algorithm equations.

2.1. Electric Spring Operation for Voltage Regulation in Active Distribution Networks

The power stage of the ES comprises a DC/AC PWM inverter, a DC-link capacitor, and a low-pass output filter. Since the DC-link capacitor is the sole energy buffering element, the system operates exclusively through reactive power exchange. For analytical simplicity, the ES is modeled as an ideal controlled AC voltage source.

The ES is connected in series with a constant-impedance load

Z_{NC, j}

, referred to as the noncritical load (NCL), forming the smart load (SL). The capacity of this SL depends on the ES output voltage

V_{ES, j}

and the impedance

Z_{NC, j}

. Figure 1 illustrates the connection of the SL to an arbitrary bus j of the ADN, in parallel with a ZIP load representing the critical load (CL). These elements are connected to the slack bus represented by a Thevenin equivalent with voltage (

V_{G}

), while

Z_{i j}

denotes the impedance of the distribution line connecting buses i and j. The current phasors include the ZIP load current (

I_{Z I P}

), the distribution line current (

I_{i j}

), the current in the NCL (

I_{N C_{j}}

), and the current flowing to a further connection bus (

I_{j k}

). Furthermore,

V_{j}

represents the PCC voltage for the SL. ES activation is defined by the binary variable

A_{j} \in {0, 1}

, where

A_{j} = 1

denotes activated and

A_{j} = 0

denotes deactivated.

Reference [23] outlines the procedure for establishing the steady-state single-phase ES model, demonstrating high reliability when integrated with the backward–forward sweep method (BFSM). This process aims to solve the ADN iteratively by updating the voltage and current values, computing the ES model, and performing power flow analysis. The accuracy of the proposed model is not evaluated in this work, as it has been addressed and validated in prior studies [23,24]. In this study, the ES is used as a test platform to evaluate the performance and adaptability of different DRL-based control strategies, rather than as a means to assess or benchmark grid-support technologies.

This work uses a ZIP load model for the CL. The equations for active power (

P_{j}

) and reactive power (

Q_{j}

) are given in (1) and (2), respectively. In particular, this model employs parameters

f_{1 j} - f_{6 j}

as weighting factors for the ZIP load components (namely, constant-impedance, constant-current, and constant-power), subject to the normalization constraints

f_{1 j} + f_{2 j} + f_{3 j} = 1

and

f_{4 j} + f_{5 j} + f_{6 j}

= 1.

For this model, a constant-power ZIP load is used. The parameters

P_{O j}

, and

Q_{O j}

denote the rated active and reactive power, respectively, while

V_{F}

specifies the nominal voltage magnitude of the CL.

P_{j} = P_{O j} [f_{1 j} + f_{2 j} (\frac{V_{j}}{V_{F}}) + f_{3 j} {(\frac{V_{j}}{V_{F}})}^{2}],

(1)

Q_{j} = Q_{O j} [f_{4 j} + f_{5 j} (\frac{V_{j}}{V_{F}}) + f_{6 j} {(\frac{V_{j}}{V_{F}})}^{2}] .

(2)

From the apparent power components of the CL, the current phasor

I_{Z I P}

is obtained as shown in Equation (3), in which

Y_{Z I P}

denotes the overall admittance of the ZIP load.

I_{Z I P} = (\frac{P_{j} - j Q_{j}}{V_{j} {(V_{j})}^{*}}) V_{j} = Y_{Z I P} V_{j} .

(3)

Similarly,

I_{j k}

is computed using the admittance form, as depicted in Equation (4), where the equivalent admittance of the line segment

j - k

is represented by

Y_{j k}

.

I_{j k} = (\frac{I_{j k}}{V_{j}}) V_{j} = Y_{j k} V_{j} .

(4)

During the iterative solution process, Equation (5) computes the phasor

V_{G}

. Both its magnitude and angle are dynamically adjusted in response to voltage deviations at the PCC, variations in the distribution line parameters, and changes in the equivalent admittances

Y_{Z I P}

and

Y_{j k}

.

V_{G} = V_{j} |1 + Z_{i j} (Y_{Z I P} + Y_{j k})| .

(5)

Equation (6) computes the phase angle

φ_{1 j}

between

V_{j}

and

V_{G}

as stated in the geometric relationships shown in [22]. The angle

φ_{0 j}

associated with the series impedance of the distribution line is calculated in Equation (7) as follows:

φ_{1 j} = a r g (1 + Z_{i j} (Y_{Z I P} + Y_{j k})),

(6)

φ_{0 j} = t a n^{- 1} (\frac{X_{i j}}{R_{i j}}) .

(7)

The complex impedance

Z_{i j}

consists of resistive and inductive components,

R_{i j}

and

X_{i j}

, respectively. Additionally, Equation (8) incorporates

a_{j}

and

b_{j}

from Equations (9)–(11) to compute the angle

θ_{j}

. This angle is essential for satisfying the analytical model and ensuring that the ES operates in reactive mode, along with the operating conditions (

a_{j}

and

b_{j}

) at the PCC.

These operational conditions define the ES’s output voltage limits. Specifically, the primary operational boundary for the ES is governed by the control variable

m_{j}

, constrained to the range −1 to 1, where

m_{j} = - 1

corresponds to full capacitive mode and

m_{j} = 1

denotes full inductive mode. Depending on the active and reactive power at the bus, a portion of the load is classified as NCL, which, in turn, sets the maximum power the ES can supply.

θ_{j} = s i n^{- 1} (m_{j}) - t a n^{- 1} (\frac{b_{j}}{a_{j}}),

(8)

a_{j} = \frac{V_{G} V_{j}}{R_{N C_{j}}} K_{j} c o s (φ_{0 j} + φ_{1 j}),

(9)

b_{j} = \frac{1}{2} {(\frac{V_{j} K_{j}}{R_{N C_{j}}})}^{2} + \frac{V_{G} V_{j}}{R_{N C_{j}}} K_{j} s i n (φ_{0 j} + φ_{1 j}),

(10)

K_{j} = \sqrt{{(R_{j k})}^{2} + {(X_{j k})}^{2}} .

(11)

To correctly integrate the SL with the BFSM, both the magnitudes and phase angles of

I_{{N C}_{j}}

and

V_{E S_{j}}

must be determined. First, the magnitude is computed, as shown in (12) and (13).

I_{{N C}_{j}} = \frac{V_{j}}{R_{N C_{j}}} c o s (θ_{j} / 2),

(12)

V_{E S_{j}} = V_{j} s i n (θ_{j} / 2) \pm I_{N C_{j}} X_{N C_{j}} .

(13)

The resistance and reactance of

Z_{N C j}

are represented by

R_{N C j}

and

X_{N C j}

, respectively. In (13), the result changes sign depending on the power factor of the NCL: it is positive when

I_{{N C}_{j}}

lags its voltage and negative when it leads. The phase angles are determined using the angle of

V_{j}

(

λ_{j}

). For

I_{N C_{j}}

, the phase angle is calculated as

θ_{I_{N C_{j}}} = λ_{j} - \frac{θ_{j}}{2}

and for

V_{E S_{j}}

, the phase angle is

θ_{E S_{j}} = λ_{j} - \frac{θ_{j}}{2} + sign (θ_{j}) \frac{π}{2}

. These correlations ensure that the ES operates in reactive compensation mode and confirm the validity of the mathematical model. The electrical variables of the ES model are calculated so that it can be integrated into the BFSM, enabling the calculation of the current

I_{i j}

, as shown in (14), through the backward sweep.

I_{i j} = I_{N C_{j}} + I_{j k} + I_{Z I P} .

(14)

2.2. Reinforcement Learning DDPG Algorithm

An RL system trains one or more agents to interact with an environment E iteratively at discrete time steps t. At each step, the agent receives an observation

x_{t}

, selects an action

a_{t}

, and receives a scalar reward

r_{t}

.

In power systems, DRL is particularly well-suited to complex, nonlinear, and time-varying control problems, such as voltage regulation in ADNs. In this work, we employ a deterministic policy within the DDPG framework [9]. This policy allows the direct adoption of the corresponding deterministic Bellman recursion [25]. Thus, for a policy

μ : S \to A

, the action–value function

Q^{μ} (s, a)

satisfies (15):

Q^{μ} (s_{t}, a_{t}) = E_{r_{t}, s_{t + 1} \sim E} [r (s_{t}, a_{t}) + γ Q^{μ} (s_{t + 1}, μ (s_{t + 1}))] .

(15)

As a result, the expectation depends only on the environment dynamics, enabling off-policy learning of

Q^{μ}

using transitions generated by a different stochastic behavior policy

β

.

In the DPG framework, a deterministic policy

μ (s | θ_{μ})

represents the actor, mapping each state to a specific action, while the critic

Q (s, a | θ^{Q})

is trained using the Bellman equation above. The actor is then updated by performing a gradient descent on the expected return. The deterministic policy gradient is given by (16):

\nabla_{θ_{μ}} J \approx E_{s_{t} \sim ρ^{β}} [\nabla_{θ_{μ}} Q (s, a | θ^{Q}) | s = s_{t}, a = μ (s_{t} | θ^{μ})] .

(16)

2.2.1. DDPG Architecture and Key Components

The DDPG algorithm [9] extends the Deterministic Policy Gradient (DPG) framework by incorporating deep neural networks to approximate both the actor and critic functions. The key components of DDPG are:

Actor Network $μ (s | θ^{μ})$ : A neural network that maps states to continuous actions. The output layer uses a tanh activation function to bound actions within the valid range $[- 1, 1]$ , which in this application corresponds to the ES control variable $m_{j}$ .
Critic Network $Q (s, a | θ^{Q})$ : A neural network that estimates the action-value function. The critic takes both the state and action as inputs and outputs a scalar Q-value representing the expected cumulative reward.
Experience Replay Buffer: A finite-sized buffer that stores transition tuples $(s_{t}, a_{t}, r_{t}, s_{t + 1})$ from agent-environment interactions. During training, mini-batches are uniformly sampled from this buffer to break temporal correlations and improve learning stability.
Target Networks: Separate copies of the actor ( $μ^{'}$ ) and critic ( $Q^{'}$ ) networks with parameters $θ^{μ^{'}}$ and $θ^{Q^{'}}$ , respectively. These target networks are updated slowly using soft updates:

$θ^{'} \leftarrow τ θ + (1 - τ) θ^{'},$

(17)

where $τ ≪ 1$ is the target smoothing factor. This mechanism stabilizes learning by providing consistent target values for the Bellman backup.
Exploration Noise: To encourage exploration, noise is added to the actor’s output during training. DDPG typically uses an Ornstein–Uhlenbeck (OU) process to generate temporally correlated noise suitable for physical control tasks.

The critic is trained by minimizing the mean squared Bellman error:

L (θ^{Q}) = E_{(s, a, r, s^{'}) \sim D} [{(Q (s, a | θ^{Q}) - y)}^{2}],

(18)

where

D

is the replay buffer and the target y is computed as:

y = r + γ Q^{'} (s^{'}, μ^{'} (s^{'} | θ^{μ^{'}}) | θ^{Q^{'}}) .

(19)

2.2.2. Twin Delayed DDPG (TD3) Algorithm

The TD3 algorithm [10] addresses several limitations of DDPG, particularly the tendency to overestimate Q-values, which can lead to suboptimal policies. TD3 introduces three key modifications:

Clipped Double Q-Learning: TD3 maintains two critic networks $Q_{1}$ and $Q_{2}$ , each with its own target networks. The target value uses the minimum of the two critics to reduce overestimation:

$y = r + γ min_{i = 1, 2} Q_{i}^{'} (s^{'}, {\tilde{a}}^{'} | θ^{Q_{i}^{'}}),$

(20)

where ${\tilde{a}}^{'} = μ^{'} (s^{'} | θ^{μ^{'}}) + ϵ$ and $ϵ \sim clip (N (0, σ), - c, c)$ is clipped noise added to the target action.
Delayed Policy Updates: The actor network is updated less frequently than the critics (typically once every two critic updates). Updating the actor less frequently allows the critic estimates to stabilize before being used to update the actor, reducing the accumulation of errors.
Target Policy Smoothing: Noise is added to the target policy’s actions (as shown in (20)) to smooth the Q-function estimates and prevent the policy from exploiting narrow peaks in the Q-landscape.

Table 2 summarizes the main differences between DDPG and TD3.

2.2.3. Comparison with Other DRL Algorithms

Beyond DDPG and TD3, other DRL algorithms have been applied to voltage regulation problems. The PPO algorithm [12] is a policy gradient method that uses a clipped surrogate objective to ensure stable updates without the need for trust-region constraints. While PPO has shown robust performance across various domains, it typically requires more samples than off-policy methods such as DDPG. The SAC algorithm [11] incorporates entropy regularization to encourage exploration and has demonstrated strong performance in continuous control tasks.

DDPG was selected as the primary algorithm due to: (1) its proven effectiveness in continuous control problems with deterministic optimal policies, (2) its compatibility with the reactive power dispatch problem, where actions are continuous and bounded, and (3) its relatively straightforward hyperparameter tuning compared to entropy-regularized methods. TD3 and PPO are included as baselines to evaluate whether their enhancements provide benefits for this specific application.

3. Framework of the Proposed Approach

This section provides a detailed examination of the proposed framework for optimizing ES operation within the ADN. The framework leverages a DRL algorithm to mitigate voltage deviations caused by fluctuating user consumption patterns and DG active power injections.

3.1. Reinforcement Learning Environment

The architecture of the proposed DRL-based approach for voltage regulation is shown in Figure 2.

In the proposed framework, a set of ESs is controlled by a single agent that interacts with the environment, which is represented as an ADN using the IEEE 33-bus system, as shown in Figure 2. At each time step, stochastic variations in load and distributed generation modify the network’s operating point. The environment provides an observation consisting of the vector of dynamic voltage deviations

Δ V_{D} (t)

with respect to the dynamic voltage profile

V_{D}

. This vector constructs the state

s_{t}

, which is then fed to the DDPG algorithm to compute the reactive-power setpoints for the ESs. The resulting voltages are then used to calculate the reward according to (21) and (22), explicitly encouraging the agent to reduce

\bar{Δ V_{D}}

. The time-delay block in Figure 2 closes the discrete-time loop: after the ESs inject or absorb reactive power, the ADN voltages are updated, a new

Δ V_{D} (t + 1)

value is obtained, and the next state–action–reward tuple is generated.

The architecture of this control scheme also includes the following components:

State: $s_{t}^{b} \in S_{t}$ represents the information obtained after the actor applies an action in the environment at a time t. In this work, the voltage deviation of each bus is defined as $s_{t}^{b}$ , where b denotes the bus index. $S_{t}$ depicts the state information of all the buses at time t.
Action: $m_{t}^{c} \in M_{t}$ denotes the control signal issued by the agent, which determines the reactive power injection or absorption of ES number c (where c identifies each ES in the system) at iteration t. The set $M_{t}$ represents the actions of all ESs at t.
Reward: $R_{E}$ represents the cumulative reward obtained in the current episode ( $E p$ ) and is computed as follows in (21):

$R_{E} (E p) = \sum_{t = 0}^{T} R_{D} (t),$

(21)

where T is the total number of time steps, and $R_{D} (t)$ is the reward at t. This work assigns $R_{D} (t)$ based on the penalization region, as defined in (22).

R_{D} (t) = \{\begin{matrix} 0 & if \frac{\bar{Δ V_{D}}}{\bar{Δ V_{B}}} \leq 0.7, \\ - 15 & if 0.7 < \frac{\bar{Δ V_{D}}}{\bar{Δ V_{B}}} \leq 0.8, \\ - 30 & if 0.8 < \frac{\bar{Δ V_{D}}}{\bar{Δ V_{B}}} \leq 0.9, \\ - 45 & if 0.9 < \frac{\bar{Δ V_{D}}}{\bar{Δ V_{B}}} \leq 1, \\ - 60 & otherwise . \end{matrix}

(22)

The piecewise reward

R_{D} (t)

in (22) is based on the deviation ratio

\bar{Δ V_{D}} / \bar{Δ V_{B}}

, which compares the mean dynamic voltage deviation obtained with the ES control action

\bar{Δ V_{D}}

against the mean deviation of the base case without any control

\bar{Δ V_{B}}

. The neutral threshold value (0.7) was selected after preliminary parametric studies, as shown in the Sensitivity Analysis of Reward Function Threshold Section. A uniformly stepped penalty is applied, increasing by 15 units for each additional 10% degradation in the deviation ratio. This shaping guides the agent progressively toward the target region while keeping the reward bounded to avoid excessively large penalties. By scaling the penalty with the ratio interval, the agent is strongly incentivized to learn strategies that keep the voltage deviation ratio as low as possible.

Sensitivity Analysis of Reward Function Threshold

To justify the selection of the 0.7 threshold, a systematic sensitivity analysis was conducted by varying

α

from 0.5 to 0.9 in increments of 0.1. For each threshold value, the DDPG agent was trained for 250 episodes under identical conditions, and the resulting performance was evaluated based on: (1) final mean voltage deviation

\bar{Δ V_{D}}

, (2) convergence speed, (3) control action variability, and (4) training time.

The results in Table 3 reveal that the lowest threshold (

α = 0.5

) failed to converge within 250 episodes. Although

α = 0.6

yields a marginally lower mean voltage deviation, it converges more slowly and requires substantially longer training time. Additionally, it exhibits the highest action variability, suggesting more aggressive control behavior. Higher thresholds (

α = 0.8, 0.9

) reduce the elapsed training time and produce smoother actions, but at the cost of poorer voltage regulation performance. Overall,

α = 0.7

offers the best trade-off, achieving low voltage deviation with reasonable convergence speed and moderate action variability. This value also corresponds to a physically meaningful target of 30% improvement over the base case, a challenging yet attainable goal for the ES configuration used in this study.

The value of the ratio changes constantly due to the update of

\bar{Δ V_{D}}

every t, as shown in (23):

\bar{Δ V_{D}} (t) = \frac{1}{n} \sum_{b = 1}^{n} |V_{B u s} (b) - 1| .

(23)

In this equation, n denotes the total number of buses, and

V_{Bus} (b)

represents the voltage at bus b. Similarly,

\bar{Δ V_{B}}

is calculated using its corresponding bus voltages.

The loop continues until the maximum number of time steps is reached or the reset condition in (24) is met.

R e s e t = \bar{Δ V_{D}} < 0.7 (\bar{Δ V_{B}}) .

(24)

This stopping criterion is based on the same target threshold,

α = 0.7

, as the reward design. An episode is considered successful once the controller achieves at least a 30% reduction in mean voltage deviation relative to the no-control baseline. To avoid premature termination due to transient improvements, a minimum of 10 steps is enforced before an episode can terminate.

3.2. Training and Evaluation

To evaluate the proposed methodology, the DDPG algorithm is trained using MATLAB’s Reinforcement Learning Toolbox and custom code. The first step of training, as shown in Figure 3, consists of obtaining an initial DG and load variation from the real profiles used in this work, specifically at

E p = 1

.

The ADN solution is obtained by solving the power flow of the IEEE 33 bus grid using the BFSM algorithm. After this sequence, all the voltages at each bus are available, enabling the computation of

\bar{Δ V_{D}}

which is used to compute the observations for the state

S_{t}

. With this information, the algorithm can take a set of actions

M_{t}

to regulate the operational state of the ESs. After applying the corresponding actions to the ADN, the power flow solution is computed, yielding a reward

R_{D}

, which is then evaluated to determine whether the reset condition is met or the T condition is satisfied. If the reset condition is met, the episode terminates, and t is reset to 1 to start the next episode (

E p + 1

). This threshold reflects the point at which the agent has successfully learned to maintain the voltage deviation within acceptable operational limits. At this point, the agent receives a cumulative reward of

R_{D} = 0

for remaining within the desired operating region, as shown in (22). This indicates convergence of the training process and satisfactory control performance.

On the other hand, if the T condition has not yet been reached, the algorithm will repeat the solution process, proposing a new set of actions every t. Training is complete when the maximum number of episodes has been reached.

3.3. Hyperparameter Tuning

In this work, an experimental tuning strategy based on the grid search method is used to adjust key hyperparameters, including the number of training episodes, the actor and critic learning rates, the decay rate, and the episode length [26].

To implement this strategy, each hyperparameter was first evaluated independently. Specifically, a set of five candidate values was defined for each hyperparameter, uniformly distributed within a ±30% interval around its default value. These defaults correspond to widely adopted configurations in continuous-control DRL algorithms and are consistently reported in prior studies. Table 4 (column DDPG-B) summarizes the hyperparameters considered in this study. The impact of each candidate value was assessed by analyzing the evolution of the reward function throughout training, enabling the identification of hyperparameters with high sensitivity, as evidenced by significant variations in convergence speed, reward stability, and steady-state performance. Hyperparameters with limited influence on these metrics were retained at their default values.

Subsequently, a subset of highly sensitive hyperparameters was fixed for further tuning. For these parameters, the search intervals were expanded to broader ranges derived from the literature on DDPG-based control and voltage regulation. A sequential tuning procedure was then applied: for each hyperparameter, all other parameters were held constant while candidate values were evaluated individually. For every configuration, ten independent training runs were performed to mitigate the effects of stochasticity inherent to neural network initialization and exploration noise. The parameter value that achieved the highest average reward while simultaneously ensuring stable voltage profiles within acceptable operational limits was retained before proceeding to the next hyperparameter.

This sequential process was repeated across the subset of sensitive hyperparameters until further adjustments produced negligible changes in convergence behavior and voltage regulation performance within the explored ranges.

4. Case Study

This section describes the experimental framework used to evaluate the proposed strategy on the IEEE 33-bus network, comparing a no-control base case against the controlled case.

Test Scenarios Description

The proposed methodology optimizes the operation of the ESs in the IEEE 33-bus radial distribution feeder, depicted in Figure 4. Line impedances and bus load data are from [27], and grid properties are listed in Table 5. All users connected to the ADN have a residential consumer profile, as described in [28].

The consumption trend of the load profiles is analyzed in Figure 5, which shows detailed curves for the 24 h retrieved from the public ERCOT database [29]. These load profiles are assigned to different grid branches, as indicated by the color coding in Figure 4. Seven DG sources are distributed across the grid, comprising four photovoltaic panels (PVs) and three wind turbines (WTs), and accounting for 60% of the total demand. The behavior of the DG source profiles is depicted in Figure 6, where the PV and WT generation profiles are from the public NREL dataset [30]. The DG sources are placed in a distributed configuration to efficiently deliver injected active power to elements farther from the main grid. To evaluate the proposed methodology, 12 ESs are distributed across the ADN. These ESs perform voltage regulation via reactive power compensation and are controlled using the DDPG algorithm. Each ES operates at 80% of the CL’s active power capacity. This value lies within the range of smart-load participation levels explored in previous ES-based studies [17,21]. To implement the control strategy, a specific neural network architecture and a set of experimentally tuned hyperparameters are adopted for the DDPG-B configuration, as summarized in Table 4. For comparison, the same table reports the hyperparameters used for DDPG-A and TD3, which follow the default MATLAB settings. The actor and critic learning rates were adjusted experimentally to ensure stable training for this specific application. PPO hyperparameters are not listed because, as an on-policy method, several TD3/DDPG-specific parameters (e.g., replay buffer size, target-network smoothing) do not apply.

Once trained, the DRL agents regulate the grid voltage at 15-min intervals over 24 h to reduce

\bar{Δ V_{D}}

below 70% of

\bar{Δ V_{B}}

.

Finally, to assess the robustness of the proposed DRL controller, the three DRL configurations (tuned DDPG-B, default DDPG-A, and TD3) are benchmarked against a no-control base case. The simulations are performed on a PC with a 13th Gen Intel(R) Core(TM) i7-13650HX 2.60 GHz processor (Santa Clara, CA, USA) running MATLAB R2022b.

5. Results

This section is divided into two parts: the key results from training the DDPG controller and the performance evaluation during testing for both the base and controlled cases.

5.1. Training Results

Figure 7 illustrates the average reward (blue line) and episode reward (purple line) of the training across 200 episodes for the DDPG agent. With the current parameters and reward function, the maximum achievable reward is zero. At the beginning of the training, the reward shows low, fluctuating values, indicating high variability in the agent´s decisions. This behavior is expected, as the agent undergoes an exploratory phase without prior knowledge of the environment or of how to properly regulate the grid voltage based on ES operation. As training progresses, the episode rewards show a clear trend toward convergence, with gradually increasing values. Nevertheless, after episode 40, sudden drops in the episode reward are observed. These drops are likely due to grid conditions the agent had not previously encountered, such as load fluctuations and active power injections. In response, the DDPG agent takes atypical exploration actions to find a suitable solution. As training continues, these fluctuations decrease, and by episode 60, the agent begins to show steady episode reward behavior. Around episode 100, both curves stabilize, indicating that the agent has successfully learned an effective policy. The average reward approaches zero, indicating that the agent is finding the desired solution specified by the reward function, corresponding to a decrease in the voltage deviation

\bar{Δ V_{D}}

below 70% of

\bar{Δ V_{B}}

.

5.2. Testing Results on the IEEE 33-Bus Network

The voltage regulation results are presented in Figure 8, which provides a comprehensive view of the system performance through two complementary representations. Figure 8a displays the minimum bus voltage recorded at each hour, enabling a direct assessment of ANSI C84.1 compliance [31]. The results indicate that the proposed strategy using DDPG-B increases the minimum hourly voltage level, thereby reducing the frequency of undervoltage events compared to the base case. Figure 8b shows the temporal evolution of bus voltages, where shaded regions represent the voltage range (minimum to maximum) across all 33 buses at each 15-min interval. In the base case, the feeder exhibits significant voltage regulation challenges during two critical periods. The first undervoltage event occurs around 6:15 h (point A), where the bus voltage drops to 0.9492 p.u. This behavior is attributed to the morning load ramp-up while DG output remains minimal. Between 8:00 h and 17:00 h, the voltage profile recovers as DG participation increases. The most severe undervoltage condition occurs at point B (21:15 h), where the minimum voltage drops to 0.9203 p.u., representing a 7.97% deviation from the nominal value. In contrast, the implementation of ESs controlled by the DDPG-B algorithm effectively mitigates undervoltage conditions. The shaded orange region in Figure 8b is notably narrower than the blue region, indicating reduced voltage variability. The controlled case exhibits only a single undervoltage event at point C (21:15 h), with the minimum voltage reaches 0.9445 p.u. This event affects buses 12–17 at the feeder end, where cumulative line losses and limited ES capacity create challenging conditions. Nevertheless, the DDPG-B controller reduces the maximum voltage deviation from 7.97% to 5.55%, representing a 30% improvement.

To better understand these results, the agent’s actions are analyzed in more detail. Figure 9a shows the ES reactive power (

Q_{E S}

) across the testing scenarios, while Figure 9b displays the voltage distribution at each bus across all the test cases, linking ES operation to its impact on system voltage. The agent’s control actions exhibit diversity; this is evident in the boxplots in Figure 9a, where each of the 12 ESs shows different reactive power patterns. This diversity in

Q_{E S}

magnitudes demonstrates that, regardless of load fluctuations, the algorithm enables coordinated ES operation to determine an overall solution, thereby reducing

\bar{Δ V_{D}}

. Notably, the boxplot of bus 19 in Figure 9a is the only one whose reactive-power values are mostly positive, indicating that this ES operates in inductive mode most of the time and absorbs reactive power, which reduces the local voltage. This behavior is mainly explained by its electrical location: the ES is connected to the branch supplied from bus 2, one of the closest buses to the slack bus, so the voltage drop along this feeder is smaller, and the voltage at bus 19 is higher than at downstream buses. Consequently, the loads connected to this bus experience voltages closer to 1 p.u., as shown in Figure 9b.

A similar case is observed at bus 22, where the boxplot shows a limited spread, indicating that its voltage values remain consistent throughout the simulation, close to 1 p.u., even though the ES does not operate in inductive mode, unlike the ES at bus 19. Although connected to the same branch, the ES’s location at the distal end results in a voltage reduction. In contrast, for the ESs located at buses 7, 8, 14, 18, 29, 30, 31, and 32, the voltage is lower and, consequently, exhibits a high voltage deviation. This effect is associated with the distance from the slack bus, which shifts the ES’s operating range into capacitive mode.

On the other hand, the ESs connected to buses 24 and 25 exhibit higher operating magnitudes in capacitive mode. Although the connection buses exhibit low-voltage deviations and operate close to the nominal voltage, these ESs still contribute significantly to capacitive compensation. This behavior is explained by the fact that these devices have the most significant compensation capacity, enabling a broader impact on the ADN. As a result, they not only compensate local loads but also provide voltage support to other connected loads. The simulation results obtained with the DDPG algorithm confirm that the training process was effective, enabling the agent to adapt to different operating scenarios while maintaining voltage stability.

Figure 10 compares the no-control base case with the four DRL configurations TD3, PPO, DDPG-A, and DDPG-B. The boxplots show the distribution of the total voltage deviation over a 24-h period, summarizing the median, interquartile range, and extreme values, thereby enabling a detailed comparison of how effectively each controller keeps the bus voltages close to the nominal value.

In the no-control base case, the boxplot shows the widest spread and the most considerable voltage deviations, with a few values exceeding 0.045 p.u. By contrast, TD3 and DDPG-A exhibit intermediate performance, with mean deviations close to 0.02 p.u. In comparison, PPO yields a lower mean deviation (0.0184 p.u.), achieving the best performance among the default configurations. Nevertheless, the proposed DDPG setting (DDPG-B) yields a much more concentrated distribution and the lowest mean value of

\bar{Δ V_{D}}

among all tested agents. In this case, the 25th and 75th percentiles lie closer to the median, and the minimum deviations are lower, indicating more consistent voltage regulation and more reliable control action. The mean value

\bar{Δ V_{D}}

is approximatly 40% lower than in the base case. Regarding the training statistics, DDPG-B achieves the lowest training time (3.76 min), whereas TD3, PPO, and DDPG-A require more than 6 min. A similar trend is observed in the convergence behavior: DDPG-B converges around episode 100, while TD3 and DDPG-A converge only after roughly 200 episodes. PPO stabilizes earlier, despite occasional reward drops. These results support the effectiveness of the proposed hyperparameter tuning.

Although all DRL techniques effectively mitigate voltage deviations, these findings confirm that the tuned DDPG-B agent significantly reduces

\bar{Δ V_{D}}

and provides more uniform behavior across different operating scenarios, outperforming DDPG-A and TD3 in terms of both voltage deviation and consistency.

5.3. Comparison with Metaheuristic-Based Centralized Voltage Control

A comparative performance analysis between the proposed DDPG-B-based control strategy and metaheuristic-based centralized voltage regulation approaches, including GA, PSO, and Grey Wolf Optimizer (GWO), is conducted in this subsection. These techniques are commonly adopted in centralized voltage control schemes and therefore constitute suitable benchmarks for evaluating the effectiveness of the proposed DRL-based approach. For implementation, GWO follows the open-source reference code in [32], while GA and PSO use the MATLAB toolbox provided in [33]. For all three metaheuristics, the default hyperparameter settings are used.

The comparison is conducted under the same network configuration, operating conditions, and scenarios defined in the proposed case study, ensuring a fair and consistent evaluation across all methods. Two key performance indicators are considered:

\bar{Δ V_{D}}

and the computational execution time required to obtain a solution.

The metaheuristic-based methods effectively solve the IEEE 33-bus active distribution network. Their mean voltage deviations,

\bar{Δ V_{D}}

, are comparable to those obtained with the DDPG-B-based control strategy, as reported in Table 6, indicating similar voltage regulation performance under the considered operating conditions. However, the computational performance differs significantly. On average, GA, PSO, and GWO require more than 2 min per scenario, while the DDPG-B-based DRL approach requires only 0.0159 min per scenario. This difference arises because the DRL approach exploits a pre-trained policy to generate control actions through a single forward pass. The substantial reduction in execution time highlights the superior computational efficiency of the proposed method, which is particularly relevant for real-time voltage regulation applications.

5.4. Scalability of the Centralized Approach

To assess the scalability of the proposed centralized DRL control framework, the experimental evaluation is extended to the IEEE 69-bus system. In this setup, five ES units are installed at buses with high active power demand, as listed in Table 7, and each ES operates at 80% of the CL active power capacity. In addition, three PV units are included using the same PV generation profiles (PV1–PV3) as in the previous case study, accounting for 60% of the total active power demand. Additionally, the residential load profile used in the 33-bus case is retained and appropriately assigned across the 69-bus network. The proposed reward function is kept unchanged, following the same criteria defined in (22). Likewise, the tuned DDPG-B hyperparameter configuration is retained.

Despite the increased network size and expanded observation space, DDPG-B achieved a

\bar{Δ V_{D}}

value of 0.0046 p.u., representing a 38.49% improvement over the base case and demonstrating the scalability of the proposed DRL configuration. Further performance improvements can be achieved by retuning the hyperparameters specifically for the IEEE 69-bus system.

6. Discussion

This section provides a critical analysis of the proposed DRL-based voltage control strategy, discussing its advantages and limitations, comparing it with existing approaches, and identifying practical implementation considerations.

6.1. Performance Analysis and Comparison

The proposed controller was trained to coordinate the ES units under stochastic load and distributed generation variations, achieving an average reduction of about 40% in the mean voltage deviation compared to the no-control base case, while keeping all bus voltages within the ANSI limits across 96 test scenarios. In addition, the DDPG-B configuration outperformed DDPG-A, TD3, and PPO, exhibiting a lower mean voltage deviation and tighter boxplots, indicating more consistent and reliable voltage regulation.

The superior performance of DDPG-B over TD3 and PPO in this application is noteworthy, as these algorithms generally outperform DDPG in continuous-control benchmarks, an advantage commonly attributed to TD3’s more conservative Q-value estimation and PPO’s conservative policy updates. However, the relatively simple reward structure and bounded action space in this voltage regulation problem may reduce these advantages. Furthermore, the hyperparameter tuning applied to DDPG-B but not to TD3 or PPO (which used default hyperparameters except for learning rates in TD3) likely contributed to this performance gap.

6.2. Advantages of the Proposed Approach

The proposed DRL-ES voltage regulation scheme offers several advantages over conventional methods:

Model-free adaptation: Unlike optimization-based approaches that require accurate network models, the DDPG agent learns directly from interactions with the environment, enabling adaptation to model uncertainties and unmodeled dynamics.
Real-time operation: Once trained, the DDPG agent can compute control actions in milliseconds, making it suitable for real-time voltage regulation under rapidly changing conditions.
Coordinated multi-ES control: The centralized agent implicitly learns the interactions between multiple ESs, enabling coordinated operation without requiring explicit communication protocols.
Cost-effective infrastructure: ESs require minimal energy storage compared to battery-based ESSs, reducing capital costs while achieving comparable voltage regulation performance.
Flexibility: The trained agent can handle diverse operating scenarios without retraining, as demonstrated by consistent performance across 96 test cases with varying load and generation profiles.

6.3. Limitations and Challenges

Despite its promising results, the proposed approach has several limitations that should be acknowledged:

Centralized architecture: The current implementation relies on a centralized agent with full observability of all bus voltages. This architecture requires a reliable communication infrastructure and introduces a single point of failure. For larger networks, communication delays and bandwidth constraints may become significant concerns.
Training data requirements: The DDPG agent requires extensive offline training with representative scenarios. If operational conditions deviate significantly from the training data (e.g., extreme weather events or major network reconfigurations), performance may degrade.
Idealized ES model: The ES is modeled as an ideal controlled AC voltage source, neglecting converter losses, DC-link voltage dynamics, and harmonic distortion. While this simplification is common in planning studies, practical implementations may exhibit different behavior.
Reactive power only: The ES configuration used in this work provides only reactive power compensation. For severe voltage deviations, active power injection from ESSs or curtailment of DG sources may be necessary.
Scalability concerns: The state and action spaces grow linearly with the number of buses and ESs. For large-scale networks with hundreds of buses, the current architecture may face scalability challenges that require distributed or hierarchical approaches.

6.4. Practical Implementation Considerations

For practical deployment of the proposed strategy, several considerations should be addressed:

Communication infrastructure: A supervisory control and data acquisition (SCADA) system or advanced metering infrastructure (AMI) is needed to collect real-time voltage measurements and dispatch control signals to ESs.
Safety constraints: Additional constraints should be incorporated to prevent the ESs from operating beyond their physical limits and to ensure fail-safe behavior during communication failures.
Online adaptation: Implementing online fine-tuning mechanisms would allow the agent to adapt to slow drifts in network characteristics without complete retraining.
Regulatory compliance: The control strategy should be validated against utility standards and regulations governing voltage regulation and power quality.

6.5. Future Research Directions

These results highlight the potential of appropriately tuned DRL-controlled ESs as an effective and robust solution for voltage regulation in ADNs with high renewable penetration and variable load profiles. Future research will focus on extending this framework to multi-agent DRL architectures, considering larger and more complex distribution networks, and integrating ESs with other compensation devices to further improve scalability and resilience.

Specific future directions include:

Developing distributed multi-agent DRL frameworks where each ES is controlled by a local agent with limited communication, thereby improving scalability and robustness.
Integration of more detailed ES models, including converter dynamics, losses, and protection schemes.
Combination of ES-based reactive compensation with battery ESSs for coordinated voltage-and-frequency regulation.
Extension to unbalanced three-phase networks and investigation of phase-specific voltage regulation.
Transfer learning approaches to enable rapid adaptation when deploying trained agents to new network configurations.

7. Conclusions

This paper presented a novel deep reinforcement learning-based voltage regulation strategy using electric springs in active distribution networks. The main conclusions of this work are as follows:

The proposed DDPG-based controller successfully coordinates 12 electric springs to regulate voltage in the IEEE 33-bus test system under time-varying load and distributed generation conditions. The trained agent achieves an approximately 40% reduction in mean voltage deviation compared to the no-control baseline. In addition, a scalability study was conducted on the IEEE 69-bus test system using the same load and distributed generation conditions, with 5 electric springs installed, while keeping the reward design and the tuned DDPG-B hyperparameter configuration unchanged. Under this setting, the controller achieves a 38.49% improvement over the no-control case.
The region-based penalized reward function effectively guides the agent toward maintaining voltage deviations below the target threshold while avoiding overly aggressive control actions. The sensitivity analysis confirms that the selected threshold of 0.7 provides an effective trade-off between voltage regulation performance and training stability.
Systematic hyperparameter tuning of the DDPG algorithm (DDPG-B configuration) yields significant performance improvements over the default DDPG configuration (DDPG-A), TD3, and PPO, highlighting the importance of algorithm tuning for specific applications rather than relying on default parameters.
The combination of DRL and electric springs represents a promising approach for cost-effective voltage regulation in distribution networks with high renewable penetration. Electric springs provide flexible reactive compensation without requiring large energy storage elements, while DRL enables adaptive, model-free control that can handle the uncertainty of renewable generation.
The proposed approach keeps all bus voltages within ANSI C84.1 limits (0.95–1.05 p.u.) in 95 out of 96 test scenarios, with only minor violations occurring at remote buses during peak demand periods.

While the centralized architecture and idealized ES model represent limitations that should be addressed in future work, the results demonstrate the feasibility and effectiveness of DRL-based ES coordination for voltage regulation. The methodology developed in this work provides a foundation for practical implementation and further research on distributed control architectures and more realistic ES models.

Author Contributions

Conceptualization, J.I.L.-P., G.T.-C., G.T.-T. and A.G.-P.; methodology, J.I.L.-P., G.T.-C. and G.T.-T.; software, J.I.L.-P., G.T.-T. and L.E.R.-G.; validation, J.I.L.-P., G.T.-T. and G.T.-C.; formal analysis, J.I.L.-P., G.T.-C., G.T.-T. and A.G.-P.; investigation, J.I.L.-P., G.T.-C. and G.T.-T.; resources, A.G.-P. and G.T.-T.; data curation, J.I.L.-P. and L.E.R.-G.; writing—original draft preparation, J.I.L.-P.; writing—review and editing, J.I.L.-P., G.T.-C., G.T.-T., L.E.R.-G. and A.G.-P.; visualization, L.E.R.-G., G.T.-C., G.T.-T. and A.G.-P.; supervision, A.G.-P., G.T.-C. and G.T.-T.; project administration, A.G.-P., G.T.-C. and G.T.-T.; funding acquisition, A.G.-P. and G.T.-T. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets used are available in [29,30].

Acknowledgments

This work was supported by the Secretaría de Ciencia, Humanidades, Tecnología e Innovación (SECIHTI) through a PhD scholarship (No. 748907) awarded to J. I. Lara-Perez.

Conflicts of Interest

The authors declare no conflicts of interest.

Nomenclature

The following abbreviations and symbols are used throughout this manuscript:

Abbreviations
ADN	Active Distribution Network
BFSM	Backward–Forward Sweep Method
CL	Critical Load
D-FACTS	Distributed Flexible AC Transmission Systems
DDPG	Deep Deterministic Policy Gradient
DG	Distributed Generation
DPG	Deterministic Policy Gradient
DRL	Deep Reinforcement Learning
ES	Electric Spring
ESS	Energy Storage System
FACTS	Flexible AC Transmission Systems
GA	Genetic Algorithm
GWO	Grey Wolf Optimizer
NCL	Non-Critical Load
PCC	Point of Common Coupling
PPO	Proximal Policy Optimization
PSO	Particle Swarm Optimization
PV	Photovoltaic
RL	Reinforcement Learning
SAC	Soft Actor–Critic
SL	Smart Load
STATCOM	Static Synchronous Compensator
TD3	Twin Delayed Deep Deterministic Policy Gradient
WT	Wind Turbine
ZIP	Constant-impedance, constant-current, constant-power load model
Symbols
$A_{j}$	Activation variable of ES at bus j ( $A_{j} \in {0, 1}$ )
$a_{j}, b_{j}$	Operating condition parameters at the PCC
E	Environment in the RL framework
$E p$	Episode number
$f_{1 j}$ – $f_{6 j}$	Weighting factors for ZIP load components
$I_{i j}$	Current phasor in distribution line from bus i to j
$I_{N C_{j}}$	Current phasor through the noncritical load at bus j
$I_{Z I P}$	Current phasor through the ZIP load
$K_{j}$	Impedance magnitude factor
$m_{j}$	ES control variable ( $- 1 \leq m_{j} \leq 1$ )
$M_{t}$	Set of all ES actions at time t
$P_{j}, Q_{j}$	Active and reactive power of ZIP load at bus j
$P_{O j}, Q_{O j}$	Rated active and reactive power of the ZIP load
$Q_{E S}$	Reactive power output of the electric spring
$Q^{μ} (s, a)$	Action-value function for policy $μ$
$R_{D} (t)$	Reward at time step t
$R_{E}$	Cumulative reward per episode
$R_{i j}, X_{i j}$	Resistance and reactance of distribution line i–j
$R_{N C_{j}}, X_{N C_{j}}$	Resistance and reactance of noncritical load at bus j
$s_{t}$	State at time step t
$S_{t}$	Set of all bus states at time t
T	Total number of time steps per episode
$V_{G}$	Thevenin equivalent voltage phasor
$V_{F}$	Nominal voltage magnitude of the critical load
$V_{j}$	Voltage phasor at bus j
$V_{E S_{j}}$	Output voltage phasor of ES at bus j
$\bar{Δ V_{B}}$	Mean voltage deviation of base case (no control)
$\bar{Δ V_{D}}$	Mean dynamic voltage deviation with ES control
$Y_{j k}$	Equivalent admittance of line segment j–k
$Y_{Z I P}$	Overall admittance of the ZIP load
$Z_{i j}$	Complex impedance of distribution line i–j
$Z_{N C_{j}}$	Complex impedance of noncritical load at bus j
$γ$	Discount factor in RL
$θ_{j}$	Angle parameter for ES operation
$θ^{μ}, θ^{Q}$	Parameters of actor and critic networks
$λ_{j}$	Phase angle of voltage $V_{j}$
$μ (s \| θ_{μ})$	Deterministic policy (actor)
$φ_{0 j}, φ_{1 j}$	Angle parameters for ES model

References

Jiao, W.; Chen, J.; Wu, Q.; Li, C.; Zhou, B.; Huang, S. Distributed Coordinated Voltage Control for Distribution Networks with DG and OLTC Based on MPC and Gradient Projection. IEEE Trans. Power Syst. 2022, 37, 680–690. [Google Scholar] [CrossRef]
Cao, D.; Zhao, J.; Hu, W.; Ding, F.; Huang, Q.; Chen, Z.; Blaabjerg, F. Data-Driven Multi-Agent Deep Reinforcement Learning for Distribution System Decentralized Voltage Control with High Penetration of PVs. IEEE Trans. Smart Grid 2021, 12, 4137–4150. [Google Scholar] [CrossRef]
Zhang, X.; Wu, Z.; Sun, Q.; Gu, W.; Zheng, S.; Zhao, J. Application and progress of artificial intelligence technology in the field of distribution network voltage Control: A review. Renew. Sustain. Energy Rev. 2024, 192, 114282. [Google Scholar] [CrossRef]
Naderi, E.; Pourakbari-Kasmaei, M.; Abdi, H. An efficient particle swarm optimization algorithm to solve optimal power flow problem integrated with FACTS devices. Appl. Soft Comput. 2019, 80, 243–262. [Google Scholar] [CrossRef]
Khan, N.H.; Wang, Y.; Tian, D.; Raja, M.A.Z.; Jamal, R.; Muhammad, Y. Design of Fractional Particle Swarm Optimization Gravitational Search Algorithm for Optimal Reactive Power Dispatch Problems. IEEE Access 2020, 8, 146785–146806. [Google Scholar] [CrossRef]
Hui, J. Adaptive sliding mode load-following control of a small modular reactor via reinforcement learning, nonlinear extended state observer, and neural network. Energy 2025, 333, 137317. [Google Scholar] [CrossRef]
Hui, J. Nonlinear extended state observer-based model-free near-optimal sliding mode water level controller of an inverted U-tube steam generator. Eng. Appl. Artif. Intell. 2026, 163, 112755. [Google Scholar] [CrossRef]
Toubeau, J.-F.; Bakhshideh Zad, B.; Hupez, M.; De Grève, Z.; Vallée, F. Deep Reinforcement Learning-Based Voltage Control to Deal with Model Uncertainties in Distribution Networks. Energies 2020, 13, 3928. [Google Scholar] [CrossRef]
Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. arXiv 2015, arXiv:1509.02971. [Google Scholar]
Fujimoto, S.; van Hoof, H.; Meger, D. Addressing Function Approximation Error in Actor-Critic Methods. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; Volume 80, pp. 1587–1596. [Google Scholar]
Sun, X.; Xu, Z.; Qiu, J.; Liu, H.; Wu, H.; Tao, Y. Optimal Volt/Var Control for Unbalanced Distribution Networks with Human-in-the-Loop Deep Reinforcement Learning. IEEE Trans. Smart Grid 2024, 15, 2639–2651. [Google Scholar] [CrossRef]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal Policy Optimization Algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar] [CrossRef]
Qian, T.; Liang, Z.; Shao, C.; Zhang, H.; Hu, Q.; Wu, Z. Offline DRL for Price-Based Demand Response: Learning From Suboptimal Data and Beyond. IEEE Trans. Smart Grid 2024, 15, 4618–4635. [Google Scholar] [CrossRef]
Zheng, W.; Pi, R.; Zhong, X.; Yang, C. Demand response for home energy management systems: A novel dual-agent DRL approach. Energy Syst. 2025. [Google Scholar] [CrossRef]
Xiong, K.; Hu, W.; Cao, D.; Zhang, G.; Chen, Z.; Blaabjerg, F. A novel two-level deep reinforcement learning enabled game approach for incentive-based distributed voltage regulation with participation of autonomous photovoltaic inverters. Energy 2025, 324, 135934. [Google Scholar] [CrossRef]
Perera, A.T.D.; Kamalaruban, P. Applications of reinforcement learning in energy systems. Renew. Sustain. Energy Rev. 2021, 137, 110618. [Google Scholar] [CrossRef]
Luo, X.; Akhtar, Z.; Lee, C.K.; Chaudhuri, B.; Tan, S.-C.; Hui, S.Y.R. Distributed Voltage Control with Electric Springs: Comparison with STATCOM. IEEE Trans. Smart Grid 2015, 6, 209–219. [Google Scholar] [CrossRef]
Lee, C.-K.; Liu, H.; Tan, S.-C.; Chaudhuri, B.; Hui, S.Y.R. Electric Spring and Smart Load: Technology, System-level Impact and Opportunities. IEEE J. Emerg. Sel. Top. Power Electron. 2020, 9, 6524–6544. [Google Scholar] [CrossRef]
Chen, T.; Liu, Y.; Tan, S.-C.; Hui, S.Y.R. Distributed Cooperative Control of Multiple DC Electric Springs for Voltage Regulation in DC Microgrids. IEEE Trans. Ind. Electron. 2018, 65, 5520–5530. [Google Scholar] [CrossRef]
Saha, S.; Dutta, S. Electric springs for coordinated voltage and frequency regulation in multi-area interconnected power systems. Sci. Rep. 2025, 15, 2847. [Google Scholar] [CrossRef]
Akhtar, Z.; Chaudhuri, B.; Hui, S.Y.R. Primary Frequency Control Contribution From Smart Loads Using Reactive Compensation. IEEE Trans. Smart Grid 2015, 6, 2356–2365. [Google Scholar] [CrossRef]
Tapia-Tinoco, G.; Valencia-Rivera, G.H.; Valtierra-Rodriguez, M.; Garcia-Perez, A.; Granados-Lieberman, D. Optimal Placement of Electric Springs in Unbalanced Distribution Networks using Improved Backward/Forward Sweep Method Based Genetic Algorithm. J. Mod. Power Syst. Clean Energy 2025, 13, 940–952. [Google Scholar] [CrossRef]
Tapia-Tinoco, G.; Granados-Lieberman, D.; Rodriguez-Alejandro, D.A.; Valtierra-Rodriguez, M.; Garcia-Perez, A. A Robust Electric Spring Model and Modified Backward Forward Solution Method for Microgrids with Distributed Generation. Mathematics 2020, 8, 1326. [Google Scholar] [CrossRef]
Wang, Q.; Cheng, M.; Chen, Z.; Wang, Z. Steady-State Analysis of Electric Springs with a Novel δ Control. IEEE Trans. Power Electron. 2015, 30, 7159–7169. [Google Scholar] [CrossRef]
Watkins, C.J.C.H.; Dayan, P. Q-learning. Mach. Learn. 1992, 8, 279–292. [Google Scholar] [CrossRef]
Jaber, Y.; Dharmasena, P.; Nassif, A.; Nassif, N. Hyperparameter Optimization of Neural Networks Using Grid Search for Predicting HVAC Heating Coil Performance. Buildings 2025, 15, 2753. [Google Scholar] [CrossRef]
Baran, M.; Wu, F. Network reconfiguration in distribution systems for loss reduction and load balancing. IEEE Trans. Power Deliv. 1989, 4, 1401–1407. [Google Scholar] [CrossRef]
Ur Rehman, A.; Ali, M.; Iqbal, S.; Habib, S.; Shafiq, A.; Elbarbary, Z.M.; Barnawi, A.B. Transition towards a sustainable power system: MA-DA&DC framework based voltage control in high PV penetration networks. Energy Rep. 2023, 9, 5922–5936. [Google Scholar]
Electric Reliability Council of Texas, Inc. ERCOT Load Profiling. 2025. Available online: https://www.ercot.com/mktinfo/loadprofile (accessed on 1 April 2025).
The National Renewable Energy Laboratory NREL Grid Modernization. 2025. Available online: https://www.nrel.gov/grid (accessed on 1 April 2025).
ANSI C84.1-2020; American National Standard for Electric Power Systems and Equipment—Voltage Ratings (60 Hertz). American National Standards Institute: Washington, DC, USA, 2020.
Mirjalili, S.; Mirjalili, S.M.; Lewis, A. Grey Wolf Optimizer. Adv. Eng. Softw. 2014, 69, 46–61. [Google Scholar] [CrossRef]
YPEA Yarpiz®. Yarpiz YPEA Evolutionary Algorithms. Available online: https://yarpiz.com/477/ypea-yarpiz-evolutionary-algorithms (accessed on 1 April 2025).

Figure 1. Integration of the ES circuit configuration into the ADN. Adapted from [22].

Figure 2. DRL-based control framework architecture.

Figure 3. Flowchart of the proposed DDPG training.

Figure 4. Modified IEEE 33 radial distribution feeder incorporating DGs and ESs. Adapted from [27].

Figure 5. Load profile behavior (residential consumption). Based on ERCOT data [29].

Figure 6. Photovoltaic panel and wind turbine active power injection profiles. Based on NREL data [30].

Figure 7. DDPG agent training curve.

Figure 8. Bus voltage results over the 24-h simulation period: (a) Minimum bus voltage per hour, (b) Voltage profiles over 96 time intervals (15-min resolution).

Figure 9. Performance evaluation of the ESs: (a) Reactive power response, (b) Voltage profile distribution of all test cases.

Figure 10. Mean voltage deviation comparison.

Table 1. Comparison of voltage regulation approaches in ADNs.

Characteristic	STATCOM [17]	ESS [14]	Electric Spring [17]
Energy storage required	No	Yes (large)	No (minimal DC-link)
Capital cost	High	Very high	Moderate
Response time	Fast	Medium	Fast
Scalability	Limited	Limited	High
Active power capability	Limited	Yes	No (reactive only)
Installation complexity	High	Moderate	Low

Table 2. Comparison between DDPG and TD3 algorithms.

Feature	DDPG	TD3
Number of critics	1	2 (twin critics)
Q-value estimation	Single critic	Minimum of two critics
Policy update frequency	Every step	Delayed (every d steps)
Target action noise	None	Clipped Gaussian
Overestimation bias	Prone to overestimation	Reduced through clipping
Training stability	Moderate	Improved

Table 3. Sensitivity analysis of reward function threshold

α

.

Table 3. Sensitivity analysis of reward function threshold

α

.

Threshold $α$	Mean $\bar{Δ V_{D}}$ (p.u.)	Convergence (Episodes)	Action Std. Dev.	Training Time (s)
0.5	-	-	-	-
0.6	0.0157	232	0.7386	913.7078
0.7	0.0159	100	0.5570	225.3243
0.8	0.0211	182	0.2182	152.0885
0.9	0.0222	<5	0.1705	160.4302

Table 4. Hyperparameters used for each DRL agent.

Hyperparameter	TD3	DDPG-A	DDPG-B
Discount factor	0.99	0.99	0.9
Experience buffer length	$1 \times 10^{4}$	$1 \times 10^{4}$	$1 \times 10^{7}$
Mini-batch size	64	64	128
Target smooth factor $τ$	$0.0050$	$1 \times 10^{- 3}$	$1 \times 10^{- 3}$
Sample time	1	1	1
Exploration noise	Gaussian	Ornstein–Uhlenbeck (0.3)	Ornstein–Uhlenbeck (0.1)
Actor learning rate	$1 \times 10^{- 6}$	$1 \times 10^{- 6}$	$5 \times 10^{- 6}$
Critic learning rate	$1 \times 10^{- 5}$	$1 \times 10^{- 5}$	$5 \times 10^{- 5}$
Variance decay rate	0	0	$10^{- 5}$
Hidden layer 1 (critic)	256	256	256
Hidden layer 2 (critic)	256	256	128
Hidden layer 1 (actor)	256	256	400
Hidden layer 2 (actor)	256	256	300
Number of episodes	500	500	200
Time steps per episode	500	500	200

Table 5. Properties of the IEEE 33-bus radial distribution feeder.

Property	Value
Number of buses	33
Number of loads	32
Total active power demand (kW)	3715
Total reactive power demand (kVAr)	2300
Number of photovoltaic units	4
Number of wind turbine units	3
Total DG active power generation capacity (kW)	2229
Number of ES units	12

Table 6. Solving methodologies comparison.

Algorithm	Max $\bar{Δ V_{D}}$ (p.u.)	Min $\bar{Δ V_{D}}$ (p.u.)	Execution Time (min)
DDPG-B	0.0306	0.0013	0.0159
GA	0.0288	0.0006	2.3629
GWO	0.0290	0.0006	2.3898
PSO	0.0289	0.0006	3.5140

Table 7. IEEE 69 bus scalability summary.

Case	Device	No.	Location (Bus)	Max Power (kW)	$\bar{Δ V_{D}}$ (p.u.)	Reduction (%)
Base case	PV gen.	3	11, 49, 59	760.4, 760.4, 760.4	0.0076	–
DDPG-B	ES	5	12, 21, 50, 61, 64	116, 91.2, 307.7, 995.2, 181.6	0.0046	38.49
DDPG-B	PV gen.	3	11, 49, 59	760.4, 760.4, 760.4	0.0046	38.49

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Lara-Perez, J.I.; Trejo-Caballero, G.; Tapia-Tinoco, G.; Raya-González, L.E.; Garcia-Perez, A. Deep Reinforcement Learning-Based Voltage Regulation Using Electric Springs in Active Distribution Networks. Technologies 2026, 14, 87. https://doi.org/10.3390/technologies14020087

AMA Style

Lara-Perez JI, Trejo-Caballero G, Tapia-Tinoco G, Raya-González LE, Garcia-Perez A. Deep Reinforcement Learning-Based Voltage Regulation Using Electric Springs in Active Distribution Networks. Technologies. 2026; 14(2):87. https://doi.org/10.3390/technologies14020087

Chicago/Turabian Style

Lara-Perez, Jesus Ignacio, Gerardo Trejo-Caballero, Guillermo Tapia-Tinoco, Luis Enrique Raya-González, and Arturo Garcia-Perez. 2026. "Deep Reinforcement Learning-Based Voltage Regulation Using Electric Springs in Active Distribution Networks" Technologies 14, no. 2: 87. https://doi.org/10.3390/technologies14020087

APA Style

Lara-Perez, J. I., Trejo-Caballero, G., Tapia-Tinoco, G., Raya-González, L. E., & Garcia-Perez, A. (2026). Deep Reinforcement Learning-Based Voltage Regulation Using Electric Springs in Active Distribution Networks. Technologies, 14(2), 87. https://doi.org/10.3390/technologies14020087

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Deep Reinforcement Learning-Based Voltage Regulation Using Electric Springs in Active Distribution Networks

Abstract

1. Introduction

1.1. Deep Reinforcement Learning for Voltage Control

1.2. Voltage Regulation Devices: From FACTS to Electric Springs

1.3. Research Gap and Contributions

2. Modeling and Theoretical Background of the Electric Spring and the DDPG Algorithm

2.1. Electric Spring Operation for Voltage Regulation in Active Distribution Networks

2.2. Reinforcement Learning DDPG Algorithm

2.2.1. DDPG Architecture and Key Components

2.2.2. Twin Delayed DDPG (TD3) Algorithm

2.2.3. Comparison with Other DRL Algorithms

3. Framework of the Proposed Approach

3.1. Reinforcement Learning Environment

Sensitivity Analysis of Reward Function Threshold

3.2. Training and Evaluation

3.3. Hyperparameter Tuning

4. Case Study

Test Scenarios Description

5. Results

5.1. Training Results

5.2. Testing Results on the IEEE 33-Bus Network

5.3. Comparison with Metaheuristic-Based Centralized Voltage Control

5.4. Scalability of the Centralized Approach

6. Discussion

6.1. Performance Analysis and Comparison

6.2. Advantages of the Proposed Approach

6.3. Limitations and Challenges

6.4. Practical Implementation Considerations

6.5. Future Research Directions

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Nomenclature

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI