Load Frequency Control of Power Systems with an Energy Storage System Based on Safety Reinforcement Learning

Gao, Song; Li, Yudun; Chen, Xiaodi; Liang, Zhengtang; Liu, Enren; Liu, Kang; Zhang, Meng

doi:10.3390/pr13061897

Open AccessArticle

Load Frequency Control of Power Systems with an Energy Storage System Based on Safety Reinforcement Learning

by

Song Gao

^1,*

,

Yudun Li

¹,

Xiaodi Chen

²,

Zhengtang Liang

¹,

Enren Liu

¹,

Kang Liu

³ and

Meng Zhang

⁴

¹

State Grid Shandong Electric Power Research Institute, Ji’nan 250003, China

²

School of Automation Science and Engineering, Xi’an Jiaotong University, Xi’an 710049, China

³

School of Future Technology, Xi’an Jiaotong University, Xi’an 710049, China

⁴

School of Cyber Science and Engineering, Xi’an Jiaotong University, Xi’an 710049, China

^*

Author to whom correspondence should be addressed.

Processes 2025, 13(6), 1897; https://doi.org/10.3390/pr13061897

Submission received: 28 March 2025 / Revised: 25 May 2025 / Accepted: 4 June 2025 / Published: 16 June 2025

(This article belongs to the Special Issue Advances in Smart Grids and Microgrids: Distributed Generation and Energy Storage Systems)

Download

Browse Figures

Versions Notes

Abstract

Load frequency control (LFC) is a critical component in power systems that is employed to stabilize frequency fluctuations and ensure power quality. As energy storage systems (ESSs) are increasingly integrated into the grid, managing additional constraints has become more challenging. To address these challenges, this paper proposes a safety reinforcement learning-based approach that incorporates ESSs into the LFC framework. By formulating a constrained Markov decision process (CMDP), this approach overcomes the limitations of conventional Markov decision processes (MDPs) by explicitly handling system constraints. Furthermore, a long short-term memory (LSTM)-based cost prediction critic network is introduced to improve the accuracy of cost predictions, and a primal-dual deep deterministic policy gradient (PD-DDPG) algorithm is employed to solve the CMDP. Simulation results demonstrate significant improvements: a 58.2% faster settling time, a 72.5% reduction in peak frequency deviation, and a 68.2% lower mean absolute error while maintaining all operational constraints.

Keywords:

load frequency control; energy storage system; safety reinforcement learning; deep deterministic policy gradient; constraint

1. Introduction

Load frequency control (LFC) serves as a fundamental mechanism for maintaining power system stability by continuously adjusting generator outputs to mitigate frequency deviations caused by load fluctuations and disturbances [1,2,3]. The integration of energy storage systems into modern grids has opened new avenues for enhancing frequency regulation performance. Unlike conventional generation units, ESS can provide rapid power injection/absorption to counteract frequency anomalies, offering superior dynamic response characteristics [4,5,6]. Among various ESS technologies, flywheel energy storage systems have emerged as particularly promising solutions for LFC applications due to their exceptional power density, ultrafast response capability, and virtually unlimited cycle life.

Recent studies have identified several technical challenges in LFC systems. Research on thermoelectric modules [7] has demonstrated the significant impact of auxiliary system losses on net energy output. Investigations of interior permanent magnet synchronous machines control systems [8] have revealed that conventional approaches often require additional sensors or observers, increasing system complexity. Furthermore, non-intrusive load monitoring studies [9,10] have highlighted the need for advanced adaptive control capabilities to handle spatiotemporal diversity in load patterns. Significant advancements in related fields provide valuable insights for LFC optimization. Impedance analysis of grid-connected converters [11] emphasizes the critical role of accurate modeling for small-signal stability. Developments in sustainable structural batteries [12] and virtual power plant operation [13] demonstrate effective multi-objective coordination for balancing economic and security requirements. Particularly, case studies of battery-integrated renewable energy systems [14] have proven the effectiveness of bi-level optimization frameworks in resolving operational target conflicts.

However, traditional control paradigms face inherent limitations. Conventional LFC approaches are usually developed based on physical dynamic models of power systems and require high real-time computational resources. These methods often rely on linearized models and classical control techniques, which may struggle to handle the increasing complexity of modern power grids. To improve system performance, The authors of [15] proposed a unified PID tuning method for LFC, based on the two-degree-of-freedom internal model control approach, enhancing robustness and adaptability to system variations. The authors of [16] introduced a direct-indirect adaptive fuzzy control method, utilizing local frequency deviations and tie-line power deviations to dynamically adjust control strategies. The authors of [17] employed a full-order generalized state observer in a two-layer active disturbance rejection control framework, integrating equivalent input disturbance compensation to improve disturbance rejection capabilities and ensure high dynamic performance. The authors of [18] developed a robust control scheme for multi-area power systems using second-order sliding mode control combined with an extended disturbance observer, treating load variations and tie-line power deviations as lumped disturbances to enhance system stability. The authors of [19] investigated a model predictive control approach for interconnected power systems, utilizing a simplified Nordic power system model to optimize frequency control while accounting for operational constraints such as tie-line power flow limits and generation rate constraints. The authors of [20] present a comprehensive approach integrating system modeling, simulation, and optimization with advanced control techniques. Their work combines particle swarm optimization with proportional-integral-derivative controllers to enhance LFC system performance. The authors of [21] developed a novel tube-based distributed model predictive control scheme, comprising a nominal MPC component and an ancillary feedback mechanism, to achieve optimal inter-area generation coordination in multi-area power systems. Nevertheless, the growing integration of energy storage systems imposes supplementary operational constraints that are inherently difficult to capture through conventional physical modeling approaches. These emerging constraints (i) substantially elevate the complexity of LFC system design and (ii) pose significant challenges for real-time optimization processes, particularly under high renewable penetration scenarios.

To address these challenges, reinforcement learning (RL), including deep reinforcement learning (DRL), has emerged as a promising solution due to its powerful searching and learning capabilities in uncertain and complex environments. Unlike conventional model-based controllers, RL-based approaches can adaptively learn optimal control policies without requiring an explicit physical model of the system. Notable research efforts in this area have been made. The authors of [22] pioneered a correlated Q(

λ

) learning algorithm to optimize cooperative equilibrium control strategies in multi-area LFC systems. The authors of [23] developed a hierarchical reinforcement learning architecture featuring dual agent modules (estimator and controller), establishing an intelligent frequency regulator with enhanced uncertainty handling capabilities. The authors of [24] devised a multi-agent reinforcement learning framework employing optimized discretized joint action spaces to achieve precision LFC performance. The authors of [25] introduced a bio-inspired artificial emotional reinforcement learning controller that synergizes mechanical logic with affective computing principles to improve adaptive control. The authors of [26] implemented a deep feature extraction paradigm using stacked denoising auto-encoders, enabling continuous action space optimization for LFC performance enhancement. The authors of [27] incorporated an attention-based critic network to selectively process environmental states, achieving reduction in training time compared to conventional methods. To address challenges including actuator malfunctions, system uncertainties, and limited communication bandwidth, the authors of [28] developed innovative event-triggered LFC strategies with reliability guarantees. In [29], a distributed observer-based framework was established for event-triggered LFC in multi-area power systems vulnerable to cyber threats. The authors of [30] proposed a model-assisted reinforcement learning approach, where an emulator network approximates power system dynamics and zeroth-order optimization enables gradient estimation for control policy generation.

The existing studies predominantly employ direct control strategies, wherein RL agents generate operational commands through real-time frequency signal processing. While these approaches have exhibited satisfactory performance in power system implementations, purely data-driven deep reinforcement learning (DRL) methods encounter fundamental limitations in precisely modeling the nonlinear dynamics of power systems during the training phase. This modeling inaccuracy may consequently yield either suboptimal control decisions or, in severe cases, system instability.

In light of these challenges, this paper proposes a safety reinforcement learning-based approach to enhance LFC performance in power systems integrated with energy storage systems. Specifically, we formulate the problem as a Constrained Markov Decision Process (CMDP), which extends the conventional MDP framework by incorporating explicit constraint handling through a cost function. This formulation enables the proposed method to effectively balance frequency regulation performance with operational constraints. Furthermore, a long short-term memory (LSTM)-based cost prediction critic network is introduced, leveraging supervised learning techniques to improve cost prediction accuracy. To solve the CMDP, we employ a primal-dual Deep Deterministic Policy Gradient (PD-DDPG) algorithm, which efficiently accommodates nonlinear constraints while optimizing frequency control performance. The main novelty of this work lies in its integrated safety reinforcement learning framework for LFC, which fundamentally advances prior approaches through two key innovations: (1) a CMDP formulation that explicitly incorporates operational constraints into the optimization framework, providing provable safety guarantees compared to existing heuristic penalty-based methods, and (2) a PD-DDPG algorithm enhanced with an LSTM-based critic network, which not only solves the CMDP through Lagrangian relaxation but also captures temporal dependencies in cost prediction—addressing critical limitations of standard DDPG implementations and feedforward critics in current literature. This dual innovation enables rigorous constraint satisfaction while achieving superior dynamic performance in renewable-rich power systems.

The rest of the paper is organized as follows. The linearized LFC model and nonlinear behaviors used throughout the paper are introduced in Section 2. Section 3 presents the framework of the proposed method. Section 4 verifies the effectiveness and advantages of the proposed method with simulation results. Finally, the conclusion of this paper is presented in Section 5.

2. Power System Frequency Response Model

This section briefly introduces the linearized LFC model and nonlinear behaviors.

2.1. Linearized LFC Model

In this paper, the microgrid includes a diesel generator and a flywheel energy storage system, with the structure shown in Figure 1.

The whole LFC system is composed of a generator, governor, turbine, and other links involved in the secondary control loop [31]. The system dynamics can be expressed as Equations (1)–(3).

Δ \dot{f} = \frac{1}{H} (Δ P_{G} + Δ P_{E} - Δ P_{L}) - \frac{D}{H} Δ f

(1)

Δ \dot{P_{G}} = \frac{1}{T_{t}} Δ X_{G} - \frac{1}{T_{t}} Δ P_{G}

(2)

Δ \dot{X_{G}} = \frac{1}{T_{f}} Δ u_{G} - \frac{1}{R_{f} T_{f}} Δ f - \frac{1}{T_{f}} Δ X_{G},

(3)

where

Δ P_{G}

and

Δ u_{G}

represent the diesel generator’s output power and control signal, respectively.

Δ X_{G}

is the output signal of the governor.

T_{f}

and

T_{t}

are the time constants of the governor and the diesel generator, respectively.

R_{f}

is the droop coefficient of the diesel generator.

Δ f

denotes the frequency deviation of the microgrid. H represents the inertia constant. D represents the generator damping coefficient.

The flywheel energy storage system model is represented as a first-order transfer function, with its dynamic model expressed as Equation (4).

Δ \dot{P_{E}} = \frac{1}{T_{E}} Δ u_{E} - \frac{1}{T_{E}} Δ P_{E},

(4)

where

Δ P_{E}

and

Δ u_{E}

represent the output power and control signal of the flywheel energy storage system, respectively.

T_{E}

is the time constant of the flywheel energy storage system.

2.2. Nonlinear Behaviors

For the sake of the practical application of the proposed method, it is necessary to model nonlinear characteristics of power systems imposed by physical dynamics. This paper considers two nonlinear power characteristics of the diesel generator: the generation rate constraint (GRC) and the power increment change constraint, as shown in Equations (5) and (6).

\begin{matrix} Δ P_{G} = min (Δ P_{G m a x}, max (0, Δ P_{G})) + max (Δ P_{G m i n}, min (0, Δ P_{G})) \end{matrix}

(5)

\begin{matrix} Δ P_{G} = \int min (δ, max (\frac{d P_{g} (t)}{d t})) + m a x (- δ, min (\frac{d P_{g} (t)}{d t})) d t \end{matrix},

(6)

where

max (x, y)

represents the maximum value between x and y.

min (x, y)

represents the minimum value between x and y.

δ

and

- δ

represent the upper and lower limits of the diesel generator’s power ramp rate, respectively.

Δ P_{G m a x}

and

Δ P_{G m i n}

represent the upper and lower limits of the power increment, respectively. The power increment constraints of the flywheel energy storage system is also considered, as shown in Equation (7).

\begin{matrix} Δ P_{E} = min (Δ P_{E m a x}, max (0, Δ P_{E})) + max (Δ P_{E m i n}, min (0, Δ P_{E})) \end{matrix}

(7)

where

Δ P_{E m a x}

and

Δ P_{E m i n}

represent the upper and lower limits of the power increment of the flywheel energy storage system, respectively.

3. Main Results

3.1. Constrained Markov Decision Process

Figure 2 shows the framework of the proposed Primal-Dual DDPG. The LFC model considering the flywheel energy storage system is transformed into a CMDP. In the CMDP, if the control strategy violates the constraints, the agent will receive a negative reward. However, the size of the penalty value is difficult to define. The CMDP extends the MDP with a cost function, which can better balance future discounted rewards with control safety constraints.

In each Markov Decision Process, the agent, at each time step, observes the current state

s_{t}

of the power system, performs action

a_{t}

, obtains a reward value

r_{t}

and a cost value

c_{t}

, and transitions to the next state

s_{t + 1}

, until the total control time period is reached. The formulation is as follows. Based on the LFC problem considering the flywheel energy storage system, the CMDP for the time step is constructed as follows:

The state space of the CMDP is defined as Equation (8).

s_{t} = [Δ f_{t}, \int Δ f d t, \frac{d}{d t} Δ f],

(8)

where

\int Δ f d t

and

d Δ f / d t

represent the integral and derivative of the frequency deviation, respectively.

The action space of the CMDP is defined as Equation (9).

a_{t} = [Δ u_{G}^{t}, Δ u_{E}^{t}] .

(9)

The reward function of the CMDP is defined as Equation (10).

r_{t} = \{\begin{matrix} α & | Δ f_{t} | < 0.02 H z \\ ξ_{1} Δ f_{t}^{2} & 0.02 H z ⩽ Δ f_{t} < 0.05 H z \\ ξ_{2} Δ f_{t}^{2} & 0.05 H z ⩽ Δ f_{t} < 0.10 H z \\ ξ_{3} Δ f_{t}^{2} & 0.10 H z ⩽ Δ f_{t} < 0.15 H z \\ ξ_{4} Δ f_{t}^{2} & 0.15 H z ⩽ Δ f_{t} ⩽ 0.20 H z \\ β & Δ f_{t} > 0.20 H z \end{matrix},

(10)

where

α

is the positive reward value, given only when the frequency deviation is less than 0.02 Hz, with

α > 0

.

ξ_{1}, ξ_{2}, ξ_{3}

, and

ξ_{4}

are the penalty coefficients corresponding to different frequency deviation intervals with

ξ_{4} < ξ_{3} < ξ_{2} < ξ_{1} < 0

. The penalty value

β

is applied when the frequency deviation exceeds a certain threshold, and the agent will receive a large penalty.

The cost function of the CMDP is defined as Equation (11).

\{\begin{matrix} c_{t} = & c_{G}^{t} + c_{E}^{t} + c_{δ}^{t} \\ c_{G}^{t} = & max (0, Δ P_{G}^{t} - Δ P_{G m a x}) \\ + max (0, Δ P_{G m i n} - Δ P_{G}^{t}) \\ c_{E}^{t} = & max (0, Δ P_{E}^{t} - Δ P_{E m a x}) \\ + max (0, Δ P_{E m i n} - Δ P_{E}^{t}) \\ c_{δ}^{t} = & max (0, d P_{G}^{t} / d t - δ) \\ + max (0, - δ - d P_{G}^{t} / d t) \end{matrix},

(11)

C_{G}^{t}

and

C_{δ}^{t}

are the functions representing the violation of the power generation rate and the power increment change of the diesel generator, respectively.

C_{E}^{t}

is the function representing the violation of the power generation rate of the flywheel energy storage system.

The objective function of the CMDP is defined as Equation (12), which builds directly upon the standard CMDP framework established in [32,33].

\{\begin{matrix} π^{*} = & \underset{π}{arg max} R (π) = \underset{π}{E} [\sum_{t = 0}^{T} γ^{t} r_{t} | s_{t}, a_{t}] \\ s . t . & C (π) = \underset{π}{E} [\sum_{t = 0}^{T} γ^{t} c_{t} | s_{t}, a_{t}] ⩽ d \end{matrix},

(12)

where

γ

is the discount factor, and

0 \leq γ \leq 1

.

π

is the policy mapping from states to actions.

r_{t}

is the reward at time t.

c_{t}

is the cost at time t. T is the total control time period. The objective is to maximize the future discounted rewards while satisfying the control safety constraints.

3.2. LSTM-Based Cost Critic Network

A cost prediction network based on LSTM is proposed. This network takes the state and action of the frequency control model as inputs to predict the control cost at the current time step, thus realizing a nonlinear mapping from state–action pairs to costs. This method can capture the complex temporal dependencies between states and actions during the system’s operation, effectively improving the accuracy of cost prediction.

In this framework, a PID controller is used to control the LFC model considering the flywheel energy storage system. By collecting real-time information from the power system, a labeled dataset

D = {(s_{t}, a_{t}, c_{t})}

is constructed, where the dataset includes the inputs and outputs of the PID controller, i.e.,

s_{t}

and

a_{t}

, respectively. Furthermore, the labeled dataset is used to train the LSTM network, thereby establishing a mapping from state and action to cost via supervised learning. During actual control, the cost calculated using Equation (11) is also used as the label for training data. By utilizing the dynamic response data of the flywheel energy storage system, an effective data pair for constructing the LSTM model is formed.

The cost prediction network based on the LSTM architecture is shown in Equation (13). The cost prediction network receives input

x_{t}

at time step t and updates the current state through hidden states

h_{t}

and memory cells

c_{t L S T M}

. The update equation for the cost prediction network’s state is shown as Equation (13).

\{\begin{matrix} f_{t} = σ (W_{f} \cdot [h_{t - 1}, x_{t}] + b_{f}) \\ i_{t} = σ (W_{i} \cdot [h_{t - 1}, x_{t}] + b_{i}) \\ {\tilde{c}}_{t L S T M} = t a n h (W_{c} \cdot [h_{t - 1}, x_{t}] + b_{c}) \\ c_{t L S T M} = f_{t} ⊙ c_{t L S T M} + i_{t} ⊙ {\tilde{c}}_{t L S T M} \\ o_{t} = σ (W_{o} \cdot [h_{t - 1}, x_{t}] + b_{o}) \\ h_{t} = o_{t} \cdot t a n h (c_{t L S T M}) \end{matrix},

(13)

where

f_{t}

,

i_{t}

, and

o_{t}

are the forget, input, and output gates, respectively, ⊙ represents element-wise multiplication,

σ

is the activation function,

W_{f}

,

W_{i}

,

W_{o}

, and

W_{c}

are weight matrices, and

b_{f}

,

b_{i}

,

b_{o}

, and

b_{c}

are bias terms.

Once the updated hidden state

h_{t}

is obtained, the output of the cost prediction network is mapped to the predicted cost via a fully connected layer, shown as Equation (14).

Q_{c} (s_{t}, a_{t}) = h^{(n)} [h^{(n - 1)} [\dots h^{(1)} (h_{t})]]

(14)

where

h^{(n)}

is the activation function of the output layer.

Q_{c} (s_{t}, a_{t})

is the predicted cost.

The cost prediction network is trained using supervised learning. The training of the cost prediction network aims to minimize the difference between the actual control cost and the cost predicted by the network at each time step, as shown in Equation (15):

min_{θ_{c}^{Q}} {∥y - Q_{c} (s_{t}, a_{t})∥}^{2}

(15)

where

θ_{c}^{Q}

represents the parameter of the cost prediction network. y is the label of the dataset.

3.3. The Proposed Primal-Dual Ddpg

A method based on Primal-Dual DDPG is proposed for solving CMDP. By introducing a Lagrange multiplier, the inequality constraint problem in the objective function is transformed into an unconstrained problem. Additionally, by utilizing a cost prediction neural network, the efficiency of the safe reinforcement learning algorithm is improved, enabling the update of the actor network.

The original CMDP problem is shown in Equation (12). To solve this, we introduce a Lagrange multiplier

λ

and reformulate the problem as an unconstrained min–max optimization shown in Equation (16).

L (π, λ) = R (π) - λ (C (π) - d)

(16)

The primal-dual optimization then becomes Equation (17):

max_{π} min_{λ \geq 0} L (π, λ)

(17)

Based on PD-DDPG [34], the Lagrange multiplier is updated to penalize constraint violations, shown in Equation (18).

λ \leftarrow max (0, λ - η_{λ} (C (π) - d))

(18)

where

η_{λ}

is the learning rate for

λ

. In our method,

C (π)

is replaced by the LSTM-based cost critic network

Q_{c} (s_{t}, a_{t})

. The policy

π

is trained to maximize

L (π, λ)

, effectively balancing reward maximization and constraint satisfaction.

A deep reinforcement learning controller is constructed, consisting of the actor network

μ (a | θ^{μ})

, the reward critic network

Q_{R} (s, a | θ_{R}^{Q})

, the target actor network

μ^{'} (a | θ^{μ^{'}})

, and the target reward critic network

Q_{R}^{'} (s, a | θ_{R}^{Q^{'}})

. The parameters of the actor network

θ^{μ}

and the reward critic network

θ_{R}^{Q}

are initialized, and the parameters of both networks are subsequently assigned to their respective target networks.

The interaction experience pool and dual variable

λ

are initialized. Interaction experiences between the actor network and the power system are obtained through simulation experiments to populate the experience pool.

The parameters of the actor network, reward critic network, target actor network, and target reward critic network are optimized based on the interaction experience pool. Specifically, the optimization process includes the following steps based on Equations (19)–(27).

1.: Target reward critic value calculation: K interaction experiences are sampled from the experience pool ${(s_{k}, a_{k}, r_{k}, c_{k}, s_{k + 1})}_{k = 1}^{K}$ . The target reward critic value $Q_{target, k}$ is computed using the equation:

$Q_{target, k} = r_{k} + γ Q_{R}^{'} (s_{k + 1}, μ^{'} (s_{k + 1} | θ^{μ^{'}}) | θ_{R}^{Q^{'}})$

(19)

where $θ^{μ^{'}}$ and $θ_{R}^{Q^{'}}$ denote the parameters of the target actor network and the target reward critic network, respectively. $γ$ is the discount factor. $r_{k}$ is the immediate reward at step k. $s_{k + 1}$ is the next state from experience tuple ${(s_{k}, a_{k}, r_{k}, c_{k}, s_{k + 1})$ .
2.: Updating reward critic network: The parameters of the reward critic network are updated by minimizing the loss function $L_{R}$ :

$L_{R} = \frac{1}{K} \sum_{k = 1}^{K} (Q_{target, k} - Q_{R} (s_{k}, μ (s_{k} | θ^{μ}) | θ_{R}^{Q}))$

(20)
3.: Action gradient calculation: Based on Equation (18), the action gradient $d_{a}$ is computed using

$d_{a} = \frac{1}{K} \sum_{k = 1}^{K} (\nabla_{θ^{μ}} Q_{R} - λ \nabla_{θ^{μ}} Q_{C})$

(21)

where $\nabla_{θ^{μ}} Q_{R}$ and $\nabla_{θ^{μ}} Q_{C}$ are computed based on chain rule, shown in Equations (22) and (23).

$\nabla_{θ^{μ}} Q_{R} = \nabla_{a} Q_{R} (s, μ (s | θ^{μ}) | θ_{R}^{Q}) \nabla_{θ^{μ}} μ (s | θ^{μ}) |_{s = s_{k}}$

(22)

$\nabla_{θ^{μ}} Q_{C} = \nabla_{a} Q_{C} (s, μ (s | θ^{μ}) | θ_{C}^{Q}) \nabla_{θ^{μ}} μ (s | θ^{μ}) |_{s = s_{k}}$

(23)

where $\nabla_{a} Q_{R} (s, μ (s | θ^{μ}) | θ_{R}^{Q})$ , $\nabla_{a} Q_{C} (s, μ (s | θ^{μ}) | θ_{C}^{Q})$ , and $\nabla_{θ^{μ}} μ (s | θ^{μ})$ are derived from the reward critic network, cost prediction network, and actor network, respectively, using the backpropagation algorithm. The parameters of the actor network are updated using gradient descent:

$θ^{μ} \leftarrow θ^{μ} - η_{a} \cdot d_{a}$

(24)

where $η_{a}$ is the learning rate of the actor network.
4.: Dual variable gradient calculation: Based on Equation (18), the gradient of the dual variable $λ$ is calculated as Equation (25).

$d_{λ} = \frac{1}{K} \sum_{k = 1}^{K} [Q_{C} (s_{k}, μ (s_{k} | θ^{μ})| θ_{C}^{Q}) - d]$

(25)

The dual variable are updated via gradient descent shown in Equation (26).

$λ \leftarrow max (0, λ - η_{λ} d_{λ})$

(26)

where $η_{λ}$ is the learning rate of the dual variable.
5.: Soft update of target networks: The target actor network and target reward critic network are updated using the soft update rule:

$\{\begin{matrix} θ^{μ^{'}} \leftarrow τ θ^{μ} + (1 - τ) θ^{μ^{'}} \\ θ_{R}^{Q^{'}} \leftarrow τ θ_{R}^{Q} + (1 - τ) θ_{R}^{Q^{'}} \end{matrix}$

(27)

where $τ$ is a preset soft update parameter.

4. Case Study

The proposed PD-DDPG LFC method is evaluated on a power system with parameters detailed in Table 1, which also outlines the system constraints. For the controller updating in simulations, we set the size of the memory replay buffer, learning rate

η

, and constant

τ

for updating the target network h as 8000, 0.00005, and 0.1, respectively. DDPG [35] is used as a comparison algorithm. DDPG applies different fixed penalty coefficients to penalize actions that violate constraints, with the penalty factor defined as

r_{p} = r_{t} - ρ c_{t}

(28)

where

ρ

is the penalty coefficient, set to 0.001 and 0.005, respectively. A PID controller, whose data builds the LFC database, is also used for comparison, and its gains are tuned manually. The memory replay buffer is initialized by the LFC database. Figure 3 shows that the final converged loss of the LSTM-based network is significantly lower than that of the CNN-based network. Figure 4 shows the prediction cost comparison between CNN-based and LSTM-based cost critic networks, demonstrating that the LSTM-based network achieves better prediction cost performance than the CNN-based network.

Step disturbances are applied at the following times: a

0.04 p . u .

disturbance at

t = 100

s, a

- 0.06 p . u .

disturbance at

t = 250

s, a

0.05 p . u .

disturbance at

t = 400

s, and a

- 0.03 p . u .

disturbance at

t = 550

s. Figure 5 shows the frequency deviation of the power system. It can be observed that, although the DDPG method with a penalty factor (

ρ = 0.005

) performs better than the proposed PD-DDPG method when handling positive step disturbances, it performs worse than even the PID controller when facing negative step disturbances. Overall, the proposed PD-DDPG method provides the best control performance, with a smaller overshoot in the frequency deviation and a faster recovery speed, demonstrating its superior ability to regulate frequency. As shown in Table 2, the proposed PD-DDPG achieves superior overall performance despite a 13.8% longer settling time (55.02 s) versus DDPG (

ρ

= 0.001) (48.33 s), delivering 41.4% lower peak deviation (0.0061 p.u.) and 30% reduced MAE (0.0007 p.u.) while maintaining 86.3% faster response than conventional PID. This optimal trade-off ensures enhanced transient stability and robustness, where minimizing overshoots proves more critical than absolute settling time in practical power systems. Figure 6 and Figure 7 show the power output of the diesel generator and the flywheel energy storage system, respectively. It can be observed that, due to different constraint conditions, the power output response curves of the diesel generator and the flywheel energy storage system exhibit distinct characteristics. Under the PD-DDPG control strategy, both the diesel generator and the flywheel energy storage system better account for these constraints during the update process. The proposed method coordinately reduces the power output fluctuations of both diesel generator and flywheel energy storage system, significantly optimizing the amplitude of system frequency deviation fluctuations and effectively improving frequency regulation performance.

5. Conclusions

This paper proposes a safety reinforcement learning framework for load frequency control in modern power systems with energy storage integration, featuring a constrained Markov decision process formulation solved through a novel primal-dual deep deterministic policy gradient algorithm enhanced by an LSTM-based critic network. The proposed approach demonstrates superior constraint-handling capability and frequency regulation performance compared to conventional methods, particularly in renewable-rich power systems, while maintaining strict operational limits, thereby providing an effective solution for next-generation smart grid applications.

Author Contributions

Conceptualization, S.G. and Y.L.; methodology, S.G.; software, S.G. and X.C.; validation, Y.L. and X.C.; formal analysis, S.G.; investigation, Z.L. and E.L.; resources, Z.L. and E.L.; data curation, X.C.; writing—original draft preparation, S.G.; writing—review and editing, X.C.; visualization, X.C. and K.L.; supervision, Y.L.; project administration, S.G. and M.Z.; funding acquisition, Y.L., K.L. and M.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by the Science and Technology Project of the State Grid Shandong Electric Power Research Institute, “Research and Application of Collaborative Control Technology for Multi-type Energy Storage and Temporal Complementarity” (No. 520626240009).

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

Authors Song Gao, Yudun Li, Zhengtang Liang and Enren Liu were employed by the State Grid Shandong Electric Power Research Institute. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Zhao, C.; Topcu, U.; Li, N.; Low, S. Design and stability of load-side primary frequency control in power systems. IEEE Trans. Autom. Control 2014, 59, 1177–1189. [Google Scholar] [CrossRef]
Shi, T.; Wang, C.; Chen, Z. The multi-point cooperative control strategy for electrode boilers supporting grid frequency regulation. Processes 2025, 13, 785. [Google Scholar] [CrossRef]
Singh, V.P.; Kishor, N.; Samuel, P. Load frequency control with communication topology changes in smart grid. IEEE Trans. Ind. Inform. 2016, 12, 1943–1952. [Google Scholar] [CrossRef]
El-Rifaie, A.M. A novel lyrebird optimization algorithm for enhanced generation rate-constrained load frequency control in multi-area power systems with proportional integral derivative controllers. Processes 2025, 13, 949. [Google Scholar] [CrossRef]
Huynh, V.V.; Minh, B.L.N.; Amaefule, E.N.; Tran, A.-T.; Tran, P.T. Highly robust observer sliding mode based frequency control for multi area power systems with renewable power plants. Electronics 2021, 10, 274. [Google Scholar] [CrossRef]
Zhang, C.-K.; Jiang, L.; Wu, Q.; He, Y.; Wu, M. Delay-dependent robust load frequency control for time delay power systems. IEEE Trans. Power Syst. 2013, 28, 2192–2201. [Google Scholar] [CrossRef]
Miao, Z.; Meng, X.; Li, X.; Liang, B.; Watanabe, H. Enhancement of net output power of thermoelectric modules with a novel air-water combination. Appl. Therm. Eng. 2025, 258, 124745. [Google Scholar] [CrossRef]
Ding, S.; Liu, C.; Fan, Z.; Hang, J. Lumped parameter adaptation-based automatic mtpa control for ipmsm drives by using stator current impulse response. IEEE Trans. Energy Convers. 2025. [Google Scholar] [CrossRef]
Lin, L.; Liu, J.; Huang, N.; Li, S.; Zhang, Y. Multiscale spatio-temporal feature fusion based non-intrusive appliance load monitoring for multiple industrial industries. Appl. Soft Comput. 2024, 167, 112445. [Google Scholar] [CrossRef]
Lin, L.; Ma, X.; Chen, C.; Xu, J.; Huang, N. Imbalanced industrial load identification based on optimized catboost with entropy features. J. Electr. Eng. Technol. 2024, 19, 4817–4832. [Google Scholar] [CrossRef]
Rong, Q.; Hu, P.; Yu, Y.; Wang, D.; Cao, Y.; Xin, H. Virtual external perturbance-based impedance measurement of grid-connected converter. IEEE Trans. Ind. Electron. 2024, 72, 2644–2654. [Google Scholar] [CrossRef]
Kusekar, S.K.; Pirani, M.; Birajdar, V.D.; Borkar, T.; Farahani, S. Toward the progression of sustainable structural batteries: State-of-the-art review. SAE Int. J. Sustain. Transp. Energy Environ. Policy 2025, 5, 283–308. [Google Scholar] [CrossRef]
Akbari, E.; Naghibi, A.F.; Veisi, M.; Shahparnia, A.; Pirouzi, S. Multi-objective economic operation of smart distribution network with renewable-flexible virtual power plants considering voltage security index. Sci. Rep. 2024, 14, 19136. [Google Scholar] [CrossRef] [PubMed]
Navesi, R.B.; Jadidoleslam, M.; Moradi-Shahrbabak, Z.; Naghibi, A.F. Capability of battery-based integrated renewable energy systems in the energy management and flexibility regulation of smart distribution networks considering energy and flexibility markets. J. Energy Storage 2024, 98, 113007. [Google Scholar] [CrossRef]
Tan, W. Unified tuning of pid load frequency controller for power systems via imc. IEEE Trans. Power Syst. 2009, 25, 341–350. [Google Scholar] [CrossRef]
Yousef, H.A.; Khalfan, A.-K.; Albadi, M.H.; Hosseinzadeh, N. Load frequency control of a multi-area power system: An adaptive fuzzy logic approach. IEEE Trans. Power Syst. 2014, 29, 1822–1830. [Google Scholar] [CrossRef]
Liu, F.; Li, Y.; Cao, Y.; She, J.; Wu, M. A two-layer active disturbance rejection controller design for load frequency control of interconnected power system. IEEE Trans. Power Syst. 2015, 31, 3320–3321. [Google Scholar] [CrossRef]
Liao, K.; Xu, Y. A robust load frequency control scheme for power systems based on second-order sliding mode and extended disturbance observer. IEEE Trans. Ind. Inform. 2017, 14, 3076–3086. [Google Scholar] [CrossRef]
Ersdal, A.M.; Imsland, L.; Uhlen, K. Model predictive load-frequency control. IEEE Trans. Power Syst. 2015, 31, 777–785. [Google Scholar] [CrossRef]
Ogar, V.N.; Hussain, S.; Gamage, K.A. Load frequency control using the particle swarm optimisation algorithm and pid controller for effective monitoring of transmission line. Energies 2023, 16, 5748. [Google Scholar] [CrossRef]
Liu, X.; Wang, C.; Kong, X.; Zhang, Y.; Wang, W.; Lee, K.Y. Tube-based distributed mpc for load frequency control of power system with high wind power penetration. IEEE Trans. Power Syst. 2023, 39, 3118–3129. [Google Scholar] [CrossRef]
Yu, T.; Wang, H.; Zhou, B.; Chan, K.W.; Tang, J. Multi-agent correlated equilibrium q (λ) learning for coordinated smart generation control of interconnected power grids. IEEE Trans. Power Syst. 2014, 30, 1669–1679. [Google Scholar] [CrossRef]
Singh, V.P.; Kishor, N.; Samuel, P. Distributed multi-agent system-based load frequency control for multi-area power system in smart grid. IEEE Trans. Ind. Electron. 2017, 64, 5151–5160. [Google Scholar] [CrossRef]
Daneshfar, F. Intelligent load-frequency control in a deregulated environment: Continuous-valued input, extended classifier system approach. IET Gener. Transm. Distrib. 2013, 7, 551–559. [Google Scholar] [CrossRef]
Yin, L.; Yu, T.; Zhou, L.; Huang, L.; Zhang, X.; Zheng, B. Artificial emotional reinforcement learning for automatic generation control of large-scale interconnected power grids. IET Gener. Transm. Distrib. 2017, 11, 2305–2313. [Google Scholar] [CrossRef]
Yan, Z.; Xu, Y. Data-driven load frequency control for stochastic power systems: A deep reinforcement learning method with continuous action search. IEEE Trans. Power Syst. 2018, 34, 1653–1656. [Google Scholar] [CrossRef]
Yang, F.; Huang, D.; Li, D.; Lin, S.; Muyeen, S.; Zhai, H. Data-driven load frequency control based on multi-agent reinforcement learning with attention mechanism. IEEE Trans. Power Syst. 2022, 38, 5560–5569. [Google Scholar] [CrossRef]
Zhang, M.; Dong, S.; Wu, Z.-G.; Chen, G.; Guan, X. Reliable event-triggered load frequency control of uncertain multiarea power systems with actuator failures. IEEE Trans. Autom. Sci. Eng. 2022, 20, 2516–2526. [Google Scholar] [CrossRef]
Zhang, M.; Dong, S.; Shi, P.; Chen, G.; Guan, X. Distributed observer-based event-triggered load frequency control of multiarea power systems under cyber attacks. IEEE Trans. Autom. Sci. Eng. 2022, 20, 2435–2444. [Google Scholar] [CrossRef]
Chen, X.; Zhang, M.; Wu, Z.; Wu, L.; Guan, X. Model-free load frequency control of nonlinear power systems based on deep reinforcement learning. IEEE Trans. Ind. Inform. 2024, 20, 6825–6833. [Google Scholar] [CrossRef]
Bevrani, H. Robust Power System Frequency Control; Springer: Berlin/Heidelberg, Germany, 2014; Volume 4. [Google Scholar]
Wachi, A.; Sui, Y. Safe reinforcement learning in constrained markov decision processes. In Proceedings of the International Conference on Machine Learning PMLR, Virtual, 13–18 July 2020; pp. 9797–9806. [Google Scholar]
Altman, E. Constrained Markov Decision Processes; Routledge: London, UK, 2021. [Google Scholar]
Liang, Q.; Que, F.; Modiano, E. Accelerated primal-dual policy optimization for safe reinforcement learning. arXiv 2018, arXiv:1802.06480. [Google Scholar]
Liu, G.-X.; Liu, Z.-W.; Wei, G.-X. Model-free load frequency control based on multi-agent deep reinforcement learning. In Proceedings of the 2021 IEEE International Conference on Unmanned Systems, Beijing, China, 15–17 October 2021; pp. 815–819. [Google Scholar]
Morsali, J.; Zare, K.; Hagh, M.T. Appropriate generation rate constraint (grc) modeling method for reheat thermal units to obtain optimal load frequency controller (LFC). In Proceedings of the 5th Conference on Thermal Power Plants (CTPP), Tehran, Iran, 10–11 June 2014; pp. 29–34. [Google Scholar]

Figure 1. Schematic diagram of the linearized LFC model with a diesel generator and a flywheel energy storage system.

Figure 2. Framework of the proposed Primal-Dual DDPG.

Figure 3. Training loss comparison of CNN-based vs. LSTM-based cost critic network.

Figure 4. Prediction cost comparison of CNN-based vs. LSTM-based cost critic network. Note: the LSTM-based network demonstrates superior prediction accuracy for system operational costs compared to the CNN-based approach.

Figure 5. Frequency deviation using different methods tested on the power system.

Figure 6. Power output of the diesel generator using different methods. Note: The slope demonstrates full compliance with GRC, while maintaining strict output boundaries of [−0.03, 0.03] p.u. throughout operation.

Figure 7. Power output of the flywheel energy storage system using different methods. Note: The operational constraints strictly limit FESS power output within [−0.025, 0.025]p.u.

Table 1. Parameters of the power system.

Parameter	Value	Parameter	Value	Parameter	Value
$Δ P_{G m a x}$ (p.u.)	$0.03$	$T_{f}$ (s)	10	H (p.u./Hz)	14.22
$Δ P_{G m i n}$ (p.u.)	$- 0.03$	$T_{t}$ (s)	0.10	D (p.u./Hz)	0
$Δ P_{E m a x}$ (p.u.)	$0.025$	$δ$ (p.u./s)	0.0017 [36]	R (Hz/p.u.)	0.05
$Δ P_{E m i n}$ (p.u.)	$- 0.025$	$T_{E}$ (s)	1

Table 2. Performance metrics of different methods.

Method	Settling Time (s)	Peak (p.u.)	Mean Absolute Error (MAE) (p.u.)
Standard PID	131.51	0.0222	0.0022
DDPG [35] with $ρ = 0.001$	48.33	0.0104	0.0010
DDPG [35] with $ρ = 0.005$	70.50	0.0153	0.0008
Proposed PD-DDPG	55.02	0.0061	0.0007

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Gao, S.; Li, Y.; Chen, X.; Liang, Z.; Liu, E.; Liu, K.; Zhang, M. Load Frequency Control of Power Systems with an Energy Storage System Based on Safety Reinforcement Learning. Processes 2025, 13, 1897. https://doi.org/10.3390/pr13061897

AMA Style

Gao S, Li Y, Chen X, Liang Z, Liu E, Liu K, Zhang M. Load Frequency Control of Power Systems with an Energy Storage System Based on Safety Reinforcement Learning. Processes. 2025; 13(6):1897. https://doi.org/10.3390/pr13061897

Chicago/Turabian Style

Gao, Song, Yudun Li, Xiaodi Chen, Zhengtang Liang, Enren Liu, Kang Liu, and Meng Zhang. 2025. "Load Frequency Control of Power Systems with an Energy Storage System Based on Safety Reinforcement Learning" Processes 13, no. 6: 1897. https://doi.org/10.3390/pr13061897

APA Style

Gao, S., Li, Y., Chen, X., Liang, Z., Liu, E., Liu, K., & Zhang, M. (2025). Load Frequency Control of Power Systems with an Energy Storage System Based on Safety Reinforcement Learning. Processes, 13(6), 1897. https://doi.org/10.3390/pr13061897

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Load Frequency Control of Power Systems with an Energy Storage System Based on Safety Reinforcement Learning

Abstract

1. Introduction

2. Power System Frequency Response Model

2.1. Linearized LFC Model

2.2. Nonlinear Behaviors

3. Main Results

3.1. Constrained Markov Decision Process

3.2. LSTM-Based Cost Critic Network

3.3. The Proposed Primal-Dual Ddpg

4. Case Study

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI