1. Introduction
The global energy crisis poses a significant challenge, necessitating an accelerated energy transition. Establishing a power system that predominantly relies on renewable energy sources is crucial for ensuring national energy security strategies [
1,
2,
3]. However, the integration of clean technologies such as wind power and photovoltaic generation into the power grid has diminished the system’s frequency regulation capability and exacerbated oscillation issues [
4,
5]. To regulate grid frequency, an automatic excitation adjustment device is commonly used on generators, but it can increase equipment costs. The most widely utilized technique for frequency regulation is load frequency control (LFC) [
6], which plays a vital role in ensuring the stable operation of the power system by strategically managing and adjusting generating units to maintain a harmonious equilibrium between the load and the system’s generation capacity [
7,
8,
9].
Currently, numerous pioneering researchers have focused on improving control schemes for LFC. Techniques such as PID control, model predictive control (MPC), and sliding mode control (SMC) are proposed for LFC to ensure compliance of the control region with LFC commands and to enhance the performance and stability of the power system [
10,
11,
12]. However, these control methods are highly dependent on the data within the model itself. Nevertheless, these control methods involve complex mathematical calculations and optimization processes in terms of parameter estimation and optimal parameter tuning [
13,
14]. Consequently, data-driven control approaches are proposed for a multi-area LFC power system.
With the increasing integration of renewable energy into power systems, the field of LFC faces significant challenges. The traditional centralized control approach, constrained by the inherent inertia of synchronous generators, may lack flexibility required to respond promptly to these challenges. While centralized control imposes minimal constraints on control signals to achieve optimal performance, it suffers from high computational complexity and a substantial computational burden, making it unsuitable for large-scale interconnected power systems with highly integrated subsystems. Furthermore, in multi-area interconnected systems, the control areas may belong to different entities, complicating the acquisition of global information. Therefore, there is a pressing need to design a distributed LFC method. In this way, a two-layer MPC in [
15], distributed automatic LFC scheme in [
16] and the error in feeder power flow with respect to scheduled value utilizing by the PI controller to estimate the change in power reference of all distributed energy resources units in [
17] are studied. In multi-area interconnected systems, control domains may fall under the jurisdiction of different entities, complicating the acquisition of comprehensive global information. Consequently, there is an urgent need to devise a data-driven distributed LFC approach that not only responds to the dynamics and distributed characteristics of modern power systems but also acknowledges inherent limitations and constraints. In multi-area power systems, control domains may fall under the jurisdiction of different entities, complicating the acquisition of comprehensive global information. Consequently, there is an urgent need to devise a data-driven distributed LFC scheme that not only responds to the dynamics and distributed characteristics of modern power systems but also acknowledges inherent limitations and constraints.
In recent years, data-driven control approaches have demonstrated significant potential for coordinated frequency control in multi-area power systems [
18,
19,
20]. The data-driven control approaches are designed to optimize a control objective directly from the system’s observed responses. By employing an appropriate training strategy and defining an effective reward mechanism, data-driven control systems that leverage multi-agent reinforcement learning (RL) can achieve collaborative outcomes even in the absence of comprehensive communication among the control agents. In this way, a deep reinforcement learning model based on a continuous action domain is proposed to enhance the LFC performance of single-area power systems, in [
21]. For multi-area interconnected power systems, a data-driven coordinated LFC method based on Multi-agent Deep Reinforcement Learning (MADRL) for LFC is introduced in [
22]. In the MADRL framework, each agent, represented by a deep neural network, collaboratively adjusts the generated commands to achieve the global objectives of the multi-area power system. Reference [
23] designed an Evolutionary Multi-agent Deep Meta-actor–critic (EMA-DMAC) algorithm, which introduces meta-reinforcement and evolutionary learning to enable rapid collaborative learning of group intelligence, thus improving the robustness and quality of the obtained LFC scheme. Moreover, a hierarchical reinforcement learning method based on the decomposition of value functions, allowing for strategy optimization at multiple levels and thus achieving hierarchical task-solving is presented [
24,
25]. However, the above-proposed methods demand higher computational resources and incur significant communication overhead.
Recent studies have proposed various constrained and safe reinforcement learning frameworks, such as Lagrangian-based methods, constrained policy optimization approaches, and safety-aware multi-agent reinforcement learning techniques. Furthermore, several advanced MADRL algorithms have demonstrated promising scalability in large-scale multi-agent environments.
Motivated by the observation of the aforementioned problems and challenges faced by existing MADRL algorithms, such as lack of constraint consideration and limited processing capability for large-scale problems, a data-driven approach is proposed for renewable power system LFC scheme in this paper. Specifically, a multi-agent Actor-Double-Critic deep reinforcement learning scheme is developed such that real-time scheduling is ensured, which is required for complying with system safety operation constraints within the LFC; Self-Critic and Cons-Critic networks are used to calculate the action value and cost of agents. Compared to PI control and fuzzy PI control methods, the presented approach reduces training difficulties and mitigates the impact of sparse immediate rewards and safety constraint costs, improving the convergence speed in multi-agent training and ensuring real-time scheduling such that system safety operation constraints can be met.
The remaining parts of this paper are structured as follows: the multi-area LFC system with constrained Markov cooperative game model is introduced in Section II. In Section III, the multi-agent Actor-Double-Critic deep reinforcement learning is presented for multi-area LFC power systems. Section IV presents the simulation results and comparative analysis. The conclusion of this work is shown in Section V.
2. Problem Statement
2.1. Load Frequency Control System Modelling
The multi-area LFC flow, which is investigated employing multi-agent Actor-Double-Critic deep reinforcement learning method, is described in
Figure 1. The parameters of the
i-th control area are represented in
Table 1. For the designed LFC power system, the system dynamics of
i-th control area are represented as the following:
2.2. Multi-Area LFC Power System with Constrained Markov Cooperative Game Model
In this paper, the MADRL is applied to enhance the generation speed of scheduling decisions. For the designed multi-area LFC power system, each control area is equipped with its own frequency controller, which is formulated within an MADRL framework. The system is divided into multiple areas, with an agent assigned to each area to make scheduling decisions for resource regulation. A multi-agent Actor-Double-Critic algorithm is designed to train the multi-agent system, ensuring that the scheduling decisions generated by the agents meet the safety and reliability requirements of system operation.
The system partitioning method influences the composition of adjustment resources within each area, consequently impacting the stability and convergence speed of MADRL training. The algorithm proposed in this paper assumes a predefined system partitioning result, with the system network topology and partitioning remaining unchanged. Additionally, it is assumed that agents can share operational state and action information during the centralized training process.
The Constrained Markov Cooperative Game (CMCG) model comprises the following elements: a set of agents M; the system state space S; the observable state space ; the action space ; the instant reward function of agent i; the global security constraint cost function C; the global real-time reward function R; the system state transition probability P; and the discount factor .
For the interaction process between the agent and the environment, first, agent i observes the operating status of the area it is located in and generates actions according to the policy function, as the regulatory resource scheduling decision of the area in which it is located. Then, according to the joint action of agents, the system feeds back to each agent the immediate reward and the global safety constraint cost . Finally, the system transitions to the next state . Collect state transition samples including the current state, action, immediate reward, constraint cost and next state of all agents, and store them in the experience playback pool.
In the process of interacting with the environment, each agent utilizes samples from the experience replay pool to optimize the accuracy of the action-value and action-cost evaluation functions. Based on the optimization goals constructed by the action-value and action-cost evaluation functions, each agent updates its own policy function. This interaction process is repeated until all agents converge and no longer change their policy functions. At this point, the multi-agent policy function represents the real-time optimal scheduling strategy model for the system.
For the multi-area LFC power system, the frequency will be influenced by another area LFC scheme. Therefore, the proposed MADRL approach for the multi-area LFC power system employs centralized learning, wherein all agents being updated in each iteration. For each area’s LFC scheme, the frequency deviation, ACE and external disturbances are represented as the state space, as shown in the following equations (
):
For action, in relation to control action
, the gradient of the centralized action-value function
Q is quantified. For agent
i, taking the gradient with respect to
, the following can be obtained:
The gradient of frequency deviation in various locations with respect to the control actions needs to be estimated based on the above formula.
During the interaction process, the system environment translates the actions output by the agent into scheduling decisions based on the established mapping relationship. This ensures that the scheduling decisions consistently adhere to the constraints of the regulating resources.
For a multi-area LFC system, frequency deviation and ACE affect frequency regulation performance. In this way, the reward function for MADRL agent in different area LFC systems is designed as follows:
where
,
,
,…,
represent the weighting factors of the reward functions corresponding to different control areas and control objectives.
The global optimization objective of the multi-agent cooperation alliance can be expressed as the following:
3. Multi-Agent Actor-Double-Critic Deep Reinforcement Learning Algorithm for LFC Power System
Because cannot accurately reflect the contribution of i-th agent’s strategy to the global optimization goal, optimizing each agent’s individual strategy during training may reduce the overall optimization objective. This can lead to conflicts between agents and reduce the training convergence speed. Therefore, using the global instantaneous reward as the immediate reward for each agent is not feasible for fostering a cooperative relationship among agents to maximize .
To improve the convergence of MADRL training, applying the cost sharing idea of Vickey–Clark–Groves (VCG) auction mechanism into the design of agent instant reward function to accurately reflect the contribution of agent’s strategy to the global optimization goal. The VCG mechanism maximizes social welfare by considering the losses incurred by bidders due to the participation of other bidders as auction costs. This approach ensures that the optimal bidding strategy, which maximizes the utility of each individual bidder, also maximizes the overall social welfare of the bidder group.
In this paper, taking the frequency deviation and ACE in the area where the agent
i is located are considered as the benefits of its participation in cooperation.
The VCG cost
of agent
i is the difference between the global optimization objective
when it does not participate in real-time scheduling and the value created by other agents in the alliance when it does participate. Since agents make decisions based solely on the operating status of their respective regions, their action strategies are independent of one another. Therefore, whether agent
i participates in scheduling does not change the strategies of other agents. The
can be expressed as follows:
where
represents the policy set that does not contain agent
i, and
is the global immediate reward and penalty term when agent
i does not participate in scheduling.
The utility of agent
i participating in the cooperative alliance is the difference between revenue and VCG cost. Taking the utility regarded as the value function
of agent
i,
Therefore, the immediate reward function of agent
i can be obtained as
It can be seen from the above formula that it reflects the impact of a single agent on the global optimization goal while considering other agents. This approach can alleviate conflicts when each agent optimizes its own strategy in a distributed manner and promote convergence in MADRL training.
Remark 1. The VCG mechanism does not theoretically guarantee convergence; however, it serves as a reward-shaping strategy that improves the alignment between local and global objectives, thereby facilitating cooperative learning and enhancing training stability. The CTDE framework has been widely adopted in multi-agent reinforcement learning because it balances cooperative learning capability and practical deployment requirements.
4. Construction of Multi-Agent Distributed Optimization
When the multi-agents reach the Nash equilibrium state, the joint strategy can be regarded as the optimal solution of the global optimization objective function:
where
is a minimum value limit to 0,
denotes the safety threshold that defines the acceptable operating region of the system, and
represents the expected value of the cumulative global security constraint cost under the joint strategy
. In this paper, taking the 0–1 state quantity
to characterize the system power frequency regulation in period
t, 1 represents system power frequency regulation convergence, and 0 represents no convergence.
To solve the above equation, the Lagrange multiplier method is applied, converting it into an unconstrained minimax problem as shown in the following equation:
where
λ is the Lagrange multiplier,
The solution process of the above equation is as follows:
and are updated iteratively until all agents no longer change the policy function. In this section, apply Self-Critic network and Cons-Critic network to evaluate the action-value and action-cost of the agent, respectively. Here, and are network weight parameters. and represent the evaluation results output of the Self-Critic and Cons-Critic networks, respectively.
To mitigate the impact of sparse characteristics of immediate rewards and security constraints on the training convergence effect,
and
can be rewritten as follows:
These formulas can be derived as
Therefore, during the training process, only the frequency deviation and ACE samples are used to update the Self-Critic network, while the non-convergent samples are used to update the Cons-Critic network. The proposed Self-Critic and Cons-Critic structures are designed to alleviate the adverse effects of sparse learning signals by providing complementary local and coordination-oriented value evaluations. This dual-critic mechanism supplies richer feedback during policy learning, which contributes to improved learning efficiency and convergence behavior.
Self-Critic network and Cons-Critic network can be updated as the following equation:
where
and
are respectively the frequency deviation convergence sample set and the non-convergence sample set in the experience playback pool
D;
and
represent the target evaluation results of Self-Critic and Cons-Critic respectively.
The policy gradient of Actor network for each agent can be expressed as the following equation:
Then, applying the updated Cons-Critic network of each agent to update the Lagrange multiplier
of the distributed optimization problem. To ensure that
is always true, the variable
is introduced. Then, it yields the following:
Remark 2. Compared with the multi-agent Actor–Critic algorithm, the proposed algorithm allows for the decoupled evaluation of action-value and action-cost, reducing the complexity of fitting a single value network. Additionally, it mitigates the impact of sparse immediate rewards and safety constraint costs on evaluation accuracy, thereby enhancing the convergence of MADRL training. Equation (21) reflects the influence of the agent’s strategy on action-value and action-cost under different system operating states. This ensures that the frequency decisions generated by the multi-agent system meet the safety operation constraints while maximizing the system’s performance optimization goals.
5. Case Study and Discussion
To prove the effectiveness of the proposed method, the two-area LFC power systems and three-area LFC power systems are applied in this section.
5.1. Case 1: Two-Area LFC Power System with Multi-Agent Actor-Double-Critic Deep Reinforcement Learning Algorithm
In this case, a two-area LFC power system utilizing multi-agent Actor-Double-Critic deep reinforcement learning algorithm is implemented to demonstrate the efficacy of the newly developed LFC approach. The system parameters are detailed in
Table 2. To evaluate the performance of the designed LFC scheme based on the MADRL framework for a two-area power system, comparative experiments are conducted with PI control and fuzzy PI control, respectively.
Based on the partitioning of the simulation system, a fully connected layer neural network is employed. The input layer of the agent’s Actor network contains three neurons, corresponding to the observed dimensions of its respective area. This network features two hidden layers, with 128 and 200 neurons, respectively. The output layer comprises a single neuron. The learning rate for the Actor network is set at 0.001. The agent’s Critic network integrates both state and action pathways, which subsequently pass through a ReLU layer followed by a fully connected layer. The state pathway incorporates two hidden layers with 128 and 200 neurons, respectively, utilizing the ReLU function as the activation function. The action pathway includes a single hidden layer with 200 neurons. The learning rate for the Critic network is uniformly set at 0.0001.
In this case, the total simulation time is set to 150 s. To analyze the designed LFC scheme performance, the load disturbances of
Hz,
Hz, and
Hz at
s,
s, and
s are set, respectively. The simulation results are shown in
Figure 2. The frequency deviation response curves for Area 1 and Area 2 are shown in
Figure 2a and
Figure 2b, respectively. The study quantifies the maximum absolute values and the average absolute values of frequency deviations under different controllers, as shown in
Table 3.
As described in
Table 3, the proposed method shows improvements over the Fuzzy PI algorithm (by
,
,
, and
respectively) and over the PI algorithm (by
,
,
, and
respectively), in terms of the maximum absolute value and the average absolute value of the system frequency deviations. The proposed method demonstrates significant improvements in managing frequency deviations. Under given load disturbances, the proposed method can quickly suppress disturbances, responding faster than the two conventional methods and resulting in significantly smaller frequency deviations. This allows the system to rapidly return to a stable state. As illustrated in
Figure 2a,b, the proposed method results in smaller frequency fluctuations and smoother frequency curves when compared to other algorithms.
To futher demonstrated the effectively performance of designed LFC scheme based MADRL, random disturbances are introduced into the two-area interconnected power systems model. Every 60s, a random load disturbance is generated and simultaneously applied to both Area 1 and Area 2, with the upper limit of the load disturbance with
Hz. The frequency deviation response curve for Area 1 and Area 2 are described in
Figure 2c and
Figure 2d, respectively. As presented in
Figure 2c,d, the proposed method significantly improvements over the Fuzzy PI algorithm (by
,
,
, and
, respectively) and PI algorithm (by
,
,
, and
, respectively).
The maximum absolute values and average absolute values of the system frequency deviations under different controllers are quantified, as shown in
Table 4.
Compared to other methods, the proposed approach results in smaller frequency fluctuations and less impact on the power grid. Under various load disturbances, it can rapidly reduce the frequency deviations across multiple areas to zero, thereby achieving stability.
5.2. Case 2: Three-Area LFC Power System with Multi-Agent Deep Reinforcement Learning
To further scrutinize the effectiveness of the designed LFC scheme based on MADRL. In the three-area LFC power system model, the total simulation time is set at 300 s, with a controller sampling time of 0.03 s.
The study involves introducing fixed load disturbances to different areas at specified times: a disturbance
Hz is added to Area 1 at time
s, a disturbance
Hz is added to Area 2 at time
s, and a disturbance
Hz is added to Area 3 at time
s. The simulation results are shown in
Figure 3.
It is observed that, the proposed method significantly improvements over the Fuzzy PI algorithm (by
,
,
,
,
, and
, respectively) and the PI algorithm (by
,
,
,
,
, and
, respectively). The maximum absolute values and average absolute values of the system frequency deviations under different controllers are quantified, as shown in
Table 5.
The proposed method effectively control frequency deviations caused by different disturbances, exhibiting smaller fluctuations. Although the Fuzzy PI and PI algorithms can also manage frequency deviations in the three-area power system, they respond more slowly and require longer adjustment periods.
Figure 3 demonstrates that the proposed method effectively reduces the impact of load disturbances on the power system, with faster response times and smoother frequency deviation curves.
6. Conclusions
A multi-agent Actor-Double-Critic deep reinforcement learning algorithm has been developed for LFC power system in this paper. This framework ensures that the scheduling decisions generated by the multi-agent system meet the safe and reliable operation requirements of the power system. To enhance the convergence of the MADRL training process, the cost sharing idea of the VCG auction mechanism has been applied to the design of the agent’s instant reward function to reflect the contribution of the agent strategy to the global optimization goal. It concludes that compared with traditional LFC scheme, the designed novel LFC scheme based on MADRL can further mitigate the impact of sparse immediate rewards and safety constraint costs, ensuring real-time scheduling decisions that comply with system safety operation constraints.
However, the current study assumes a centralized-training and decentralized-execution environment and does not explicitly consider communication failures, system partitioning, or fully distributed learning scenarios.
Future work will focus on extending the proposed framework to high-renewable-penetration power systems, evaluating its performance under severe renewable-energy uncertainties and communication-constrained environments, incorporating explainable reinforcement learning techniques, and developing rigorous stability and safety analysis methods to further enhance its practical applicability.