Next Article in Journal
Study on Coupled Evolution Mechanisms of Stress–Fracture–Seepage Fields in Overburden Strata During Fully Mechanized Coal Mining
Next Article in Special Issue
Multimodal Switching Control Strategy for Wide Voltage Range Operation of Three-Phase Dual Active Bridge Converters
Previous Article in Journal
Discrimination of High Impedance Fault in Microgrids: A Rule-Based Ensemble Approach with Supervised Data Discretisation
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Load Frequency Control via Multi-Agent Reinforcement Learning and Consistency Model for Diverse Demand-Side Flexible Resources

1
The College of Electrical Engineering, Shanghai University of Electric Power, Shanghai 200090, China
2
State Grid Shanghai Electric Power Research Institute, Shanghai 200063, China
*
Author to whom correspondence should be addressed.
Processes 2025, 13(6), 1752; https://doi.org/10.3390/pr13061752
Submission received: 11 April 2025 / Revised: 17 May 2025 / Accepted: 20 May 2025 / Published: 2 June 2025

Abstract

With the high-proportion integration of renewable energy into the power grid, the fast-response capabilities of demand-side flexible resources (DSFRs), such as electric vehicles (EVs) and thermostatic loads, have become critical for frequency stability. However, the diverse dynamic characteristics of heterogeneous resources lead to high modeling complexity. Traditional reinforcement learning methods, which rely on neural networks to approximate value functions, often suffer from training instability and lack the effective quantification of resource regulation costs. To address these challenges, this paper proposes a multi-agent reinforcement learning frequency control method based on a Consistency Model (CM). This model incorporates power, energy, and first-order inertia characteristics to uniformly characterize the response delays and dynamic behaviors of EVs and air conditioners (ACs), providing a reduced-order analytical foundation for large-scale coordinated control. On this basis, a policy gradient controller is designed. By using projected gradient descent, it ensures that control actions satisfy physical boundaries. A reward function including state deviation penalties and regulation costs is constructed, dynamically adjusting penalty factors according to resource states to achieve priority configuration for frequency regulation. Simulations on the IEEE 39-node system demonstrate that the proposed method significantly outperforms traditional approaches in terms of frequency deviation, algorithm training efficiency, and frequency regulation economy.

1. Introduction

The large-scale integration of renewable energy and the electrification of the demand side are driving the power system toward a rapid transition to a high-proportion renewable energy and high-proportion power electronic device paradigm. Power electronic devices, relying on semiconductor-based converter technology, cannot provide the mechanical inertia support inherent in traditional synchronous generators [1]. Meanwhile, the significant decline in the penetration rate of traditional synchronous generators further exacerbates the attenuation of the system’s overall inertia level [2,3]. The large-scale integration of intermittent renewable energy sources (RESs), such as wind and photovoltaic (PV) power, has intensified power fluctuations in electrical grids, posing significant challenges to frequency stability [4,5]. Although traditional generation-side regulation remains indispensable, DSFRs, including EVs and thermostatically controlled loads, have become an important complement to frequency regulation due to their controllability and fast-response capabilities.
In recent years, growing global interest in DSFRs has been witnessed. However, the capacity limitations and diversity of individual units make it impossible for them to participate in frequency modulation (FM) directly, so resource aggregation and quantification have become key to unlocking their FM potential [6,7]. For instance, reference [8] explored the role of EVs in peak shaving and frequency regulation from the perspective of vehicle/grid interaction. References [9,10] used Markov state models to aggregate load populations. Reference [11] developed an equivalent energy storage model for ACs by discretizing load power regulation. Such studies provide a theoretical basis for the improvement in frequency deviation using a single type of DSFR. Reference [12] proposed an optimized method for the frequency response by constructing a dynamic model of the AC system and integrating EV scheduling strategies. Reference [13] investigated the role of EVs and energy storage devices in islanded hybrid microgrids for frequency regulation and proposed a variety of advanced control strategies that can effectively improve the frequency stability of the system. Reference [14] proposed that the aggregator must master the flexibility of clustering and analytically represent the aggregated feasible domain of the power curve. References [15,16] adopted high-dimensional multivariate load aggregation models to reflect real-time energy dynamics. However, these methods focus on detailed state parameters and suffer from high complexity; moreover, the number of parameters grows exponentially with the order, leading to difficulties in model convergence during real-time control and limiting their applicability in large-scale coordinated scenarios.
Traditional frequency control methods, which rely on physical models to construct control strategies and leverage the advantages of simple parameter tuning and clear response mechanisms, have long supported the frequency stability of power systems. Reference [17] proposed a multistage constant compressor speed control strategy based on genetic algorithms, setting temperature as the input control factor and compressor speed as the output factor to optimize the control strategy. Reference [18] developed a novel PID control scheme for the Pound/Drever/Hall frequency tracking system, aiming to address the instability of theoretical inputs and the difficulty in accurately determining the transfer functions of controlled modules and actuators. Reference [19] introduced a consensus strategy derived from droop control, enabling distributed battery energy storage systems to simultaneously participate in different frequency control requirements. However, these methods are highly dependent on precise system parameters and fixed operational scenario assumptions, exposing significant limitations in complex environments with high-proportion new energy grid integration and diverse DSFR participation: On one hand, the output fluctuations of intermittent power sources such as wind and PV exhibit strong randomness, while the dynamic responses of DSFRs show nonlinear and time-varying characteristics. Traditional models struggle to accurately characterize the complex coupling relationships among multiple agents. On the other hand, fixed control parameters cannot adaptively adjust to match the real-time changing potential of frequency regulation resources, leading to decreased control accuracy and imbalance in the system economy.
The participation of diverse flexible resources in frequency response control within power systems involves massive and complex data, which are not only large-scale and high-dimensional but also difficult to acquire. Traditional methods have strict requirements for data integrity and accuracy, and their control effectiveness in frequency response is compromised when data are incomplete or exhibit complex nonlinear relationships [20,21]. Deep Reinforcement Learning (DRL) can automatically learn highly abstract and effective feature representations from massive, complex, or even incomplete data, and solve for optimal frequency control strategies under difficult data acquisition conditions by continuously exploring different flexible resource control schemes [22]. Reference [23] proposed a DRL-based control strategy to minimize the energy consumption of heating, ventilation, and air conditioning (HVAC) systems and user thermal discomfort. Reference [24] developed a novel Deep Q-Network (DQN) algorithm with tilt-based fuzzy cascade control for system frequency regulation across different operating regions. Reference [25] introduced a large-scale multi-agent twin delayed Deep Deterministic Policy Gradient (DDPG) algorithm with congenital cognition for load frequency control (LFC), incorporating prior knowledge into agents before pre-training based on human innate cognitive mechanisms. However, these methods consider the long-term costs of constraints through expected values but cannot guarantee constraint satisfaction at every time step. Reference [26] proposed a DDPG-based dynamic droop control strategy to determine optimal droop gains via a multi-reward function design for real-time optimal operation and frequency regulation. Reference [22] developed a simulator network with a historical database instead of a critic network to obtain policy gradients for backpropagation. Nonetheless, these approaches overlook the security risks of constraint violations caused by poorly trained reinforcement learning (RL) policies during training [27]. Specifically, without model knowledge, RL policies learn from the feedback of random exploration (i.e., trial-and-error), where random control decisions may frequently violate system operation constraints. Agents in traditional online DRL methods require extensive “trial-and-error,” leading to erroneous decisions [28]. Such erroneous decisions can result in uncomfortable indoor temperatures in buildings and suboptimal regulation performance, imposing high costs in practice and rendering them unreliable for power systems demanding high operational reliability.
To address the above challenges, this paper proposes a multi-agent reinforcement learning method based on a CM to fully leverage the speed and flexibility of demand-side resources in frequency control. First, a coordinated control model for different types of flexible resources participating in grid frequency regulation is established. By utilizing the analytical tractability of reduced-order models, transfer functions for integrating the frequency regulation process are developed to match the required dynamic frequency response. Subsequently, abandoning the DDPG’s reliance on neural networks to approximate value functions, an analytical controller based on frequency deviation gradients is designed. A reward function incorporating the regulation cost of flexible resources is constructed, integrating the state of charge of EVs and temperature deviation penalties for ACs, thereby optimizing the participation of flexible resources. The contributions of this study to the research field are as follows:
  • A universal model incorporating power constraints, energy constraints, and first-order inertia characteristics is constructed, using only a few key parameters to characterize resource response delays and dynamic behaviors.
  • A policy gradient controller is designed to guide the update of agent network parameters by combining frequency deviations and resource response characteristics. Through projected gradient descent, the parameter update direction is adjusted to the tangent direction of the feasible domain, ensuring that actions always satisfy constraints.
  • A novel reward function for frequency control is proposed, dynamically adjusting weight factors and penalty terms based on resource states to achieve a priority configuration for different optimization objectives.
This paper is organized as follows. Section 2 proposes a model that unifies the state parameters to describe the response characteristics of different types of flexibility resources. Section 3 demonstrates the parameter updating mechanism to improve the DDPG algorithm with the participation of demand-side resources. Section 4 verifies the effectiveness of the proposed method by a 39-node system.

2. A Consistency Model for Frequency Control Involving Diverse Flexible Resources

The large geographic distribution, tiny individual size, and variety of types of flexible resources allow for quick reactions to changes in system frequency. However, there are several obstacles to overcome when modeling numerous resources in depth: The addition of more state variables makes the regulation process more difficult and drastically increases the model order. Given that a uniform first-order lag element can be used to represent EVs and ACs in standardized modeling during grid frequency control, this paper proposes a unified characterization of dynamic features based on their shared properties. This approach achieves substantial model simplification while preserving essential physical features.

2.1. Consistency Model for Air Conditioning Loads

The activation/deactivation of AC units depends on whether the indoor temperature resides within a predefined setpoint range. The interior temperature must be kept within higher and lower bounds, as seen in Figure 1. The ACs automatically turn on when the temperature rises over the upper threshold and turn off when it falls below the lower threshold. Within the permissible temperature range, AC units exhibit dual-frequency regulation capabilities: They can be activated to give a downward regulating capacity as long as the indoor temperature is within acceptable limitations and the AC is turned off. On the other hand, the ACs can be turned off to provide an upward regulation capacity if the interior temperature remains within acceptable limits.
The first-order thermodynamic equivalent model for ACs’ differential equation is written as follows:
θ i t + 1 = θ i t + α ( θ 0 t θ i t ) b P AC max s t
where α is the thermal inertia coefficient, α = 1 / R C ; R represents the building’s equivalent thermal resistance; C denotes the equivalent thermal capacitance; b = η / C ; η is the energy efficiency coefficient; P AC max is the AC’s power consumption; θ 0 t is the ambient temperature; θ i t is the interior temperature; and s t is the binary on/off state. The interior temperature equation during period V is as follows:
θ V = ( 1 α ) V θ 0 + α t = 1 V ( 1 α ) V θ i b t = 1 V ( 1 α ) V t P AC max s t
The temperature limitations are as follows in order to avoid frequent switching brought on by temperature dead bands:
θ c ε 2 θ i θ c + ε 2
where θ c is the AC’s setpoint temperature and ε is the dead band width. Combining Equations (2) and (3) yields the following:
θ c + ε 2 θ V b t = 1 V ( 1 α ) V t P AC max s t θ c ε 2 θ V b
While the AC’s electrical power consumption and interior temperature are generally constant during steady-state operation, changes to the AC’s setpoint temperature upset the indoor thermal equilibrium. However, a shift from steady-state to dynamic behavior is triggered by a change in the setpoint temperature. The ACs’ electrical power reacts instantly during this brief period, while the building’s thermal inertia causes the indoor temperature to fluctuate slowly. This lag persists until a new thermal equilibrium is established, restoring the system to steady-state conditions.
The discrete power variable P AC max s t is converted to a continuous variable P AC , with the constraint expressed as follows [29]:
0 P AC P AC max
The ACs’ energy is completely used up when the room temperature hits the top limit, as shown by the adjustable power capacitance being zero. In contrast, the energy of the ACs is fully charged when the room temperature falls to the lower limit. The baseline charging/discharging power for the ACs cluster is defined as follows [30]:
P t base = θ 0 θ c η R
where P t base is the baseline charging/discharging power of the AC cluster.
S t + 1 AC = ( 1 α ) S t AC + ( P t AC P t base ) Δ t
where S t A C is the energy state of ACs at time t; S t + 1 A C is the energy state of ACs at time t + 1; and Δ t is the time step duration.
The upper and lower limits of the CM’s charging/discharging power can be obtained:
P t ACmax = P t ACsummax P t base
P t ACmin = P t ACsummin P t base
where P t ACsummax and P t ACsummin are the maximum and minimum power limits of the ACs cluster, respectively.
The aggregated ACs cluster CM is expressed as follows:
P t ACmin P t AC , CM P t ACmax
S t ACmin S t AC , CM S t ACmax
S t AC , CM = ( 1 α ) S t AC , CM + P t AC , CM Δ t
where S t AC , CM and P t AC , CM represent the energy state and power of the ACs cluster CM, respectively; P t ACmax and P t ACmin denote its upper and lower power limits; and S t ACmax and S t ACmin indicate its maximum and minimum energy state boundaries.
The transfer function of the aggregated ACs cluster, which participates in frequency regulation, can be mathematically represented. Consequently, the following link between frequency deviation and power output is obtained by this paper’s derivation of the equivalent transfer function for the ACs cluster CM involved in frequency regulation:
Δ P AC ( s ) = K AC 1 + T AC s e τ AC s Δ f ( s )
where Δ P AC ( s ) is the charging/discharging power of the ACs cluster to the grid; T AC is the charge/discharge time constant of the ACs cluster; K AC is the droop control parameter of the ACs cluster; τ AC is the time delay of ACs; and s is the Laplace variable.
When flexible resources are in a high state of charge, their charge/discharge efficiency is higher, and they should be prioritized for charge/discharge operations. Therefore, the relationship between the state of charge of flexible resources and the droop control coefficient is as follows:
K FL = K FLbase ( 1 | F SOC FL F SOCS FL | F SOCS FL )
where K ACbase , K EVbase K FLbase , K ACbase , and K EVbase are the reference droop control coefficients for ACs and EVs, respectively; F SOCS FL is the set value of the state of charge of flexible resources; F SOC FL is the state of charge of flexible resources; and K FLbase is the reference droop control coefficient of flexible resources.

2.2. Consistency Model for Electric Vehicles

Users’ desire to participate in peak-shaving auxiliary services, such as virtual energy storage, is crucial, because EV owners have the option to discontinue the service at any moment. Such opt-out behavior affects control tactics by creating uncertainty during the planned service period. The viable charging/discharging power zone in Figure 2 that takes into account EV user behavior is shown by the shaded area. The frequency regulation capability of EVs exhibits strong coupling with user behavior characteristics. Therefore, it is necessary to construct a model that integrates the following:
In the figure, the red line segment represents the lower bound of the feasible charging/discharging region for EVs, indicating the minimum battery energy requirement during regulation. Here, t EVon denotes the grid-connection time of EVs, and t EVoff represents the grid-disconnection time.
The energy feasible region in Figure 2a is determined by Equations (15) and (16), while that in Figure 2b is defined by Equations (15)–(18). The upper and lower power limits derived from these energy feasible regions are given by Equations (19) and (20), respectively.
S i EVmax ( t ) = min ( S i on + P i max t t i EVon , S i max t [ t i EVon , t i EVoff ]
S i EVmin ( t ) = max ( S i on + P i min t t i EVon , S i min t [ t i EVon , t i EVoff ]
S i EVmin ( t ) = max ( S i ex + P i min t i EVoff t , S i min t [ t i EVon , t i EVoff ]
S i EVmin ( t ) = max { ( S i EVmin ( t ) , S i EVmin ( t ) } t [ t i EVon , t i EVoff ]
P i EVmax ( t ) = min S i EVmax ( t + 1 ) S i EVmax ( t ) Δ t , P i max
P i EVmin ( t ) = max S i EVmax ( t + 1 ) S i EVmax ( t ) Δ t , P i min
where S i EVmax ( t ) and S i EVmin ( t ) are the maximum and minimum energy states of the EVs at time t; P i EVmax ( t ) and P i EV min ( t ) represent the maximum charging and discharging power of the i-th EVs; S i EVmin ( t ) and S i EVmin ( t ) denote the minimum energy thresholds at grid-connection and disconnection times; S i on is the initial energy state of the i-th EVs when connecting to the grid; S i ex is the required energy state of the i-th EVs when disconnecting from the grid; t i EVon and t i EVoff indicate the grid-connection and disconnection times of the i-th EVs; and Δ t is the period.
The EVs unit’s power and electricity boundaries can be used to determine the EV cluster’s boundary in the manner described below [31]:
S t EV , CMmax / min = i = 1 N EV S i EVmax / min x i ( t )
P t EV , CMmax / min = i = 1 N EV P i EVmax / min x i ( t )
where N EV is the number of EV clusters; x i ( t ) represents the grid-connected status of the i-th vehicle; S i EVmax / min are the upper and lower limits of the electrical energy of the EV cluster CM; P i EVmax / min are the upper and lower limits of the power of the EV cluster CM.
The EV cluster CM is as follows:
P t EV , CM min P t EV , CM P t EV , CMmax
S t EV , CM min S t EV , CM S t EV , CMmax
S t EV , CM = S t EV , CM + P t EV , CM Δ t
where S t EV , CM and P t EV , CM are the energy state and power of the EV cluster’s CM, respectively.
The EV cluster CM involved in frequency regulation has the following comparable transfer function:
Δ P EV ( s ) = K EV 1 + T EV s e τ EV s Δ f ( s )
where Δ P EV ( s ) is the charging/discharging power of the EV cluster to the grid; T EV is the charge/discharge time constant of the EV cluster; K EV is the droop control parameter of the EV cluster; and τ EV is the time delay of EVs.

2.3. Assessing the Controllable Potential Considering the Uncertainty of Flexible Resources

In the above-mentioned CM for flexible resources, the number of flexible resources connected to the power grid at any given time is highly correlated with the maximum and minimum charging and discharging power, as well as the upper and lower limits of the electricity quantity. To meet the time margin mandated by the power market regulations at the day-ahead stage, a system for predicting parameters must be established based on historical data and user behavior patterns. By mining the spatiotemporal distribution patterns of the operating status of flexible resource clusters, we can predict the time-series changes in the number of resource accesses in each period, allowing for the daily calculation of the power and electricity boundaries of the CM response capability.
The boundary parameters of power and electricity in the CM are predicted using the gate-controlled cyclic unit technique, which has high processing capabilities for time-series data. The historical data of CM’s initial state of charge, grid-connection time, and off-grid time are separated into training and testing groups for the day-ahead prediction stage, which is based on the gate-controlled cyclic unit algorithm. The training group’s historical data are then used to mine the data’s inherent connections. Rolling correction mainly includes three situations: new flexible resources connected to the grid during the response period, flexible resources delayed off-grid, and early off-grid.
  • For flexible resources newly connected to the grid during the response period, if the demand-side flexible resources can reach the minimum power capacity at the end of the response period, the lower limit of the FM power potential of the flexible resources will be corrected:
F Δ SOC , k CM min = F SOC , CM targe t P CM max η c [ t CM set ( k Δ t + T ) ] S CM
P t CMmin = min ( P CM max , F Δ SOC , k CM min S CM η c ( k Δ t + T t 0 ) ) , F Δ SOC , k CM min 0 max ( P CM min , F Δ SOC , k CM min S CM η d ( k Δ t + T t 0 ) ) , F Δ SOC , k CM min < 0
where t 0 is the current moment; k Δ t + T is the moment when the response ends; F Δ SOC , k CM min is the minimum charging amount or the maximum discharging amount of the flexible resource; S CM is the electrical energy of the flexible resource; η c and η d are the charging and discharging efficiencies, respectively; F SOC , CM targe t is the set value of the state of charge when the flexible resource is off-grid; t CM set is the set off-grid moment; P CM max is the maximum charging and discharging power; and P t CM min is the lower limit of the frequency regulation power within the response period.
The upper limit of the FM power potential needs to be revised to prevent flexible resources from overcharging during the response period:
P t CMmax = min ( P CM max , F Δ SOC , k CM max S CM η c ( k Δ t + T t 0 ) )
F Δ SOC , k CM max = 100 % F SOC CM ( t 0 )
where F Δ SOC , k CM max is the allowable increase in the electrical energy of the flexible resource within the remaining time of the response period; P t CMmax is the upper limit of the frequency regulation power within the response period; and F SOC CM ( t 0 ) is the state of charge of the flexibility resource at time t 0 .
2.
Similar to the earlier addressed scenario for the recently grid-connected flexible resources, the frequency regulation capability of the remaining period should be re-evaluated for flexible resources with a delayed off-grid time if the CM has reached the set electrical energy. If the CM has not reached the set electrical energy, the upper and lower limits of the frequency regulation power of the flexible resources should be revised to the allowable maximum charging power.
3.
For the flexible resources that go off-grid in advance, the upper and lower limits of their frequency regulation power should be revised to zero, and they will not participate in the frequency response for the remaining time.
The rolling correction and evaluation process of the frequency regulation potential is shown in Figure 3. Among them, Δ t is the evaluation time interval; T is the length of the evaluation period; and Δ t and T are set to 5 min and 1 h, respectively.

2.4. Consistency Model Considering the Participation of Different Types of Resources in Frequency Control

Agents are used to control large-scale, variable resources. Figure 4 illustrates how the power grid frequency control model is set up with flexible resources involved. Among them, R G , R 1 , and R FL are the sag control coefficients; T Ga and T Gb are the time constants of the speed governor of the frequency regulation unit and the steam turbine unit, respectively; K 1 , T 1 , e s τ 1 , K FL , T FL , and e s τ FL represent the sag control parameters, the time constant, and the time delay link of the different flexible resource cluster, respectively; τ 1 and τ FL represent the delay time constant of the different flexible resource cluster; Δ P 1 and Δ P FL represent the power output of different flexibility resources, respectively; Δ P G denotes the generator power output; Δ P L is the disturbance power; and s is the Laplace-transform variable. The inputs of the controller include the system frequency deviation Δ f , the derivative Δ f I and the integral Δ f d of the deviation, the reference operating points P FL 0 , P 1 0 , and P G 0 of the flexible resource and generator unit, and the adjustable potential of the CM. H and D are the system inertia time constant and load damping coefficient. Their output is the control command A G , A 1 , and A FL of each frequency regulation unit. The frequency regulation unit’s power adjustment amount is determined by the controller’s output and the droop control coefficient. The power variation of the flexible resources in the frequency response can be expressed as follows:
Δ P FL , j = A j Δ f R j
where Δ P FL , j represents the power variation in the j-th flexible resource agent.
The transfer function between the total response power of each frequency regulation unit and the flexible resource cluster and the system frequency can be expressed as follows:
i G = 1 m Δ P G , i ( s ) + j FL = 1 n Δ P FL , j ( s ) Δ P L ( s ) = Δ f ( s ) ( 2 H s + D )
where m and n represent the numbers of the frequency regulation units and the flexible resource clusters, respectively; Δ P G , i ( s ) and Δ P FL , j ( s ) are the power variations of the i-th frequency regulation unit and the j-th flexible resource cluster; and Δ P L ( s ) is the disturbance power, respectively.

3. Proposed Method

The DDPG algorithm adopted in this paper employs an “actor-critic” architecture. The proposed control model framework is shown in Figure 5. During the iterative computation at time step t, the actor first generates an action through the policy network u ( s , θ u ) based on the observed state of the CM at this moment. Subsequently, the CM undergoes a state transition according to the control policy at this time, reaching the state s t + 1 at the next time step. Meanwhile, the reward r t at time t is fed back to the agent. The agent controller can obtain a large number of experience samples through interactions with the environment, which are stored in an experience replay buffer. These sample data are first input into the critic model, and ultimately used to update the agent’s network parameters θ in the form of policy gradients.

3.1. Control Objectives and Reward Functions

In this paper, the minimum of the sum of the weighted sum of the system frequency deviation and the total cost of the FM is taken as the control objective, and the optimal multi-intelligent body cooperative control strategy is formulated by taking the frequency stability and total cost of the FM into consideration.
A state-action value function Q ( s , A ) is defined as follows to measure the system’s overall performance when every intelligence is working together to regulate it [32,33,34]:
Q ( s , A ) = λ [ t = 1 T γ t ( Δ f ( t ) ) 2 Δ t ] + ( 1 λ ) [ t = 1 T ( i G m Δ P G , i ( t ) C G , i Δ t + j FL n Δ P FL , j ( t ) C FL , j Δ t ) ]
r i = u i | Δ f i | w 1 C FL , i w 2 ( F SOC EV , i F SOCS EV ) 2 w 3 Δ P i base
where Q ( s , A ) is the action value function; G and FL are the sets of the frequency regulation units and the flexible resource clusters, respectively; m and n represent the numbers of the frequency regulation units and the flexible resource clusters, respectively; Δ t and T represent the size of the time step and the length of the experience trajectory, respectively; s is the state information of the system; Δ P G , i ( t ) and Δ P FL , j ( t ) are the dimensionally unified outputs; Δ f ( t ) is the dimensionally unified frequency deviation; C G , i is the unit power output cost of the generator unit; C FL , j is the unit power output cost of the flexible resource cluster; and λ is the weighting coefficient of the reward function. Since the reward function includes the frequency control effect and the frequency regulation cost, the applicability of the algorithm can be tested by comparing different proportions. Additionally, a discount factor γ ( 0 , 1 ) is included in this study to ensure that the current frequency deviation is given more weight than future ones and that the system frequency deviation can be removed as quickly as possible. This study aims to lower the power system’s frequency regulation cost while maintaining the frequency control effect by prioritizing the power system’s frequency control effect over the economy. r i is the reward function of the agent; | Δ f i | is the frequency deviation; u i and w 1 are the weights corresponding to the frequency regulation effect and the frequency regulation economy; w 2 is the penalty coefficient for the EV cluster deviating from the target; w 3 is the penalty coefficient for the change in the power reference value of the ACs cluster; Δ P i base is the change in the power reference value of the i-th ACs cluster; F SOC EV , i is the state of charge of the i-th EVs; and F SOCS EV is the set value of the state of charge of the EVs.

3.2. Actor Network Parameter Update Method

The optimal control action is obtained by maximizing the expected state-action value function E s ~ ρ [ Q ( s , A ) ] as the evaluation index. Thus, this study uses the chain derivation method to calculate the gradient of the expected state-action value function to the parameter θ = [ θ u 1 , θ u 2 , , θ u m + n ] sequentially based on the gradient ascent algorithm concept. The iteratively updated parameter reaches its greatest value along the gradient of the expected state-action value function, resulting in the network parameter that allows each strategy network to output the best action. The gradient and its accompanying parameters are updated in the following manner.
θ u j E s ~ ρ [ Q ( s , A ) ] 1 N u i θ u j Q ( s , A ) | s = s i = 1 N u i A j Q ( s , A ) θ u j u j ( s , θ u j ) | s = s i
θ u j θ u j + η θ u j E s ~ ρ [ Q ( s , A ) ]
where η ( 0 , 1 ) is the learning rate of the policy network parameter θ u j and N u is the number of small batch sampling samples; the gradient of the expected state-action value function to parameter is approximated by averaging the gradient of the N state-action value functions to the parameter.
To update parameters, A j Q ( s , A ) and θ u j ( s , θ u j ) should be obtained first. The function expression of θ u j ( s , θ u j ) in the policy network is as follows:
θ u j u j ( s ; θ u j ) = θ u j ( f θ ( n ) ( p ) [ f θ ( 1 ) ( 1 ) ( s ) ] )
where p represents the number of layers of the neural network; state s is the input of the neural network; and f θ ( l ) ( l ) denotes the activation function.
During the iterative updating of the network parameters, in order to find the gradient of the expectation of the state-action value function with respect to the parameters, it is crucial to obtain the gradient of the state-action value function with respect to the controller action.
According to the chain rule of differentiation, for the i G FM unit, the following is obtained:
A i G Q u i G ( s , A ) = [ 2 λ t = 1 T γ t Δ f ( t ) A i G Δ f ( t ) f 0 2 Δ t + ( 1 λ ) t = 1 T A i G Δ P G , i ( t ) C i G P 0 C 0 Δ t ]
For the j FL flexible resource cluster, the following is obtained:
A j FL Q u j FL ( s , A ) = [ 2 λ t = 1 T γ t Δ f ( t ) A j FL Δ f ( t ) f 0 2 Δ t + ( 1 λ ) t = 1 T A j FL Δ P FL , j ( t ) C j FL P 0 C 0 Δ t ]
where P 0 , C 0 , and f 0 denote the datum values used for normalization when unifying the scale.
According to the knowledge of the frequency response model, the transfer function between the system frequency deviation, the FM unit, and the flexible resource controller output command can be expressed in the following form:
Δ f ( s ) ( 2 H s + D ) = i = 1 , i G m [ A i ( s ) Δ f ( s ) R i ] + j = 1 , j FL n k FL , j [ A j ( s ) Δ f ( s ) R j ] ( 1 + s T FL ) Δ P L
When compared to the generator set, the flexible resource cluster’s time constant is essentially insignificant.
The actual active output of FM units and flexible resource clusters is regulated by the sag control factor and controller output. Maximum and minimum output power limits must also be met. The maximum/minimum output power constraint of the generator and the flexible resource cluster can be further expressed as follows:
P G / FL , j min P G / FL , j 0 + A j Δ f R j P G / FL , j max
where P G / FL 0 , P G / FL min , and P G / FL max denote the base operating point, the minimum output power constraint, and the maximum output power constraint.
Let P G / FL , j = P G / FL , j 0 + A j Δ f R j , while considering A j = u j ( s ; θ u j ) ; then, the following is an expression for the P G / FL , j gradient about the network parameters:
θ u j P G / FL , j = θ u j A j 1 R j θ u j Δ f = θ u j u j ( s ; θ u j ) ( 1 A j Δ f R j )
From Equation (31), the policy network parameter θ u j is updated along the following direction. Then, the corresponding generator and flexible resource cluster power output increases; conversely, the power output decreases.
θ u j θ u j + η θ u j P G / FL , j
In combination with the ascending direction of the gradient given by Equation (36) and the generator power constraint, the final rules for updating the strategy network parameter are derived. The flexible resource cluster output is obtained as follows:
θ u j θ u j + η θ u j E s ~ ρ [ Q ( s , A ) ] , P G / FL , j min < P G / FL , j < P G / FL , j max θ u j + η { θ u j E s ~ ρ [ Q ( s , A ) ] θ u j E s ~ ρ [ Q ( s , A ) ] θ u j P G / FL , j θ u j P G / FL , j 2 θ u j P G / FL , j , e l s e
The aforementioned equations demonstrate that the output power of the generating units and flexible resource clusters surpasses the limit if the parameter is updated in the gradient direction of the expected state-action value function to the parameter. To ensure that the trajectory of the parameter update is always within the feasible domain delineated by the frequency control model, the parameters will be updated in the subsequent iteration in the same direction as the projection of the expectation of the vector state-action value function on the plane normal to the vector θ u P G / FL , j .

4. System Studies

4.1. Test System

This case simulation focuses on a 39-node power system. This system not only has ten traditional generators that are all in grid-connected operation status, but also includes three EV clusters, three AC clusters, and three PV power plants, respectively, connected to nodes eight, twenty-four, and twenty-eight. Among them, the droop coefficients, governor inertia time constants, and other parameters of different traditional generators vary significantly, and their ramp characteristics and unit power generation costs also differ. The parameter settings refer to reference [35]. The system load is a dynamic active load. The power fluctuation curve within 10 min is shown in Figure 6, with the power fluctuation range being from 2820 MW to 2991 MW, which is used to simulate the change in the active power demand. The load damping coefficient is set to one. For the flexibility resources in the system, intelligent agent control units are established to achieve the coordinated control of large-scale flexible resources. The AC and EV clusters are modeled using the proposed CM in this paper, and a unified first-order inertia link describes their dynamic characteristics. During the simulation, all control commands need to undergo a power boundary check by the CM to ensure that the commands comply with physical and user constraints. In addition, the parameter values, output power reference values, maximum charge and discharge powers, and unit power scheduling costs of the flexible resources are listed in Table 1 and Table 2, respectively. The reference power, power upper and lower limits, unit power scheduling cost, regulation factor, and inertia time constant of the generator sets are listed in Table 3.
As shown in Figure 7, six CMs are deployed in groups by resource type: CM1–CM3 are responsible for the charging/discharging coordination of EV clusters, which correspond to the PV-connected nodes due to the energy storage characteristics of EVs; CM4–CM6 are responsible for the power regulation of AC clusters, which correspond to the nodes with concentrated loads due to the load characteristics of ACs.
The simulation was implement in MATLAB/Simulink, running in the 2022b enviroment with a Core i7-10875H CPU, 32GB of memory and NVIDIA GeForce GTX 1650 Ti GPU.
To verify the effectiveness of the proposed algorithm for the real-time control of the system frequency in large-scale grid-connected scenarios of new energy, this section compares the control performance of several commonly used algorithms, which are based on the DQN algorithm for reinforcement learning and the DDPG algorithm. The reward decay rate, sampling period, and number of learning iterations of the DQN algorithm are set to 0.7, 0.1 s, and 800 episodes, respectively. It should be noted that the DQN algorithm employs a discrete action space, which means that when facing continuously varying system states, its action selection is coarse-grained and unable to achieve precise adjustment, thus resulting in limited control accuracy. The DDPG parameter settings are the same as in reference [36], with a training iteration of 800 times. It should be noted that the algorithm does not incorporate a constraint penalty term, which may generate actions violating practical constraints during the control process. Under the scenario of the large-scale integration of renewable energy, this could cause the operation of some devices in the system to exceed safety limits or optimal operating ranges, thereby affecting the system’s stability and reliability. The parameter settings for the proposed algorithm, including the number of experience trajectories M, the capacity of the experience replay pool D, the number of small batch sampling samples N, the learning rate η , discount factor λ , and intelligence g, are shown in Table 4.

4.2. System Random and Load Perturbation

To simulate the operation of a complex interconnected power system, the section uses the load disturbance and PV output as simulations, and the simulation time is 600 s, as shown in Figure 8. The rectangular load disturbance, serving as an idealized step model in simulations, is used to verify the algorithm’s response capability to sudden disturbances and can be extended to more complex load fluctuation scenarios in practical applications [37,38,39].
Figure 9 illustrates the dynamic response curves of system frequency deviations under different control algorithms across three scenarios. To ensure the clear presentation of curve trends and completely avoid marker symbols from obscuring data fluctuations, some markers were removed. The horizontal axis denotes time (600 s), while the vertical axis represents frequency deviation (Hz). Red, orange, and blue curves correspond to the control performance of DQN, DDPG, and the proposed method, respectively. The simulation scenarios include the following: the EVs only (Scenario I), the ACs only (Scenario II), and the clustered CM participation (Scenario III). The results demonstrate that while the DQN algorithm employs discrete action spaces, EV charging/discharging and AC power regulation inherently involve continuous variables. The DQN algorithm requires discretizing continuous control variables into finite actions, leading to insufficient power regulation precision and notable response hysteresis under dynamic disturbances. Although the DDPG algorithm operates in continuous action spaces, it fails to integrate the CM’s power/energy boundaries into its policy network or explicitly address multi-resource coordination, thereby struggling to optimize power allocation between EVs and ACs. To address these limitations, the proposed method employs a unified CM framework that comprehensively models the power constraints, energy constraints, and inertial characteristics of both EVs and ACs. By incorporating gradient projection optimization into policy network parameter updates, this approach ensures strict adherence to physical constraints while significantly reducing computational complexity. Under the three scenarios, the proposed strategy restricts frequency deviations to (−0.028 Hz, 0.030 Hz), (−0.021 Hz, 0.027 Hz), and (−0.017 Hz, 0.021 Hz), respectively. The comparative analysis demonstrates that the developed CM effectively suppresses frequency fluctuations in power systems. Both load clusters exhibit a substantial frequency regulation capacity throughout daily operations and participate in system frequency regulation with enhanced fault tolerance. This strategy provides a superior solution for frequency regulation challenges in renewable-dominated grids, particularly addressing the complexities of high renewable penetration. The proposed framework offers critical advantages in maintaining grid stability while optimizing multi-resource coordination.
Table 5 reveals the core advantages of the CM in integrating two types of flexible resources—EVs and ACs: traditional methods (DQN/DDPG) exhibit limitations even in single scenarios. The DQN algorithm, constrained by its discrete action space, fundamentally stems from the mismatch between discrete regulation granularity and the continuous dynamic characteristics of devices. While the DDPG algorithm optimizes regulation accuracy through a continuous action space, it relies on indirect correction by reducing rewards after command violations occur. In contrast, the proposed method leverages the CM to minimize violations of system operation constraints via projected gradient descent, transforming independent resources into aggregate optimization within collaborative scenarios. This provides a feasible solution for the efficient participation of multi-type flexible resources in grid regulation.
To investigate the impact of different cluster scales on frequency control performance, this study establishes three AC clusters of varying scales: a small-scale cluster with 1000 individual ACs, a medium-scale cluster with 5000 units, and a large-scale cluster with 10,000 units.
As shown in Figure 10, the small-scale cluster exhibits significant individual differences, leading to the “overloading of some devices and idling of others” during regulation. This results in insufficient regulation power and only limited frequency support (frequency fluctuation range (−0.046, 0.048 Hz)), verifying the “small perturbation” characteristic of single loads. The medium-scale cluster transforms individual differences into “group-averaged characteristics” through collective effects, achieving a notable leap in regulation capability: compared with the small-scale cluster, its deviation suppression effect improves to (−0.035, 0.038 Hz), demonstrating the enhanced regulation capacity enabled by scaling. The large-scale cluster realizes a qualitative change in regulation capability, with its frequency fluctuation range narrowed to (−0.026, 0.025 Hz), proving that large-scale aggregation can transform AC clusters into critical regulatory resources for power systems. This phenomenon also validates the CM aggregation mechanism, i.e., the collective effect of load clusters narrows the frequency fluctuation range.
Figure 11 compares the cumulative reward value iterative convergence process of the DQN and DDPG algorithms, and the proposed method. The cumulative reward values of the DQN and DDPG algorithms tend to stabilize after 640 and 520 iterations. Notably, each iteration cycle contains an experience trajectory of 150 iterations, which means that the DQN and DDPG algorithms require 96,000 and 78,000 iterations, respectively, to converge. In contrast, the proposed method requires only 150 iterations (22,500 iterations updates) to complete the parameter training of deep neural networks. In addition, the convergence curve of the proposed algorithm has the smallest oscillation amplitude. The DDPG algorithm exhibits significantly lower cumulative rewards compared to the proposed method, accompanied by notable oscillations. This deficiency stems from its reliance on experience replay through random sampling, which disrupts the temporal correlations among power system states. Consequently, it demonstrates delayed learning in extreme operational scenarios and fails to incorporate the physical constraints of DSFRs, thereby frequently triggering penalty mechanisms. Meanwhile, the DQN approach suffers from insufficient control precision or dimensional catastrophes due to action space discretization, while its Q-value maximization strategy tends to overestimate action values, leading to policy deviation. In contrast, the proposed method calculates policy gradients in real-time via gradient optimization without relying on experience replay, and the projected gradient method ensures the feasibility of control commands, thereby avoiding penalty losses and enhancing the number of effective regulation instances. This verifies the superiority of the proposed strategy in complex power system control.
Table 6 gives a comparison of different algorithms in terms of system frequency deviation, training time, and decision time. The proposed algorithm improves over the DQN (43.8%, 66.2%, 20%) and DDPG (27.8%, 76.0%, 56.1%) algorithms in terms of frequency control, training time, and decision time. The proposed method has better frequency control under the interference of new energy and load, which can greatly reduce the time required for intensive learning training [37,38,39]. These advantages are attributed to the CM’s unified modeling of heterogeneous resources, reducing model complexity compared to traditional detailed state-space methods.

4.3. High Percentage of New Energy Disturbances

A high proportion of new energy has strong randomness. Only PV is considered as a random fluctuation in the system without considering the frequency regulation effect of the load to simulate the stochastic variations of a high percentage of new energy sources. The system with a steady-state initial state is subject to PV fluctuations at nodes eight, twenty-four, and twenty-eight, and the power fluctuations of the PV power station used are shown in the Figure 12.
Figure 13 presents the system frequency variations of different algorithms under the scenario of high-proportion renewable energy random fluctuations, where only PV is used as the random fluctuation input, and the load frequency regulation effect is not considered. As shown in Figure 12, the PV fluctuations at nodes eight, twenty-four, and twenty-eight are the primary causes of frequency changes. It can be intuitively observed from the curves in Figure 13 that the frequency fluctuation curve of the proposed method is the smoothest. In contrast, the frequency fluctuation amplitudes of the DQN and DDPG algorithms are significantly larger. The proposed method can effectively suppress frequency fluctuations due to its unified modeling and optimization strategy, which enables it to quickly react to random PV fluctuations, adjust system power allocation, and stabilize the frequency. However, restricted by its discrete action space, the DQN algorithm cannot achieve fine-grained control during power regulation. Facing continuous random variations in PV, it struggles to quickly match appropriate actions, leading to larger frequency fluctuations. Since the DDPG algorithm does not effectively handle physical constraints, it fails to accurately coordinate the power output of each device when PV fluctuations change the system’s power demand, resulting in significant frequency fluctuations. These results demonstrate that the proposed method exhibits obvious advantages in frequency control under high-proportion renewable energy random fluctuation scenarios, providing a reliable solution for frequency stability control in high-proportion renewable energy power grids.
Data from Table 7 demonstrate that the proposed method exhibits significant advantages over the DQN algorithm (81.1%, 36.7%) and DDPG algorithm (50.0%, 29.6%) in terms of absolute frequency deviation and maximum frequency deviation, respectively. This is of great significance for actual system operation: taking absolute frequency deviation as an example, this advantage means that the average degree of system frequency deviation from the rated frequency is significantly reduced, minimizing impacts on the normal operation of various devices in the system. In scenarios with high-proportion renewable energy and random fluctuations, rapid changes in PV power pose a severe test to the algorithm’s control capabilities. Due to its discrete action space, the DQN algorithm struggles to achieve fast and precise power regulation in the face of frequent PV power fluctuations. For instance, when PV power fluctuates drastically over a short period, the DQN algorithm cannot select appropriate actions from its limited discrete action set for adjustment in a timely fashion, leading to large frequency deviations. Although the DDPG algorithm has a continuous action space, it lacks the effective handling of physical constraints. When PV fluctuations cause changes in system power demand, the DDPG algorithm fails to accurately coordinate power output. For example, during a sudden drop in PV power, the DDPG algorithm may not restrict the power consumption of certain devices, causing the system frequency to decline rapidly and resulting in significant frequency deviations. In contrast, the proposed method, leveraging unified modeling and optimized strategies, precisely adapts to random PV fluctuations, and rapidly responds to and suppresses frequency deviations. It fully considers the physical constraints of each device in the system, reasonably allocates power during PV fluctuations, maintains stable system operation, and ensures that frequency deviations remain at low levels.
The proposed method introduces the weighting factor λ in the objective function to balance the frequency deviation control effect and the total system FM cost. The frequency and economic indicators under different weighting factors are shown in Table 8.
Table 8 presents the influence of the weight factors in the reward function on frequency deviation and regulation cost, reflecting the critical role of the CM in multi-objective optimization. When λ = 1, the reward function prioritizes minimizing frequency deviation, and the CM’s power boundary constraints ensure that devices do not overload during extreme regulation, resulting in minimal frequency deviation but the highest total cost due to frequent activation of high-cost discharge resources. As λ decreases, the system reduces reliance on high-cost resources, achieving a dynamic balance between frequency stability and economic regulation. This advantage stems from the CM’s unified modeling of heterogeneous resources and the reduction in trial-and-error costs during training.

5. Conclusions

This study proposed a multi-agent reinforcement learning framework integrating a CM to achieve the collaborative frequency control of multi-type DSFRs, such as EVs and ACs. By developing a unified modeling mechanism, the research translates the dynamic characteristics of heterogeneous resources and user constraints into computable collaborative regulation parameters. Considering the uncertainty of flexible resources, a frequency collaborative control model was constructed. The projected gradient descent algorithm was introduced to ensure that control commands meet real-time device physical boundaries and user demands. A state-aware multi-objective optimization strategy was further designed to dynamically adjust weight and penalty factors based on resource real-time states, achieving priority configuration for different objectives such as frequency stability, user constraints, and regulation economy. The proposed method enables the collaborative characterization of multi-resource dynamic behaviors, a reliable guarantee of control command feasibility, and the effective balance of multi-objective optimization. Simulation results demonstrate that this framework significantly enhances frequency regulation performance in complex scenarios, providing a novel solution for large-scale flexible resources to participate in grid regulation. It holds important value for promoting renewable energy integration and enhancing the flexibility regulation capability of power systems.
However, on the demand side, a large number of thermal, cooling, and gas energy resources also exhibit fast-response and equivalent energy storage characteristics. Future research should focus on analyzing how to incorporate more diverse resources into the model, integrating demand-side flexibility to provide multiple services for the power system. Additionally, this paper only uses the weighted sum of frequency regulation cost and performance in the reward function without a specific cost analysis. In practical scenarios, the relationship between frequency regulation cost and performance is more complex and requires further investigation.

Author Contributions

Writing—original draft, X.L.; Writing—review & editing, G.Y.; Formal analysis, T.C. and J.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

Authors Tiantian Chen and Jing Liu were employed by the State Grid Shanghai Electric Power Research Institute. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Nomenclature

Variables
α The thermal inertia coefficient
R The building’s equivalent thermal resistance
C The equivalent thermal capacitance
η The energy efficiency coefficient
P AC max The ACs power consumption
θ 0 t The ambient temperature
θ i t The interior temperature
s t The binary on/off state
θ c The ACs setpoint temperature
ε The dead band width
P t base The baseline charging/discharging power
S t A C The energy state of ACs at time t
Δ t The time step duration
P t ACsummax The maximum power limits
P t ACsummin The minimum power limits
S t AC , CM The energy state of the ACs cluster CM
P t AC , CM The power state of the ACs cluster CM
S t ACmax Maximum energy state boundaries
S t ACmin Minimum energy state boundaries
Δ P AC ( s ) The charging/discharging power of the ACs cluster
T AC The charge/discharge time constant of the ACs cluster
K AC The droop control parameter of the ACs cluster
τ AC The time delay of ACs
s Laplace variable
K ACbase Reference droop control coefficients for ACs
K EVbase Reference droop control coefficients for EVs
F SOCS FL Set the value of the state of charge of flexible resources
F SOC FL The state of charge of flexible resources
K FLbase Reference droop control coefficient of flexible resources
t EVon The grid-connection time of EVs
t EVoff The grid-disconnection time
S i EVmax ( t ) The maximum energy states of the EVs at time t
S i EVmin ( t ) The minimum energy states of the EVs at time t
P i EVmax ( t ) The maximum charging and discharging power of the i-th EVs
P i EV min ( t ) The minimum charging and discharging power of the i-th EVs
S i EVmin ( t ) The minimum energy thresholds at grid connection
S i EVmin ( t ) The minimum energy thresholds at disconnection times
S i on The initial energy state of the i-th EVs
S i ex The required energy state of the i-th EVs
N EV The number of EV clusters
x i ( t ) The grid-connected status of the i-th vehicle
S i EVmax / min The upper and lower limits of the electrical energy of the EV cluster CM
P i EVmax / min The upper and lower limits of the power of the EV cluster CM
S t EV , CM The energy state of the EV cluster’s CM
P t EV , CM The power of the EV cluster’s CM
Δ P EV ( s ) The charging/discharging power of the EV cluster
T EV The charge/discharge time constant of the EV cluster
K EV The droop control parameter of the EV cluster
τ EV The time delay of EVs
t 0 The current moment
k Δ t + T The moment when the response ends
F Δ SOC , k CM min The minimum charging amount or the maximum discharging amount of the flexible resource
S CM The electrical energy of the flexible resource
η c The charging and discharging efficiencies
η d The charging and discharging efficiencies
F SOC , CM targe t The set value of the state of charge when the flexible resource is off-grid
t CM set The set off-grid moment
P CM max The maximum charging and discharging power
P t CM min The lower limit of the frequency regulation power within the response period
F Δ SOC , k CM max The allowable increase in the electrical energy
P t CMmax The upper limit of the frequency regulation power within the response period
F SOC CM ( t 0 ) The   state   of   charge   of   the   flexibility   resource   at   time   t 0
R The droop control coefficient
T G a The time constants of the speed governor
T G b The steam turbine
M G ( s ) The governor steam turbine dynamic
K FL The sag control parameters
T FL The time constant
e s τ FL The time delay link of the different flexible resource clusters
τ FL The delay time constant of the different flexible resource clusters
Δ P FL The power output of different flexibility resources
Δ P CM The CM power output
Δ f Frequency deviation
Δ f I The derivative of the deviation
Δ f d The integral of the deviation
P FL 0 The reference operating points of the flexible resource
P G 0 The reference operating points of the generator
A j Controller Output
Δ P FL , j The power variation of the j-th flexible resource cluster
m The number of frequency regulation units
n The number of flexible resource clusters
Δ P G , i ( s ) The power variations of the i-th frequency regulation unit
Δ P L ( s ) The disturbance power
H The system’s inertia time constant
D Load damping coefficient
π The action policy
u ( s , θ u ) The policy network
s State
Q ( s , A , θ Q ) The action value function
θ u j The neural network parameters
Δ P G , i ( t ) The dimensionally unified outputs
C G , i The unit power output cost of the generator
C FL , j The unit power output cost of the flexible resource cluster
λ The weighting coefficient
r i The reward function
γ Discount factor
u i Weight coefficient
w 1 Weight coefficient
w 2 Weight coefficient
Δ P i base The change in the power reference value
F SOC EV , i The state of charge
F SOCS EV The set value of the state of charge
E s ~ ρ [ Q ( s , A ) ] The expected state-action value function
η The learning rate
f θ ( l ) ( l ) The activation function
p The number of layers of the neural network
P 0 The datum values
P G / FL 0 The base operating point
P G / FL min The minimum output power constraint
P G / FL max The maximum output power constraint
Δ f ¯ Absolute frequency deviation
t 1 Training time
t 2 Decision time
F G ¯ Average FM costs
Abbreviations
DSFRsDemand-side flexible resources
EVsElectric vehicles
CMConsistency Model
ACsAir conditioners
FMFrequency modulation
DRLDeep reinforcement learning
PVPhotovoltaic
DQNDeep Q-network
DDPGDeep deterministic policy gradient
LFCLoad frequency control

References

  1. Fang, J.; Li, H.; Tang, Y.; Blaabjerg, F. On the Inertia of Future More-Electronics Power Systems. IEEE J. Emerg. Sel. Top. Power Electron. 2019, 7, 2130–2146. [Google Scholar] [CrossRef]
  2. Hu, P.; Li, Y.; Yu, Y.; Blaabjerg, F. Inertia estimation of renewable-energy-dominated power system. Renew. Sustain. Energy Rev. 2023, 183, 113481. [Google Scholar] [CrossRef]
  3. Moore, P.; Alimi, O.A.; Abu-Siada, A. A Review of System Strength and Inertia in Renewable-Energy-Dominated Grids: Challenges, Sustainability, and Solutions. Challenges 2025, 16, 12. [Google Scholar] [CrossRef]
  4. Yu, G.; Liu, C.; Tang, B.; Chen, R.; Lu, L.; Cui, C.; Hu, Y.; Shen, L.; Muyeen, S.M. Short term wind power prediction for regional wind farms based on spatial-temporal characteristic distribution. Renew. Energy 2022, 199, 599–612. [Google Scholar] [CrossRef]
  5. Yu, G.Z.; Lu, L.; Tang, B.; Wang, S.Y.; Chen, R.S.; Chung, C.Y. Ultra-short-term Wind Power Subsection Forecasting Method Based on Extreme Weather. IEEE Trans. Power Syst. 2022, 38, 5045–5056. [Google Scholar] [CrossRef]
  6. Azizi, S.; Sun, M.; Liu, G.; Terzija, V. Local Frequency-Based Estimation of the Rate of Change of Frequency of the Center of Inertia. IEEE Trans. Power Syst. 2020, 35, 4948–4951. [Google Scholar] [CrossRef]
  7. Liu, X.; Liu, Y.; Liu, J.; Xiang, Y.; Yuan, X. Optimal planning of AC-DC hybrid transmission and distributed energy resource system: Review and prospects. CSEE J. Power Energy Syst. 2019, 5, 409–422. [Google Scholar] [CrossRef]
  8. Shrivastava, S.; Khalid, S.; Nishad, D.K. Impact of EV interfacing on peak-shelving and frequency regulation in a microgrid. Sci. Rep. 2024, 14, 31514. [Google Scholar] [CrossRef]
  9. Zhao, H.; Wu, Q.; Huang, S.; Zhang, H.; Liu, Y.; Xue, Y. Hierarchical control of thermostatically controlled loads for primary frequency support. IEEE Trans. Smart Grid 2018, 9, 2986–2998. [Google Scholar] [CrossRef]
  10. Jendoubi, I.; Sheshyekani, K.; Dagdougui, H. Aggregation and optimal management of TCLs for frequency and voltage control of a microgrid. IEEE Trans. Power Deliv. 2021, 36, 2085–2096. [Google Scholar] [CrossRef]
  11. Yu, Z.; Bao, Y.; Yang, X. Day-ahead scheduling of air-conditioners based on equivalent energy storage model under temperature-set-point control. Appl. Energy 2024, 368, 123481. [Google Scholar] [CrossRef]
  12. Cai, L.; Yang, C.; Li, J.; Liu, Y.; Yan, J.; Zou, X. Study on Frequency-Response Optimization of Electric Vehicle Participation in Energy Storage Considering the Strong Uncertainty Model. World Electr. Veh. J. 2025, 16, 35. [Google Scholar] [CrossRef]
  13. Shukla; Rajan, R.; Garg, M.M.; Panda, A.K. Driving grid stability: Integrating electric vehicles and energy storage devices for efficient load frequency control in isolated hybrid microgrids. J. Energy Storage 2024, 89, 111654. [Google Scholar] [CrossRef]
  14. Wen, Y.; Hu, Z.; You, S.; Duan, X. Aggregate Feasible Region of DERs: Exact Formulation and Approximate Models. IEEE Trans. Smart Grid 2022, 13, 4405–4423. [Google Scholar] [CrossRef]
  15. Wang, S.; Wu, W. Aggregate Flexibility of Virtual Power Plants with Temporal Coupling Constraints. IEEE Trans. Smart Grid 2021, 12, 5043–5051. [Google Scholar] [CrossRef]
  16. Feng, C.; Chen, Q.; Wang, Y.; Kong, P.Y.; Gao, H.; Chen, S. Provision of Contingency Frequency Services for Virtual Power Plants with Aggregated Models. IEEE Trans. Smart Grid 2023, 14, 2798–2811. [Google Scholar] [CrossRef]
  17. Huang, X.; Li, K.; Xie, Y.; Liu, B.; Liu, J.; Liu, Z.; Mou, L. A novel multistage constant compressor speed control strategy of electric vehicle air conditioning system based on genetic algorithm. Energy 2022, 241, 122903. [Google Scholar] [CrossRef]
  18. Liu, W.; Tan, J.; Cui, J. Design of a PID Control Scheme for Pound-Drever–Hall Frequency-Tracking System Based on System Identification and Fuzzy Control. IEEE Sens. J. 2024, 24, 39252–39259. [Google Scholar] [CrossRef]
  19. He, L.; Tan, Z.; Li, Y.; Cao, Y.; Chen, C. A Coordinated Consensus Control Strategy for Distributed Battery Energy Storages Considering Different Frequency Control Demands. IEEE Trans. Sustain. Energy 2024, 15, 304–315. [Google Scholar] [CrossRef]
  20. Xie, J.; Sun, W. Distributional Deep Reinforcement Learning-Based Emergency Frequency Control. IEEE Trans. Power Syst. 2022, 37, 2720–2730. [Google Scholar] [CrossRef]
  21. Li, J.; Yu, T.; Cui, H. A multi-agent deep reinforcement learning-based “Octopus” cooperative load frequency control for an interconnected grid with various renewable units. Sustain. Energy Technol. Assess. 2022, 51, 101899. [Google Scholar] [CrossRef]
  22. Chen, X.; Zhang, M.; Wu, Z.; Wu, L.; Guan, X. Model-Free Load Frequency Control of Nonlinear Power Systems Based on Deep Reinforcement Learning. IEEE Trans. Ind. Inform. 2024, 20, 6825–6833. [Google Scholar] [CrossRef]
  23. Yu, L.; Sun, Y.; Xu, Z.; Shen, C.; Yue, D.; Jiang, T.; Guan, X. Multi-agent deep reinforcement learning for HVAC control in commercial buildings. IEEE Trans. Smart Grid 2021, 12, 407–419. [Google Scholar] [CrossRef]
  24. Sahu, P.C.; Baliarsingh, R.; Prusty, R.C.; Panda, S. Novel DQN optimised tilt fuzzy cascade controller for frequency stability of a tidal energy-based AC microgrid. Int. J. Ambient. Energy 2020, 43, 3587–3599. [Google Scholar] [CrossRef]
  25. Li, J.; Zhou, T. Prior Knowledge Incorporated Large-Scale Multiagent Deep Reinforcement Learning for Load Frequency Control of Isolated Microgrid Considering Multi-Structure Coordination. IEEE Trans. Ind. Inform. 2024, 20, 3923–3934. [Google Scholar] [CrossRef]
  26. Lee, W.-G.; Kim, H.-M. Deep Reinforcement Learning-Based Dynamic Droop Control Strategy for Real-Time Optimal Operation and Frequency Regulation. IEEE Trans. Sustain. Energy 2025, 16, 284–294. [Google Scholar] [CrossRef]
  27. Dobbe, R.; Hidalgo-Gonzalez, P.; Karagiannopoulos, S.; Henriquez-Auba, R.; Hug, G.; Callaway, D.S.; Tomlin, C.J. Learning to control in power systems: Design and analysis guidelines for concrete safety problems. Electr. Power Syst. Res. 2020, 189, 106615. [Google Scholar] [CrossRef]
  28. Zhang, Z.; Zhang, D.; Qiu, R.C. Deep reinforcement learning for power system applications: An overview. CSEE J. Power Energy Syst. 2019, 6, 213–225. [Google Scholar]
  29. Hao, H.; Somani, A.; Lian, J.; Carroll, T.E. Generalized aggregation and coordination of residential loads in a smart community. In Proceedings of the 2015 IEEE International Conference on Smart Grid Communications (SmartGridComm), Miami, FL, USA, 2–5 November 2015; pp. 67–72. [Google Scholar] [CrossRef]
  30. Hao, H.; Wu, D.; Lian, J.; Yang, T. Optimal Coordination of Building Loads and Energy Storage for Power Grid and End User Services. IEEE Trans. Smart Grid 2018, 9, 4335–4345. [Google Scholar] [CrossRef]
  31. Ulbig, A.; Andersson, G. Analyzing operational flexibility of electric power systems. In Proceedings of the 2014 Power Systems Computation Conference, Wroclaw, Poland, 18–22 August 2014; pp. 1–8. [Google Scholar] [CrossRef]
  32. Yu, P.; Zhang, H.; Song, Y.; Hui, H.; Huang, C. Frequency Regulation Capacity Offering of District Cooling System: An Intrinsic-Motivated Reinforcement Learning Method. IEEE Trans. Smart Grid 2023, 14, 2762–2773. [Google Scholar] [CrossRef]
  33. Yu, P.; Zhang, H.; Song, Y. Adaptive Tie-Line Power Smoothing with Renewable Generation Based on Risk-Aware Reinforcement Learning. IEEE Trans. Power Syst. 2024, 39, 6819–6832. [Google Scholar] [CrossRef]
  34. Yu, P.; Zhang, H.; Song, Y.; Hui, H.; Chen, G. District Cooling System Control for Providing Operating Reserve Based on Safe Deep Reinforcement Learning. IEEE Trans. Power Syst. 2024, 39, 40–52. [Google Scholar] [CrossRef]
  35. Moeini, A. Open data IEEE test systems implemented in SimPower Systems for education and research in power grid dynamics and control. In Proceedings of the 2015 50th International Universities Power Engineering Conference, Stoke on Trent, UK, 3 December 2015; pp. 1–6. [Google Scholar]
  36. Lowe, R.; Wu, Y.; Tamar, A.; Harb, J.; Abbeel, P.; Mordatch, I. Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments. arXiv 2020, arXiv:170602275. [Google Scholar]
  37. Yang, F.; Huang, D.; Li, D.; Lin, S.; Muyeen, S.M.; Zhai, H. Data-Driven Load Frequency Control Based on Multi-Agent Reinforcement Learning with Attention Mechanism. IEEE Trans. Power Syst. 2023, 38, 5560–5569. [Google Scholar] [CrossRef]
  38. Zhang, G.; Li, J.; Xing, Y.; Bamisile, O.; Huang, Q. Data-driven load frequency cooperative control for multi-area power system integrated with VSCs and EV aggregators under cyber-attacks. ISA Trans. 2023, 143, 440–457. [Google Scholar] [CrossRef]
  39. Yan, Z.; Xu, Y. A Multi-Agent Deep Reinforcement Learning Method for Cooperative Load Frequency Control of a Multi-Area Power System. IEEE Trans. Power Syst. 2020, 35, 4599–4608. [Google Scholar] [CrossRef]
  40. Wu, Z.; Lv, Z.; Huang, X.; Li, Z. Data driven frequency control of isolated microgrids based on priority experience replay soft deep reinforcement learning algorithm. Energy Rep. 2024, 11, 2484–2492. [Google Scholar] [CrossRef]
Figure 1. Air-conditioning load operation characteristic diagram.
Figure 1. Air-conditioning load operation characteristic diagram.
Processes 13 01752 g001
Figure 2. Feasible range of electric vehicle charging and discharging.
Figure 2. Feasible range of electric vehicle charging and discharging.
Processes 13 01752 g002
Figure 3. Rolling evaluation diagram of flexible resource frequency modulation potential.
Figure 3. Rolling evaluation diagram of flexible resource frequency modulation potential.
Processes 13 01752 g003
Figure 4. Different types of flexible resources participate in grid frequency coordination control.
Figure 4. Different types of flexible resources participate in grid frequency coordination control.
Processes 13 01752 g004
Figure 5. Frequency cooperative control principle model.
Figure 5. Frequency cooperative control principle model.
Processes 13 01752 g005
Figure 6. Load power within 10 min.
Figure 6. Load power within 10 min.
Processes 13 01752 g006
Figure 7. 39-node power system.
Figure 7. 39-node power system.
Processes 13 01752 g007
Figure 8. Strong random interference to the system.
Figure 8. Strong random interference to the system.
Processes 13 01752 g008
Figure 9. Frequency deviation of different scenarios under load and PV perturbations. The EVS only (Scenario I, (a)), the ACs only (Scenario II, (b)), and the clustered CM participation (Scenario III, (c)).
Figure 9. Frequency deviation of different scenarios under load and PV perturbations. The EVS only (Scenario I, (a)), the ACs only (Scenario II, (b)), and the clustered CM participation (Scenario III, (c)).
Processes 13 01752 g009
Figure 10. The frequency control effect of air conditioning units of different scales participating in frequency control.
Figure 10. The frequency control effect of air conditioning units of different scales participating in frequency control.
Processes 13 01752 g010
Figure 11. Convergence curves of different algorithms.
Figure 11. Convergence curves of different algorithms.
Processes 13 01752 g011
Figure 12. Dynamic power changes.
Figure 12. Dynamic power changes.
Processes 13 01752 g012
Figure 13. Frequency deviation in different scenarios under high-scale PV perturbation. The EVS only (Scenario I, (a)), the ACs only (Scenario II, (b)), and the clustered CM participation (Scenario III, (c)).
Figure 13. Frequency deviation in different scenarios under high-scale PV perturbation. The EVS only (Scenario I, (a)), the ACs only (Scenario II, (b)), and the clustered CM participation (Scenario III, (c)).
Processes 13 01752 g013
Table 1. Parameter detting of different resources.
Table 1. Parameter detting of different resources.
Architectural-AC Parameters
Thermal capacity/(kJ/°C)690.87Initial set temperature /(°C)25
Thermal resistance/(°C/kW)7.12Outdoor temperature /(°C)33
EV Parameters
Energy consumption per kilometer ((kW·h) kW−1)0.149Charge state upper limit (%)90
EV charging and discharging efficiency (%)95Lower limit of charging status (%)10
Expected state of charge at home (%)70Expected state of charge when leaving home (%)40
Table 2. Parameters for different flexible resources.
Table 2. Parameters for different flexible resources.
ParameterValue
EV charge/discharge power (MW)25, 28
EV discharge cost (yuan/MWh)90
EV charge cost (yuan/MWh)95
AC charge/discharge power (MW)20, 24.5
AC discharge cost (yuan/MWh)75
AC Charge cost (yuan/MWh)79
Table 3. Parameter settings for generator sets.
Table 3. Parameter settings for generator sets.
GensetMinimum Output Power/MWMaximum Output Power/MWGeneration Costs/(yuan/MWh)Base Power/MWRegulation Factor/p.u.Inertia Time Constant/s
G11501000355003.55.5
G21501000355003.55.5
G3906004030044.5
G4906004030044.5
G5906004030044.5
G650350602004.53.5
G750350602004.53.5
G8301807510052.5
G9301807510052.5
G10301807510052.5
Table 4. Parameter setting.
Table 4. Parameter setting.
ParameterMDN η γ g
Values5006400320.030.9516
Table 5. Comparison of algorithmic control effects in different scenarios.
Table 5. Comparison of algorithmic control effects in different scenarios.
MethodDQNDDPGProposed Method
Scenario I(−0.039, 0.046) Hz(−0.034, 0.042) Hz(−0.030, 0.035) Hz
Scenario II(−0.035, 0.041) Hz(−0.030, 0.035) Hz(−0.024, 0.026) Hz
Scenario III(−0.028, 0.030) Hz(−0.021, 0.027) Hz(−0.017, 0.021) Hz
Table 6. Performance comparison of different algorithms under PV and load perturbations.
Table 6. Performance comparison of different algorithms under PV and load perturbations.
Performance IndicatorsDQNDDPGProposed Method
Δ f ¯ /Hz0.0250.0210.015
t1/min14.921.65.1
t2/s0.00560.00880.0043
Table 7. Performance comparison of different algorithms under high percentage of PV perturbations.
Table 7. Performance comparison of different algorithms under high percentage of PV perturbations.
Performance IndicatorsDQNDDPGProposed Method
Δ f ¯ /Hz [37,38,39,40]0.1050.0340.020
Max Δ f 0.0280.0250.015
t1/min15.224.46.3
t2/s0.01130.01540.0096
Table 8. Performance with different weighting factors.
Table 8. Performance with different weighting factors.
Weighting Factor λ 10.80.60.40.2
Δ f ¯ (Hz)0.0120.0140.0160.0190.021
F G ¯ (yuan)59985904574558035769
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yu, G.; Li, X.; Chen, T.; Liu, J. Load Frequency Control via Multi-Agent Reinforcement Learning and Consistency Model for Diverse Demand-Side Flexible Resources. Processes 2025, 13, 1752. https://doi.org/10.3390/pr13061752

AMA Style

Yu G, Li X, Chen T, Liu J. Load Frequency Control via Multi-Agent Reinforcement Learning and Consistency Model for Diverse Demand-Side Flexible Resources. Processes. 2025; 13(6):1752. https://doi.org/10.3390/pr13061752

Chicago/Turabian Style

Yu, Guangzheng, Xiangshuai Li, Tiantian Chen, and Jing Liu. 2025. "Load Frequency Control via Multi-Agent Reinforcement Learning and Consistency Model for Diverse Demand-Side Flexible Resources" Processes 13, no. 6: 1752. https://doi.org/10.3390/pr13061752

APA Style

Yu, G., Li, X., Chen, T., & Liu, J. (2025). Load Frequency Control via Multi-Agent Reinforcement Learning and Consistency Model for Diverse Demand-Side Flexible Resources. Processes, 13(6), 1752. https://doi.org/10.3390/pr13061752

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop