1. Introduction
The transition from conventional fossil-fuel-based power generation to renewable energy sources (RESs) has significantly transformed the global energy landscape, establishing sustainable and eco-friendly electricity networks [
1,
2,
3]. The current paradigm shift in energy production, characterized by the widespread adoption of renewable sources such as wind and solar energy, owes much to their abundant supply and decreasing costs [
4]. Nevertheless, this transition presents substantial challenges to the stability and dependability of electrical grids [
5].
Traditional power systems primarily rely on synchronous generators (SGs), which offer the necessary inertia to maintain frequency stability through their large rotating masses [
6]. However, with the increased penetration of RESs that lack physical inertia, such as wind and photovoltaic (PV) generation, the system’s overall inertia is reduced, leading to a higher risk of frequency instabilities [
7]. This issue is more prominent in islanded microgrids that operate autonomously and cannot depend on the central grid’s inertia for frequency stabilization [
8]. A paramount concern is ensuring frequency stability in islanded microgrids, where voltage source converters (VSCs) interface with RESs. These microgrids are often deprived of the inertial support that synchronous generators provide to maintain grid stability. Therefore, a meticulous approach to maintaining frequency stability becomes necessary to ensure a reliable and uninterrupted power supply.
Microgrids (MGs) have emerged as a pivotal element in the evolution of electricity distribution networks, signifying a transformative shift from traditional power systems towards a more distributed, smart grid topology, attributed largely to the integration of distributed energy resources (DERs) [
9], including both renewable and conventional energy sources. Microgrids are a network of DERs that can operate in islanded or grid-connected modes [
10]. Microgrids can be DC, AC, or hybrid [
11]. They enhance power quality [
12,
13], improve energy security [
14], enable the integration of storage systems [
15,
16], and optimize system efficiency. Microgrids offer economic advantages [
17], reduce peak load prices, participate in demand response markets, and provide frequency management services to the larger grid [
18].
Moreover, the utilization of power-electronics-linked (PEL) technologies in the microgrids, despite their benefits, presents notable obstacles. These include intricate control issues resulting from short lines and low inertia within microgrids, leading to voltage and frequency management complications [
19]. The interdependence between reactive and active powers, arising from microgrid-specific features like relatively large R/X ratios [
20], poses pivotal considerations for control and market dynamics, particularly regarding voltage characteristics. Additionally, the limited contribution of PEL-based DERs during system faults and errors raises safety and protection concerns [
21]. Microgrids often need more computational and communication resources, like larger power systems, demanding cost-effective and efficient solutions to address these challenges. Abrupt or significant load changes can also cause instability in isolated microgrid systems [
22]. Sustaining system stability becomes especially demanding when incorporating a blend of inertia-based generators, static-converter-based photovoltaics, wind power, and energy storage devices. This complexity is further compounded by integrating power electronic devices and virtual synchronous generators, necessitating comprehensive investigations and close equipment coordination to ensure stability.
Various methods are used for microgrid frequency control, including conventional droop control [
23] and its more advanced variant, adaptive droop control [
24]. Other notable methods include robust control, fractional-order control, fuzzy control, PI derivative control, adaptive sliding mode control [
25], and adaptive neural network constraint controller [
26]. Advanced primary control methods relying on communication offer superior voltage regulation and effective power sharing, but they require communication lines among the inverters, which can increase the system’s cost and potentially compromise its reliability and expandability due to long-distance communication challenges [
27]. Although control techniques have made significant advancements, there are still prevalent challenges common to primary control methods. These challenges include slow transient response, frequency, voltage amplitude deviations, and circulating current among inverters due to line impedance [
28]. Due to microgrids’ complexities and varied operational conditions, each control method has advantages and disadvantages. As a result, it is difficult for a single control scheme to address all drawbacks in all applications effectively. Ongoing research in this field is crucial for improving the design and implementation of future microgrid architectures, ensuring they can meet the dynamic and diverse needs of modern power systems [
29].
Virtual inertia (VI) has been introduced to address these challenges in power systems, particularly in microgrids [
30]. VI-based inverters emulate the behavior of traditional SGs. These systems consist of various configurations like virtual synchronous machines (VSMs) [
31], virtual synchronous generators (VSGs) [
32], and synchronverters. By emulating the inertia response of a conventional SG, these VI-based systems help stabilize the power grid frequency, thus countering the destabilizing effects of the high penetration of RES. While implementing VI-based inverters has shown promising results in stabilizing frequency in microgrids, it also presents new challenges and research directions. The selection of a suitable topology depends on the system control architecture and the desired level of detail in replicating the dynamics of synchronous generators. This variety in implementation reflects the evolving nature of VI systems and underscores the need for further research, particularly in the systems-level integration of these technologies [
33].
The introduction and advancement of VI technologies in microgrids marks a significant step towards accommodating the growing share of RES in power systems while maintaining system stability and reliability [
34]. As power systems continue to evolve towards a more sustainable and renewable-centric model, the role of VI in ensuring smooth and stable operation becomes increasingly crucial [
35].
The current landscape of power system control is characterized by increasing complexity, nonlinearity, and uncertainty, leading to the adoption of machine learning techniques as a significant breakthrough. In particular, reinforcement learning (RL) has shown considerable potential in addressing intricate control challenges in power systems [
36]. RL enables a more adaptable and responsive approach to VI control, crucial for maintaining frequency stability in microgrids heavily reliant on RES [
37].
The Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm is a notable advancement in RL. TD3 ia an extension of the Deep Deterministic Policy Gradient (DDPG) algorithm; both algorithms address the overestimation bias found in value-based methods like Deep Q-Networks (DQNs). The TD3 algorithm leverages a pair of critic networks to estimate the value function, which helps reduce the overestimation bias. Additionally, the actor network in TD3 is updated less frequently than the critic networks, further stabilizing the learning process [
38]. The use of target networks and delayed policy updates in TD3 enhances the stability and performance of the RL agent, making it a robust choice for complex and continuously evolving systems like power grids.
In the context of power systems, RL can be instrumental in optimizing the operation of VI systems. Implementing RL in VI systems involves training an RL agent to control the parameters of the VI system, such as the amount of synthetic inertia to be provided, based on the real-time state of the grid. The agent learns to predict the optimal control actions that would minimize frequency deviations and ensure grid stability, even in the face of unpredictable changes in load or generation [
39].
The RL agent’s ability to continuously learn and adapt makes it particularly suited for managing VI systems in dynamic and uncertain grid conditions. For instance, in scenarios with sudden changes in load or unexpected fluctuations in RES output, the RL agent can quickly adjust the VI parameters to compensate for these changes, thereby maintaining grid frequency within the desired range. This adaptability is crucial, given the stochastic nature of RES and the increasing complexity of modern power grids.
Furthermore, implementing RL in VI systems can lead to more efficient and cost-effective grid management. By optimizing the use of VI resources, RL can help reduce the need for expensive traditional spinning reserves, leading to economic benefits for utilities and consumers. It also supports the integration of more RES into the grid, contributing to the transition towards a more sustainable and low-carbon power system. Applying RL offers a promising pathway for enhancing the operation and efficiency of virtual inertia systems in power grids. In microgrid control, [
40] introduced a new variable fractional-order PID (VFOPID) controller that can be fine-tuned online using a neural-network-based algorithm. This controller is specifically designed for VI applications. The proposed VFOPID offers several advantages, including improved system robustness, disturbance rejection, and adaptability to time-delay systems; however, it needs to address some technical issues for the VIC system in terms of algorithm performance evolution, including computational complexity reduction, accuracy enhancement, and robustness improvements, including testing the proposed controller on a nonlinear microgrid system. Ref. [
41] addressed the challenges of inertia droop characteristics in interconnected microgrids and proposed an ANN-based control system to improve coordination in multi-area microgrid control systems. Additionally, [
42] presented a secondary controller that utilizes (DDPG) techniques to ensure voltage and frequency stability in islanded microgrids and future work includes studying the high penetration level of RES. Ref. [
43] explored a two-stage deep reinforcement learning strategy that enables virtual power plants to offer frequency regulation services and issue real-time directives to DER aggregators, demonstrating the potential of advanced machine learning in optimizing microgrid operations and highlighting the need for more utilization of RL techniques in virtual inertia applications and paving the road for utilizing new techniques like TD3.
This paper addresses a significant issue in power system control—the underutilization of reinforcement learning techniques in implementing VI systems for islanded microgrids. Integrating RES into microgrids is a step towards sustainable energy, but it can lead to frequency deviations that impact stability and reliability. To tackle this issue, a VI controller based on the TD3 and DDPG algorithms is proposed. The RL-based VI controller is designed to optimize the VI system’s response to frequency deviations, thereby enhancing the stability and reliability of islanded microgrids. This innovative approach fills the critical gap in applying advanced reinforcement learning methods to VI, contributing to developing more resilient and efficient power systems. This work aims to demonstrate the potential of RL in revolutionizing the control mechanisms for modern power systems, particularly in the context of frequency regulation in microgrids.
The remainder of the paper is organized as follows:
Section 2 provides a detailed modeling of the microgrid system under study.
Section 3 introduces the RL algorithms, detailing their operational principles.
Section 4 presents the simulation results, highlighting the efficacy of the proposed RL-based VI controller in regulating frequency deviations. Finally, the paper concludes by summarizing the key contributions of the present study and outlining the future directions of research in advancing microgrid technology.
2. System Model
The microgrid system under study represents a common configuration used by oil and gas industries situated in remote areas far from the central power grid. The system also represents a typical power system when the grid is disconnected for a long time and only emergency supply and renewable energy sources are available. This microgrid predominantly relies on synchronous generators, including motor and static loads. In recent developments, the system has been augmented by integrating renewable energy sources. A prototypical site powered by synchronous generators utilizes droop control to distribute the load evenly. This setup serves as a model in the current study to simulate the dynamic operations characteristic of a standard oil and gas facility. Moreover, an adjacent DC microgrid, sourced from local renewable energy, has been implemented to support the AC grid loads.
Figure 1 illustrates the microgrid configuration being analyzed. This system comprises a diesel generator, various static loads, and induction motor loads. These components are all interconnected at the AC microgrid’s point of common coupling (PCC). Additionally, the DC microgrid is linked to the AC grid through a VSC, which is regulated by a virtual inertia control loop with the reinforcement learning agent based on the TD3 employed.
The DC microgrid consists of a constant power source representing renewable energy sources, such as a PV or wind system.
The system outlined in
Figure 1 is the basis for analyzing the microgrid’s frequency response, focusing on the rate of change of frequency (RoCoF) and nadir. It also examines how AC-side fluctuations impact the DC microgrid’s DC voltage. Furthermore, the study delves into the dynamic efficacy of the suggested virtual inertia controller for frequency stabilization, as illustrated in the same figure.
For the purpose of the training process of the reinforcement learning agent, a small-signal linearized model of the microgrid’s components has been developed. The outcomes of this analysis are detailed in the subsequent subsections.
2.1. DC Microgrid Modeling
In this study, the VSC serves as the pivotal link between the DC and AC microgrids being examined. The control of the VSC plays a crucial role in maintaining the microgrid’s stability, especially during contingency scenarios. This is achieved by aiding the microgrid frequency in terms of the rate of change of frequency (RoCoF) and nadir through the provision of virtual inertia support. The VSC accomplishes this by adapting the reinforcement learning techniques.
The control system incorporates an agent trained in reinforcement learning, trained at reducing frequency deviations and improving nadir values. The study also includes a comparative analysis with two reinforcement learning agents, the DDPG and the TD3, to assess their effectiveness in mirroring dynamic behavior and enhancing overall performance. The subsequent sections will detail these different methods’ results and present a comparative analysis.
The net power contribution of the DC microgrid towards the AC grid, denoted as
, is calculated by deducting the sum of the constant power load (
) and the resistive load present in the DC microgrid from the constant power source. Concurrently, the power transmission to the AC microgrid is represented by
. Furthermore, the behavior of the DC link capacitor (
) associated with the interconnecting VSC is described by
Disregarding any losses in the VSC, Equation (
2) delineates the power delivered to the AC microgrid.
where
and
represent the voltages in the DQ reference frame of the AC grid, and
and
represent the output currents of the VSC within the same DQ reference frame. In the DC microgrid, the renewable energy sources are effectively represented as a constant power source with a power output of
. This constant power is consistently fed into the AC grid, and the voltage of the DC grid is controlled through the interlinking VSC. This mechanism is attributed to the relatively slow variation in power output from renewable energy sources, especially when compared to the dynamics of the inertia support loop. The resistive loads within the DC microgrid are modeled as resistance, denoted as
R. As a result, the surplus power generated by the DC microgrid can be determined by the following calculation:
The linearized equation of the power transferred to the AC microgrid is given by (
4):
where
and
are the base voltage and power of the system,
and
are the operating points where the system is linearized,
and
are small changes in the current in the DQ reference frame.
Figure 2 depicts the current control loop of the VSC, where the reference current values are denoted as
and
. In this setup, the q-reference is maintained at zero, whereas the d-reference is derived from the virtual inertia; the control includes an outer current loop and an inner voltage loop with the decoupling components.
and
represent the proportional–integral (PI) controller gains of the current loop. Notably, the virtual inertia loop is controlled by a reference signal provided by the agent’s actions, directly influencing the
reference. The agent’s action is driven by the RL framework, where the states of the environment are measured through the frequency of the system and DC link voltage, and both are then compared to the nominal value to produce the errors as the states. The reward function drives the agent’s learning to generate the actions that produce the required virtual inertia support.
2.2. Model of Induction Machine
The dynamics of the induction motor (IM), particularly relations between its stator and rotor voltages and currents within the rotating reference frame, are listed in the equations denoted as (5). While these equations can be formulated using a variety of state variables, including both fluxes and currents, it is noted that these variables are not mutually exclusive. For the purpose of cohesively integrating the IM’s state equations into the broader linearized model of the microgrid, it is more advantageous to use currents as the state variables. Consequently, the interplay between stator and rotor voltage and current within the IM is detailed in the universally recognized synchronous DQ reference frame, as outlined in the following equations [
44].
In these equations,
and
denote the inductances of the stator and rotor, respectively. Similarly,
and
refer to the resistances of the stator and rotor. Additionally,
signifies the mutual inductance,
refers to the synchronous speed, and
indicates the speed of the rotor. The formulation of the electromagnetic torque within this context is presented as follows:
The correlation between torque and mechanical can be established:
where
stands for the number of poles,
J denotes the combined inertia of the motor and its load, and
refers to the torque exerted by the load. It is important to note that before proceeding with the linearization of these machine equations, one must consider the influence of the stator supply frequency, which is governed by the droop equations in a microgrid system. This necessitates accounting for the minor variations in signal, essential for developing a comprehensive and integrated model for small-signal analysis. Therefore, the linear differential equations for the induction machine can be articulated as follows:
where
is the state vector,
is the input vector,
is the system matrix, and
is the input matrix, which is divided into two parts,
and
.
2.3. Model of Diesel Generator
2.3.1. Generator Model
The AC microgrid model utilized in this study incorporates a diesel generator, along with the dynamics of both the governor and the automatic voltage regulator (AVR). The synchronous generator within this model is defined such that represents the synchronous speed, and the difference between the actual rotor speed and this synchronous speed is expressed as .
The stator currents and voltages in the reference frame are denoted by and , respectively. Additionally, the stator fluxes in the frame are represented as . The rotor fluxes and input field voltage from the exciter are symbolized in their per-unit form as , and , respectively. stands for the mechanical power input from the turbine.
The constants for this per-unit model are detailed in
Table 1. The equations that model the diesel generator are as follows [
45,
46] and the equations are detailed in [
30].
The state-space equations of the synchronous generator are described in (
9a) and (
9b), and the matrices of the synchronous generator are described in (
9c) and (
9d).
2.3.2. Governer and Engine Model
In this model, the governor and turbine are configured to endow the generator with a droop gain, designated as
. This feature is critical for illustrating power distribution when multiple generators are in operation. Furthermore, the throttle actuator and the engine within the model are simulated using a low-pass filter approach. Each of these components is associated with its own time delay, identified as
for the throttle actuator and
for the engine, detailed as follows [
45]:
2.3.3. AVR Model
The AVR in the model is designed in line with the IEEE AC5A type, as illustrated in
Figure 3. Key parameters of this AVR are shown in
Table 2.
The AVR setup is determined as follows:
3. Reinforcement Learning Controller
The RL is a subset of machine learning, where an agent is trained to make optimal decisions through interactions with an environment guided by a system of states and rewards. This learning process involves the agent developing a policy, essentially a function that maps given states to actions, with the aim of maximizing cumulative rewards over time. The RL controller is utilized in this paper to be trained to provide frequency support by controlling the virtual inertia. In this context, the key components of an RL task are observation states, actions, and rewards, as shown in
Figure 4.
In the system addressed in this study, the observation state and action are represented as
and
, respectively. They are defined as follows:
Such that
represents the frequency deviation from its nominal value, and
indicates the deviation in the DC link voltage. The integrated values of these errors are also included in the states. The action
refers to the reference input for the (VSC) controller. The RL framework involves the RL agent interacting with a learning environment, in this case, the VSC controller. At each time step
t, the environment provides the RL agent with a state observation
. The RL controller then executes an action from its action space, observes the immediate reward
, and updates the value of the state–action pair accordingly. This iterative process of exploration and refinement enables the RL-controlled controller to approximate an optimal control policy. The reward function is designed to penalize frequency deviation, DC link voltage deviation, and the magnitude of the previous action by the RL agent as follows:
such that
is the absolute value of deviation in frequency from the nominal value,
is the absolute DC link voltage deviation from the nominal value, and
is the previous action by the RL agent; the values of the parameters used in the reward function are shown in
Table 3.
In this study, two RL agents are presented; the first agent is based on DDPG, presented and discussed in detail in [
47], and the second agent is based on TD3. This section presents the structure and the training algorithm of the TD3 algorithm. The TD3 algorithm is an advanced model-free, online, off-policy reinforcement learning method, evolving from the DDPG algorithm. The TD3, designed to address DDPG’s tendency to overestimate value functions, incorporates key modifications for improved performance. It involves learning two Q-value functions and using the minimum of these estimates during policy updates, updating the policy and targets less frequently than Q functions, and adding noise to target actions during policy updates to avoid exploitation of actions with high Q-value estimates. The structure of the actor and critic networks used in this article is shown in
Figure 5 and the structure of the DDPG actor and critic is shown in
Figure 6.
The network architectures were designed using a comprehensive approach that balanced several considerations, including task complexity, computational resources, empirical methods, insights from the existing literature, and demands required by different network functions. Networks with more layers and neurons are needed in complex scenarios with high-dimensional state spaces and continuous action space. The methodology for selecting the most appropriate network architecture was mainly empirical, entailing the exploration and evaluation of various configurations. This iterative process typically begins with the deployment of relatively simple models, with subsequent adjustments involving incremental increases in complexity in response to training performance and computational time. The existing literature and benchmarks relevant to our task further informed our design choices. By examining successful network configurations applied to similar problems, we could draw upon established insights and best practices as a foundation for our architectural decisions. The activation function at the output neuron of the actor network greatly affected the network’s performance during the training; the tanh activation function fitted the most in the architecture of the actor network and produced the best outcome compared to the ReLU activation function.
During its training phase, a TD3 agent actively updates its actor and critic models at each time step, a process integral to its learning. It also employs a circular experience buffer to store past experiences, a crucial aspect of iterative learning. The agent utilizes mini-batches of these stored experiences to update the actor and critic, randomly sampled from the buffer. Furthermore, the TD3 agent introduces a unique aspect of perturbing the chosen action with stochastic noise at each training step, an approach that enhances exploration and learning efficacy.
The TD3 uses a combination of deterministic policy gradients and Q-learning to approximate the policy and value functions. The algorithm uses a deterministic actor function denoted by , where are its parameters, inputs the current state, and outputs deterministic actions to maximize long-term reward. The target actor function uses the same structure and parameterization as the actor function but with periodically updated parameters for stability. The TD3 also uses two Q-value critics , with parameters () to input observation () and action () and output the expected long-term reward. The critics have distinct parameters () and if two critics are used, they generally have the same structure but different initial parameters. The TD3 utilizes two target critics whose parameters () are periodically updated with the latest critic parameters. The actor, the target actor, the critics, and their respective targets have identical structures and parameterizations.
The actor network in a TD3 agent is trained by updating actor and critic properties at each time step during learning. It uses a circular experience buffer to store past experiences, sampling mini-batches from this buffer for updates. The action chosen by the policy is perturbed at each training step using stochastic noise. The actor is trained using a policy gradient. This gradient,
, is approximated as follows:
with
where
is the gradient of the minimum critic output with respect to the action, and
is the gradient of the actor output with respect to the actor parameters, both evaluated for the observation
. The actor parameters are then updated using the learning rate
as follows:
In the TD3 algorithm, the critic is trained at each training step by minimizing the loss (
) for each critic network. The loss is calculated over a mini-batch of sampled experiences using the equation
where
is the target value for the ith sample,
is the output of the kth critic network for the state
and action
, and
are the parameters of the kth critic network. This training process helps the critic to accurately estimate the expected rewards, contributing to the overall effectiveness of the TD3 algorithm. The critic parameters are then updated using the learning rate
.
The target networks are then slowly updated using smoothing target factor
.
The training algorithm of the TD3 agent is shown in
Figure 7.