Networked Multi-Agent Deep Reinforcement Learning Framework for the Provision of Ancillary Services in Hybrid Power Plants

Ikram, Muhammad; Habibi, Daryoush; Aziz, Asma

doi:10.3390/en18102666

Open AccessEditor’s ChoiceArticle

Networked Multi-Agent Deep Reinforcement Learning Framework for the Provision of Ancillary Services in Hybrid Power Plants

by

Muhammad Ikram

,

Daryoush Habibi

and

Asma Aziz

^*

School of Engineering, Edith Cowan University, Joondalup, Perth, WA 6027, Australia

^*

Author to whom correspondence should be addressed.

Energies 2025, 18(10), 2666; https://doi.org/10.3390/en18102666

Submission received: 28 March 2025 / Revised: 13 May 2025 / Accepted: 16 May 2025 / Published: 21 May 2025

(This article belongs to the Collection Artificial Intelligence and Smart Energy)

Download

Browse Figures

Versions Notes

Abstract

Inverter-based resources (IBRs) are becoming more prominent due to the increasing penetration of renewable energy sources that reduce power system inertia, compromising power system stability and grid support services. At present, optimal coordination among generation technologies remains a significant challenge for frequency control services. This paper presents a novel networked multi-agent deep reinforcement learning (N—MADRL) scheme for optimal dispatch and frequency control services. First, we develop a model-free environment consisting of a photovoltaic (PV) plant, a wind plant (WP), and an energy storage system (ESS) plant. The proposed framework uses a combination of multi-agent actor-critic (MAAC) and soft actor-critic (SAC) schemes for optimal dispatch of active power, mitigating frequency deviations, aiding reserve capacity management, and improving energy balancing. Second, frequency stability and optimal dispatch are formulated in the N—MADRL framework using the physical constraints under a dynamic simulation environment. Third, a decentralised coordinated control scheme is implemented in the HPP environment using communication-resilient scenarios to address system vulnerabilities. Finally, the practicality of the N—MADRL approach is demonstrated in a Grid2Op dynamic simulation environment for optimal dispatch, energy reserve management, and frequency control. Results demonstrated on the IEEE 14 bus network show that compared to PPO and DDPG, N—MADRL achieves 42.10% and 61.40% higher efficiency for optimal dispatch, along with improvements of 68.30% and 74.48% in mitigating frequency deviations, respectively. The proposed approach outperforms existing methods under partially, fully, and randomly connected scenarios by effectively handling uncertainties, system intermittency, and communication resiliency.

Keywords:

multi-agent deep reinforcement learning; soft actor–critic; hybrid power plants; optimal dispatch; ancillary services; frequency control

1. Introduction

1.1. Background

The potential of ancillary services from inverter-based resources (IBRs) is evolving with the ongoing penetration of renewable energy sources. To effectively deliver frequency and non-frequency ancillary services, it is essential to develop advanced control techniques for non-synchronous generation [1]. A hybrid power plant (HPP) combines independent electricity generation and storage technologies connected at a single point to provide coordinated electrical power services. This integration optimises land use and increases profitability by enabling participation in various energy markets, including capacity, time-varying pricing, and ancillary services. However, due to grid code requirements, the provision of grid support and ancillary services from HPPs is still in the early stages of development [2]. Frequency-control ancillary services from HPPs are challenging due to the complexity and coordination of heterogeneous technologies. Several studies have explored conventional approaches that involve plant-level control by introducing time-varying set points [3,4], but have encountered challenges with communication delays, slower response, and optimal control. The grid-forming control schemes in IBR-enabled HPPs were investigated in [5] to provide coordinated droop control and inertial support for frequency regulation and mitigation of frequency deviations. In [6], an HPP-enabled environment was considered to investigate the system performance in relation to frequency deviation under highly intermittent situations. In this case, the HPP model was optimised by minimising the frequency overshoot and improving the frequency stability.

For real-time coordination in plant controllers, estimating frequency response by dispatching active power is discussed in [7]. However, this approach does not apply to HPP controllers due to the complexity of coordination control. Different levels of control that need to be coordinated include asset-level control, plant-level control, HPP-level control, and HPP energy management system (EMS) control, as shown in Figure 1 [8]. Thus, it becomes challenging for the HPP EMS control level to acquire time-sensitive frequency control ancillary services such as fast frequency response (FFR) and black start (BS) due to the inverter’s response on different control levels. IBR-enabled systems generate large volumes of data through metering and control systems, offering valuable opportunities for data-driven analysis. These insights can be used to develop practical solutions for frequency control, particularly through cooperative control schemes. System responses are monitored and optimised in these approaches to meet various control objectives [9]. These systems are optimised through tailored training protocols based on Markov chain principles, data-driven strategies, and multi-agent control schemes. Multi-agent reinforcement learning (MARL) schemes are employed for frequency stability using partial communication to enable synergy among controllers and promote cooperative behaviour across the network. Various studies have explored MARL for decentralised generation control and joint optimisation [10]. For example, Markov game theory and graph-theoretic approaches have been applied to frequency control services, where distributed control nodes work together to achieve global control objectives. However, these methods are often computationally inefficient for real-time coordination, and require precise tuning of parameters for accurate modelling and optimal results [11].

Reinforcement learning (RL) approaches are better suited for low-dimensional action domains, which limits their effectiveness in complex and data-intensive IBR-enabled environments. Deep reinforcement learning (DRL) addresses high-dimensional state-action processes, utilising deep neural networks (DNN) for processing high-dimensional data and RL for decision-making. In one study [12], DRL was applied to complex distribution networks to improve voltage stability and frequency control. Another study [13] used DRL with the deep deterministic policy gradient (DDPG) algorithm to optimise voltage fluctuations in PV networks with high solar penetration. Based on policy networks, DRL algorithms have been used to solve frequency control optimisation problems across low- and high-dimensional state-action spaces. For example, in [14], a multi-agent DRL (MADRL) framework was implemented for the bidding process in PV plants, aiming to maximise power utilisation and cost optimisation. Similarly, MADRL was proposed in [15] for optimal demand response (DR) in electric vehicle (EV) charging infrastructure to optimise performance. A comprehensive review of MADRL in [16] highlighted various DRL algorithms for high-dimensional data and assessed their robustness in addressing voltage and frequency stability issues. In [17], MADRL was proposed for real-time voltage stability control to enhance both the learning efficiency and robustness of DRL performance.

Multi-agent deep deterministic policy gradient (MADDPG) and deep-Q-network (DQN) schemes were proposed in [18] for time-sensitive applications to optimise the operation and performance of virtual power plants (VPP). MADDPG employs a decentralised actor-critic architecture in a multi-agent control environment. This setup enables global information access and centralised training, while agents make autonomous decisions based on local observations and execute them in a decentralised manner. For load frequency control (LFC) in multi-area power systems, MADRL has been applied to regulate global functions using local control signals to manage frequency fluctuations [19]. In [20], the authors discussed the possibility of real-time multi-agent coordination, however, optimal selection of hyperparameters poses challenges during fine-tuning and implementation. The MADRL approach proposed in [21] focused on optimising multi-carrier energy grids by utilising stacked denoising autoencoders to dispatch energy nodes hourly. Similarly, an MARL model was introduced in [22] to address optimal load scheduling, energy management, and dispatch challenges in complex energy systems. Both approaches aim to enhance the coordination and efficiency of energy distribution while optimising performance across varying energy nodes and loads [23]. In [24], multi-microgrid scheduling schemes were implemented using soft actor-critic (SAC) and multi-agent control techniques. Meanwhile, an RL-based approach was adopted in [25], utilising a Markov decision process (MDP) in the MADRL framework for timescale bidding in IBR-enabled PV plants. This method coordinates multiple agents by considering stored energy, solar generation, bidding decisions, and other generation sources, with historical data used to initiate real-time coordination and bidding processes in the environment. This approach has demonstrated high profitability and effective management of energy imbalances across multiple timescales [26]. However, it does not address optimisation improvement for numerous IBR-enabled technologies, such as wind and ESS. Table 1 summarises state-of-the-art MADRL schemes for power systems applications involving multiple energy sources, coordination schemes, various algorithms, network scalability, and control performance.

The techniques mentioned in Table 1 are focused on DNN architectures, which are mostly implemented for model-free and high-dimensional complex networks. However, DNN-based approaches often operate as black boxes during the training and execution phases when processing high-dimensional state-action data. This lack of transparency can sometimes result in significant failures, particularly when the model struggles to converge to optimal state and control functions [45]. In addition, several studies have suggested graph-theoretic approaches with MADRL-based models for maintaining voltage stability in complex distribution network systems. This approach uses transformers and graph neural networks (GNN) instead of actor-critic networks for centralised training and decentralised execution [46]. However, while this approach has been applied to optimise physical parameters such as voltage control and stability, it does not fully address the heterogeneity of agents or the nonlinear behaviour of IBR-enabled power networks. Additionally, power resource scheduling is critical for providing ancillary services, but it introduces further challenges when dealing with intermittent energy sources.

This paper presents a networked MADRL (N—MADRL) approach to address these challenges for frequency control services in HPP environments. The N—MADRL approach utilises physical control parameters to develop a model-free framework. This framework is independent of specific controller input data and is highly scalable, fully adaptive, and decentralised for coordinating PV, WP, and ESS plants. In addition, it offers real-time coordination, control synergy, and computational efficiency using stable baseline DRL algorithms. In light of the complexity and nonlinearity of the HPP environment, the N—MADRL model overcomes three key challenges: 1 developing a model-free environment to coordinate numerous energy assets in the HPP network; 2 reducing the intermittency of HPP assets in optimal dispatch, reserve management, and frequency control services; and 3 addressing system vulnerabilities and dynamic performance with a networked communication structure. The proposed N—MADRL scheme ensures fast responses and compliance with grid code requirements for the HPP network. In addition, it effectively manages decentralised coordination across different levels of the HPP system, ensuring the timely provision of optimal dispatch and frequency control services.

1.2. Contribution and Outline

Despite the advancements in coordination control and optimal dispatch of energy assets for harnessing frequency control services in HPP networks, the existing literature on MADRL model-free frameworks for various energy management applications is still in the early exploration stage. Several challenges must be addressed for a fully coordinated HPP environment to provide frequency control ancillary services. First, the conceptual framework of MADRL, discussed in [47,48], is focused on the small-scale aggregation of distributed energy resources (DER), and does not fully exploit the utility-scale HPP environment through networked coordination control. Second, existing models rely on physical constraints and control parameters for the delivery of optimal dispatch [49], reserve management [50], and voltage control [51] based on cost optimisation, lacking focus on dynamic and performance-based control functions for utility-scale HPP environments. Lastly, the communication resiliency and delay-dependent energy network investigated in [52,53] for optimal dispatch decision-making does not address a decentralized networked structure for fast-response frequency control services, regardless of the communication structure. To address these challenges, in this paper we propose a networked multi-agent deep reinforcement learning (N—MADRL) framework in a utility-scale HPP for optimal dispatch and frequency control ancillary services. The primary contributions of this paper are outlined as follows:

(1).: For optimal coordination, most of the relevant literature focuses on centralised and hierarchical control schemes such as those in [30,35]. Here, we develop a model-free HPP environment using MASAC approach to enable coordination between heterogeneous PV, WP, and ESS agents using local control inputs without global information. In addition, our MASAC approach combines MAAC and SAC schemes for handling scalability and coordination control with only local information instead of global information. Compared to existing schemes [38,39], the proposed approach efficiently reduces frequency deviations and regulates HPP frequency for real-time decision-making.
(2).: To facilitate decentralised coordination and improve the learning process, a centralised training and decentralised execution (CTDE) mechanism is proposed in which every agent has unique characteristics for real-time decision-making. Compared with [43,44], in which frequency deviation was considered as a cost-optimisation reward function, we propose a performance-based shared reward function for optimal dispatch of HPP assets, reserve management, and power balancing for frequency control and system stability.
(3).: Considering the communication complexity of HPP agents, which ultimately impacts the robustness and stability of the training model [52], we propose a networked communication model that uses partially, fully, and randomly connected communication topologies for training, testing, and validation processes without affecting performance. The proposed scheme provides robust, stable, and resilient performance regardless of communication topology and delays in real-time decision-making.

The remainder of this paper is structured as follows: the research problem is formulated in Section 2, including the objective functions and constraints for the dynamic HPP environment; the proposed N—MADRL framework is presented in detail in Section 3; the numerical simulation results of the proposed framework are explained in Section 4 through a number of case studies; finally, Section 6 provides our conclusions and recommendations for future work.

2. Problem Formulation

In this section, the problem formulation for the proposed optimisation problem is addressed, focusing on the frequency stability and optimal dispatch of the HPP environment. The multi-objective function includes power imbalance, frequency deviation, reserve management, and active power sharing. The constraints of the proposed problem focus on PV and WP curtailment, power balancing of the ESS plant, energy storage in the ESS plant, and frequency control. The details are presented in the following sections.

2.1. Description of the HPP Environment

The proposed HPP environment is modelled on the IEEE 14 bus network using a dynamic simulation environment. The HPP environment is comprised of 14 generation sources: six PV plants, four wind plants, and four ESS plants. The capacities of the PV plants are 11.7 MW, 14.8 MW, 17.6 MW, 10.4 MW, 10.7 MW, and 12.2 MW, located at buses 1, 2, 4, 11, 8, and 14, respectively. The capacities the WP plants are 26.7 MW, 41.4 MW, 33.4 MW, and 22.8 MW, connected on buses 3, 6, 9, 7, respectively. The capacities of the ESS plants are 21.8 MWh, 25.75 MWh, 18.6 MWh, and 22.7 MWh, located on buses 5, 12, 13, and 10, respectively. The reserve margins are set to 20% according to the maximum capacities of the PV, WP, and ESS plants. A detailed description of the HPP environment is provided in Section 4.1, where the physical parameters are listed in the nomenclature.

2.2. Objective Function

This study aims to maintain frequency stability via optimal dispatch of active power, mitigation of frequency deviation, optimisation of the reserve margin, and improvements to the power sharing capabilities of the HPP environment, as presented in Equation (1):

\begin{matrix} F = {β_{1} |P_{HP} - P_{{HP}_{ref}}| + β_{2} |f_{HP} - f_{{HP}_{ref}}| + β_{3} (R_{PV} + R_{WP} + R_{ESS}) + \\ β_{4} |P_{PV}^{r} + P_{WP}^{r} + P_{ESS}^{r} - P_{HP}|} \end{matrix}

(1)

where

β_{1}

,

β_{2}

,

β_{3}

,

β_{4}

are the weighting factors, which are dynamically set for the frequency stability, while

P_{HP} - P_{{HP}_{ref}}

,

f_{HP} - f_{{HP}_{ref}}

,

R_{PV} + R_{WP} + R_{ESS}

, and

P_{PV}^{r} + P_{WP}^{r} + P_{ESS}^{r} - P_{HP}

are the power imbalance, frequency deviation, reserve management, and power balancing, respectively.

2.3. Constraints

For optimal dispatch and frequency stability, the following constraints must be considered [8].

2.3.1. Curtailment of PV and WP Plant

For integration and coordination of the PV and WP plants, the power capacity should not exceed the threshold for stability of the HPP, as presented below.

P_{PV}^{c} = \{\begin{matrix} P_{PV} & if P_{HP} \leq P_{{HP}_{\max}} \\ P_{PV} - (1 - α) (P_{HP} - P_{{HP}_{\max}}) & if P_{HP} > P_{{HP}_{\max}} \end{matrix}

(2)

P_{WP}^{c} = \{\begin{matrix} P_{WP} & if P_{HP} \leq P_{{HP}_{\max}} \\ P_{WP} - α (P_{HP} - P_{{HP}_{\max}}) & if P_{HP} > P_{{HP}_{\max}} \end{matrix}

(3)

In Equations (2) and (3),

P_{HP} \leq P_{{HP}_{\max}}

shows no curtailment and

P_{HP} > P_{{HP}_{\max}}

represents curtailment, while

α

= 1 indicates curtailment of the WP plant and

α

= 0 curtailment of the PV plant. The purpose of choosing a binary

α

is to simplify interoperability and partitioning for PV and WP agents in the proposed environment, which helps to ensure availability and prioritisation. Implementation of a continuous

α

, which shows the proportional curtailment of PV and WP was not considered in this environment.

2.3.2. Power Balancing of the ESS Plant

The ESS plant provides power balancing capabilities by charging during excess power generation and discharging under low power generation, as described in Equation (4):

P_{ESS}^{c} = \{\begin{matrix} P_{{ESS}_{\min}} & if P_{HP} - P_{WP}^{r} - P_{PV}^{r} \leq P_{{ESS}_{\min}} \\ P_{{ESS}_{\max}} & if P_{HP} - P_{WP}^{r} - P_{PV}^{r} \geq P_{{ESS}_{\max}} \\ P_{HP} - P_{WP}^{r} - P_{PV}^{r} & otherwise \end{matrix}

(4)

In Equation (4),

P_{HP} - P_{WP}^{r} - P_{PV}^{r} \leq P_{{ESS}_{\min}}

is the maximum discharging rate,

P_{HP} - P_{WP}^{r} - P_{PV}^{r} \geq P_{{ESS}_{\max}}

is the maximum charging rate, and

P_{HP} - P_{WP}^{r} - P_{PV}^{r}

represents the dynamically adjusted charging and discharging rates. It is worth mentioning that the economic costs of ESS operation, such as cycle penalties, state-of-health (SoH), and degradation cost, are not considered in this constraint. Our proposed N—MADRL model is fully performance-oriented, and as such does not include the costs associated with operation.

2.3.3. Energy Storage at ESS Plant

The optimal energy storage at ESS plant with capacity limits is presented in Equation (5):

E_{{ESS}_{\min}} \leq E_{ESS} \leq E_{{ESS}_{\max}} .

(5)

2.3.4. Frequency Control

Frequency regulation in the HPP plant ensures safe limits, helping to avoid under-frequency and over-frequency conditions by coordinating curtailments while dynamically adjusting ESS charging and inertial support, as shown in Equation (6):

f_{{HP}_{\min}} \leq f_{HP} \leq f_{{HP}_{\max}} .

(6)

The proposed constraints are specific for optimal dispatch and frequency control, while the inverter dynamics, such as voltage and current loops, are not considered in the N—MADRL environment. Thus, the reward is entirely performance-based for optimal dispatch and frequency regulation objectives. In addition, the problem of maintaining frequency stability via optimally dispatching active power, mitigating frequency deviation, optimising reserve management, and improving power sharing capabilities is presented for real-time HPP grid conditions. The optimisation problem is formulated by considering the actual HPP environment, its dynamic behaviour, uncertainties in HPP assets, and the decentralised coordination of multi-agent control of the HPP environment. Conventional optimisation control schemes, such as MPC and rule-based schemes, require a centralised control environment, leading to challenges around scalability, performance, coordination, and adaptability. To address these limitations, the N—MADRL framework proposed in this research provides a decentralised, adaptive, scalable, and coordinated approach for providing frequency-control ancillary services in HPP environments.

3. Proposed Networked Multi-Agent Deep Reinforcement Learning (N—MADRL) Approach

This section addresses the network representation of N—MADRL to develop a decentralised coordination control network for the HPP environment. The preliminaries of MADRL presented in this section are used for learning policies according to states, actions, observations, rewards, and the transition state of N—MADRL agents in the environment. The control policies of the N—MADRL framework are presented using the MASAC approach, which combines MAAC and SAC schemes. The scalability and heterogeneity issues of N—MADRL are addressed and developed through decentralised control functions. At the end of this section, the implementation of N—MADRL is presented using current and target value functions.

3.1. Network Representation

In a conventional power network, grid operators use centralised control for power balancing and frequency regulation. Today, due to the increasing penetration of IBR-enabled resources, centralised control is changing to decentralised control to meet scalability, computational, and optimal performance requirements. In addition, the complexity of centralised control networks can be reduced by decentralised control systems thanks to their ability to independently and autonomously operate power network resources. As our proposed HPP network is an IBR-enabled power system, we used the N—MADRL framework for frequency-control ancillary services. In the proposed N—MADRL framework, a decentralised network of HPP resources is coordinated using multi-agent control, placing each resource within a collaborative learning environment for frequency regulation and power balancing. In the proposed N—MADRL framework, each resource is considered as an independent agent coordinating with corresponding agents to create a multi-agent environment. The decision-making process of a multi-agent control environment involves learning and executing without the need for centralised control systems.

To shape the centralised control operation within a decentralised model, our proposed HPP environment is divided into three zones for the PV, WP, and ESS plants; each zone has a decentralised topology for interacting with neighbouring plants. Let the HPP environment be represented by the graph

G

, which contains agents

V

and

E

for communication, i.e.,

G = {V, E}

. The number of PV, WP, and ESS agents are

N_{P V}

,

N_{W P}

, and

N_{E S S}

, respectively. The assets of each zone are

N_{P V} = N_{p v 1} \dots, N_{p v n}

,

N_{W P} = N_{w p 1} \dots, N_{w p n}

, and

N_{E S S} = N_{e s s 1} \dots, N_{e s s n}

, respectively. The agents are connected with

G = {V_{PV}, V_{WP}, V_{ESS}, E}

using

E \subset V_{PV} \times V_{WP} \times V_{ESS}

. The total number of HPP agents

V_{HPP}

is

V_{PV} + V_{WP} + V_{ESS}

. The adjacency matrix is

A_{ij} = [1]

for adjacent agents and

A_{ij} = [0]

for non-adjacent agents. The degree matrix

D_{ij} = d i a g [D_{ij} \dots, D_{n}

] shows the number of connections, while the Laplacian matrix

[L_{ij}] = [D_{ij}] - [A_{ij}

] shows the connected agents. The Laplacian matrix

L_{ij}

of the networked graph is shown below.

L_{ij} = \{\begin{matrix} - 1 & if j \in N_{i} \\ | N_{i} | & if j = i \end{matrix}

(7)

3.2. MADRL Preliminaries

For the MADRL network of the HPP environment, a Markov decision process (MDP) is used for learning policies according to their states, actions, observations, rewards, and the global state of the agents [46]. For a given time t, the MDP is represented with five tuples:

\{N, S^{t}, A^{t}, O^{t}

,

R^{t}\}

, where

N = {1, 2, \dots, n}

is the number of agents,

S^{t}

is the global state, the action of the agent is

A^{t} = \{a_{1}, a_{2}, a_{3}, \dots, a_{n}\}

, the observation of the agent is

O^{t} = \{o_{1}, o_{2}, o_{3}, \dots, o_{n}\}

, and the reward

R

is

R_{i} : (S^{t} \times {r_{1}, r_{2}, r_{3}, \dots, r_{n}}) \mapsto R^{t}

. Each agent chooses an action based on its observation, which is called policy mapping:

π_{i} : S_{i} \mapsto P (A_{i})

. After the agent executes action

A^{t}

, it receives an immediate reward

r_{i}^{t} : (S_{i}^{t} \times A_{i}^{t}) \mapsto R^{t}

and proceeds to the next state

P (S^{t + 1} ∣ S^{t}, A^{t})

. Then, the agent observes new states and performs actions based on the new observations:

P (O_{i}^{t + 1} ∣ S^{t})

. This process is continued until the optimal policy

π_{i}^{*}

:

π_{1}, π_{2}, π_{3}, \dots, π_{n}

is maximised, as follows:

M (π_{i}^{*}) = max_{π_{i}} \underset{\begin{matrix} a_{i}^{t} \sim π_{i} \end{matrix}}{R} [\sum_{t = 0}^{t} γ^{t} r_{i}^{t}]

(8)

where

M (π_{i}^{*})

is the expected return from optimal policy and

γ^{t}

is the discounted factor

γ \in [0, 1]

used to estimate immediate and long-term future rewards.

In addition, a state-value function represented by

V^{t} (s)

is used to maximise the policy returns for optimal dispatch and frequency stability. The action-value function is represented by

Q^{t} (s, a)

, which performs optimal actions in specific states for optimal dispatch and frequency stability. The state-value and action-value functions of agents are represented as follows:

\begin{matrix} V_{i} (s = f_{HP}, P_{HP}) = max_{π_{i}} \underset{\begin{matrix} a_{i}^{t} \sim π_{i} \end{matrix}}{E} [\sum_{t = 0}^{T} γ^{t} r_{i}^{t} ∣ s = f_{HP}, P_{HP}], \end{matrix}

(9)

\begin{matrix} Q_{i} (s = f_{HP}, P_{HP}, a = P_{PV}, P_{WP}, P_{ESS}) = max_{π_{i}} \underset{\begin{matrix} a_{i}^{t} \sim π_{i} \end{matrix}}{E} \sum_{t = 0}^{T} γ^{t} r_{i}^{t} (s = f_{HP}, P_{HP}, a = P_{PV}, P_{WP}, P_{ESS}) . \end{matrix}

(10)

In the N—MADRL scheme, agents only use a partially observable Markov game (POMG) [40] for the system state

S^{t}

\{P_{HP}, P_{PV}^{r}, P_{WP}^{r}, P_{ESS}^{r}\}

. Observations

O_{i}^{t}

,

O_{j}^{t}

,

O_{k}^{t}

for action

A^{t}

of agents

i, j, k

at time t are represented as

A_{i}^{t}

,

A_{j}^{t}

,

A_{k}^{t}

for

{- l, l}

, where l is the range of optimal dispatch and reserve management. In addition, every agent interacts only with a neighbouring agent in PV, WP, and ESS zones. The states and observations of agents are limited to their zones, meaning that they do not contribute to global agents in the global space. In this way, the POMG is represented as follows:

Agents: Agents $i, j, k$ represent PV, WP, and ESS assets, respectively.
Region Set: The HPP environment is considered in terms of PV, WP, and ESS zones.
System States: The system state $S^{t}$ contains the global information set of all agents, for instance $\{P_{HP}, P_{PV}^{r}, P_{WP}^{r}, P_{ESS}^{r}\}$ .
Agent Observation: The observation sets of PV, WP, and ESS agents are $O_{i}^{t}$ , $O_{j}^{t}$ , $O_{k}^{t}$ , respectively.
Agent Actions: An action $A^{t}$ of agent $i, j, k$ at time t is represented as $A_{i}^{t} : - l \leq A_{i}^{t} \leq l$ , $A_{j}^{t} : - l \leq A_{j}^{t} \leq l$ , and $A_{k}^{t} : - l \leq A_{k}^{t} \leq l$ , representing the optimal dispatch and frequency control. The boundary of $A_{i}^{t}$ , $A_{j}^{t}$ , and $A_{k}^{t}$ is limited within ${- l, l}$ , where l represents optimal dispatch and reserve management.
Reward Function: For optimal dispatch and frequency stability, the shared reward function $r_{h p p}^{t}$ accumulates the system reward $r_{s y s}^{t}$ and agent reward $r_{p v}^{t} + r_{w p}^{t} + r_{e s s}^{t}$ , formulated as follows:

\begin{matrix} r_{s y s}^{t} = - β_{1} |P_{HP} - P_{{HP}_{ref}}| - β_{2} |f_{HP} - f_{{HP}_{ref}}| + \\ β_{3} (R_{PV} + R_{WP} + R_{ESS}) + β_{4} |P_{PV}^{r} + P_{WP}^{r} + P_{ESS}^{r} - P_{HP}| \end{matrix}

(11)

where

P_{HP} - P_{{HP}_{ref}}

and

f_{HP} - f_{{HP}_{ref}}

are negative rewards, while

R_{PV} + R_{WP} + R_{ESS}

and

P_{PV}^{r} + P_{WP}^{r} + P_{ESS}^{r} - P_{HP}

are positive rewards. The agent rewards are formulated on power balancing, energy reserve, and frequency control, as follows:

r_{pv}^{t} = - \{α_{1} |P_{pv}^{r} - P_{{pv}_{ref}}| + α_{2} |f_{HP} - f_{{HP}_{ref}}| + α_{3} R_{pv}\},

(12)

r_{wp}^{t} = - \{ω_{1} |P_{wp}^{r} - P_{{wp}_{ref}}| + ω_{2} |f_{HP} - f_{{HP}_{ref}}| + ω_{3} R_{wp}\},

(13)

r_{ess}^{t} = - \{σ_{1} |P_{ess}^{r} - P_{{ess}_{ref}}| + σ_{2} |f_{HP} - f_{{HP}_{ref}}| + σ_{3} R_{ess}\} .

(14)

The evaluation of reward coefficients is aligned with optimal dispatch and frequency stability. Thus, weights

β_{1}

,

α_{1}

,

ω_{1}

, and

σ_{1}

are two to three times greater than the remaining weights. The optimisation of these weights is focused on the N—MADRL model to develop performance-based rewards [54]. This approach prioritises grid stability and frequency regulation, as opposed to the cost optimisation in [14], which focused on minimising cost despite compromising grid performance. The shared reward of the N—MADRL scheme for the HPP environment is

r_{h p p}^{t} = r_{sys}^{t} + r_{pv}^{t} + r_{wp}^{t} + r_{ess}^{t} .

(15)

State Transitions: When PV, WP, or ESS agents execute actions

A_{i}^{t}

,

A_{j}^{t}

, and

A_{k}^{t}

under states

S_{i}^{t}

,

S_{j}^{t}

, and

S_{k}^{t}

, the transition function moves the environment to new states

S_{i}^{t + 1}

,

S_{j}^{t + 1}

, and

S_{k}^{t + 1}

. As the transition functions of the agents are unknown within the environment, the proposed N—MADRL approach chooses an optimal policy for the HPP agents. The optimal policy

π_{i}^{*}

:

π_{1}, π_{2}, π_{3}, \dots, π_{n}

maximises the expected values before completion of the transition function.

3.3. Multi-Agent and Soft Actor–Critic (MASAC) Approach

In the multi-agent actor-critic (MAAC) approach, the actor selects actions to maximise the current rewards, while the critic selects actions for the actor to maximise expected future rewards. Referring to Equations (9) and (10), the value functions

V_{i} (s = f_{HP}, P_{HP})

and

Q_{i} (s = f_{HP}, P_{HP}, a = P_{PV}, P_{WP}, P_{ESS})

are optimised for the state–action context. Agent

V_{PV}, V_{WP}, V_{ESS}

contains decentralised actors that use local states

P_{PV}, P_{WP}, P_{ESS}

in policy

π_{i}^{*}

. For optimal dispatch and frequency control, this provides optimal results for the actor loss and critic loss as well as fast convergence and an optimal learning process that maximises rewards. Through global observation of the critic network, it provides optimal actions

A_{i}^{*}

,

A_{j}^{*}

, and

A_{k}^{*}

for the current states

S_{i}^{t}

,

S_{j}^{t}

, and

S_{k}^{t}

. In testing, each agent

V_{PV}, V_{WP}, V_{ESS}

for states

P_{PV}, P_{WP}, P_{ESS}

accesses its own actors for optimal actions

A_{i}^{*}

,

A_{j}^{*}

, and

A_{k}^{*}

, ensuring operation in a decentralised fashion. As the HPP environment is a continuous action space, a soft actor-critic (SAC) scheme is integrated with MAAC to measure the entropy for the uncertainties and randomness of policy actions. Unlike DDPG and PPO schemes, SAC balances agent exploration and exploitation during training. The expected reward is updated using entropy maximisation, as follows:

J (π) = \sum_{k = 0}^{n} E_{(s_{k}, a_{k}) \sim π} [r (s_{k}, a_{k}) + α H (π (\cdot ∣ s_{k}))]

(16)

where

H (π (\cdot ∣ s_{k}))

is the entropy function and

α

is the stochasticity control parameter for policy optimisation. Referring to Equations (9) and (10), the expected returns are modified for the entropy-regularised functions, as follows:

V^{π} (s = f_{HP}, P_{HP}) = E_{π} [\sum_{k = 0}^{\infty} γ^{k} (r_{k} + α H (π (\cdot ∣ s_{k}))) ∣ s_{0} = f_{HP}, P_{HP}],

(17)

\begin{matrix} Q^{π} (s = f_{HP}, P_{HP}, a = P_{PV}, P_{WP}, P_{ESS}) = E_{π} [\sum_{k = 0}^{\infty} γ^{k} (r_{k} + α H (π (\cdot ∣ s_{k}))) \\ | s_{0} = f_{HP}, P_{HP}, a_{0} = P_{PV}, P_{WP}, P_{ESS}] . \end{matrix}

(18)

The relationship between

V^{π} (s)

and

Q^{π} (s, a)

is expressed as follows:

\begin{matrix} V^{π} (s = f_{HP}, P_{HP}) = E_{a \sim π} [Q^{π} (s = f_{HP}, P_{HP}, a = P_{PV}, P_{WP}, P_{ESS}) \\ + α H (π (\cdot ∣ s_{k})) | s_{0} = f_{HP}, P_{HP}] . \end{matrix}

(19)

To balance the exploration process and develop a probabilistic model instead of a deterministic model, the entropy-regularised Bellman equation is used to handle a highly sampled complex HPP environment. Because the HPP environment is a probabilistic model, the entropy-regularised Bellman equation is expressed as follows:

\begin{matrix} Q^{π} (s = f_{HP}, P_{HP}, a = P_{PV}, P_{WP}, P_{ESS}) = E_{\begin{matrix} s_{k + 1} \sim P \\ a_{k + 1} \sim π \end{matrix}} [r_{k} + γ Q^{π} (s_{k + 1}, a_{k + 1}) \\ + α H (π (\cdot ∣ s_{k + 1}))] . \end{matrix}

(20)

Using Equation (20), the entropy-regularised Bellman equation is then rearranged as follows:

\begin{matrix} Q^{π} (s = f_{HP}, P_{HP}, a = P_{PV}, P_{WP}, P_{ESS}) = \\ E_{s_{k + 1} \sim P} [r_{k} + γ V^{π} (s_{k + 1} = f_{HP}, P_{HP})] . \end{matrix}

(21)

The policy in the HPP environment is improved and updated at every k iteration for the value functions

V^{π} (s)

and

Q^{π} (s, a)

.

3.4. Scalability and Heterogeneity

As the HPP environment includes PV, WP, and ESS assets with different capacities, control functions, and operating zones, the networked approach of the MADRL model meets the need for scalability and heterogeneity of agents. Let the state of agents

V_{HPP}

,

V_{PV}, V_{WP}, V_{ESS}

be

P_{HP}, P_{PV}, P_{WP}, P_{ESS}

such that

P_{HP} = P_{PV} + P_{WP} + P_{ESS}

using Equations (2)–(4). Similarly, the frequencies using Equation (6) are

f_{PV}

,

f_{WP}

,

f_{ESS}

,

f_{HP}

. The initial states

P_{PV}, P_{WP}, P_{ESS}

and

f_{PV}

,

f_{WP}

,

f_{ESS}

after k iterations are updated with a control input

U_{i j}

, as shown below.

[\begin{matrix} P_{PV} (k + 1) \\ P_{WP} (k + 1) \\ P_{ESS} (k + 1) \\ P_{HP} (k + 1) \end{matrix}] = [\begin{matrix} P_{PV} (k) \\ P_{WP} (k) \\ P_{ESS} (k) \\ P_{HP} (k) \end{matrix}] + [\begin{matrix} U_{i j}^{pv} (k) \\ U_{i j}^{wp} (k) \\ U_{i j}^{ess} (k) \\ U_{i j}^{hpp} (k) \end{matrix}]

(22)

[\begin{matrix} f_{PV} (k + 1) \\ f_{WP} (k + 1) \\ f_{ESS} (k + 1) \\ f_{HP} (k + 1) \end{matrix}] = [\begin{matrix} f_{PV} (k) \\ f_{WP} (k) \\ f_{ESS} (k) \\ f_{HP} (k) \end{matrix}] + [\begin{matrix} U_{i j}^{pv} (k) \\ U_{i j}^{wp} (k) \\ U_{i j}^{ess} (k) \\ U_{i j}^{hpp} (k) \end{matrix}]

(23)

Referring to the Equation (22) for active power states and Equation (23) for frequency control, the control input is

U_{ij}^{pv}

,

U_{ij}^{wp}

,

U_{ij}^{ess}

, which after coordination with neighbour agents becomes

U_{ij}^{pv} = \sum_{j \in N_{i}} ∥ P_{j}^{pv} - P_{i}^{pv} ∥

,

U_{ij}^{wp} = \sum_{j \in N_{i}} ∥ P_{j}^{wp} - P_{i}^{wp} ∥

,

U_{ij}^{ess} = \sum_{j \in N_{i}} ∥ P_{j}^{ess} - P_{i}^{ess} ∥

. The updated Equation (22) is then:

P_{PV} (k + 1) = P_{PV} (k) + α^{pv} \sum_{j \in N_{i}} ∥ P_{j}^{pv} (k) - P_{i}^{pv} (k) ∥,

(24)

P_{WP} (k + 1) = P_{WP} (k) + α^{wp} \sum_{j \in N_{i}} ∥ P_{j}^{wp} (k) - P_{i}^{wp} (k) ∥,

(25)

P_{ESS} (k + 1) = P_{ESS} (k) + α^{ess} \sum_{j \in N_{i}} ∥ P_{j}^{ess} (k) - P_{i}^{ess} (k) ∥ .

(26)

The convergence factors

α^{pv}, α^{wp}, α^{ess}

ranging from 0–1 represent the speed of convergence in decentralised coordination and assess the stability in communication resiliency. Based on sensitivity analysis and optimal convergence, the lowest value of

α^{pv}, α^{wp}, α^{ess}

demonstrates the slowest convergence, while the highest values of

α^{pv}, α^{wp}, α^{ess}

accelerate the convergence process. In our proposed N—MADRL environment, the range of

α^{pv}, α^{wp}, α^{ess}

is kept dynamic for the PV, WP, and ESS zones to assess the communication resiliency in dynamically changing topologies along with the speed of decentralised coordination. The combined convergence factors are expressed as follows:

α = \frac{1}{N} (α^{pv} \sum_{i} P_{PV} + α^{wp} \sum_{j} P_{WP} + α^{ess} \sum_{k} P_{ESS}) .

(27)

The resiliency of the N—MADRL scheme is evaluated by considering communication failures under the dynamic topologies in the HPP environment, as shown in Figure 2. For the time-varying t dynamic topology, the control inputs for the PV, WP, and ESS agents are

U_{ij, t}^{pv}

,

U_{ij, t}^{wp}

,

U_{ij, t}^{ess}

, respectively. Now, the PV, WP, and ESS agents exchange their control inputs

U_{ij, t}^{pv}

with neighbour agents

U_{ij, t}^{pv} = \sum_{j \in N_{i}} ∥ P_{j, t}^{pv} - P_{i, t}^{pv} ∥

,

U_{ij, t}^{wp} = \sum_{j \in N_{i}} ∥ P_{j, t}^{wp} - P_{i, t}^{wp} ∥

,

U_{ij, t}^{ess} = \sum_{j \in N_{i}} ∥ P_{j, t}^{ess} - P_{i, t}^{ess} ∥

, respectively. In this way, the control functions and training of heterogeneous agents under the dynamic topology are redeveloped, and Equations (24)–(26) are updated as follows:

\begin{matrix} P_{PV} (k + 1, t) = P_{PV} (k, t) + α^{pv} \sum_{j \in N_{i}} ∥ P_{j}^{pv} (k, t) - P_{i}^{pv} (k, t) ∥, \end{matrix}

(28)

\begin{matrix} P_{WP} (k + 1, t) = P_{WP} (k, t) + α^{wp} \sum_{j \in N_{i}} ∥ P_{j}^{wp} (k, t) - P_{i}^{wp} (k, t) ∥, \end{matrix}

(29)

\begin{matrix} P_{ESS} (k + 1, t) = P_{ESS} (k, t) + α^{ess} \sum_{j \in N_{i}} ∥ P_{j}^{ess} (k, t) - P_{i}^{ess} (k, t) ∥ . \end{matrix}

(30)

3.5. N—MADRL Implementation

The detailed illustration of the N—MADRL approach is presented in Figure 3. The current value network is a Q-value network in the training process, which minimises the difference between the predicted and target values of Q functions. Similarly, the target value network is the stabilised version of the current value network, which helps to mitigate training instability and divergence. As the HPP environment is nonstationary for PV, WP, and ESS assets, the control policy for each agent is dependent and requires optimal coordination. The N—MADRL scheme uses a centralised critic network and a decentralised actor network for optimal training of the HPP environment. It uses decentralised actors for the collective action and observation of PV, WP, and ESS assets

C_{k}^{p v} = {o_{1}^{k} \dots, o_{n}^{k}, a_{1}^{k} \dots, a_{n}^{k} ∣ θ_{i}^{Q}}

,

C_{k}^{w p} = {o_{1}^{k}, \dots, o_{n}^{k}, a_{1}^{k}, \dots, a_{n}^{k}} ∣ θ_{j}^{Q}}

, and

C_{k}^{e s s} = {o_{1}^{k}, \dots, o_{n}^{k}, a_{1}^{k}, \dots, a_{n}^{k} ∣ θ_{k}^{Q}}}

, respectively. Then, the critic network estimates Q-values

Q^{π} (s = f_{HP}, P_{HP}, a = P_{PV}, P_{WP}, P_{ESS})

using centralised training. For states

P_{PV}, P_{WP}, P_{ESS}

, we access their actors for actions

A_{i}^{*} = μ (o_{i}^{k} ∣ θ_{i}^{μ})

,

A_{j}^{*} = μ (o_{j}^{k} ∣ θ_{j}^{μ})

, and

A_{k}^{*} = μ (o_{k}^{k} ∣ θ_{k}^{μ})

in decentralised fashion, which requires observation of each agent. The proposed Q-value and target value functions are formulated as follows:

L (θ_{i, j, k}^{Q}) = E_{(o^{k}, a^{k}, s^{k + 1}, r^{k}) \sim D} [({(y_{pv}^{k} - C_{k}^{pv})}^{2} + {(y_{wp}^{k} - C_{k}^{wp})}^{2} + {(y_{ess}^{k} - C_{k}^{ess})}^{2})],

(31)

y_{i, j, k}^{t} = r_{h p p}^{t} + γ \{Q^{π} (s_{k + 1}, a_{k + 1}) + α H (π (\cdot ∣ s_{k + 1}))\} .

(32)

Here,

L (θ_{i, j, k}^{Q})

is the loss function,

C_{i, j, k}

are the parameters for the Q functions, and

y_{i, j, k}^{t}

are the target values of the HPP agents. Because the reward and inputs in critic networks differ in the HPP environment, the control policy in the learning and training process is customised for the PV, WP, and ESS agents. Therefore, we compared the proposed algorithm with the on-policy PPO algorithm and off-policy DDPG algorithm for learning the optimal policy, maximising the reward, and improving the policy. The proposed policy gradient is formulated as follows:

\nabla_{θ_{i, j, k}^{μ}} J \approx E_{o^{k} \sim D} [{\nabla_{a_{i}, a_{j}, a_{k}} Q^{π} (s_{k + 1}, a_{k + 1}) + α H (π (\cdot ∣ s_{k + 1}))} \nabla_{θ_{i, j, k}^{μ}} μ (o_{i, j, k} ∣ θ_{i, j, k}^{μ})] .

(33)

For training stabilization, the target

θ_{i, j, k}^{'}

and expected

\nabla_{θ_{i, j, k}^{μ}} J

values are expressed as

θ_{i, j, k}^{'} \leftarrow θ_{i, j, k} + τ \nabla_{θ_{i, j, k}^{μ}} J .

(34)

Soft updating of the target value network

θ_{i, j, k}^{'}

uses expected value network

\nabla_{θ_{i, j, k}^{μ}} J

, with

τ

indicating the learning rate. This gradually updates the target network to prevent instability in the learning process. The N—MADRL scheme is explained in Algorithm 1; for actors of PV, WP, ESS agents, the actions are

A_{i}^{k}

,

A_{j}^{k}

,

A_{k}^{k}

, respectively. Observations

o_{i}^{k}

,

o_{j}^{k}

, and

o_{k}^{k}

are retrieved in accordance with state information

s_{i}^{k}

,

s_{j}^{k}

, and

s_{k}^{k}

using the current policy

π (\cdot ∣ θ_{i, j, k}^{μ})

for cumulative rewards

r_{h p p}^{t}

and the next state observations

o_{i}^{k + 1}

,

o_{j}^{k + 1}

, and

o_{k}^{k + 1}

stored in the replay buffer

D

. To accelerate the training process, samples from the PV, WP, and ESS agents are segmented in mini-batches

B (D)

in the replay buffer. The training process in the N—MADRL approach has

M

episodes and

T

steps. The training process continues until the PV, WP, and ESS agents in the HPP environment converge for optimal dispatch, reserve maximisation, and frequency control.

Algorithm 1 Proposed N—MADRL Scheme

1:: Initialize environment and agent parameters.
2:: for each agent $N = {1, 2, \dots, n}$ do
3:: Initialise parameters for the current critic-actor and target actor-critic networks.
4:: Initialize control actions $θ_{i, j, k}^{μ}$ of agent $i, j, k$ .
5:: Initialize control actions in PV, WP, and ESS zones: $θ_{i}^{μ}$ , $θ_{j}^{μ}$ , and $θ_{k}^{μ}$ .
6:: end for
7:: for each episode $m = 1$ to $M$ do
8:: Initialize CTDE for exploration of HPP environment.
9:: Receive initial observations $O_{i}^{t}$ , $O_{j}^{t}$ , $O_{k}^{t}$ for PV, WP, and ESS zones.
10:: Store observations in replay buffer $D$ .
11:: for each time step $t = 1$ to $T$ do
12:: Select actions: $A_{i}^{*} = μ (o_{i}^{t} ∣ θ_{i}^{μ})$ , $A_{j}^{*} = μ (o_{j}^{t} ∣ θ_{j}^{μ})$ , $A_{k}^{*} = μ (o_{k}^{t} ∣ θ_{k}^{μ})$ .
13:: Execute observations and actions: $C_{k}^{p v}$ , $C_{k}^{w p}$ , $C_{k}^{e s s}$ for PV, WP, and ESS agents.
14:: for each agent $i, j, k = 1$ to $N_{PV}, N_{WP}, N_{ESS}$ do
15:: Store tuples in replay buffer $D$ : $S_{i}^{t}, S_{j}^{t}, S_{k}^{t}$ (states), $A_{i}^{t}, A_{j}^{t}, A_{k}^{t}$ (actions), $r_{h p p}^{t}$ (reward), $S_{i}^{t + 1}, S_{j}^{t + 1}, S_{k}^{t + 1}$ (new states).
16:: end for
17:: if $D$ is full then
18:: for each agent $i, j, k = 1$ to $N_{PV}, N_{WP}, N_{ESS}$ do
19:: Store tuples in replay buffer $D$ again.
20:: Update actor weights in current value network: $L (θ_{i, j, k}^{Q})$ .
21:: Update critic weights in current value network: $L (θ_{i, j, k}^{Q})$ .
22:: Update actor weights in target value network: $y_{i, j, k}^{t}$ .
23:: Update critic weights in target value network: $y_{i, j, k}^{t}$ .
24:: Update policy gradient: $\nabla_{θ_{i, j, k}^{μ}} J$ .
25:: end for
26:: Update target value network for PV, WP, and ESS agents: $θ_{i, j, k}^{'} \leftarrow θ_{i, j, k} + τ \nabla_{θ_{i, j, k}^{μ}} J$ .
27:: end if
28:: end for
29:: end for
30:: Repeat until convergence.

The detailed flowchart of the proposed N—MADRL framework is presented in Figure 4. The N—MADRL framework initiates agents in the Grid2Op environment of the HPP power system. Each HPP source is modelled as an independent agent in three zones of the IEEE 14 bus network. Multi-agent interaction is initiated using episodic steps by observing the HPP environment using time steps. The optimal actions for dispatch and frequency control are executed to obtain episodic rewards and HPP environment global rewards. Reward maximisation is conducted using the continuous action state and the experience samples stored in the replay buffer. For optimal memory management, the buffer is segmented into mini-batches and checked for the convergence of proposed learning policies. When the replay buffer is full, the convergence of N—MADRL policies for optimal dispatch and frequency control is evaluated. Otherwise, the process is initiated again to update the actor-critic network of MASAC until convergence occurs. Meanwhile, the updated target values network is consistently associated with the convergence parameter. This looping process is continuous until the experience samples are collected and stored in the replay buffer and fully associate the target network with the value network for learning the control policies for optimal dispatch and frequency control. This process occurs in a centralised fashion among the three zones of the HPP environment during the training process. In the execution stage, the decentralised control policies are tested and exercised in order to assess optimal dispatch, reserve management, power imbalance, and active power sharing.

4. Numerical Simulations

In this section, the proposed N—MADRL environment is explained through the Grid2Op dynamic simulation environment and its control parameters. The results of training and learning control policies are evaluated and compared with DRL algorithms. The optimal dispatch analysis is presented for dynamically changing, partially-connected, and fully-connected communication topologies.

4.1. N—MADRL Environment

Simulations of the N—MADRL model were conducted on a computer with 64-bit Windows 10 OS, Core i7 station, 2.50 GHz processor, and 16 GB RAM. The N—MADRL model was developed in Python framework using a gymnasium environment and implementing numerous DRL algorithms such as MAAC and SAC. The simulations of HPP power systems were executed in the Grid2Op dynamic simulation environment [55]. Table 2 shows the parameter selection of the N—MADRL, PPO, and DDPG algorithms.

The flowchart for the numerical simulations of the N—MADRL framework is illustrated in Figure 5 for training, validation, and evaluation of the HPP environment. The initialisation of the HPP model in the Grid2Op environment was started using the IEEE 14 bus system. The Grid2Op environment was associated with the N—MADRL framework by using the CDTE mechanism for agent interactions and exchanging control inputs. Decentralised coordination among the PV, WP, and ESS plants was initiated by observing the respective states and taking the appropriate actions for optimal dispatch and frequency deviation control. This process continued until maximising the N—MADRL shared reward functions for the HPP environment. To stabilise the learning and control policies, the MASAC approach was implemented in the training process to observe the attention and entropy regularisation of the N—MADRL environment. The results were evaluated for dynamic, partially-connected, and fully-connected communication topologies to asses the training stability. In addition, random samples were initiated in the training samples to observe uncertainty control by assessing variance and standard deviations. The results of the proposed N—MADRL framework were compared with those of the PPO and DDPG DRL algorithms for optimal dispatch and frequency control. In addition, the performance metrics were evaluated for N—MADRL, PPO, and DDPG to observe the cumulative rewards, frequency deviations, and convergence rates in the HPP environment.

As the objective and constraints are not formulated in a static optimisation case, it was integrated with the Grid2Op dynamic simulation environment using the N—MADRL control scheme. In this scenario, the multi-agent controllers interact with dynamically changing frequency due to the imbalances of the HPP environment resources. The process of learning policies of the N—MADRL framework captures nonlinear frequency from Grid2Op to provide realistic grid conditions. The Grid2Op dynamic simulation environment supports time domain system states, action-based multi-agent interactions, and communication contingencies. For the Grid2Op dynamic simulation environment, the proposed N—MADRL approach implemented centralised training and decentralised execution using integrated MAAC and SAC algorithms for the optimal dispatch of multi-agent controllers. The nonlinearity of frequency in the Grid2Op environment is demonstrated by considering physical constraints of the HPP environment, Grid2Op simulation network effects, and nonlinear mapping decisions of optimal dispatch actions.

Figure 6 depicts IEEE 14 bus HPP systems with the distribution of PV, WP, and ESS plants, 19 transmission lines, and 14 generation sources. Table 3 refers to the dynamics of the

P_{PV}

,

P_{WP}

,

P_{ESS}

HPP environment in the Grid2Op simulations. In the Grid2Op simulations, the PV, WP, and ESS plants participate in optimal dispatch, reserve management, and frequency control services. Specifically, six PV plants with capacities of 11.7 MW, 14.8 MW, 17.6 MW, 10.4 MW, 10.7 MW, and 12.2 MW are connected on buses 1, 2, 4, 11, 8, and 14, respectively. In addition, four WP plants are connected to buses 3, 6, 9, and 7, with capacities of 26.7 MW, 41.4 MW, 33.4 MW, and 22.8 MW, respectively. Similarly, the four ESS plants have capacities of 21.8 MWh, 25.75 MWh, 18.6 MWh, and 22.7 MWh and are connected on buses 5, 12, 13, and 10, respectively. The

P_{PV}

,

P_{WP}

,

P_{ESS}

data were simulated on dynamic, partially-connected, and fully-connected communication topologies. The algorithm parameters were configured from the gymnasium environment and stable baseline3 DRL algorithms, while the

P_{PV}

,

P_{WP}

,

P_{ESS}

characteristics were programmed using the Grid2Op simulation platform with the reserve margins set to 20% according to the maximum capacities of the PV, WP, and ESS plants.

4.2. Results Validation

4.2.1. Training Process

For effective generalisation, the data were partitioned into 80% training, 10% validation, and 10% testing datasets. The training dataset was used to optimise the policy network, the validation dataset for tuning the hyperparameters, and the testing dataset for assessing the model’s ability to generalise. In addition, the robustness of the proposed model was evaluated by implementing random five-fold cross-validation, in which four folds were used for training and one fold for validation in each iteration.

The training process of the N—MADRL scheme was benchmarked with continuous action-space DRL algorithms such as DDPG and PPO. The DDPG offers off-policy actor-critic and deterministic policy gradients for handling continuous action-space data in the HPP environment, which helps to optimise active power dispatch for frequency control and complex decision-making. In contrast, PPO is an on-policy policy gradient algorithm that implements a method for balancing exploration and exploitation, making for a stable training process. In addition, PPO helps to minimise frequency deviations in a complex HPP environment. Moreover, both schemes use actor-critic frameworks, cooperative control, and uncertainty management to address such complex environments.

The performance of the DDPG and PPO algorithms compared to N—MADRL was evaluated using localised training and execution. In contrast, N—MADRL achieved optimal performance using centralised training and decentralised execution (CTDE) for optimal dispatch and frequency stability. Figure 7 shows the cumulative rewards and frequency deviations in the training processes of the N—MADRL, DDPG, and PPO algorithms. In Figure 7, the solid lines show the average values of random experiments and the error margins are represented with shaded lines.

Initially, the HPP agents did not stabilise the frequencies among HPP assets due to the unknown environment and limited agent interactions. The cumulative rewards were maximised and the frequency deviations reduced as the training progressed. The N—MADRL approach presented higher performance than DDPG and PPO, dispatching an average power of 0.22 MW with an average frequency deviation of 0.013 Hz. This superior performance is due to its partial observability, attention mechanism, and nonstationary behaviour. In contrast, PPO provided an average power dispatch of 0.38 MW with a frequency deviation of 0.041 Hz, while DDPG offered an average power dispatch of 0.57 MW with an average frequency deviation of 0.053 Hz. The variance in PPO’s convergence is due to its stochastic behaviour and limited exploration within the highly dynamic HPP environment. PPO offers clipping behaviour to avoid instabilities, but is not able to control large variances. The instabilities in the performance of PPO are due to Gaussian distribution samples in the highly dynamic POMG HPP environment. In addition, PPO utilises only new samples and does not use samples from experience, ultimately resulting in policy shifts. On the other hand, N—MADRL uses centralised critic and entropy regularisation, which enhances its performance, helping it to achieve 42.10% and 61.40% more efficiency in optimal dispatch than PPO and DDPG, respectively. In terms of frequency stability, N—MADRL is more stable than PPO and DDPG, achieving improvements of 68.30% and 74.48%, respectively.

4.2.2. Comparison of DRL Algorithms

To validate the performance of the proposed N—MADRL scheme in comparison with other DRL schemes, a peak generation scenario was considered in which optimal dispatch and frequency deviation pose a challenging issue. Table 4 provides a computational cost comparison of the N—MADRL, PPO, and DDPG algorithms in terms of training, validation, and testing times. It shows that the computational cost of N—MADRL for training is lower than that of PPO and DDPG despite the complex architecture incorporating MAAC and SAC. In addition, the validation and testing results of N—MADRL are higher than those of PPO and DDPG, proving its robust convergence and computational efficiency. These results provide validation for and practical deployment of real-world HPP environments. The DDPG algorithm uses a replay buffer in centralised training in the actor–critic network to store

o_{i}^{k}

,

o_{j}^{k}

,

o_{k}^{k}

for states

s_{i}^{k}

,

s_{j}^{k}

,

s_{k}^{k}

, using the current policy

π (\cdot ∣ θ_{i, j, k}^{μ})

for rewards

r_{h p p}^{t}

and the next state

s_{i}^{k + 1}

,

s_{j}^{k + 1}

,

s_{k}^{k + 1}

without using additional centralised training of the HPP environment. In contrast, N—MADRL performs execution solely on local states and operates independently for PV, WP, ESS assets based on

P_{PV}

,

P_{WP}

,

P_{ESS}

for

P_{HP}

and

f_{HP}

, as shown in Figure 8a.

For the PPO scheme, the centralised training of agents is developed by using the agents’ collective data to update the proximal policy. DDPG and PPO do not use the additional centralised data of the HPP environment to optimise control performance. In addition, PPO executes stochastic policy

π (\cdot ∣ θ_{i, j, k}^{μ})

using observations

o_{i}^{k}

,

o_{j}^{k}

,

o_{k}^{k}

and states

s_{i}^{k}

,

s_{j}^{k}

,

s_{k}^{k}

for HPP agents to estimate the accumulative reward

r_{h p p}^{t}

. as shown in Figure 8b.

In contrast, N—MADRL uses an MAAC scheme as an attention mechanism for the heterogeneous agents

P_{PV}

,

P_{WP}

,

P_{ESS}

to estimate global states

P_{HP}

and

f_{HP}

. The N—MADRL scheme uses integrated SAC

C_{i, j, k}

parameters for Q-value and target functions

y_{i, j, k}^{t}

to minimise the loss function

L (θ_{i, j, k}^{Q})

and maximise accumulative rewards. As the episodic reward and input of the critic network are different at the initial stage, the training process is not mature at the beginning. Exploration and entropy maximisation

H (π (\cdot ∣ s_{k}))

using experience samples from the replay buffer helps to maximise the reward compared to PPO and DDPG algorithms, as shown in Figure 8c.

The frequency deviations are assessed with existing and proposed optimisation schemes under the dynamic simulation environment of Grid2Op, stochastic policies of the N—MADRL learning process, and control performance of the HPP environment. It is worth noting that small frequency deviations in PV, WP, and ESS regions are due to the absence of explicit modelling parameters, the dynamic simulation environment of Grid2Op, and the N—MADRL learning process. Moreover, the control performance indicates that our proposed framework provides superior stability compared to PPO and DDPG. Table 5 provides the performance comparison of the N—MADRL, PPO, and DDPG algorithms.

4.3. Optimal Dispatch Analysis

Referring to Equations (22)–(23), the optimal dispatch of the PV, WP, and ESS plants is evaluated considering dynamic, partially-connected, and fully-connected topologies to assess the resiliency of the proposed N—MADRL scheme. This analysis uses the dynamics of the HPP environment from Table 3 and the simulation environment in Figure 6. Figure 9 illustrates the optimal dispatch of HPP assets considering dynamic topologies, as mentioned in Equations (28)–(30).

Referring to Equations (24)–(26), partially-connected and fully-connected communication topologies were considered for optimal dispatch across PV, WP, and ESS zones to observe the influence of communication. Due to the external control inputs of

U_{ij}^{pv}

,

U_{ij}^{wp}

, and

U_{ij}^{ess}

for policy update

π (\cdot ∣ θ_{i, j, k}^{μ})

, the PV, WP, and ESS agents converge rapidly for optimal dispatch after only a few iterations k, as shown in Figure 10 and Figure 11.

5. Discussion and Results Analysis

In this section, the results of the proposed N—MADRL framework are discussed and analysed. We evaluate the robustness of N—MADRL with uncertainty control and communication resiliency to assess the frequency deviation and optimal dispatch objectives. In addition, the results of the performance analysis of the HPP environment and PV, WP, and ESS zones are evaluated and assessed in detail.

5.1. Robustness Analysis

The robustness of the proposed N—MADRL framework was analysed and evaluated with uncertainty control to assess the training resiliency with random samples. In addition, the communication resiliency was evaluated for partially-connected and fully-connected communication topologies in order to assess the stability of training and learning control policies.

5.1.1. Uncertainty Control

The uncertainty of the HPP environment was considered for PV and WP plants due to their intermittent nature, which is established by different scenarios. Standard deviations were introduced into the training of the PV, WP, and ESS plants, represented with

{σ_{PV}, σ_{WP}, σ_{ESS}}

. As the PV, WP, and ESS assets contribute uniformly for optimal dispatch, we introduced random variation for building uncertainties such as

{rand (σ_{PV}), rand (σ_{WP}), rand (σ_{ESS})}

. In this way, the power generation

P_{PV}

,

P_{WP}

,

P_{ESS}

and optimal dispatch for

P_{HP}

and

f_{HP}

will be

{P_{PV} + rand (σ_{PV})}

,

{P_{WP} + rand (σ_{WP})}

, and

{P_{ESS} + rand (σ_{ESS})}

.

Randomly generated training samples were chosen for the proposed N—MADRL model to validate this claim. Figure 12 shows the variations in inertial response, cumulative reward, and frequency deviation under uncertainty situations. The frequency deviation increases as the power generation peaks in the middle of the day. The variations in optimal dispatch of

P_{HP}

and frequency

f_{HP}

stabilise for N—MADRL due to execution of the optimal policy

π (\cdot ∣ θ_{i, j, k}^{μ})

and maximisation of

H (π (\cdot ∣ s_{k}))

entropy regularisation. In this experiment, the performance of the proposed N—MADRL model is again higher than that of PPO and DDPG. At the same time, the frequency deviations in the uncertainty situations remain within the limits for the HPP environment. In uncertain situations, the reward coefficients are balanced to control power imbalances, reserve management, active power sharing, and frequency deviations. Explicitly tuning the frequency deviation factor

β_{2}

could sharpen frequency stability, but results in over-optimising the proposed model, consequently affecting the generalisability of the HPP environment. In addition, the proposed N—MADRL model considers weighted reward factors

β_{1} - β_{4}

that closely associate active power balancing, reserve management, and improving frequency deviations.

5.1.2. Communication Resiliency

Communication resiliency is considered in the HPP environment by changing communication topologies among

P_{PV}

,

P_{WP}

, and

P_{ESS}

agents. The impact of changing communication topologies ultimately changes

r_{h p p}^{t}

due to the updated policy

π (\cdot ∣ θ_{i, j, k}^{μ})

. The following scenarios involving changing communication topologies were considered:

Scenario One (S 1): In this scenario, every agent $P_{PV}$ , $P_{WP}$ , $P_{ESS}$ is fully connected in each zone during the training process of the DDPG, PPO, and N—MADRL algorithms.
Scenario Two (S 2): In this scenario, every agent in $P_{PV}$ , $P_{WP}$ , $P_{ESS}$ is partially connected in their respective zones during the training process for the DDPG, PPO, and N—MADRL algorithms.

The communication topologies were selected by dynamically assigning the convergence rate, such as

α^{pv}, α^{wp}, α^{ess}

. For fully connected communication topologies, the range was dynamically changing between 0.60 and 0.70, with every agent connected with neighbour agents in mesh topologies. For the partially-connected communication topologies, the rate of convergence of

α^{pv}, α^{wp}, α^{ess}

was set to 0.35 to 0.45, with every agent connected with at least one neighbour agent to exchange control inputs

U_{ij}^{pv}

,

U_{ij}^{wp}

,

U_{ij}^{ess}

. For randomly-connected communication topologies, the convergence rate of

α^{pv}, α^{wp}, α^{ess}

was set between 0.35 to 0.70, in which the number of connections was changed after every iteration when learning the control policies. Conversely, if the range of fully-connected communication topologies is considered, there are 60% to 70% connections are available to for the exchange of control inputs

U_{ij}^{pv}

,

U_{ij}^{wp}

,

U_{ij}^{ess}

. In addition, if the range of partially-connected communication topologies are considered, there are 35% to 45% connections available for the exchange of control inputs

U_{ij}^{pv}

,

U_{ij}^{wp}

,

U_{ij}^{ess}

. Moreover, the dynamically-connected communication topologies use 35% to 70% connections to share the control input

U_{ij}^{pv}

,

U_{ij}^{wp}

,

U_{ij}^{ess}

within the HPP environment.

From Figure 13, fully-connected topologies are considered for each zone to observe the performance of DDPG, PPO, and N—MADRL schemes for the frequency stability. The disturbances are randomly initiated for PV, WP, and ESS zones to observe the frequency deviations. The highest-frequency deviations are observed for DDPG, which uses a single-agent replay buffer and contains local state information in the absence of external control inputs of

U_{ij}^{pv}

,

U_{ij}^{wp}

,

U_{ij}^{ess}

. Similarly, the PPO algorithm is an interacting environment for the policy update

π (\cdot ∣ θ_{i, j, k}^{μ})

, and is executed locally without considering the dynamics of the HPP environment. Our proposed N—MADRL provides robust performance due to its full coordination and control among HPP agents.

Referring to Figure 14, a partially-connected communication topology is considered and a disturbance is initiated at times

t = 22

s,

t = 16

s, and

t = 12

s to observe the frequency deviations in the PV, WP, and ESS zones. Initially, the results of the DDPG scheme show higher fluctuations and damping due to the overestimation of Q-values and insufficient coordination among HPP agents. In contrast, the performance of the PPO algorithm is satisfactory compared to DDPG due to its stochastic policy and stable training. The proposed N—MADRL model achieves the highest performance due to its independent training, shared reward maximisation, and ability to handle nonstationary behaviour on the part of HPP agents. Based on these results, N—MADRL provides optimal performance compared to DDPG and PPO in both scenarios.

The purpose of considering both partially- and fully-connected communication topologies is for a realistic communication approximation of delay-dependent systems. The quantification of delay attributes and latency models is not investigated for the N—MADRL approach. The proposed N—MADRL approach maintains optimal performance in both communication scenarios without affecting the learning policies or robustness. Table 6 illustrates the control performance of the proposed algorithm in terms of optimal dispatch, minimised frequency deviation, and communication resiliency.

5.2. Performance Analysis

The results of the N—MADRL approach were experimented on using the Grid2Op power simulation environment of the IEEE 14 bus network with the participation of PV, WP, and ESS plants. The random communication scenario was considered, referring to Equations (24)–(30) to check the overall performance of communication influence. In the random topology, the control inputs of the PV, WP, and ESS agents

U_{ij}^{pv}

,

U_{ij}^{wp}

, and

U_{ij}^{ess}

are coordinated with neighbour agents for estimation of the global status

P_{PV} (k + 1)

,

P_{WP} (k + 1)

, and

P_{ESS} (k + 1)

respectively.

The factors

α^{pv}, α^{wp}, α^{ess}

decide the speed of convergence and coordination for the frequency control, as shown in Figure 9. Referring to Equation (27), the optimal range of

α^{pv}, α^{wp}, α^{ess}

is between 0.550 and 0.710 for connected topologies. For the dynamic topology, the control policy

π (\cdot ∣ θ_{i, j, k}^{μ})

changes for the

P_{PV} (k)

,

P_{WP} (k)

,

P_{ESS} (k)

plants. Due to the adaptive nature and communication resiliency of the proposed N—MADRL approach, the cumulative rewards and frequency deviation control outperform those of DDPG and PPO algorithms, as shown in Figure 15.

Referring to Table 3, the performance of the PV, WP, and ESS zones is analysed along with its impact on the HPP environment for the optimal dispatch and control of frequency deviations. Table 7 represents the control performance of the PV plant, which is examined under the dynamic communication topology for the control of frequency deviations. Due to the optimal reserve margin

R_{PV}

and convergence

α^{pv}

factors, the frequency deviations for N—MADRL exhibit the lowest susceptibility as compared to PPO and DDPG, as shown in Figure 16a.

Similarly, the control performance of the WP plant was observed for the control of frequency deviations under the time-varying communication topologies, as shown in Table 8. Due to the optimal control inputs

U_{ij}^{wp}

and converging factor

α^{wp}

, the control performance of the WP zone provides more stability for the N—MADRL approach as compared to PPO and DDPG, as shown in Figure 16b.

In addition, the control of frequency deviations in the ESS plant exhibits similar results under the random communication topologies, as shown in Figure 16c. Table 9 refers to the control inputs

U_{ij}^{ess}

,

α^{ess}

,

R_{ESS}

, with the performance of the N—MADRL approach in the ESS zone demonstrating lower susceptibility to frequency deviations compared to the PPO and DDPG schemes.

Due to the heterogeneity and optimal coordination of the N—MADRL model, Figure 16 exhibits an adaptive and flexible approach for the HPP environment to provide fast-changing control inputs

U_{ij, t}^{pv}

,

U_{ij, t}^{wp}

,

U_{ij, t}^{ess}

, converging factors

α^{pv}, α^{wp}, α^{ess}

, and reserve margins

R_{PV}

,

R_{WP}

,

R_{ESS}

for the stability of frequency and optimal dispatch. Moreover, Table 7, Table 8 and Table 9 provide more insights into different scenarios of HPP zones for the stability of frequency control and performance improvement of the HPP environment under dynamic communication topologies. The individual assets of the PV, WP, and ESS plants are considered for frequency control using physical constraints, and the communication resiliency is assessed with numerous topologies and control inputs.

6. Conclusions

This paper presents the novel N—MADRL actor-critic architecture by developing a model-free environment to coordinate PV, WP, and ESS plants in the HPP environment. The proposed N—MADRL approach aims to maintain frequency stability by optimal dispatch of active power, thereby mitigating frequency deviations, optimising reserve management, and improving power sharing capabilities. The results indicate that 1 compared to the PPO and DDPG schemes, the proposed N—MADRL approach achieves 42.10% and 61.40% higher efficiency for optimal dispatch and 68.30% and 74.48% improvement in mitigation of frequency deviations, respectively; 2 due to the intermittent nature of PV and WP plants, the results for various control scenarios examined to assess the performance of the proposed approach indicate that its optimal dispatch and reserve management effectively minimise frequency deviations; 3 considering random, partial, and fully-connected communication topologies for the PV, WP, and ESS plants assessed at different timescales, our proposed scheme exhibits robust performance and higher communication resiliency than the PPO and DDPG schemes, demonstrating robustness to communication vulnerabilities in the HPP environment; 4 the dynamic performance and scalability of the HPP environment were evaluated for optimal power dispatch and energy reserve management to reduce frequency deviations, with the results indicating that N—MADRL demonstrates the lowest susceptibility to frequency deviations compared to the PPO and DDPG approaches. The proposed framework addresses optimal dispatch problems using a decentralised coordination mechanism to enhance the operational stability and frequency security of the HPP environment. Physical constraints such as active power sharing, energy reserve margins, and frequency deviations are considered when updating the policy network. The proposed N—MADRL agents rely solely on local control parameters within each HPP zone for frequency control and stability. Furthermore, a dynamic Grid2Op simulation environment was considered, with the results demonstrated on the IEEE 14 bus network to validate the effectiveness of the N—MADRL approach in handling uncertainty, system intermittency, and communication resiliency. The proposed N—MADRL approach is ready to deploy for transmission system operators for optimal dispatch and frequency control. It can provide grid support services in steady-state and emergency control by responding to power imbalances and system disturbances. In future work, this research could be extended to the provision of fast frequency control ancillary services by considering the influence of inverter dynamics, ESS cost-oriented rewards formulation, the control parameters, and the optimisation of frequency response. The electric power industry must enable fast frequency support ancillary services from IBR-enabled HPP to meet inertial requirements and rapid frequency regulations in modern power grids.

Author Contributions

Conceptualization, M.I., D.H., and A.A.; methodology, M.I. and A.A.; software, M.I.; validation, M.I. and A.A.; formal analysis, M.I.; investigation, M.I. and A.A.; resources, M.I.; data curation, M.I.; writing—original draft preparation, M.I.; writing—review and editing, A.A. and D.H.; visualization, M.I. and A.A.; supervision, A.A.; project administration, A.A. and D.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data will be made available on request.

Acknowledgments

The first author would like to thank Edith Cowan University for the awarded (ECU-HDR) higher degree research scholarship to support this project.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

IBR	Inverter-based resources
HPP	Hybrid power plant
PV	Photovoltaic
WP	Wind plant
ESS	Energy storage system
EMS	Energy management system
FFR	Fast frequency response
BS	Black start
DR	Demand response
EV	Electric vehicle
VPP	Virtual power plant
DER	Distributed energy resources
MPC	Model predictive control
DNN	Deep neural network
RL	Reinforcement learning
MARL	Multi-agent reinforcement learning
LFC	Load frequency control
GNN	Graph neural network
MG	Markov game
POMG	Partially-observable Markov game
DRL	Deep reinforcement learning
DDPG	Deep deterministic policy gradient
MADDPG	Multi-agent deep deterministic policy gradient
PPO	Proximal policy optimisation
MDP	Markov decision process
DQN	Deep Q-network
CTDE	Centralised training and decentralised execution
MADRL	Multi-agent deep reinforcement learning
SAC	Soft actor–critic
MAAC	Multi-agent actor–critic
MASAC	Multi-agent and soft actor–critic
MAAC	Multi-agent actor–critic
N—MADRL	Networked multi-agent deep reinforcement learning
Variables
$P_{HP}, f_{HP}$	Power and frequency of hybrid power plant
$P_{PV}, P_{WP}, P_{ESS}$	Power of PV, WP, and ESS plants
$f_{PV}, f_{WP}, f_{ESS}$	Frequency of PV, WP, and ESS plants
$i, j, k$	Indices of PV, WP, and ESS agents
$N_{PV}, N_{WP}, N_{ESS}$	Number of PV, WP, and ESS agents
$V_{PV}, V_{WP}, V_{ESS}$	PV, WP, and ESS plant agents
$o_{i}^{k}, o_{j}^{k}, o_{k}^{k}$	Observations of PV, WP, and ESS agents
$s_{i}^{k}, s_{j}^{k}, s_{k}^{k}$	States of PV, WP, and ESS agents
$A_{i}^{t}, A_{j}^{t}, A_{k}^{t}$	Actions of PV, WP, and ESS agents
$r_{i}^{k}, r_{j}^{k}, r_{k}^{k}$	Rewards of PV, WP, and ESS agents
$π_{i}^{}, π_{j}^{}, π_{k}^{*}$	Optimal policy of PV, WP, and ESS agents
$U_{i j}^{pv}, U_{i j}^{wp}, U_{i j}^{ess}$	Control input of PV, WP, and ESS agents
$α^{pv}, α^{wp}, α^{ess}$	Convergence coefficients of PV, WP, and ESS agents
$C_{k}^{p v}, C_{k}^{w p}, C_{k}^{e s s}$	Collective action and observation of agents
$V^{π} (s), Q^{π} (s, a)$	Value functions of HPP agents
$θ_{i}^{Q}, θ_{j}^{Q}, θ_{k}^{Q}$	Parameters of actor networks for agents $i, j, k$
$y_{i, j, k}^{t}$	Proposed target value of agents $i, j, k$
$L (θ_{i, j, k}^{Q})$	Proposed loss function of agents $i, j, k$
$α H (π (\cdot ∣ s_{k + 1}))$	Entropy-regularized exploration of agents
$\nabla_{θ_{i, j, k}^{μ}} μ (o_{i, j, k} ∣ θ_{i, j, k}^{μ})$	Gradient of policy for agents $i, j, k$

References

Čović, N.; Pavić, I.; Pandžić, H. Multi-energy balancing services provision from a hybrid power plant: PV, battery, and hydrogen technologies. Appl. Energy 2024, 374, 123966. [Google Scholar] [CrossRef]
Klyve, Ø.S.; Grab, R.; Olkkonen, V.; Marstein, E.S. Influence of high-resolution data on accurate curtailment loss estimation and optimal design of hybrid PV–wind power plants. Appl. Energy 2024, 372, 123784. [Google Scholar] [CrossRef]
Kim, Y.S.; Park, G.H.; Kim, S.W.; Kim, D. Incentive design for hybrid energy storage system investment to PV owners considering value of grid services. Appl. Energy 2024, 373, 123772. [Google Scholar] [CrossRef]
Zhang, T.; Xin, L.; Wang, S.; Guo, R.; Wang, W.; Cui, J.; Wang, P. A novel approach of energy and reserve scheduling for hybrid power systems: Frequency security constraints. Appl. Energy 2024, 361, 122926. [Google Scholar] [CrossRef]
Askarov, A.; Rudnik, V.; Ruban, N.; Radko, P.; Ilyushin, P.; Suvorov, A. Enhanced Virtual Synchronous Generator with Angular Frequency Deviation Feedforward and Energy Recovery Control for Energy Storage System. Mathematics 2024, 12, 2691. [Google Scholar] [CrossRef]
Leng, D.; Polmai, S. Virtual Synchronous Generator Based on Hybrid Energy Storage System for PV Power Fluctuation Mitigation. Appl. Sci. 2019, 9, 5099. [Google Scholar] [CrossRef]
Pourbeik, P.; Sanchez-Gasca, J.J.; Senthil, J.; Weber, J.; Zadkhast, P.; Ramasubramanian, D.; Rao, S.D.; Bloemink, J.; Majumder, R.; Zhu, S.; et al. A Generic Model for Inertia-Based Fast Frequency Response of Wind Turbines and Other Positive-Sequence Dynamic Models for Renewable Energy Systems. IEEE Trans. Energy Convers. 2024, 39, 425–434. [Google Scholar] [CrossRef]
Long, Q.; Das, K.; Pombo, D.V.; Sørensen, P.E. Hierarchical control architecture of co-located hybrid power plants. Int. J. Electr. Power Energy Syst. 2022, 143, 108407. [Google Scholar] [CrossRef]
Ekomwenrenren, E.; Simpson-Porco, J.W.; Farantatos, E.; Patel, M.; Haddadi, A.; Zhu, L. Data-Driven Fast Frequency Control Using Inverter-Based Resources. IEEE Trans. Power Syst. 2024, 39, 5755–5768. [Google Scholar] [CrossRef]
Jendoubi, I.; Bouffard, F. Multi-agent hierarchical reinforcement learning for energy management. Appl. Energy 2023, 332, 120500. [Google Scholar] [CrossRef]
Guruwacharya, N.; Chakraborty, S.; Saraswat, G.; Bryce, R.; Hansen, T.M.; Tonkoski, R. Data-Driven Modeling of Grid-Forming Inverter Dynamics Using Power Hardware-in-the-Loop Experimentation. IEEE Access 2024, 12, 52267–52281. [Google Scholar] [CrossRef]
Zhu, D.; Yang, B.; Liu, Y.; Wang, Z.; Ma, K.; Guan, X. Energy management based on multi-agent deep reinforcement learning for a multi-energy industrial park. Appl. Energy 2022, 311, 118636. [Google Scholar] [CrossRef]
Li, Y.; Hou, J.; Yan, G. Exploration-enhanced multi-agent reinforcement learning for distributed PV-ESS scheduling with incomplete data. Appl. Energy 2024, 359, 122744. [Google Scholar] [CrossRef]
Ochoa, T.; Gil, E.; Angulo, A.; Valle, C. Multi-agent deep reinforcement learning for efficient multi-timescale bidding of a hybrid power plant in day-ahead and real-time markets. Appl. Energy 2022, 317, 119067. [Google Scholar] [CrossRef]
Jin, R.; Zhou, Y.; Lu, C.; Song, J. Deep reinforcement learning-based strategy for charging station participating in demand response. Appl. Energy 2022, 328, 120140. [Google Scholar] [CrossRef]
Vázquez-Canteli, J.R.; Nagy, Z. Reinforcement learning for demand response: A review of algorithms and modeling techniques. Appl. Energy 2019, 235, 1072–1089. [Google Scholar] [CrossRef]
Xia, Q.; Wang, Y.; Zou, Y.; Yan, Z.; Zhou, N.; Chi, Y.; Wang, Q. Regional-privacy-preserving operation of networked microgrids: Edge-cloud cooperative learning with differentiated policies. Appl. Energy 2024, 370, 123611. [Google Scholar] [CrossRef]
Sun, X.; Xie, H.; Qiu, D.; Xiao, Y.; Bie, Z.; Strbac, G. Decentralized frequency regulation service provision for virtual power plants: A best response potential game approach. Appl. Energy 2023, 352, 121987. [Google Scholar] [CrossRef]
Kofinas, P.; Dounis, A.I.; Vouros, G.A. Fuzzy Q-Learning for multi-agent decentralized energy management in microgrids. Appl. Energy 2018, 219, 53–67. [Google Scholar] [CrossRef]
May, R.; Huang, P. A multi-agent reinforcement learning approach for investigating and optimising peer-to-peer prosumer energy markets. Appl. Energy 2023, 334, 120705. [Google Scholar] [CrossRef]
Zhang, B.; Hu, W.; Ghias, A.M.; Xu, X.; Chen, Z. Two-timescale autonomous energy management strategy based on multi-agent deep reinforcement learning approach for residential multicarrier energy system. Appl. Energy 2023, 351, 121777. [Google Scholar] [CrossRef]
Xu, X.; Xu, K.; Zeng, Z.; Tang, J.; He, Y.; Shi, G.; Zhang, T. Collaborative optimization of multi-energy multi-microgrid system: A hierarchical trust-region multi-agent reinforcement learning approach. Appl. Energy 2024, 375, 123923. [Google Scholar] [CrossRef]
Lv, C.; Liang, R.; Zhang, G.; Zhang, X.; Jin, W. Energy accommodation-oriented interaction of active distribution network and central energy station considering soft open points. Energy 2023, 268, 126574. [Google Scholar] [CrossRef]
Wang, Y.; Cui, Y.; Li, Y.; Xu, Y. Collaborative optimization of multi-microgrids system with shared energy storage based on multi-agent stochastic game and reinforcement learning. Energy 2023, 280, 128182. [Google Scholar] [CrossRef]
Bui, V.H.; Su, W. Real-time operation of distribution network: A deep reinforcement learning-based reconfiguration approach. Sustain. Energy Technol. Assess. 2022, 50, 101841. [Google Scholar] [CrossRef]
Li, H.; He, H. Optimal Operation of Networked Microgrids With Distributed Multi-Agent Reinforcement Learning. In Proceedings of the 2024 IEEE Power & Energy Society General Meeting (PESGM), Seattle, WA, USA, 21–25 July 2024; pp. 1–5. [Google Scholar]
Fang, X.; Wang, J.; Song, G.; Han, Y.; Zhao, Q.; Cao, Z. Multi-agent reinforcement learning approach for residential microgrid energy scheduling. Energies 2019, 13, 123. [Google Scholar] [CrossRef]
Anjaiah, K.; Dash, P.; Bisoi, R.; Dhar, S.; Mishra, S. A new approach for active and reactive power management in renewable based hybrid microgrid considering storage devices. Appl. Energy 2024, 367, 123429. [Google Scholar] [CrossRef]
Wu, H.; Qiu, D.; Zhang, L.; Sun, M. Adaptive multi-agent reinforcement learning for flexible resource management in a virtual power plant with dynamic participating multi-energy buildings. Appl. Energy 2024, 374, 123998. [Google Scholar] [CrossRef]
Zhao, H.; Wang, B.; Liu, H.; Sun, H.; Pan, Z.; Guo, Q. Exploiting the flexibility inside park-level commercial buildings considering heat transfer time delay: A memory-augmented deep reinforcement learning approach. IEEE Trans. Sustain. Energy 2021, 13, 207–219. [Google Scholar] [CrossRef]
Ebrie, A.S.; Kim, Y.J. Reinforcement Learning-Based Multi-Objective Optimization for Generation Scheduling in Power Systems. Systems 2024, 12, 106. [Google Scholar] [CrossRef]
Qiu, D.; Wang, Y.; Zhang, T.; Sun, M.; Strbac, G. Hybrid Multiagent Reinforcement Learning for Electric Vehicle Resilience Control Towards a Low-Carbon Transition. IEEE Trans. Ind. Inform. 2022, 18, 8258–8269. [Google Scholar] [CrossRef]
Wang, Y.; Qiu, D.; Strbac, G.; Gao, Z. Coordinated Electric Vehicle Active and Reactive Power Control for Active Distribution Networks. IEEE Trans. Ind. Inform. 2023, 19, 1611–1622. [Google Scholar] [CrossRef]
Singh, V.P.; Kishor, N.; Samuel, P. Distributed Multi-Agent System-Based Load Frequency Control for Multi-Area Power System in Smart Grid. IEEE Trans. Ind. Electron. 2017, 64, 5151–5160. [Google Scholar] [CrossRef]
Yu, T.; Wang, H.Z.; Zhou, B.; Chan, K.W.; Tang, J. Multi-Agent Correlated Equilibrium Q(λ) Learning for Coordinated Smart Generation Control of Interconnected Power Grids. IEEE Trans. Power Syst. 2015, 30, 1669–1679. [Google Scholar] [CrossRef]
Shen, R.; Zhong, S.; Wen, X.; An, Q.; Zheng, R.; Li, Y.; Zhao, J. Multi-agent deep reinforcement learning optimization framework for building energy system with renewable energy. Appl. Energy 2022, 312, 118724. [Google Scholar] [CrossRef]
Lu, R.; Li, Y.C.; Li, Y.; Jiang, J.; Ding, Y. Multi-agent deep reinforcement learning based demand response for discrete manufacturing systems energy management. Appl. Energy 2020, 276, 115473. [Google Scholar] [CrossRef]
Wu, J.; He, H.; Peng, J.; Li, Y.; Li, Z. Continuous reinforcement learning of energy management with deep Q network for a power split hybrid electric bus. Appl. Energy 2018, 222, 799–811. [Google Scholar] [CrossRef]
Ajagekar, A.; Decardi-Nelson, B.; You, F. Energy management for demand response in networked greenhouses with multi-agent deep reinforcement learning. Appl. Energy 2024, 355, 122349. [Google Scholar] [CrossRef]
Wang, Y.; Qiu, D.; Strbac, G. Multi-agent deep reinforcement learning for resilience-driven routing and scheduling of mobile energy storage systems. Appl. Energy 2022, 310, 118575. [Google Scholar] [CrossRef]
Zhang, J.; Sang, L.; Xu, Y.; Sun, H. Networked Multiagent-Based Safe Reinforcement Learning for Low-Carbon Demand Management in Distribution Networks. IEEE Trans. Sustain. Energy 2024, 15, 1528–1545. [Google Scholar] [CrossRef]
Tavakol Aghaei, V.; Ağababaoğlu, A.; Bawo, B.; Naseradinmousavi, P.; Yıldırım, S.; Yeşilyurt, S.; Onat, A. Energy optimization of wind turbines via a neural control policy based on reinforcement learning Markov chain Monte Carlo algorithm. Appl. Energy 2023, 341, 121108. [Google Scholar] [CrossRef]
Li, J.; Yu, T.; Zhang, X. Coordinated load frequency control of multi-area integrated energy system using multi-agent deep reinforcement learning. Appl. Energy 2022, 306, 117900. [Google Scholar] [CrossRef]
Dong, L.; Lin, H.; Qiao, J.; Zhang, T.; Zhang, S.; Pu, T. A coordinated active and reactive power optimization approach for multi-microgrids connected to distribution networks with multi-actor-attention-critic deep reinforcement learning. Appl. Energy 2024, 373, 123870. [Google Scholar] [CrossRef]
Deptula, P.; Bell, Z.I.; Doucette, E.A.; Curtis, J.W.; Dixon, W.E. Data-based reinforcement learning approximate optimal control for an uncertain nonlinear system with control effectiveness faults. Automatica 2020, 116, 108922. [Google Scholar] [CrossRef]
Zhang, B.; Cao, D.; Hu, W.; Ghias, A.M.; Chen, Z. Physics-Informed Multi-Agent deep reinforcement learning enabled distributed voltage control for active distribution network using PV inverters. Int. J. Electr. Power Energy Syst. 2024, 155, 109641. [Google Scholar] [CrossRef]
Li, S.; Hu, W.; Cao, D.; Chen, Z.; Huang, Q.; Blaabjerg, F.; Liao, K. Physics-model-free heat-electricity energy management of multiple microgrids based on surrogate model-enabled multi-agent deep reinforcement learning. Appl. Energy 2023, 346, 121359. [Google Scholar] [CrossRef]
Abid, M.S.; Apon, H.J.; Hossain, S.; Ahmed, A.; Ahshan, R.; Lipu, M.H. A novel multi-objective optimization based multi-agent deep reinforcement learning approach for microgrid resources planning. Appl. Energy 2024, 353, 122029. [Google Scholar] [CrossRef]
Hua, M.; Zhang, C.; Zhang, F.; Li, Z.; Yu, X.; Xu, H.; Zhou, Q. Energy management of multi-mode plug-in hybrid electric vehicle using multi-agent deep reinforcement learning. Appl. Energy 2023, 348, 121526. [Google Scholar] [CrossRef]
Xie, J.; Ajagekar, A.; You, F. Multi-Agent attention-based deep reinforcement learning for demand response in grid-responsive buildings. Appl. Energy 2023, 342, 121162. [Google Scholar] [CrossRef]
Xiang, Y.; Lu, Y.; Liu, J. Deep reinforcement learning based topology-aware voltage regulation of distribution networks with distributed energy storage. Appl. Energy 2023, 332, 120510. [Google Scholar] [CrossRef]
Guo, G.; Zhang, M.; Gong, Y.; Xu, Q. Safe multi-agent deep reinforcement learning for real-time decentralized control of inverter based renewable energy resources considering communication delay. Appl. Energy 2023, 349, 121648. [Google Scholar] [CrossRef]
Si, R.; Chen, S.; Zhang, J.; Xu, J.; Zhang, L. A multi-agent reinforcement learning method for distribution system restoration considering dynamic network reconfiguration. Appl. Energy 2024, 372, 123625. [Google Scholar] [CrossRef]
Tzani, D.; Stavrakas, V.; Santini, M.; Thomas, S.; Rosenow, J.; Flamos, A. Pioneering a performance-based future for energy efficiency: Lessons learnt from a comparative review analysis of pay-for-performance programmes. Renew. Sustain. Energy Rev. 2022, 158, 112162. [Google Scholar] [CrossRef]
Donnot, B. *Grid2Op: A Testbed Platform to Model Sequential Decision Making in Power Systems*. Available online: https://github.com/Grid2op/grid2op (accessed on 31 March 2025).

Figure 1. Illustration of coordination control of a hybrid power plant. The solid line flow denotes bidirectional cyber-physical communication. The dotted lines show the PV, WP, and ESS asset exchange variables at different control levels.

Figure 2. Illustration of heterogeneous HPP agents in the N—MADRL framework. Global state information is used for centralised training and local state information is used for decentralised execution by the PV, WP, and ESS agents.

Figure 3. Illustration of the proposed N—MADRL framework. The samples of HPP agents are trained in the current value network and estimate the Q-value function. The target value network is used as feedback to improve the training stability and sample efficiency.

Figure 4. Flowchart of the proposed N—MADRL framework for optimal dispatch and frequency control observing training stability and model efficiency.

Figure 5. Flowchart of numerical simulations of the proposed N—MADRL framework for optimal dispatch and frequency control.

Figure 6. Grid2Op HPP environment of IEEE 14−bus system for PV, WP, and ESS plants.

Figure 7. Training results in the HPP environment: (a) cumulative rewards of the PPO, DDPG, and N—MADRL algorithms and (b) frequency deviation of the PPO, DDPG, and N—MADRL algorithms.

Figure 8. Frequency deviations in the HPP environment: (a) frequency deviation in the PV zone using the PPO, DDPG, and N—MADRL algorithms; (b) frequency deviation in the WP zone using the PPO, DDPG, and N—MADRL algorithms; (c) frequency deviation in the ESS zone using the PPO, DDPG, and N—MADRL algorithms.

Figure 9. Optimal dispatch under dynamic communications: (a) optimal dispatch PV agents; (b) optimal dispatch WP agents; (c) optimal dispatch of ESS agents; (d) frequency regulation of HPP environment.

Figure 10. Optimal dispatch under partial communication: (a) optimal dispatch of

P_{PV}

agents for

f_{HP}

; (b) optimal dispatch of

P_{WP}

agents for

f_{HP}

; (c) optimal dispatch of

P_{ESS}

agents for

f_{HP}

; (d) frequency regulation of HPP

f_{HP}

by optimal dispatch of

P_{PV}, P_{WP}, P_{ESS}

agents.

Figure 10. Optimal dispatch under partial communication: (a) optimal dispatch of

P_{PV}

agents for

f_{HP}

; (b) optimal dispatch of

P_{WP}

agents for

f_{HP}

; (c) optimal dispatch of

P_{ESS}

agents for

f_{HP}

; (d) frequency regulation of HPP

f_{HP}

by optimal dispatch of

P_{PV}, P_{WP}, P_{ESS}

agents.

Figure 11. Optimal dispatch under full communication: (a) optimal dispatch of

P_{PV}

agents for

f_{HP}

; (b) optimal dispatch of

P_{WP}

agents for

f_{HP}

; (c) optimal dispatch of

P_{ESS}

agents for

f_{HP}

; (d) frequency regulation of HPP

f_{HP}

by optimal dispatch of

P_{PV}, P_{WP}, P_{ESS}

agents.

Figure 11. Optimal dispatch under full communication: (a) optimal dispatch of

P_{PV}

agents for

f_{HP}

; (b) optimal dispatch of

P_{WP}

agents for

f_{HP}

; (c) optimal dispatch of

P_{ESS}

agents for

f_{HP}

; (d) frequency regulation of HPP

f_{HP}

by optimal dispatch of

P_{PV}, P_{WP}, P_{ESS}

agents.

Figure 12. Uncertainty control in the HPP environment: (a) inertial response of the HPP environment using the PPO, DDPG, and N—MADRL algorithms; (b) cumulative rewards of the HPP environment under uncertainty situations; (c) frequency deviation in the HPP environment under uncertainty situations.

Figure 13. Fully-connected communication topologies: (a) frequency deviations in the PV zone, (b) frequency deviations in the WP zone, and (c) frequency deviations in the ESS zone.

Figure 14. Partially-connected communication topologies: (a) frequency deviations in the PV zone, (b) frequency deviations in the WP zone, and (c) frequency deviations in the ESS zone.

Figure 15. Performance analysis of the HPP environment: (a) cumulative rewards under dynamic communication and (b) frequency deviations under dynamic communication.

Figure 16. Performance analysis of IEEE 14 bus HPP environment: (a) frequency deviation under dynamic communication in the PV zone, (b) frequency deviation under dynamic communication in the WP zone, and (c) frequency deviation under dynamic communication in the ESS zone.

Table 1. MADRL schemes for energy management applications.

Refs.	Energy Source	Coordinations	Algorithms	Scalability	Performance
[27]	BESS, DG	Decentralized	Q-learning	Low	Improves power dispatch
[28]	PV, ESS, WP	Centralized	MPC	Low	Energy scheduling
[29]	VPP	Distributed	Q-learning	Moderate	Energy cost optimization
[30]	Microgrid	Centralized	SAC, DDQN	Moderate	Enhances thermal management
[31]	PV, ESS	Distributed	MOPS	Low	Improves cost optimization
[32,33]	PV, WP, EV	Distributed	Dec-POMDP	Moderate	Marginal cost optimization
[34,35]	Smart grid	Centralized	Dec-POMDP	Moderate	Reduces frequency deviation
[36,37,38]	PV, WP, ESS	Decentralized	D3Q, MADDPG	Moderate	Fast convergence
[39,40,41]	WP, ESS, PV	Networked	SAC, DDQN	Moderate	Optimizes energy cost
[42,43,44]	WP, PV	Networked	DDPG, MAAC	Moderate	Energy scheduling
This paper	PV, WP, ESS	Networked	SAC, MAAC	High	Optimal dispatch and Reserve

Table 2. Parameter settings for DRL algorithms.

Algorithms	Parameters	Values
N—MADRL	Actors hidden units	{128, 256}
	Actors hidden layers	{2, 5}
	Critic hidden units	{128, 256}
	Critic hidden layers	{2, 5}
	Learning rate	0.0003
	$α^{pv}, α^{wp}, α^{ess}$	0.60
PPO & DDPG	Actors hidden units	256
	Critic hidden units	256
	Optimizer	Adam
	Activation function	ReLU
Shared	Learning rate	0.0003
	Mini-batch size $M$	64
	Reply buffer $D$	$10^{6}$
	Discount factor $γ^{t}$	0.97
	Total iterations k	$10^{6}$
	Soft update $τ$	0.10

Table 3. The dynamics of the PV, WP, and ESS plants in the HPP environment.

g-id	Bus	Asset	$U_{ij}^{pv}$	$U_{ij}^{wp}$	$U_{ij}^{ess}$	$P_{PV}^{c}$	$P_{WP}^{c}$	$P_{ESS}^{c}$	$R_{PV}$	$R_{WP}$	$R_{ESS}$
0	1	pv	0.039	0	0	11.7	0	0	2.34	0	0
1	5	ess	0	0	0.032	0	0	21.8	0	0	4.36
2	3	wind	0	0.026	0	0	26.7	0	0	5.34	0
3	12	ess	0	0	0.030	0	0	25.75	0	0	5.15
4	2	pv	0.036	0	0	14.8	0	0	2.96	0	0
5	6	wind	0	0.023	0	0	41.4	0	0	8.28	0
6	4	pv	0.034	0	0	17.6	0	0	3.52	0	0
7	13	ess	0	0	0.029	0	0	18.6	0	0	3.72
8	9	wind	0	0.025	0	0	33.4	0	0	6.68	0
9	11	pv	0.037	0	0	10.4	0	0	2.08	0	0
10	7	wind	0	0.021	0	0	22.8	0	0	4.56	0
11	10	ess	0	0	0.031	0	0	22.7	0	0	4.54
12	8	pv	0.035	0	0	10.7	0	0	2.14	0	0
13	14	pv	0.038	0	0	12.2	0	0	2.44	0	0

Table 4. Computational cost comparison of the N—MADRL, PPO, and DDPG algorithms.

Algorithms	Training (hr)	Validation (min)	Testing (min)
N—MADRL	6.50	25	13
PPO	8.25	40	21
DDPG	10.20	50	24

Table 5. Performance comparison of the N—MADRL, PPO, and DDPG algorithms.

Algorithms	Ave. Rewards	Ave. Deviation (Hz)	Ave. Power (MW)
N—MADRL	−9.93	0.013	0.22
PPO	−12.45	0.041	0.38
DDPG	−24.58	0.053	0.57

Table 6. Control performance of DRL algorithms under partially-connected and full-connected communication topologies.

Scenarios	Algorithms	Rewards	Deviation (Hz)	Std. Deviation	Convergence Rate	Threshold	Steps
Scenario 1	N—MADRL	−25	0.085	4.37	0.61	61%	21,100
	PPO	−54	0.098	7.11	0.64	64%	28,400
	DDPG	−128	0.106	9.34	0.62	62%	–
Scenario 2	N—MADRL	−35	0.092	4.37	0.37	37%	33,200
	PPO	−70	0.103	6.48	0.41	41%	41,230
	DDPG	−135	0.160	7.15	0.39	39%	–

Table 7. Performance of the PV plant in IEEE 14 bus systems.

g-id	Bus	$α^{pv}$	$U_{ij}^{pv}$	$P_{PV}^{c}$	$R_{PV}$
0	1	0.618	0.039	11.7	2.34
4	2	0.562	0.036	14.8	2.96
6	4	0.647	0.034	17.6	3.52
9	11	0.683	0.037	10.4	2.08
12	8	0.594	0.035	10.7	2.14
13	14	0.616	0.038	12.2	2.44

Table 8. Performance of the WP plant in IEEE 14 bus systems.

g-id	Bus	$α^{wp}$	$U_{ij}^{wp}$	$P_{WP}^{c}$	$R_{WP}$
2	3	0.594	0.026	26.7	5.34
5	6	0.631	0.023	41.4	8.28
8	9	0.658	0.025	33.4	6.68
10	7	0.614	0.021	22.8	4.56

Table 9. Performance of the ESS plant in IEEE 14 bus systems.

g-id	Bus	$α^{ess}$	$U_{ij}^{ess}$	$P_{ESS}^{c}$	$R_{ESS}$
1	5	0.670	0.032	21.8	4.36
3	12	0.621	0.030	25.75	5.15
7	13	0.632	0.029	18.6	3.72
11	10	0.594	0.031	22.7	4.54

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ikram, M.; Habibi, D.; Aziz, A. Networked Multi-Agent Deep Reinforcement Learning Framework for the Provision of Ancillary Services in Hybrid Power Plants. Energies 2025, 18, 2666. https://doi.org/10.3390/en18102666

AMA Style

Ikram M, Habibi D, Aziz A. Networked Multi-Agent Deep Reinforcement Learning Framework for the Provision of Ancillary Services in Hybrid Power Plants. Energies. 2025; 18(10):2666. https://doi.org/10.3390/en18102666

Chicago/Turabian Style

Ikram, Muhammad, Daryoush Habibi, and Asma Aziz. 2025. "Networked Multi-Agent Deep Reinforcement Learning Framework for the Provision of Ancillary Services in Hybrid Power Plants" Energies 18, no. 10: 2666. https://doi.org/10.3390/en18102666

APA Style

Ikram, M., Habibi, D., & Aziz, A. (2025). Networked Multi-Agent Deep Reinforcement Learning Framework for the Provision of Ancillary Services in Hybrid Power Plants. Energies, 18(10), 2666. https://doi.org/10.3390/en18102666

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Networked Multi-Agent Deep Reinforcement Learning Framework for the Provision of Ancillary Services in Hybrid Power Plants

Abstract

1. Introduction

1.1. Background

1.2. Contribution and Outline

2. Problem Formulation

2.1. Description of the HPP Environment

2.2. Objective Function

2.3. Constraints

2.3.1. Curtailment of PV and WP Plant

2.3.2. Power Balancing of the ESS Plant

2.3.3. Energy Storage at ESS Plant

2.3.4. Frequency Control

3. Proposed Networked Multi-Agent Deep Reinforcement Learning (N—MADRL) Approach

3.1. Network Representation

3.2. MADRL Preliminaries

3.3. Multi-Agent and Soft Actor–Critic (MASAC) Approach

3.4. Scalability and Heterogeneity

3.5. N—MADRL Implementation

4. Numerical Simulations

4.1. N—MADRL Environment

4.2. Results Validation

4.2.1. Training Process

4.2.2. Comparison of DRL Algorithms

4.3. Optimal Dispatch Analysis

5. Discussion and Results Analysis

5.1. Robustness Analysis

5.1.1. Uncertainty Control

5.1.2. Communication Resiliency

5.2. Performance Analysis

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI