A Two-Layer User Energy Management Strategy for Virtual Power Plants Based on HG-Multi-Agent Reinforcement Learning

Tian, Sen; Xiao, Qian; Li, Tianxiang; Wang, Zibo; Qiao, Ji; Zhu, Hong; Ji, Wenlu

doi:10.3390/app15126713

Open AccessArticle

A Two-Layer User Energy Management Strategy for Virtual Power Plants Based on HG-Multi-Agent Reinforcement Learning

by

Sen Tian

¹

,

Qian Xiao

¹

,

Tianxiang Li

^1,*

,

Zibo Wang

²,

Ji Qiao

²

,

Hong Zhu

³ and

Wenlu Ji

³

¹

State Key Laboratory of Intelligent Power Distribution Equipment and System, Tianjin University, Tianjin 300072, China

²

China Electric Power Research Institute, Beijing 100192, China

³

Nanjing Power Supply Company, State Grid Jiangsu Electric Power Co., Nanjing 210019, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(12), 6713; https://doi.org/10.3390/app15126713

Submission received: 6 May 2025 / Revised: 4 June 2025 / Accepted: 11 June 2025 / Published: 15 June 2025

(This article belongs to the Special Issue Emerging Trends in Energy Management: Techniques, Applications and Future Directions)

Download

Browse Figures

Versions Notes

Abstract

Household loads are becoming dominant in virtual power plants (VPP). However, their dispatch potential has not yet been explored due to the lack of detailed user power management. To solve this issue, a novel two-layer user energy management strategy based on HG-multi-agent reinforcement learning has been proposed in this paper. Firstly, a novel two-layer optimization framework is established, where the upper layer is applied to coordinate the scheduling and benefit allocation among various stakeholders and the lower layer is applied to execute intelligent decision-making for users. Secondly, the mathematical model for the framework is established, where a detailed household power management model is proposed in the lower layer, and the generated predicted power demands are used to replace the conventional aggregate model in the upper layer. As a result, the energy consumption behaviors of household users can be precisely described in the scheduling scheme. Furthermore, an HG-multi-agent reinforcement-based method is applied to accelerate the game-solving process. Case study results indicate that the proposed method leads to a reduction in user costs and an increase in VPP profit.

Keywords:

hierarchical game; virtual power plant; multi-agent reinforcement learning; optimized scheduling; demand response; energy management strategy

1. Introduction

As an emerging energy autonomy platform, the Virtual Power Plant (VPP) enables the high-level integration and intelligent coordination of distributed clean and renewable energy resources [1,2,3]. Through advanced control technologies and information and communication systems, VPPs facilitate efficient interaction and optimal scheduling between energy production and consumption sides. This not only enhances the flexibility and reliability of the energy system but also promotes large-scale integration and utilization of clean energy, ensuring a secure, economical, green, and stable power supply for end users. VPPs thus play a vital role in advancing the development of a low-carbon and efficient modern energy system.

According to statistics, household energy consumption accounts for over 27% of total societal energy consumption and continues to rise. This indicates the growing importance of residential users in the overall energy consumption structure. However, as household loads increase in scale, competition among VPP stakeholders intensifies. Conventional scheduling methods overlook differences between household users, which leaves the dispatch potential among users largely unexplored [4,5,6,7]. Therefore, the current research focuses on formulating VPP optimization strategies that meet the needs of multiple stakeholders and accommodate the diversity in the dynamics of user energy behavior. Many studies have been conducted on developing optimized operation strategies for VPPs. Considering the trend towards multi-agent participation in the VPP, the discussion below primarily focuses on two aspects: the energy behaviors of each user and the multi-agent game-solving method.

Regarding the energy behaviors of each user, to explore the dispatch potential of users in particular, their energy consumption patterns should first be modeled. As a crucial means of energy regulation in VPPs, demand response (DR) has garnered significant attention. Some scholars have established operational models under DR mechanisms. In reference [8], a three-stage stochastic two-tier optimization model for VPP is proposed, which enhances the market competitiveness and economic efficiency of virtual power plants by incorporating demand response scenarios for distributed resources within the VPP into the model. In reference [9], a risk-constrained day-ahead scheduling strategy based on information gap decision theory is proposed, considering the uncertainties from the participation of concentrating solar power plants, responsive residential, and industrial loads in the demand response mechanism. In reference [10], a bidirectional flow of power and information between energy service providers and electricity consumers within a smart grid model is established. It applies a price-based dynamic DR mechanism, which utilizes clustering algorithms to identify user types for implementing DR. Various market-clearing models considering DR mechanisms are proposed in reference [11,12,13,14]. Some scholars have placed VPP models within reinforcement learning frameworks for their research. In reference [15], dynamic demands within VPPs are considered, and a multi-agent reinforcement learning energy trading model that includes users and energy supply appliances is developed. In reference [16], a multi-stage reward mechanism for VPPs within a deep learning framework is introduced, quantifying the impact of multiple entities on the VPP during operation, thereby reducing the difficulty of model training. In reference [17,18], an optimization model based on Q-learning to reduce energy costs and dissatisfaction costs caused by DR is proposed. Most of the literature establishes multi-agent models for VPPs, utilizes price-based DR regulation models for energy supply, or employs reinforcement learning frameworks to enhance model accuracy during operation. However, as the variety of electrical appliances in household users increases, the difficulty of load regulation also rises. The energy dispatch instructions issued by VPPs struggle to accommodate diverse user energy demands, thus limiting the dispatch potential on the user side. Reinforcement learning methods, with their ability to iteratively explore and experiment in dynamic environments and possessing learning memory, hold certain advantages in formulating complex multi-appliance instant adjustment schemes. Within reinforcement learning methods, Q-Learning has emerged as a preferred choice for device-level energy management in residential users due to several key advantages: it does not rely on an accurate environmental model, features a simple algorithmic structure, offers strong adaptability to changing conditions, and demonstrates good convergence properties. These characteristics make Q-Learning particularly well-suited for managing the complex and dynamic energy consumption behaviors in household-level VPP applications.

Regarding the multi-agent game-solving method, the diverse array of energy supply and consumption entities within VPPs possesses varying interests. Conventional centralized algorithms have limited capabilities in resolving conflicts among multiple stakeholders, resulting in poor solution speed and accuracy. Therefore, algorithms rooted in game theory have become a more suitable choice. In reference [19], a non-cooperative game model between VPPs is developed to achieve optimal pricing of photovoltaic feed-in with reduced load bias. In reference [20], a hierarchical energy management structure for VPPs is proposed and a two-tier optimal scheduling model based on multi-intelligent agent systems and DR is established to realize the benefits of VPPs. In reference [21,22], the authors consider the incompleteness of the information game of multiple stakeholders within a VPP and research the interaction and gaming behaviors among stakeholders, ultimately improving the overall benefit. In reference [23], the energy management of the VPP is modeled using the multi-leader-multi-follower Stackelberg game theory, which enhances both VPP profit and customer energy consumption levels. In reference [24], a profit-sharing model for multiple VPP operators in energy reserve trading is established, benefit-sharing is designed using asymmetric Nash bargaining, and the model’s validity is confirmed through the alternating direction multiplicative decomposition technique. In reference [25], differences in the accumulated knowledge of game participants are considered, local incomplete game theory is introduced into the computational model of the multi-subject game for multi-agent systems, and the stability of the methodology is analyzed. In reference [26], the authors address the optimal scheduling problem faced by power retail companies in the context of Virtual Power Plants (VPPs), taking into account the uncertainties associated with photovoltaic generation, energy storage, and controllable loads. To this end, a leader–follower (Stackelberg) game-based optimization model is developed. In reference [27], a prospect theory-based Stackelberg game model is proposed to address the energy trading problem between prosumers and power companies under price uncertainty. By introducing reference points and loss aversion coefficients, the model captures the subjective decision-making behavior of prosumers and determines the optimal energy trading price. In reference [28], A multi-level energy management approach based on a multi-leader–follower game theory framework is proposed, enabling negotiation between power suppliers and building users as well as user demand scheduling. This approach effectively reduces the overall system cost. Although the above literature has established a more accurate multi-agent game model, there remains potential for enhancement in algorithmic speed and accuracy.

Based on the above analysis, a novel two-layer user energy management strategy based on HG-multi-agent reinforcement learning is proposed. The primary contributions of this article can be outlined in the following three aspects:

(1): A two-layer optimization framework for the VPP has been proposed, in which the upper layer coordinates the scheduling and benefit allocation among various stakeholders and the lower layer executes intelligent decision-making for users.
(2): A detailed household power management model is proposed to generate predicted power demands and allocate power within each household user in the lower layer.
(3): A HG-multi-agent reinforcement-based method is proposed, which reconstructs the objective functions of each stakeholder with the hierarchical game algorithm to simplify the solving process, utilizing the multi-agent Q-learning algorithm combined with upper-layer solution information to accelerate training and solving.

The following is the outline of the upcoming sections of this paper. Section 2 analyzes the tasks of various stakeholders in the VPP and proposes a two-layer optimization framework. Section 3 presents mathematical models for the two-layer framework. Section 4 introduces the HG-multi-agent reinforcement-based method for the framework. Section 5 discusses the simulation results. Finally, Section 6 concludes this paper.

2. Two-Layer Optimization Framework for VPP

The two-layer optimization framework proposed in this paper is shown in Figure 1.

The framework consists of four categories of stakeholders: the grid, the VPP aggregator, the household collection, and the energy storage station (EST). The EST comprises three types of appliances: batteries (BA), photovoltaics (PV) and wind turbines (WT). The VPP aggregator includes appliances such as combined heat and power systems (CHP) and PV. The household collection consists of multiple household users, each with multiple electrical appliances and a home energy management system (HEMS). The HEMS can integrate artificial intelligence algorithms to assist users in decision-making.

2.1. Operation Tasks of the Two-Layer Optimization Framework

The operation task of each layer in the proposed framework is described as follows:

(1): Upper layer: Coordinate the scheduling and benefit allocation among stakeholders

The mechanism for energy and information flow within the upper layer framework is designed as follows: the grid and the VPP aggregator sign contracts, with the VPP aggregator coordinating and managing household users’ energy consumption. The EST ensures its interests by complying with the VPP aggregator’s management. Each household user reports load adjustment space and satisfaction cost coefficients, which are aggregated into the household collection model as predicted power demands. The focus of the upper-layer framework research lies in the game of multiple stakeholders, which involves the power allocation across each stakeholder.

(2): Lower layer: Execute intelligent decision-making for users

The mechanism for energy and information flow within the lower layer is designed as follows: household users receive the outcomes of the game from the upper layer and adjust household electricity loads by training their HEMS. Users’ appliances are categorized into three types based on their electricity usage: fixed appliances, curtailable appliances, and shiftable appliances, and are assisted by reinforcement learning theory for decision-making. The detailed classification of the devices is provided in Section 3. The focus of research in the lower-layer framework is to devise load adjustment strategies for users considering their energy consumption preferences.

2.2. Comparison of the Framework

The comparison of the conventional framework and the two-layer framework proposed in this paper is shown in Figure 2.

The two-layer framework proposed in this paper expands upon the conventional game-theoretic framework by incorporating household load management on the user side. The conventional game-theoretic framework computes unified scheduling for user collections, yielding results that do not offer practical guidance for individual VPP users. In contrast, the two-layer framework introduced in this paper considers variations in user electricity consumption scale and time-of-use preferences, establishing precise user mathematical models that interact with other VPP stakeholders. This framework formulates scheduling processes that better align with the needs of each stakeholder in the VPP while also devising appliance-level schemes on the user side to unlock their dispatch potential.

3. Mathematical Model for the Proposed Framework

Due to the differing interests of each stakeholder, the upper layer of the VPP exhibits complex game relationships. Therefore, separate mathematical models for each stakeholder and the DR model they rely on are established as follows:

3.1. Mathematical Model for the Upper Layer

(1): VPP aggregator model

The scheduling of the VPP throughout the day is divided into T slots, where the VPP aggregator receives grid load peaking commands, and the price of electricity sold to household users is subject to constraints:

λ_{G R, t}^{b u y} < λ_{A G, t}^{s e l l} < λ_{G R, t}^{s e l l}

(1)

where

λ_{G R, t}^{b u y}

and

λ_{G R, t}^{s e l l}

are the price of electricity purchased and sold from the grid at time t, respectively;

λ_{A G, t}^{s e l l}

is the price of electricity sold by the VPP aggregator at the time slot t in a day.

The VPP aggregator is equipped with CHP and PV, and CHP consumes natural gas to generate electricity, modeling the operating cost of its unit as follows:

E_{C H P, t} = \frac{g_{n a}}{Q_{n a} η_{C H P}} P_{C H P, t}

(2)

where

P_{C H P, t}

is the electrical power output from CHP at time t in a day; g_na and Q_na are the unit price and calorific value of natural gas, respectively;

η_{C H P}

is the generation efficiency of CHP. The profit of the VPP aggregator in a day can be expressed as follows:

E_{A g} = E_{A g}^{s} + E_{G r}^{D R} - E_{E S T}^{D R} - E_{H h}^{D R} - E_{C H P}

(3)

where

E_{A g}^{s}

is the benefit received by the VPP aggregator from the sale of electricity to the household users;

E_{G r}^{D R}

is the DR subsidy from the grid to the VPP aggregator;

E_{E S T}^{D R}

is the DR subsidy paid by the VPP aggregator to the EST;

E_{H h}^{D R}

is the DR subsidy paid by the VPP aggregator to household users; E_CHP is the operating cost of CHP. The calculations of the DR subsidy can be found in Equations (11) and (12).

(2): EST model

The EST is internally configured with PVs and WPs to provide electric energy supplemental services primarily to household users and can be considered for participation in DR missions. For a user collection containing n household users, the capacity of EST at time slot t + 1 in a day can be expressed as follows:

C_{E S T}^{t + 1} = C_{E S T}^{t} + (η_{E S T}^{c} P_{E S T, t}^{c} - \frac{P_{E S T, t}^{d}}{η_{E S T}^{d}}) Δ t

(4)

where

C_{E S T}^{t}

is the energy storage capacity of the EST at time slot t.

P_{E S T, t}^{c}

,

P_{E S T, t}^{d}

are the charging and discharging power of EST at time slot t;

η_{E S T}^{c}

,

η_{E S T}^{d}

are the transfer efficiencies during EST charging and discharging processes.

To guarantee that the EST can continue to provide service, the total power used by all customers at any one time is restricted to not exceed the capacity of the EST:

P_{E S T}^{\min} \leq P_{E S T}^{t} \leq P_{E S T}^{\max}

(5)

where

P_{E S T}^{\max}

and

P_{E S T}^{\min}

are the maximum and minimum values allowed for EST capacity. Consider that the user’s charging and discharging power is limited by EST in any given time slot:

|\frac{P_{E S T}^{t + 1} - P_{E S T}^{t}}{Δ t}| \leq P_{E S T, t}^{d}

(6)

The benefits of the EST consist of both DR incentives and customer payments for electricity consumption, as well as the cost of charging and discharging. Therefore, the benefit function of EST in a day is as follows:

E_{E S T} = E_{E S T}^{D R} + \sum_{t = 1}^{T} (λ_{E S S, t} P_{E S T, t}^{d} Δ t - λ_{c} P_{E S T, t}^{c} Δ t - λ_{d} P_{E S T, t}^{d} Δ t)

(7)

where λ_ESS,t, is the fee that the users need to pay per unit of discharge power; λ_c and λ_d are fees that the EST unit charges and discharging cost coefficients, respectively.

(3): Household collection model

Electric loads of household users are categorized into three types: fixed loads, curtailable loads, and shiftable loads. Fixed loads are inherently inflexible and can only consume power at fixed times, curtailable loads receive a certain degree of load curtailment, and shiftable loads can transfer part of their loads to other times. Therefore, the total electrical loads of a household user under time slot t in a day can be expressed as follows:

L_{H, t} = L_{H, t}^{f} + L_{H, t}^{c} + L_{H, t}^{s}

(8)

where

L_{H, t}^{f}

,

L_{H, t}^{c}

, and

L_{H, t}^{s}

are the fixed, curtailable, and shiftable loads of a household user at time slot t, respectively. The energy cost function of a household user in a day is as follows:

E_{H h} = \sum_{t = 1}^{T} (E_{H h, t}^{A g} + E_{H h, t}^{E S T} + E_{s, t} - E_{H h}^{D R})

(9)

where

E_{H h, t}^{A g}

and

E_{H h, t}^{E S T}

are the cost of electricity purchased by household users from the VPP aggregator and the cost of electricity purchased from EST at time slot t; E_s,t is the satisfaction cost arising from load adjustment for household users.

E_{s, t} = \frac{y_{k}}{2 P_{H h, t}^{r}} {(L_{H h, t}^{r})}^{2} - y_{k} L_{H h, t}^{r} + \frac{y_{k}}{2} P_{H h, t}^{r}

(10)

where

L_{H h, t}^{r}

is the power of class r load after user DR at time slot t; y_k is the comfort factor, which is set by the user according to their preference;

P_{H h, t}^{r}

is the initial value of class r load before the user DR at time slot t.

(4): Laddered DR mechanism

Household users’ electricity behavior is uncertain, so the predicted DR scheme and the actual load-side demand situation may have deviations, that is, the effect of load-side adjustment cannot meet the baseline load. To reduce the impact of this phenomenon, a laddered DR mechanism is proposed as follows:

E_{K, t}^{D R, d} = \{\begin{matrix} E_{K, b}^{D R, d} & P_{K, t}^{D R} \leq B a_{t} \\ E_{K, b}^{D R, d} - (P_{K, t}^{D R} - B a_{t}) (1 + z) & B a_{t} < P_{K, t}^{D R} \leq B a_{t} + j \\ E_{K, b}^{D R, d} - (P_{K, t}^{D R} - B a_{t}) (1 + 2 z) & B a_{t} + j < P_{K, t}^{D R} \leq B a_{t} + 2 j \\ E_{K, b}^{D R, d} - (P_{K, t}^{D R} - B a_{t}) (1 + 3 z) & B a_{t} + 2 j < P_{K, t}^{D R} \leq B a_{t} + 3 j \\ 0 & B a_{t} + 3 j < P_{K, t}^{D R} \end{matrix}

(11)

E_{K, t}^{D R, c} = \{\begin{matrix} E_{K, b}^{D R, c} & B a_{t} \leq P_{K, t}^{D R} \\ E_{K, b}^{D R, c} - (B a_{t} - P_{K, t}^{D R}) (1 + z) & B a_{t} - j \leq P_{K, t}^{D R} < B a_{t} \\ E_{K, b}^{D R, c} - (B a_{t} - P_{K, t}^{D R}) (1 + 2 z) & B a_{t} - 2 j \leq P_{K, t}^{D R} < B a - j \\ E_{K, b}^{D R, c} - (B a_{t} - P_{K, t}^{D R}) (1 + 3 z) & B a_{t} - 3 j \leq P_{K, t}^{D R} < B a - 2 j \\ 0 & P_{K, t}^{D R} < B a - 3 j \end{matrix}

(12)

where

E_{K, t}^{D R, d}

and

E_{K, t}^{D R, c}

are the incentive values for the DR program corresponding to the load spike and load trough moments, respectively; K is the user type, which contains VPP aggregator, EST, and household users;

E_{K, b}^{D R, d}

is the value of the reward for meeting the DR standard;

P_{K, t}^{D R}

and Ba_t are the adjusted and baseline loads of subject K at time slot t, respectively; z and j are the penalty factor and load regulation margin, respectively.

3.2. Mathematical Model for the Lower Layer

In the lower layer of the framework, a mathematical model under the reinforcement learning paradigm is established for household users. The model incorporates three types of appliance agents into HEMS: fixed appliance agent, curtailable appliance agent, and shiftable appliance agent. Under the reinforcement learning paradigm, a task is considered a Markov decision process (MDP) if the state and environment of the intelligence satisfy Markovian properties. An MDP typically consists of five tuples (S, A, P( , ), r, u), where S denotes the state set, A denotes the action space, P( , ) denotes the action transition probability, r denotes the immediate reward, and u is the learning rate. The details of the MDP formulation are described as follows:

(1): State

The environmental information required during the operation of appliances together forms the agent’s state space. In this paper, the state S_t at the time slot t can be defined as follows:

S_{t} = [λ_{t}^{H}, λ_{t + 1}^{H}, \dots, λ_{T}^{H}, L_{t}^{H}, L_{t + 1}^{H}, \dots, L_{T}^{H}]

(13)

where

[λ_{t}^{H}

,

λ_{t + 1}^{H}

, …,

λ_{T}^{H}]

are the tariff under the current time slot t and the predicted tariff from the next time slot t + 1 to the end time slot T;

[L_{t}^{H}

,

L_{t + 1}^{H}

, …,

L_{T}^{H}]

are the adjusted load target under the current time slot t and the adjusted load target from the next time slot t + 1 to the end time slot T.

(2): Action

The control actions of household appliance agents can be categorized based on the type of appliance as follows [29]:

(a) Fixed appliance agent: Fixed appliances, such as EPS, require stability. These appliances are essential for daily life, so their demand must be met. The action space for these appliances includes the following:

A_{f i x e d} = [o n]

(14)

(b) Curtailable appliance agent: Curtailable appliances, such as REF and AC, possess the characteristic of load reduction. Household users can adjust the demand for such appliances within a certain range, enabling a discretized selection of load reduction actions within the following action space:

A_{c u r t a} = [A_{R E F, k}, A_{A C, k}]

(15)

where k is the level, which is classified according to the operating power of the appliance; A_REF,k and A_AC,k represent the action values of REF and AC at the k level.

(c) Shiftable appliance agent: Shiftable appliances, such as CP, WM, and WH, can flexibly adjust the timing of energy usage. These appliances operate at two distinct working points: on and off. The action space for these appliances is as follows:

A_{s h i f t} = [A_{C P}, A_{W M}, A_{W H}]

(16)

(3): Reward

Setting the reward function can constrain the action of the appliance agent to achieve the training needs. The reward function settings of various types of appliance agents are as follows:

(a) Fixed appliance agent: The output of fixed appliances cannot be changed, so their reward value is only related to the tariff. The reward function is as follows:

E_{f i x e d} = \max [(- λ_{A g, t} \times P_{E P S, t}), 0]

(17)

where λ_Ag,t is the time-sharing tariff given by the VPP aggregator’s upper-layer game at time slot t; P_EPS,t is the case of the appliance output of the EPS at time slot t.

(b) Curtailable appliance agent: The incentive function for curtailable appliances includes the satisfaction costs resulting from the tariff and load adjustments. The reward function is as follows:

E_{c u r t a} = \max [(- λ_{A g, t} \times P_{k, t}), 0] - E_{j, t}

(18)

where k is the set of curtailable appliances; E_j,t is the satisfaction cost, given by Equation (10).

(c) Shiftable appliance agent: The incentive function for shiftable appliances includes the satisfaction costs resulting from the tariff and load adjustments. The reward function is as follows:

E_{s h i f t} = \max [(- λ_{A g, t} \times P_{l, t}), 0] - E_{l, t} - λ_{u s e} |L_{t}^{H} - P_{k, t} - P_{l, t}|

(19)

where l is the set of shiftable appliances; E_l,t is the satisfaction cost, given by Equation (10); λ_use is the penalty factor for insufficient user compliance with DR, calculated by Equations (11) and (12).

4. HG-Multi-Agent Reinforcement Learning

To improve the speed and accuracy of model solving, a HG-multi-agent reinforcement learning method is applied to the proposed framework. The objective functions of stakeholders in the upper layer are reconstructed based on the Nash equilibrium principle. Then, the whole objectives are integrated to form a hierarchical game (HG) algorithm. After the results are transmitted to the HEMS in the lower layer, a multi-agent Q-learning algorithm is adopted to train the agents of each appliance within household users, assisting users in making quick decisions.

4.1. Flowchart of the Solving Method

The flowchart of the HG-multi-agent reinforcement learning method is illustrated in Figure 3. In the multi-agent game process, the power allocation among each stakeholder of the VPP is resolved through hierarchical game algorithms based on time-of-use pricing information and overall load adjustment goals for lower-layer household users. In the HEMS training process, each household’s HEMS collects user preference data, combined with time-of-use pricing information, solved by the multi-agent game process, and each HEMS is trained using the multi-agent Q-learning algorithm. This process determines optimal scheduling strategies for each household user, thereby achieving optimized operation of the entire VPP.

4.2. Hierarchical Game Algorithm

The Nash equilibrium principle states that in a game, no participant can improve their outcome by unilaterally changing their strategy. Thus, the combination of strategies chosen by each participant is considered a Nash equilibrium solution. Based on this principle, the following objective function reconstruction method is proposed in the paper:

T_{i}^{H i} = \frac{a b s (T_{i}^{n} - T_{i}^{\max})}{T_{l}}

(20)

where

T_{i}^{H i}

is the objective function of the reconstructed subject i;

T_{i}^{n}

is the benefit of subject i under the current solution;

T_{i}^{\max}

is the maximum benefit that subject i can achieve on its own, given that the other subjects’ strategies are determined; T_l is a positive reference value. Based on the objective function of each subject after the reconstruction, the objectives of the VPP aggregator, the EST and the household collection are integrated into one objective function:

T = \sum_{j = 1}^{d} \{\begin{matrix} β_{1} & φ_{2} < T_{i}^{H i} < φ_{1} \\ β_{2} & φ_{3} < T_{i}^{H i} < φ_{2} \\ ⋮ & ⋮ \\ β_{i} & φ_{i + 1} < T_{i}^{H i} < φ_{i} \\ ⋮ & ⋮ \\ β_{m} & T_{i}^{H i} < φ_{m} \end{matrix}

(21)

where T is the objective function of the hierarchical game algorithm; d is the number of subjects involved in the game, which in this paper is 3; m is the number of levels set by the hierarchical game algorithm, which needs to be set according to the scenario; φ_i (i = 1,2…m) is the threshold set by the hierarchical game algorithm; β_i (i = 1,2…m) is the reward value set by the hierarchical game algorithm, where if β_m is set to 1, then each set of Nash equilibrium solutions corresponding to the objective function of each subject is 1. At this time, the final value of the hierarchical objective function is equivalent to the sum of the number of subjects.

4.3. Multi-Agent Q-Learning Algorithm

After acquiring the time-sharing tariff information and user DR target information from the multi-agent game layer, the actual response strategy is formulated for household users. The multi-agent Q-learning algorithm is an algorithm that selects the optimal action by exploring and exploiting mechanisms to update the optimal Q-values.

The fundamental mechanism of this algorithm involves constructing a Q-table, where the Q-values Q(s_t, a_t) are updated based on the selection of actions and the updating of states until complete convergence is achieved. The optimal Q-value

Q_{π}^{*}

(s_t, a_t) can be obtained by using the Bellman equation given below.

Q_{π}^{*} (s_{t}, a_{t}) = r (s_{t}, a_{t}) + γ * \max Q (s_{t + 1}, a_{t + 1})

(22)

Algorithm 1 outlines the process of the multi-agent Q-learning algorithm for solving the model. This algorithm Algorithm 1 outlines the process of the multi-agent Q-learning algorithm for solving the model. This algorithm achieves optimal decision-making based on the actions of household appliances and their initial states. Specifically, it begins with the initialization of Q-values. Then, at each time slot within each episode, household appliances select an action, calculate the reward value r(s_t, a_t), obtain the next state value s_t+1, and update Q(s_t, a_t) until reaching the maximum time slot. This process is repeated at each episode until the difference in Q-values between consecutive steps is less than a predetermined threshold. At this point, the algorithm terminates and outputs the optimal actions for the appliances.

Algorithm 1: Multi-Agent Q-learning Algorithm
1	Initialization initialize Q-value
2	for episode = 1,2,…, max episode do
3	for t = 1,2,…, max time slot do
4	Initialize the state s_t
5	Choose the arbitrary action a_t of the device by (14)–(16)
6	Calculate the current reward r(s_t, a_t) and the next state s_t+₁ by(17)–(19)
7	Update the Q-value Q(s_t, a_t) by(21)
8	Save s_t+₁
9	Calculate the absolute value diff between the Q values under two episodes
10	End until the diff is less than the set value ρ
11	Output the optimal policy

5. Evaluation Results

5.1. Case Study Setting

To validate the speed and accuracy of the proposed method, simulations are conducted in the scene depicted in Figure 4. In this scene, eight household users participate in DR, and their load forecasts are illustrated in Figure 5, along with the baseline load set by the grid. The power grid sets the overall target load that users need to achieve.

To emphasize the advantages of the two-layer optimization framework for the VPP proposed in this paper, four comparison scenes, as shown in Table 1, are established. The simulation is conducted separately in the aforementioned four scenes. It should be noted that, given that the conventional DR method is not inherently compatible with the household refinement modeling concept proposed in this paper, household users are assumed to distribute the superior instructions evenly while executing response decisions.

The comparative results of the scenes are presented in Table 2. From Table 2, compared to Scene 1, Scene 4 exhibits an increase in VPP aggregator benefits by USD 187.00, a reduction in household energy costs by USD 222.01, and a decrease in EST benefits by USD 21.30. The increase in combined benefits and the reduction in household costs are mainly due to the grid’s DR policy subsidy and the adjustment of customers’ electricity use hours. Since Scene 1 is not included in DR, the energy supply during peak load hours is mainly supplemented by EST, and thus Scene 1 has the highest EST benefits compared to the other scenes.

In comparison to Scene 2, the model proposed in this paper intricately considers household energy preferences, enhancing household satisfaction and DR program feasibility. The conventional DR method exhibits limited load regulation capabilities, resulting in weaker performance across all metrics than in Scene 4.

In Scene 3, the EST exits the game, losing the subsidy from DR compliance. It only needs to meet the household’s load-adjusted shortfall, significantly reducing its benefits. Meanwhile, the VPP aggregator and households take on more load-adjusted tasks, leading to slightly worse relative benefits and energy usage costs than in Scene 4.

5.2. Analysis of Game Results

To illustrate the outcomes of the game, the post-game VPP aggregator electricity price curve, detailed distribution effect charts of DR indicators, and deviation of adjusted household user load from baseline loads are present in Figure 6, Figure 7 and Figure 8.

As shown in Figure 6, it can be seen that in the game process, where the selling price is the interactive factor influencing each agent’s output, the VPP aggregator integrates the interests and coupling relationships among the stakeholders. This integration allows it to achieve an energy selling price that lies between the grid selling price and the purchasing price.

Figure 7 illustrates that households’ final DR effect deviates slightly from the baseline load, with the maximum positive and negative deviations occurring at 14:00 and 16:00, reaching 10.1 kW and −13.3 kW, respectively. In the DR mechanism outlined in this paper, households will not face penalties for load reductions below the baseline, enabling them to selectively trim loads to lower energy consumption costs. During peak energy consumption periods, households balance the satisfaction cost of energy usage with the rewards of DR within the stepped DR mechanism established herein. While deviating somewhat from the baseline load, the DR scheme remains feasible, and the resulting deviations are acceptable to the VPP aggregator.

Figure 8 depicts the appliance-level energy consumption profiles for households 1–8 throughout the day. This scheme considers variations in the timing and scale of electricity usage by household appliances, as well as the satisfaction impact of each household’s participation in DR, thereby achieving tailored effects for individual users. Simultaneously, EST engages in the upper-layer game within the VPP to determine its energy charge and discharge strategies, aligning power output with household responses and effectively implementing DR plans in collaboration with household users.

5.3. Comparative Analysis of the HG-Multi-Agent Reinforcement Learning Method

The HG-multi-agent reinforcement learning method proposed in this paper is compared with other methods cited in references [19,23,24] to showcase its superiority. Comparison method 1, based on reference [19], employs a bi-level game algorithm solved using an iterative search approach. Comparison method 2, from reference [23], constructs a leader–follower game model and utilizes a genetic algorithm to find a solution. Comparison method 3, as described in reference [24], solves the game problem using the ADMM algorithm. The results, including computation time statistics and user cost statistics after DR compensation, are depicted in Figure 9.

In Figure 9a,b, it is evident that the proposed method outperforms the methods in the comparative reference in terms of computational time and user energy costs after DR compensation. Specifically, compared to the method proposed in reference [19], the computational time is reduced by 23.8%. Additionally, compared to the method in reference [24], user costs are decreased by 8.8%. It is worth noting that due to the inherent stochastic nature of this method, the provided computational data represents averaged results from ten simulations of the original test cases.

5.4. Convergence Analysis

The convergence analysis of the HG-multi-agent reinforcement learning method comprises two layers: the upper layer involves the game among the VPP aggregator, household connection, and EST, while the lower layer features the multi-agent Q-learning model represented by the household agents. This study selected HEMS training data from household user 1 for analysis. The overall convergence of this algorithm is illustrated in Figure 10 and Figure 11.

As shown in Figure 10, it is evident that the upper-layer multi-agent game model converges by the 44th iteration, with computational errors for all agents remaining within the predefined threshold. Figure 10 illustrates the training results of HEMS for household user 1, indicating that in the multi-agent Q-learning model, each appliance agent gradually achieves convergence after 200 training iterations. During the training convergence of the equipment agent, a small number of data points that do not align with the convergence trend are removed to provide a clearer visualization of the agent’s convergence. These data points do not affect the final convergence outcome.

5.5. Analysis of Single-Household Load Adjustment

Figure 12 and Figure 13 illustrates the energy consumption of primary household appliances for household user 1 before and after DR. In the model setup of this study, during the training process, users can customize penalty factors for adjusting appliance outputs based on their appliance preferences. The HEMS flexibly adjusts the output of household appliances according to user settings, thereby forming a decision model tailored to household needs.

6. Conclusions

To enhance the dispatch potential of users and maximize the operational benefits of the expanded VPP through competitive agent interactions, this paper proposes a novel two-layer user energy management strategy for the VPP based on multi-agent reinforcement learning. A novel two-layer optimization framework is established, where the upper layer is applied to coordinate the scheduling and benefit allocation among various stakeholders and the lower layer is applied to execute intelligent decision-making for users. An HG-multi-agent reinforcement-based method is applied to accelerate the game-solving process. To validate the effectiveness and superiority of the proposed strategy, simulation verification was conducted, leading to the following conclusions:

(1): Compared to the conventional framework for optimized operation, the proposed framework and strategy can increase VPP aggregator benefit by over 56.2%, reduce user energy costs by over 12.9%, and increase ESS benefit by over 7.8%.
(2): The proposed stepped DR mechanism helped the VPP reduce peak load by approximately 39.33%. Moreover, it safeguards users’ interests in cases of load adjustment deviations, increasing their willingness to respond to the power company’s policies actively.
(3): Compared to conventional methods, the HG-multi-agent reinforcement learning method can reduce calculation time by over 23.8% and lower user costs after DR compensation by over 8.8%.

Author Contributions

Validation, W.J. and H.Z.; formal analysis, T.L. and Q.X.; writing—original draft, S.T.; writing—review and editing, S.T., T.L. and Q.X.; supervision, Z.W., J.Q. and Q.X. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by Science and Technology Project of the State Grid Corporation of China (5700-202418337A-2-1-ZX).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to privacy.

Conflicts of Interest

Authors Zibo Wang and Ji Qiao were employed by the company China Electric Power Research Institute. Authors Hong Zhu and Wenlu Ji were employed by the company Nanjing Power Supply Company, State Grid Jiangsu Electric Power Co. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Pham, T.T.H.; Besanger, Y.; Hadjsaid, N. New challenges in power system restoration with large scale of dispersed generation insertion. IEEE Trans. Power Syst. 2009, 24, 398–406. [Google Scholar] [CrossRef]
Baringo, A.; Baringo, L.; Arroyo, J.M. Day-ahead self-scheduling of a virtual power plant in energy and reserve electricity markets under uncertainty. IEEE Trans. Power Syst. 2019, 34, 1881–1894. [Google Scholar] [CrossRef]
Tang, Y.; Zhai, Q.; Zhao, J. Multi-stage robust scheduling for community microgrid with energy storage. J. Mod. Power Syst. Clean Energy 2023, 11, 1923–1934. [Google Scholar] [CrossRef]
Pudjianto, D.; Ramsay, C.; Strbac, G. Microgrids and virtual power plants: Concepts to support the integration of distributed energy resources. Proc. Inst. Mech. Eng. Part A J. Power Energy 2008, 222, 731–741. [Google Scholar] [CrossRef]
Thavlov, A.; Bindner, H.W. Utilization of flexible demand in a virtual power plant set-up. IEEE Trans. Smart Grid 2015, 6, 640–647. [Google Scholar] [CrossRef]
Xu, J.; Xiao, Q.; Jia, H.; Mu, Y.; Jin, Y.; Lu, W.; Ma, S. A simplified control parameter optimisation method of the hybrid modular multilevel converter in the medium-voltage DC distribution network for improved stability under a weak AC system. IET Energy Syst. Integr. 2024, 6, 512–524. [Google Scholar] [CrossRef]
Liu, Y.; Liu, J.; Wang, T.; Zhang, F.; Zheng, J.; Yin, L.; Ban, M. High-frequency resonance suppression based on cross-coupled filter and improved passive damper for MMC-HVDC system. IEEE Trans. Power Delivery 2024, 39, 1952–1962. [Google Scholar] [CrossRef]
Kardakos, E.G.; Simoglou, C.K.; Bakirtzis, A.G. Optimal offering strategy of a virtual power plant: A stochastic bi-level approach. IEEE Trans. Smart Grid 2016, 7, 794–806. [Google Scholar] [CrossRef]
Zhao, Y.; Lin, Z.; Wen, F.; Ding, Y.; Hou, J.; Yang, L. Risk-constrained day-ahead scheduling for concentrating solar power plants with demand response using info-gap theory. IEEE Trans. Ind. Inform. 2019, 15, 5475–5488. [Google Scholar] [CrossRef]
Wen, L.; Zhou, K.; Feng, W.; Yang, S. Demand side management in smart grid: A dynamic-price-based demand response model. IEEE Trans. Eng. Manag. 2024, 71, 1439–1451. [Google Scholar] [CrossRef]
Vahedipour-Dahraie, M.; Rashidizadeh-Kermani, H.; Parente, M.; Shafie-Khah, M.; Siano, P. Investigating the impact of external demand response flexibility on the market power of strategic virtual power plant. IEEE Access 2022, 10, 84960–84969. [Google Scholar] [CrossRef]
Liang, H.; Ma, J. Data-driven resource planning for virtual power plant integrating demand response customer selection and storage. IEEE Trans. Ind. Inform. 2022, 18, 1833–1844. [Google Scholar] [CrossRef]
Vahedipour-Dahraie, M.; Rashidizadeh-Kermani, H.; Shafie-Khah, M.; Catalão, J.P.S. Risk-averse optimal energy and reserve scheduling for virtual power plants incorporating demand response programs. IEEE Trans. Smart Grid 2021, 12, 1405–1415. [Google Scholar] [CrossRef]
Gazijahani, F.S.; Salehi, J. IGDT-based complementarity approach for dealing with strategic decision making of price-maker VPP considering demand flexibility. IEEE Trans. Ind. Inform. 2020, 16, 2212–2220. [Google Scholar] [CrossRef]
Chen, W.; Qiu, J.; Zhao, J.; Chai, Q.; Dong, Z.Y. Customized rebate pricing mechanism for virtual power plants using a hierarchical game and reinforcement learning approach. IEEE Trans. Smart Grid 2023, 14, 424–439. [Google Scholar] [CrossRef]
Liu, X.; Li, S.; Zhu, J. Optimal coordination for multiple network-constrained VPPs via multi-agent deep reinforcement learning. IEEE Trans. Smart Grid 2023, 14, 3016–3031. [Google Scholar] [CrossRef]
Bui, V.; Hussain, A.; Kim, H. Double deep Q-learning-based distributed operation of battery energy storage system considering uncertainties. IEEE Trans. Smart Grid 2020, 11, 457–469. [Google Scholar] [CrossRef]
Wang, B.; Li, Y.; Ming, W.; Wang, S. Deep reinforcement learning method for demand response management of interruptible load. IEEE Trans. Smart Grid 2020, 11, 3146–3155. [Google Scholar] [CrossRef]
Xu, Z.; Guo, Y.; Sun, H. Competitive pricing game of virtual power plants: Models, strategies, and equilibria. IEEE Trans. Smart Grid 2022, 13, 4583–4595. [Google Scholar] [CrossRef]
Bahloul, M.; Breathnach, L.; Khadem, S. Residential Virtual power plant control: A novel hierarchical multi-objective optimization approach. IEEE Trans. Smart Grid 2025, 16, 1301–1313. [Google Scholar] [CrossRef]
Li, X.; Li, C.; Liu, X.; Chen, G.; Dong, Z.Y. Two-stage community energy trading under end-edge-cloud orchestration. IEEE Internet Things J. 2023, 10, 1961–1972. [Google Scholar] [CrossRef]
Wang, Y.; Wang, H.; Du, X. Optimized operation of multi-virtual power plant for energy sharing based on nash multi-stage robust. IEEE Access 2024, 12, 169805–169823. [Google Scholar] [CrossRef]
Wu, J.K.; Liu, Z.W.; Li, C.; Zhao, Y.; Chi, M. Coordinated operation strategy of virtual power plant based on two-layer game approach. IEEE Trans. Smart Grid 2025, 16, 554–567. [Google Scholar] [CrossRef]
Feng, C.; Zheng, K.; Zhou, Y.; Palensky, P.; Chen, Q. Update scheduling for ADMM-based energy sharing in virtual power plants considering massive prosumer access. IEEE Trans. Smart Grid 2023, 14, 3961–3975. [Google Scholar] [CrossRef]
Cappello, D.; Mylvaganam, T. Distributed differential games for control of multi-agent systems. IEEE Trans. Control Netw. Syst. 2022, 9, 635–646. [Google Scholar] [CrossRef]
Zhang, T.; Li, Y.; Yan, R.; Siada, A.A.; Guo, Y.; Liu, J.; Huo, R. A master-slave game optimization model for electric power companies considering virtual power plant. IEEE Access 2022, 10, 21812–21820. [Google Scholar] [CrossRef]
ElRahi, G.; Etesami, S.R.; Saad, W.; Mandayam, N.B.; Poor, H.V. Managing price uncertainty in prosumer-centric energy trading: A prospect-theoretic stackelberg game approach. IEEE Trans. Smart Grid 2019, 10, 702–713. [Google Scholar] [CrossRef]
Chouikhi, S.; Merghem-Boulahia, L.; Esseghir, M.; Snoussi, H. A game-theoretic multi-level energy demand management for smart buildings. IEEE Trans. Smart Grid 2019, 10, 6768–6781. [Google Scholar] [CrossRef]
Xu, X.; Jia, Y.; Xu, Y.; Xu, Z.; Chai, S.; Lai, C.S. A multi-agent reinforcement learning-based data-driven method for home energy management. IEEE Trans. Smart Grid 2020, 11, 3201–3211. [Google Scholar] [CrossRef]
Wang, Y.; Ai, X.; Tan, Z.; Yan, L.; Liu, S. Interactive dispatch modes and bidding strategy of multiple virtual power plants based on demand response and game theory. IEEE Trans. Smart Grid 2016, 7, 510–519. [Google Scholar] [CrossRef]

Figure 1. Two-layer multi-agent game framework for VPP. (EPS: electric power strip. REF: refrigerator. AC: air conditioner. CP: charging post. WM: washing machine. WH: water heater).

Figure 2. Comparison between the workflow of the conventional framework and the two-layer framework proposed in this paper.

Figure 3. Flowchart of the solving method.

Figure 4. Simulation scene with eight household users.

Figure 5. Household forecast loads and baseline loads.

Figure 6. Post-game aggregator electricity price curve.

Figure 7. Overall load deviation of users under the proposed method.

Figure 8. Detailed distribution effect charts of DR indicators. (From (a–h) represent household 1 to 8.).

Figure 9. Comparison results of different methods. (a) Computational time. (b) User costs after DR compensation.

Figure 10. Convergence curves for the HG-multi-agent reinforcement learning method.

Figure 11. Convergence curve of the multi-agent Q-learning model training for a single household user.

Figure 12. Energy consumption of major appliances with DR in a house.

Figure 13. Energy consumption of major appliances without DR in a house.

Table 1. Comparison scene.

Scene	Scene Description
1	No consideration of DR behaviors
2	Using conventional DR methods [30]
3	Using the proposed method, while EST is not involved in gaming
4	Using the proposed method in this paper

Table 2. Comparison result.

Scene	VPP Aggregator Benefit/USD	Household Benefit/USD	EST Benefit/USD	Total Benefit/USD
1	85.06	513.74	143.97	229.03
2	174.14	334.75	113.76	287.90
3	237.47	302.42	87.92	325.39
4	272.06	291.73	122.67	394.74

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tian, S.; Xiao, Q.; Li, T.; Wang, Z.; Qiao, J.; Zhu, H.; Ji, W. A Two-Layer User Energy Management Strategy for Virtual Power Plants Based on HG-Multi-Agent Reinforcement Learning. Appl. Sci. 2025, 15, 6713. https://doi.org/10.3390/app15126713

AMA Style

Tian S, Xiao Q, Li T, Wang Z, Qiao J, Zhu H, Ji W. A Two-Layer User Energy Management Strategy for Virtual Power Plants Based on HG-Multi-Agent Reinforcement Learning. Applied Sciences. 2025; 15(12):6713. https://doi.org/10.3390/app15126713

Chicago/Turabian Style

Tian, Sen, Qian Xiao, Tianxiang Li, Zibo Wang, Ji Qiao, Hong Zhu, and Wenlu Ji. 2025. "A Two-Layer User Energy Management Strategy for Virtual Power Plants Based on HG-Multi-Agent Reinforcement Learning" Applied Sciences 15, no. 12: 6713. https://doi.org/10.3390/app15126713

APA Style

Tian, S., Xiao, Q., Li, T., Wang, Z., Qiao, J., Zhu, H., & Ji, W. (2025). A Two-Layer User Energy Management Strategy for Virtual Power Plants Based on HG-Multi-Agent Reinforcement Learning. Applied Sciences, 15(12), 6713. https://doi.org/10.3390/app15126713

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Two-Layer User Energy Management Strategy for Virtual Power Plants Based on HG-Multi-Agent Reinforcement Learning

Abstract

1. Introduction

2. Two-Layer Optimization Framework for VPP

2.1. Operation Tasks of the Two-Layer Optimization Framework

2.2. Comparison of the Framework

3. Mathematical Model for the Proposed Framework

3.1. Mathematical Model for the Upper Layer

3.2. Mathematical Model for the Lower Layer

4. HG-Multi-Agent Reinforcement Learning

4.1. Flowchart of the Solving Method

4.2. Hierarchical Game Algorithm

4.3. Multi-Agent Q-Learning Algorithm

5. Evaluation Results

5.1. Case Study Setting

5.2. Analysis of Game Results

5.3. Comparative Analysis of the HG-Multi-Agent Reinforcement Learning Method

5.4. Convergence Analysis

5.5. Analysis of Single-Household Load Adjustment

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI