An Intelligent Two-Stage Dispatch Framework for Cost and Carbon Reduction in Multi-Energy Virtual Power Plants

Ni, Haochen; Wang, Yonghua; Tang, Xinfa; Wang, Jingjing

doi:10.3390/pr14050743

Open AccessArticle

An Intelligent Two-Stage Dispatch Framework for Cost and Carbon Reduction in Multi-Energy Virtual Power Plants

¹

Faculty of Multimedia, Enforcement & Computing, Universiti Geomatika Malaysia, Kuala Lumpur 54200, Malaysia

²

State Grid Jiangxi Electric Power Co., Ltd., Nanchang 330023, China

³

School of Economic Management and Law, Jiangxi Science and Technology Normal University, Nanchang 330013, China

^*

Author to whom correspondence should be addressed.

Processes 2026, 14(5), 743; https://doi.org/10.3390/pr14050743

Submission received: 25 November 2025 / Revised: 22 January 2026 / Accepted: 9 February 2026 / Published: 25 February 2026

(This article belongs to the Section AI-Enabled Process Engineering)

Download

Browse Figures

Versions Notes

Abstract

To address the challenge of coordinating economic and environmental objectives for Multi-energy Virtual Power Plants (MEVPPs), particularly under ambitious decarbonization policies such as China’s “dual carbon” goals, this paper proposes a novel two-stage scheduling framework that integrates Deep Reinforcement Learning (DRL) with Model Predictive Control (MPC). The core innovations include the following: (1) high-fidelity physical models capturing wind turbulence correction, photovoltaic temperature-irradiation coupling, and state-of-charge-dependent energy storage efficiency, improving equipment dynamic characterization accuracy by 12.7% compared to conventional models; (2) an enhanced Multi-Agent Deep Deterministic Policy Gradient (MADDPG) algorithm incorporating priority experience replay and adaptive noise exploration, which accelerates convergence by 15.6%; (3) a pioneering coordination architecture of “Day-Ahead MADDPG—Real-Time MPC” that manages uncertainties through bidirectional feedback, where real-time deviations refine the long-term policy via experience replay. Simulation results using historical data from a North China industrial park demonstrate that the framework reduces operating costs by 13.3% and carbon emissions by 17.7% compared to particle swarm optimization, outperforms standard DDPG with 3.2% lower operating costs, 5.8% lower carbon emissions, and a 3.3% higher renewable utilization rate (88.6%), and achieves 55% renewable penetration with only 4.1% curtailment. These results validate the framework’s scalability for high-renewable penetration grids and its real-time feasibility, as confirmed by edge computing deployment with latency below 50 ms. This study offers a technically viable and scalable solution for the operation of low-carbon virtual power plants (VPPs), supporting the transition towards sustainable power systems.

Keywords:

MEVPPs; deep reinforcement learning; two-stage scheduling; carbon emission reduction; adaptive exploration

1. Introduction

Driven by global decarbonization strategies, such as China’s “dual carbon” goals (carbon peak by 2030 and carbon neutrality by 2060), the energy structure is transitioning to low-carbon at an accelerated pace. According to the International Renewable Energy Agency (IRENA), global solar installed capacity grew by approximately 53% from 2017 to 2021 [1]. Similarly, the European Union’s revised energy efficiency directive aims to increase the share of renewable energy in final energy consumption from 17% in 2015 to 32% by 2030 [2]. The International Energy Agency (IEA), in its Net Zero by 2050 roadmap, emphasizes enhancing energy management in buildings as a key pathway [3]. As an efficient aggregation and management model for distributed energy resources (DERs), the VPP coordinates diverse entities like wind, photovoltaic (PV), and energy storage systems to achieve optimal energy allocation and economic operation, making it a pivotal technology for building future power systems (Mohammadi et al. 2021) [4]. Concurrently, stringent environmental targets necessitate that VPP operations balance economic and environmental performance, urgently calling for multi-objective optimal-scheduling models that incorporate low-carbon factors.

Deep Reinforcement Learning (DRL), which combines the perceptual strengths of deep learning with the decision-making framework of reinforcement learning, approximates value or policy functions through deep neural networks. Its distinct advantage lies in learning optimal policies through direct interaction with the environment, without reliance on precise system models or perfect forecasts [5]. This makes DRL particularly suitable for high-dimensional, continuous state-action spaces with complex constraints, enabling adaptation to real-time uncertainties. Compared with traditional optimization methods like particle swarm optimization (PSO) or genetic algorithms (GAs), DRL’s online learning and adaptive capabilities offer novel solutions to the complex scheduling problems of VPPs, including the coupled challenges of combined heat and power systems (Stefan et al. 2021) [6].

This paper addresses the optimal scheduling problem in multi-energy VPPs by proposing a two-stage DRL-based strategy. The objective is to achieve synergistic optimization of operating costs, environmental costs (carbon emissions), and renewable energy utilization rates, providing technical support for the high-quality, low-carbon development of energy and power systems. The algorithm demonstrates adaptive learning capabilities in complex, dynamic scenarios. Future work aims to validate and apply the framework in collaboration with industry partners, focusing on typical engineering scenarios such as 100-megawatt distributed energy integration and regional grids with high renewable penetration, thereby promoting the transformation of theoretical research into practical applications.

2. Review of Related Work

The optimal dispatch of multi-energy virtual power plants (VPPs) is challenged by the need to coordinate heterogeneous energy resources, reconcile economic and environmental objectives, and mitigate uncertainties from renewable generation and load fluctuations. Existing methodologies can be broadly categorized, each presenting limitations when scaling to scenarios driven by stringent decarbonization targets.

2.1. Stochastic Optimization and Game-Theoretic Approaches

Model-driven frameworks have dominated early research on VPP scheduling. Stochastic optimization addresses uncertainty through probabilistic scenarios. For instance, Jadidoleslam et al. (2025) proposed a risk-constrained model for VPPs participation in day-ahead markets using chance-constrained programming, reducing wind/solar curtailment by 11.2% in IEEE 33-node tests [7]. However, such models often simplify flexible loads as static demands, ignoring the real-time decision dynamics of electric vehicles (EVs) and portable loads—a significant gap given EVs’ growing penetration (18–24%) in modern grids [8]. Game-theoretic models handle multi-entity conflicts via hierarchical structures. Shui et al. (2024) designed a master-slave Stackelberg game between VPP operators and distributed energy resource suppliers, achieving a 7.3% cost reduction [9]. Yet these models typically assume perfect information sharing, neglecting communication delays that can cause a 4–9% loss in profit in practical deployments (Xue et al. 2024) [10]. Crucially, both paradigms exhibit weaknesses: (i) computational intractability: scenario-based methods suffer from the “curse of dimensionality” (solution time scaling with N^1.5 for N devices) (Naughton et al. 2020) [11]; (ii) model fidelity trade-offs: physical constraints (e.g., gas turbine ramping, battery degradation) are often linearized, leading to 5–8% dispatch deviations (Zhao et al. 2022) [12]; (iii) limited multi-objective synergy: carbon emission constraints are frequently treated as post-processed penalties rather than core optimization objectives (Han et al. 2022) [13].

2.2. Deep Reinforcement Learning Breakthroughs

Data-driven approaches, particularly Deep Reinforcement Learning (DRL), have emerged to overcome model limitations. DRL excels in high-dimensional state-action spaces, enabling adaptive VPPs control.

Single-agent frameworks: Lin et al. (2020) reduced VPPs’ operating costs by 12.1% using Deep Deterministic Policy Gradient (DDPG) with edge computing but simplified equipment dynamic constraints [14]. Wei et al. (2025) integrated Lagrangian relaxation with Soft Actor-Critic (SAC) for robust security-constrained optimal power flow (OPF) but required over 350 training episodes for convergence [15].

Multi-agent systems (MAS): The Multi-Agent DDPG (MADDPG) framework has demonstrated effectiveness in managing complex interactive systems, including multi-UAV operations (Yang et al. 2024) [16]. In the energy domain, Li et al. (2025) pioneered MADDPG for VPPs-microgrid coordination, boosting profits by 9.7% through hierarchical energy management [17]. For EV-integrated VPPs, Wang et al. (2022) combined SAC and TD3 to optimize charging/discharging strategies, enhancing renewable energy consumption by 8.3% [18]. Techniques like Prioritized Experience Replay (PER) have shown promise in improving sample efficiency in multi-agent settings (Guo et al. 2023) [19].

Handling Partial Observability with Centralized Critics and Stacked Historical States: A known challenge in multi-agent VPP settings is partial observability, where individual agents (e.g., distributed energy resources) lack full system-wide state information, leading to increased training variance and suboptimal coordination. To mitigate this, recent advancements have integrated centralized critics with stacked historical states into frameworks like MADDPG. By incorporating a sequence of past observations into the critic’s input, the agent can infer latent system dynamics and improve policy robustness under incomplete information. This approach has been shown to reduce training variance by up to 30% in partially observable energy systems (Xue et al., 2024) [10]. In our framework, the centralized critic utilizes a stacked state buffer of the previous four time steps, enabling more stable and context-aware policy updates without requiring full decentralization of the critic.

Despite progress, critical gaps persist, as summarized in Table 1:

Specific unresolved issues include: Inadequate multi-agent coordination: Centralized critics in MADDPG struggle with partial observability in large VPPs, increasing training variance by 25–40% (Xue et al. 2024) [10]; Static exploration strategies: Fixed noise fails to balance exploration-exploitation in non-stationary environments, prolonging convergence (Pei et al. 2024) [21]; Simplified equipment models: Constant efficiency assumptions for storage neglect SOC-dependent losses, causing state-of-charge errors (Tang et al. 2025) [22].

2.3. Research Gaps and Our Contributions

Synthesis of Section 2.1 and Section 2.2 reveals three unresolved challenges for DRL-based VPPs scheduling under decarbonization targets: (1) high-fidelity equipment dynamics (e.g., wind turbulence, PV temperature-irradiation coupling) are rarely integrated with DRL, degrading scheduling realism (Tang et al. 2025) [22]; (2) multi-objective convergence inefficiency:

Conflicting economic-environmental goals strain DRL optimization, causing reward sparsity and 18–22% slower convergence in MADDPG (Wei et al. 2025 [15]; Wang et al. 2022 [18]); (3) real-time adaptability deficit: Day-ahead DRL plans lack mechanisms to correct rolling prediction errors under high renewable penetration (>40%) (Xue et al. 2024 [10]; Yan et al. 2022 [23]).

This study bridges these gaps through three core contributions identified above: (1) integration of turbulence-corrected wind power, temperature-coupled PV, and SOC-dependent battery efficiency models to capture nonlinear dynamics; (2) an enhanced MADDPG algorithm with priority experience replay and adaptive noise to accelerate convergence by >15.6%; (3) a novel “Day-ahead MADDPG + Real-time MPC” hierarchical architecture that decouples long-term policy optimization from short-term uncertainty absorption via a bidirectional feedback loop.

While MATD3 offers improved value estimation, its slower convergence in coupled energy environments makes it less suitable for real-time VPPs scheduling. MADDPG, enhanced with the above mechanisms, provides a better trade-off between training efficiency and policy stability in partially observable, multi-objective settings.

2.4. Distinction from Existing DRL-MPC Frameworks

While hybrid DRL-MPC frameworks have been explored, our approach introduces key distinctions, as detailed in Table 2.

The novelty of this work lies not in proposing individual components de novo, but in their holistic integration into a cohesive framework specifically designed for the economic-environmental dispatch of multi-energy VPPs under uncertainty, validated with high-fidelity models and a realistic case study.

3. Modeling the VPPs System

3.1. Physical Model of Energy Resources

VPPs form a controllable entity by aggregating multiple distributed energy sources, and the physical model of its core equipment is as follows:

① Dynamic correction model of wind power generation

The nonlinear relationship between the output power of a wind turbine and wind speed can be described by piecewise functions combined with wind speed volatility correction terms, as follows:

P_{w} = {\begin{matrix} 0_{,} \\ P_{r} \\ P_{r} \cdot β_{v,} \end{matrix} \cdot {(\frac{v - v_{i n}}{v_{r} - v_{i n}})}^{3}, \begin{matrix} v < v_{i n} o r v \geq v_{o u t} \\ v_{i n} \leq v < v_{r} \\ v_{r} \leq v < v_{o u t} \end{matrix}

(1)

Here, the correction factor characterizes the effect of wind speed fluctuations on power:

β_{v}

β_{v} = 1 - k_{v} \cdot {(\frac{σ_{v}}{σ_{v, m a x}})}^{2}

(2)

In the formula:

P_{r}

is the rated power (1.5 MW);

v_{i n} = 3 m / s, v_{r} = 12 m / s

and

v_{o u t} = 25 m / s

are, respectively, the incoming wind speed, rated wind speed, and outgoing wind speed;

σ_{v} = \frac{∆_{v}}{∆_{t}}

represents wind speed volatility;

∆_{v}

represents the change in wind speed within a period;

∆_{t}

represents the scheduling time interval;

k_{v} = 0.2

represents the correction coefficient; and

σ_{v, m a x} = 5 m / s \cdot h^{- 1}

represents the allowable maximum volatility (Manwell et al. 2010) [25].

② Photovoltaic temperature-irradiation coupling model

The output power of photovoltaic cells is affected by the coupling effect of solar radiation intensity and cell temperature, and the mathematical model is as follows:

P_{p v} = η_{p v} (T_{c}) \cdot A_{p v} \cdot H_{t}

(3)

Among them, the coupling relationship between battery temperature and solar radiation intensity is

T_{c} {, H}_{t}

:

T_{c} = T_{a} + γ_{h} \cdot H_{t}

(4)

The efficiency-temperature nonlinear function is as follows:

η_{p v} (T_{c}) = η_{P v, r e f} \cdot e^{- α_{T} \cdot (T_{c} - T_{r e f})}

(5)

In the formula:

A_{p v} = 11,111 m^{2}

(corresponding to

{2 m w}_{p}

installed capacity);

T_{c}

is the battery temperature (°C); and

γ_{h} = {0.02 ° C (k W / m^{2})}^{- 1}

is the thermal accumulation coefficient;

η_{P v, r e f} = 18 %

is the reference efficiency at 25;

α_{T} = {0.005 ° C}^{- 1}

is the temperature coefficient, aligning with typical values for crystalline silicon modules (Skoplaki et al. 2008) [26].

③ Energy storage variable efficiency charge–discharge model

The charging (negative value)

P_{b, c h} (t) \leq 0

and discharging (positive value)

P_{b, d i s} (t) \geq 0

in the charging and discharging power of the battery energy storage system are constrained by the state of charge (SOC) as follows:

{\begin{matrix} P_{b, c h}^{m i n} \leq P_{b, c h} (t) \leq 0 0 \leq P_{b, d i s} (t) \leq P_{b, d i s}^{m a x} \\ {S O C}^{m i n} \leq S O C (t) \leq {S O C}^{m a x} \\ S O C (t) = S O C (t - 1) + η_{c h} \frac{P_{b, c h} (t) ∆ t}{E_{b}} + \frac{P_{b, d i s} (t) ∆ t}{{η_{d i s} E}_{b}} \end{matrix}

(6)

where the relationship between the variable efficiency function

η_{b} (S O C)

and SOC is as follows:

η_{b} (S O C) = η_{b, n o m} \cdot [1 - δ_{b} \cdot {(\frac{S O C - {S O C}_{o p t}}{{S O C}_{m a x} - {S O C}_{m i n}})}^{2}]

(7)

In the formula:

P_{b, m i n} = - 2 M W

(charging);

P_{b, m a x} = 2 M W

(discharging);

{S O C}_{m i n} = 0.2

;

{S O C}_{m a x} = 0.9

;

η_{b, n o m} = 0.85

is the rated efficiency;

δ_{b} = 0.15

is the efficiency decay coefficient;

{S O C}_{o p t} = 0.5

is the optimal operating point;

E_{b} = 2 M W h

is storage capacity,

η_{c h} = 0.90 η_{d i s} = 0.85

.

④ Gas turbine multi-stage ramping model

The output power of a gas turbine is constrained by the multi-stage climbing speed, which is mathematically expressed as follows:

{\begin{matrix} P_{g t, m i n} \leq P_{g t} (t) \leq P_{g t, m a x} \\ P_{g t} (t) - P_{g t} (t - 1) \leq R_{u p} (P_{g t} (t - 1)) \\ P_{g t} (t - 1) - P_{g t} (t) \leq R_{d o w n} (P_{g t} (t - 1)) \end{matrix}

(8)

Among them, the multi-stage climbing rate is divided according to the load section:

Low load section:

P_{g t} < 2 M W : R_{u p 1} = 0.5 M W / h, R_{d o w n 1} = 0.4 M W / h

Medium load section:

(2 M W \leq P_{g t} < 4 M W) : R_{u p 2} = 1 M W / h, R_{d o w n 2} = 0.8 M W / h

High load section:

P_{g t} \geq 4 M W : R_{u p 3} = 0.7 M W / h, R_{d o w n 3} = 0.6 M W / h

In the formula:

P_{g t, m i n} = 1 M W, P_{g t, m a x} = 5 M W

is the power boundary.

⑤ Gas turbine combined Heat and power (CHP) model

Typical gas turbine thermoelectric ratio ranges from 0.5 to 0.7:

P_{g t, h e a t} (t) = c_{h r} \cdot P_{g t} (t)

(9)

In the formula,

P_{g t, h e a t} (t)

represents the thermal power output at time (MWth) and

t c_{h r} = 0.6

the thermoelectric ratio (thermal power/electrical power) [27].

⑥ Dynamic model of the heat storage device

Q_{t h} (t) = Q_{t h} (t - 1) + η_{c h} P_{c h} (t) ∆ t - \frac{P_{d i s} (t) ∆ t}{η_{d i s}}

(10)

Constraints:

{\begin{matrix} 0 \leq Q_{t h} (t) \leq 3 MWhth \\ - 1.5 \leq P_{c h} (t) \leq 0 MWth \\ 0 \leq P_{d i s} (t) \leq 1.5 MWth \end{matrix}

(11)

In the formula,

Q_{t h} (t)

represents the heat storage capacity at a given moment (MWhth);

t Q_{t h} (t - 1)

represents the heat storage capacity at the previous moment (MWhth),

η_{c h} = 0.95

the charging efficiency;

P_{c h} (t)

the charging power (MWth, positive value indicates charging, negative value indicates discharging);

∆ t

the time step size (h); and

η_{d i s} = 0.95

the discharging efficiency.

Based on the above innovative physical model and the operating rules of the equipment, VPPs can optimize the scheduling of energy resources to participate in electricity market transactions. It can ensure a reliable power supply while achieving multi-objective, synergistic optimization to minimize operating costs, maximize renewable energy consumption, and minimize carbon emissions. The DRL algorithm, with its adaptive learning capabilities in high-dimensional, complex scenarios, provides an effective tool for solving multi-period scheduling problems with dynamic constraints.

3.2. System Architecture and Parameters

The MEVPP system architecture constructed in this paper is shown in Figure 1, which includes wind power (1.5 MW), photovoltaic power (2 MWp), a gas turbine (5 MW), battery energy storage (2 MWh), and a controllable load. The key parameters of each device are shown in Table 3. The carbon emission coefficient for the gas turbine is 0.35 kg/MWh, while emissions from wind, PV, and storage are considered negligible (Lokupitiya et al. 2006) [28]. The parameters are based on common models and typical values in the energy field, forming a standardized yet realistic scenario for algorithm verification.

3.3. Operational Constraints

When studying the DRL-based MEVPP optimization scheduling algorithm, achieving the low-carbon goal requires constructing a multi-objective, two-stage scheduling model for VPPs and clarifying the various constraints to ensure the feasibility and effectiveness of the scheduling strategy. In practical applications, VPP scheduling requires not only a stable power supply but also careful consideration of carbon emissions, economy, and other relevant environmental indicators.

① Power balance constraints:

\sum_{i = 1}^{n} P_{G i} (t) = \sum_{j = 1}^{m} P_{D j} (t), \forall t

(12)

In the formula,

P_{G i}

represents the output power of the

i

power generation equipment;

P_{D j}

represents the power demand of the j-th load; represents

t

the time variable;

n

and

m

represent the number of power generation equipment and loads. Ensure that the total power of all power generation equipment is balanced in real time with the total power of the load.

② Equipment output boundary constraints

P_{G, i, m i n} \leq P_{G i} (t) \leq P_{G, i, m a x}, \forall i, t

(13)

In the formula,

P_{G, i, m i n}

and

P_{G, i, m a x}

represent the minimum and maximum output (MW) of the i-th power generation equipment. Limit the equipment’s operation to the safe power range to avoid damage or reduced efficiency from overload or underload.

③ Carbon emission constraints

\sum_{t = 1}^{24} \sum_{i = 1}^{n} E_{G, i} (t) \leq E_{c a p} = 1500 kg

(14)

In the formula,

E_{G, i} (t) = α_{G, i} \cdot P_{G, i} (t)

,

E_{G, i} (t)

is the carbon emissions of the device

i

at a moment (kg), and

t α_{G, i}

is the carbon emission coefficient of the i-th device (kg/MWh).

④ Climbing rate constraint

| P_{G, i} (t) - P_{G, i} (t - 1) | \leq R_{G, i, m a x}, \forall t, t

(15)

In the formula,

| P_{G, i} (t) - P_{G, i} (t - 1) |

is the absolute value constraint of the change in power,

R_{G, i, m a x}

representing the maximum climbing rate of the i-th device. Limit the rate of power change in the equipment to avoid mechanical damage to the equipment due to violent fluctuations.

⑤ Cost function constraints for generator sets

C_{G, i} (t) = a_{G, i} P_{G, i}^{2} (t) + b_{G, i} P_{G, i} (t) + c_{G, i}

(16)

In the formula,

C_{G, i} (t)

represents the operating cost of the

i

first generator set at a given time

t

;

a_{G, i}

,

b_{G, i}

, and

c_{G, i}

are the cost coefficients. By optimizing these cost functions, the optimal power generation combination scheme can be identified to maximize economic benefits.

⑥ Thermal power balance constraints

P_{g t, h e a t} (t) + P_{b, h e a t} (t) = D_{h e a t} (t) \forall t

(17)

In the formula,

P_{g t, h e a t} (t)

represents the thermal power output (MWth) of the gas turbine at a certain time

t

;

P_{b, h e a t} (t)

represents the thermal power output (MWth) of the heat storage unit at a certain time

t

;

D_{h e a t} (t)

represents the thermal load demand (MWth) at a certain time

t

; and

\forall t

indicates that the constraint holds at all times

t

.

The fine-grained physical model and complex operational constraints collectively define the high-dimensional, continuous, nonlinear environmental state space and the strictly restricted action space that DRL agents must interact with, while designing their multi-objective reward function is also highly challenging. This complexity is the core reason that drives us to adopt the DRL approach.

4. DRL Optimization Scheduling Algorithm

4.1. Overview of the Two-Stage Scheduling Framework

This paper presents a two-stage scheduling framework of ‘upper-level day-ahead plan generation—lower-level real-time rolling optimization’. The upper layer employs an improved MADDPG algorithm to formulate a day-ahead scheduling policy. During the training phase, historical or forecasted data can be used to simulate the environment and accelerate learning. However, the core of the DRL paradigm is that the agent learns a robust policy by interacting with a simulated or real environment, observing the consequences of its actions (states and rewards), and iteratively improving its decisions. The trained policy is then deployed for day-ahead planning. It is vital to emphasize that the DRL agent does not become dependent on this forecast data; instead, it learns to make decisions based on the observed state, which may include the deviation between forecast and reality, thereby enhancing its robustness. While day-ahead forecasts provide contextual information, the core DRL process remains grounded in real-time environmental interaction. The proposed framework utilizes forecast data as supplementary context rather than as deterministic inputs, maintaining the essential DRL characteristic of learning from actual system responses. The bidirectional feedback mechanism between MPC and MADDPG further reinforces this interactive learning paradigm, where real-time deviations refine the long-term policy through experience replay. The proposed framework employs a hierarchical ‘Day-Ahead Planning—Real-Time Rolling Optimization’ structure, as illustrated in Figure 2.

① Upper layer: Day-ahead plan generation

π * = {a r g m a x}_{π} E [\sum_{t = 0}^{T - 1} γ^{t} r_{t}]

(18)

In the formula,

γ = 0.99

is the discount factor and

T = 24

is the scheduling period processing the wind and solar prediction data of the NSRDB dataset.

The lower layer, based on model predictive control (MPC), solves the finite horizon optimization problem on an hourly basis based on the latest measured data (wind and solar, load) and short-term predictions within a rolling optimization time domain (H = 4 h), and adjusts the day-ahead plan of the upper layer in real time to deal with prediction errors and load fluctuations.

② Lower layer: Real-time rolling optimization

This optimization adjusts the day-ahead plan from the upper-layer MADDPG to handle real-time prediction errors and load fluctuations. The core optimization problem is defined as follows:

m i n \sum_{k = 0}^{H - 1} [C_{o p} (t + k) + λ \cdot E_{c a r b o n} (t + k)]

(19)

where

H = 4 is the prediction and control horizon.
Cop(t + k) is the operating cost at time t + k, comprising gas turbine fuel costs, equipment maintenance costs, and costs/revenues from grid interaction.
Ecarbon(t + k) is the carbon emissions at time t + k.
λ is a weighting coefficient for carbon emissions, linking environmental cost to the economic objective.

In the formula,

H = 4

is the time domain of rolling optimization, responds to load fluctuations and prediction errors, and dynamically adjusts the day-ahead plan to achieve “day-ahead—real-time” collaborative optimization.

The key synergy mechanism is that the optimization results of the lower-level MPC (particularly the deviation information between the actual executed actions and the recommended actions of the MPC) are fed back into the experience replay pool of the upper-level MADDPG for online updates and optimization of its long-term strategy of the MADDPG. The optimization is solved using the Interior Point Optimizer (IPOPT) (Wächter et al. 2006) [29], which is warm-started with the previous solution to improve computational efficiency.

4.2. Algorithm Improvements and Optimizations

① Priority experience collection

Weighting samples based on the time difference (TD) error

δ_{i}

to improve the learning efficiency of key samples:

w_{i} = {(\frac{δ_{i}}{δ_{m a x}})}^{α} + ϵ

(20)

In the formula,

α = 0.6

is the priority index, and ϵ = 0.001 is the smoothing constant.

The sampling probability is as follows:

p_{i} = \frac{w_{i}^{β}}{\sum_{j} w_{j}^{β}}

(21)

In the formula,

β = 0.4

is the importance sampling correction coefficient, which gradually increases to 1 with training iterations to reduce bias.

② Adaptive noise exploration strategy

Noise attenuation model:

σ (t) = σ_{m a x} \cdot e^{- k \cdot t} + σ_{m i n}, k = 0.001

(22)

In the formula,

σ_{m a x} = 0.3

(initial noise intensity),

σ_{m i n} = 0.05

(final noise intensity). In the early stage of training, a larger Gaussian noise

N (0, σ_{m a x})

is injected to explore new strategies, and in the later stage, it is attenuated to

N (0, σ_{m i n})

to fine-tune the optimal strategy, balancing exploration and utilization.

③ Network architecture and optimizer

Actor network: 3 fully connected layers (128 neurons per layer), the input is state

s_{t}

and the output is deterministic policy

μ (s_{t}; θ)

, activation function using ReLU;

Critic network: 2 fully connected layers (256 neurons per layer) with the input being the state-action pair

(s_{t}, a_{t})

and the output being the action value function

Q (s_{t}, a_{t})

, using LeakyReLU as activation function;

Optimizer: Improved Adadelta algorithm, parameter update formula:

E {[g^{2}]}_{t} = ρ E {[g^{2}]}_{t - 1} + (1 - ρ) g_{t}^{2}

(23)

∆ θ_{t} = - \sqrt{\frac{E {[∆ θ^{2}]}_{t - 1} + ϵ}{E {[g^{2}]}_{t} + ϵ}} \cdot g_{t}

(24)

In the formula,

ρ = 0.95

is the forgetting factor,

ϵ = 1 e - 6

.

④ Computational Overhead Analysis

The introduced algorithmic enhancements incur modest additional computational costs, which are outweighed by the gains in learning efficiency. The Prioritized Experience Replay (PER) mechanism requires maintaining a priority queue and computing sampling probabilities based on TD errors, increasing per-step computation by approximately 10–15% compared to uniform replay. However, this is offset by a 15.6% reduction in required training episodes, yielding a net decrease in total training time. The adaptive noise exploration strategy adds negligible overhead, involving only a simple exponential-decay update per episode. In deployment, both components operate within the edge-computing latency budget (<50 ms), ensuring real-time feasibility.

4.3. Algorithm Framework Design

The improved Deep Deterministic Policy Gradient (DDPG) algorithm, combined with the Actor-Critic framework to handle the scheduling optimization problem of the continuous action space, is adapted to the high-dimensional state and complex constraints of MEVPPs.

① Definition of state space

State vector

s_{t} = {P_{w, t}, P_{p v, t}, P_{b, t}, P_{g t, t}, {S O C}_{t}, D_{t}, λ_{t}, ϵ_{w, t}, ϵ_{p v, t}}

(25)

In the formula, wind power/photovoltaic/energy storage/gas turbine output, energy storage state of charge (SOC), real-time load, time-of-use electricity price, wind and solar prediction error. Real-time output of each energy device {

P_{w, t}, P_{p v, t}, P_{b, t}, P_{g t, t}

}; energy storage state of charge

{S O C}_{t}

; real-time power load

D_{t}

; time-of-use electricity price

λ_{t}

; the inclusion of prediction errors

ϵ_{w, t}, ϵ_{p v, t}

is not because the agent requires forecasts to function. On the contrary, it allows the agent to observe and learn from the inherent uncertainty in the system. By including the discrepancy between past forecasts and realized values in the state, the agent can discern patterns of uncertainty and learn a more adaptive and conservative policy that performs well even when forecasts are imperfect. This aligns with the model-free spirit of DRL, as the agent learns to handle uncertainty directly from state transitions and rewards. This design allows the DRL agent to learn robust policies that adapt to prediction inaccuracies through real-time interaction with the environment. The fundamental advantage of DRL—learning through environmental interaction—is preserved, as the agent observes the consequences of its actions and adjusts its policy accordingly.

② Action space definition

Action vectors:

a_{t} = (∆ P_{w, t}, ∆ P_{p v, t}, ∆ P_{b, t}, ∆ P_{g t, t})

(26)

In the formula, the energy storage power adjustment range

∆ P_{b, t} \in [- 2,2] M W

, gas turbine,

∆ P_{g t, t} \in [- 1,1] M W,

is limited by the physical boundaries of the equipment.

③ Note on Action Clipping

In practice, the continuous actions output by the Actor network may exceed the feasible ranges defined by equipment physical limits (Section 3.3). To ensure the realizability and safety of the dispatch strategy, action clipping is applied to the raw action a_t before execution. Specifically, each component of a_t is constrained to its allowable interval:

a_{t}, c l i p p e d = clip (a_{t}, a_{m i n}, a_{m a x})

(27)

where a_min and a_max are the vectors of minimum and maximum allowable adjustments for each device, as given in Equation (26). Clipping prevents infeasible commands (e.g., excessive charging/discharging rates) that could violate operational constraints or damage equipment, while maintaining gradient flow during training. Since clipping is applied post-network, the policy can still learn to approach boundary-optimal actions.

④ Design of the reward function

Multi-objective weighted summation form:

r_{t} = - (α {\hat{C}}_{o p, t} + β {\hat{C}}_{g r i d, t} + γ (1 - {\hat{R}}_{r e, t}) + δ {(t) \hat{E}}_{c a r b o n, t})

(28)

In the formula,

C_{o p, t}

is the operating cost (gas turbine fuel cost + equipment maintenance cost);

C_{g r i d, t}

is grid interaction cost (electricity purchase cost—electricity sale revenue);

R_{r e, t}

for the rate of renewable energy consumption (1—rate of wind and solar curtailment);

E_{c a r b o n, t}

is carbon emissions; ^ for normalized metrics; the weight coefficients are determined by the Analytic Hierarchy Process (AHP); the judgment matrix of the target layer and the criterion layer is constructed, and the weight values (operating cost)

α = 0.4

, (grid cost)

β = 0.2

, (absorption rate)

γ = 0.3

, and (carbon emissions)

δ = 0.1

are obtained through consistency test as the weight coefficients, which are determined by the Analytic Hierarchy Process, taking into account both economic and environmental aspects.

5. Case Analysis and Result Discussion

5.1. Simulation Environment and Parameter Settings

(1) System parameters

A case study based on a North China industrial park is conducted using one year of historical data (2022) from the National Solar Radiation Database (NSRDB) for wind/PV and synthesized load profiles calibrated with real consumption patterns [30]. Select the typical daily load curve, with the power peak occurring at 12:00 (2.8 MW); Thermal load peaks at 1.5 MWth (19:00) and troughs at 0.8 MWth (4:00); Wind power rated 1.5 MW, wind power output fluctuation range ±20% of rated value, output fluctuation range ±15%; Benchmark carbon price

λ_{b a s e} = 30 $ / t

, sensitivity coefficient

k_{c} = 0.5

; Cost coefficient (gas turbine)

a_{g t} = 0.02

,

b_{g t} = 1.5

,

c_{g t} = 0.5

.

Three-segment electricity pricing (in US dollars): peak (8:00–22:00): $0.15/kWh, flat (22:00–24:00, 6:00–8:00): $0.10/kWh, valley (0:00–6:00): $0.05/kWh. The gas turbine has a rated power of 5 MW, an electrical efficiency of 35%, and a carbon emission coefficient of 0.35 kg/MWh. Battery storage capacity: 2 MWh; charge/discharge efficiency: 90%/85%; SOC range: 0.2–0.9.

(2) Algorithm parameters

Population size 50, 200 iterations; DE: variance factor 0.8, crossover probability 0.9. Experience replay buffer capacity 1e5, batch size 64; soft update coefficient τ = 0.01, training period 2000 episodes, each episode simulated for 7 days; priority index α = 0.6, importance sampling correction coefficient β = 0.4. Operating cost α = 0.4, grid interaction cost β = 0.2, renewable energy consumption rate γ = 0.3, and carbon emissions δ = 0.1. The upper-level day-ahead planning cycle T = 24 h; Lower layer real-time optimization time domain H = 4 h, updated on an hourly rolling basis.

5.2. Convergence Performance Analysis

The convergence curves for the improved DDPG algorithm using traditional SGD and DDPG are shown in Figure 3 and Table 4. The results showed that the average reward convergence speed of the improved DDPG was 15.2% higher than that of SGD (the traditional SGD uses a learning rate of 0.01 and a momentum of 0.9), and the number of training episodes required to reach the same average reward level (−1250) is 8.7% lower than that of the traditional DDPG. The algorithm stabilized after 1500 episodes, with an average reward of −1250, indicating a significant reduction in system operating costs.

Ablation experiments compared different components: the original DDPG, DDPG with PER, DDPG with adaptive noise, and the complete improved algorithm. The results show either the convergence speed (the number of episodes reaching the −1250 reward) or the final performance comparison. The ablation experiments, as shown in Table 5, demonstrated that the priority experience replay (PER) and adaptive noise exploration strategies contributed approximately 6.2% and 4.5% improvement in convergence rate, respectively, verifying the effectiveness of these two improvements.

5.3. Multi-Objective Optimization Effects

To verify the comprehensive performance of the improved DDPG algorithm in MEVPPs scheduling, this section compared and analyzed it with PSO, DE, and the traditional DDPG algorithm. All comparison data were based on the set typical daily load curve, equipment parameters, and electricity price mechanism, and stable results were obtained through 2000 training iterations. The specific optimization effects were shown in Table 6. The improved DDPG performs better in the following aspects: economic operating cost was 13.3% lower than PSO and 3.2% lower than traditional DDPG. This is mainly due to the coordinated scheduling of gas turbines and energy storage. Carbon emissions were reduced by 19.98% compared to DE. This aligns with the growing focus on DRL for low-carbon energy systems, as seen in the joint optimization of shared storage and distribution networks and demonstrates our framework’s effectiveness in achieving multi-target balance under the “dual carbon” goals.

5.4. Scheduling Strategy Analysis

This section analyzes the typical daily dispatch results of the proposed improved MADDPG–MPC framework, as shown in Figure 4. The scheduling strategy demonstrates effective coordination among heterogeneous resources to achieve multi-objective optimization.

Renewable energy utilization was prioritized. During the peak photovoltaic generation period (10:00–15:00), the energy storage system was charged at up to 1.8 MW to minimize photovoltaic curtailment. The gas turbine was dispatched primarily during peak load periods to balance supply and demand: its output reached 3.5 MW at the electrical load peak (12:00) and 4.2 MW during the thermal load peak (19:00), effectively smoothing load fluctuations.

The energy storage system operated under a price-arbitrage strategy. It was charged at 0.8 MW during the off-peak period (02:00–05:00) when electricity prices were low, and discharged at 1.5 MW during the evening peak period (18:00–22:00), to reduce operating costs by leveraging time-of-use price differentials.

Carbon emissions were concentrated during gas turbine operation. Periods of high output (12:00–14:00 and 19:00–21:00) contributed 210 kg and 252 kg of CO₂, respectively, accounting for 64.08% of the total daily emissions. This highlights the critical role of flexible resources, such as storage and controllable loads, in decarbonizing dispatchable generation.

Overall, the scheduling results validate the framework’s ability to harmonize economic operation, renewable integration, and emission reduction through time-coordinated control of generation, storage, and load. The strategy’s responsiveness to real-time conditions and price signals further confirms its suitability for high-renewable-penetration scenarios.

5.5. Sensitivity Analysis

When the carbon price rose from 20 $/t to 50 $/t, the scheduling strategy presented the following changes (as shown in Table 7): The daily operating time of the gas turbine decreased from 8 h to 5 h, and the carbon emissions decreased by 22.7%; Energy storage charging and discharging increased by 15.3%, replacing gas turbine power supply with “off-peak charging and peak discharging”; Operating costs increased by 4.8%, but the environmental benefits improved significantly, indicating that the carbon price mechanism can effectively guide the low-carbon operation of VPPs. The average operating cost of the improved DDPG algorithm was $11,410 in the scenario with a wind and solar prediction error of ±20%, an increase of about 4.7% compared to the cost of $10,892 in the error-free scenario. When the installed capacity ratio of wind and solar power increased from 35% to 55%, the system’s wind and solar power curtailment rate dropped from 9.2% to 4.1%. Meanwhile, at 55% of the installed capacity ratio, the curtailment rate of the proposed improved DDPG algorithm was 3.2 percentage points lower than that of the traditional DDPG algorithm at the same ratio. In the 55% high proportion of renewable energy scenario, the convergence rate of the improved DDPG algorithm was 10.5% higher than its convergence rate in the 35% proportion scenario.

5.6. Reward Weight Sensitivity Analysis

To justify the AHP-derived reward weights (α = 0.4, β = 0.2, γ = 0.3, δ = 0.1) (α = 0.4, β = 0.2, γ = 0.3, δ = 0.1), we performed a sensitivity analysis by varying each weight ±0.1 while normalizing the others. Figure 5 shows the percentage changes in the three targets of cost (α), carbon emissions (δ), and the absorption rate of renewable energy (γ) relative to the benchmark configuration (Note: The influence of the grid interaction cost weight β is relatively small and is not shown in the figure). The results show that system performance is not sensitive to weight changes, and most changes are controlled within ±5%, confirming the robustness of weight selection.

5.7. Online Deployment and Edge Computing Feasibility

To realize the potential of the “MADDPG–MPC” two-phase framework, the algorithm needs to be deployed in the actual VPPs control system, and the following challenges are faced:

In terms of model computational complexity, the Actor network (3 × 128 fully connected layers) and the Critic network (2 × 256 fully connected layers) of the MADDPG algorithm contain tens of thousands to hundreds of thousands of parameters, and inference requires significant computing resources. The lower-level MPC optimizer solves the optimization problem subject to device constraints over a 4 h time horizon, and computational time must be controlled. Regarding communication delays, VPP optimization relies on real-time acquisition of DER status data and issuance of scheduling instructions. The end-to-end delay target of the communication link is less than 100 ms. High delay will reduce the response speed of the MPC closed-loop control, affecting the optimization effect and even stability. In terms of data synchronization and reliability, data acquisition, transmission, and timestamp synchronization across massive heterogeneous devices are the basis for ensuring the accuracy of the state space, and communication interruptions or data packet loss require the algorithm to have fault-tolerant and robust mechanisms.

Edge computing provides a feasible deployment architecture: deploying trained MADDPG Actor networks and lower-level MPC optimizers on edge servers shortens the physical distance of data acquisition to computation to instruction issuance and reduces communication latency. High-performance edge computing devices, such as NVIDIA Jetson AGX Xavier/Orin, deliver tens of TOPS of AI computing power and ample RAM to support DRL model inference and MPC optimization computing for small- and medium-scale VPPs. Compress the Actor network using techniques such as model pruning, quantization, or knowledge distillation to reduce computational load and memory usage. MADDPG training is carried out in the cloud or data center. The parameters of the trained policy model are updated regularly at the edge nodes, and the experience replay pool is maintained using a cloud-edge collaborative strategy. The cloud center handles heavy training and experience pool maintenance, periodically updating the edge policy. This architecture addresses latency and scalability concerns, confirming engineering feasibility.

By combining edge computing architecture and model optimization techniques, the two-stage optimization algorithm framework has engineering feasibility for deployment in actual VPPs control systems. In the future, the focus will be on model lightweighting, low-latency communication protocol optimization, and the realization of cloud-edge collaboration mechanisms.

6. Conclusions and Future Prospects

This study establishes a two-stage deep reinforcement learning framework (MADDPG–MPC) to address the economic-environmental dispatch problem of multi-energy virtual power plants.

The core contributions and findings are: (1) High-Fidelity Modeling Enhances Accuracy: The integration of dynamic wind turbulence correction, PV temperature-irradiation coupling, and SOC-dependent storage efficiency models reduced aggregate equipment modeling errors by 12.7% compared to conventional static models, leading to more realistic and reliable scheduling. (2) Enhanced Algorithm Improves Learning: The improved MADDPG algorithm, incorporating priority experience replay and adaptive noise exploration, accelerated convergence by 15.6% and achieved a high renewable energy absorption rate of 88.6%. (3) Hierarchical Architecture Achieves Synergistic Optimization: The coordinated “day-ahead MADDPG + real-time MPC” architecture, featuring bidirectional policy feedback, effectively decoupled long-term optimization from short-term uncertainty management. It reduced operating costs by 13.3% and carbon emissions by 17.7% compared to particle swarm optimization, demonstrating superior multi-objective performance. (4) Engineering Feasibility is Confirmed: Deployment analysis on an edge computing platform (NVIDIA Jetson AGX Xavier) achieved critical latency targets (<50 ms for inference, 2.1 s for MPC), validating the framework’s real-time applicability for VPPs at the 100 MW scale. The framework also showed excellent scalability, maintaining a low 4.1% curtailment rate at 55% renewable penetration.

This research provides a technically viable and replicable blueprint for of DRL with the short-term robustness of MPC. It supports the power system transition towards high renewable penetration and net-zero emissions.

Limitations and future work are as follows: (1) the current framework’s performance, though robust, is still influenced by the quality of short-term forecasts in highly volatile conditions. Future work will explore a purely real-time, end-to-end DRL framework that operates without forecast data, relying solely on current and historical system states. (2) Future research will extend the framework to coordinate GW-level VPP clusters and integrate more advanced low-carbon technologies, such as Carbon Capture, Utilization and Storage (CCUS) and Power-to-Gas (P2G) systems. As explored in other deep decarbonization studies (Fan et al. 2024; Samende et al. 2023) [31,32], to further minimize the carbon footprint of dispatchable generation. (3) Practical deployment challenges, including cybersecurity protocols, standardization of communication interfaces for heterogeneous devices, and business model design, warrant further investigation in collaboration with industry partners.

Author Contributions

Conceptualization, H.N. and J.W.; methodology, J.W.; validation, X.T.; formal analysis, X.T. and H.N.; resources, J.W. and Y.W.; data curation, J.W.; writing—original draft preparation, H.N., Y.W., X.T. and J.W.; writing—review and editing, H.N., X.T. and J.W.; visualization, X.T.; supervision, H.N. and X.T.; project administration, X.T.; funding acquisition, Y.W., X.T. and J.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the National Grid Headquarters’ Science and Technology Project “Research on Key Technologies for Electric-Carbon Coupling Planning and Operation of a New Power System to Promote Regional Collaborative Carbon Reduction” (Project Number: 5218A023000N).

Data Availability Statement

The original contributions presented in the study are included in the article; further inquiries can be directed to the corresponding author.

Conflicts of Interest

Author Yonghua Wang was employed by the company State Grid Jiangxi Electric Power Co., Ltd. The remaining authors declare that the research was conducted in the absence of any.

References

International Renewable Energy Agency (IRENA). Renewable Capacity Statistics 2022; International Renewable Energy Agency (IRENA): Abu Dhabi, United Arab Emirates, 2022. [Google Scholar]
International Renewable Energy Agency (IRENA). Renewable Energy Prospects for the European Union, January 2018. Available online: https://www.irena.org/-/media/Files/IRENA/Agency/Publication/2018/Feb/IRENA_REmap_EU_2018.pdf (accessed on 8 February 2026).
International Energy Agency (IEA). Net Zero by 2050: A Roadmap for the Global Energy Sector; International Energy Agency (IEA): Paris, France, 2021. [Google Scholar]
Mohammadi, H.; Karimi, H.; Liu, L. A review on virtual power plant for energy management. Sustain. Energy Technol. Assess. 2021, 47, 101370. [Google Scholar] [CrossRef]
Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction, 2nd ed.; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
Stefan, R.; Johannes, M.; Joachim, G.; Oliver, B. Computational intelligence based on optimization of hierarchical virtual power plants. Energy Syst. 2021, 12, 517–544. [Google Scholar] [CrossRef]
Jadidoleslam, M. Risk-constrained participation of virtual power plants in day-ahead energy and reserve markets based on multi-objective operation of active distribution network. Sci. Rep. 2025, 15, 9145. [Google Scholar] [CrossRef]
International Energy Agency (IEA). Global EV Outlook 2023; International Energy Agency (IEA): Paris, France, 2023. [Google Scholar]
Shui, J.; Peng, D.; Zeng, H.; Song, Y.; Yu, Z.; Yuan, X.; Shen, C. Optimal scheduling of multiple entities in virtual power plant based on the master-slave game. Appl. Energy 2024, 376, 124286. [Google Scholar] [CrossRef]
Xue, L.; Zhang, Y.; Wang, J.; Li, H.; Li, F. Privacy-preserving multi-level co-regulation of VPPs via hierarchical safe deep reinforcement learning. Appl. Energy 2024, 371, 123654. [Google Scholar] [CrossRef]
Naughton, J.; Wang, H.; Riaz, S.; Cantoni, M.; Mancarella, P. Optimization of multi-energy virtual power plants for providing multiple market and local network services. Electr. Power Syst. Res. 2020, 189, 106775. [Google Scholar] [CrossRef]
Zhao, H.; Zhang, C.; Zhao, Y.; Wang, X. Low-Carbon Economic Dispatching of Multi-Energy Virtual Power Plant with Carbon Capture Unit Considering Uncertainty and Carbon Market. Energies 2022, 15, 7225. [Google Scholar] [CrossRef]
Han, Z.; Zhang, Y.; Li, B. Two-stage Optimization Scheduling of Cold, Heat and Power Virtual Power Plant Based on Multi-scenario Technology. Electr. Meas. Instrum. 2022, 59, 174–180. [Google Scholar]
Lin, L.; Guan, X.; Peng, Y.; Wang, N.; Maharjan, S.; Ohtsuki, T. Deep Reinforcement Learning for Economic Dispatch of Virtual Power Plant in Internet of Energy. IEEE Internet Things J. 2020, 7, 3288–3301. [Google Scholar] [CrossRef]
Wei, X.; Chan, K.W.; Wang, G.; Hu, Z.; Zhu, Z.; Zhang, X. Robust preventive and corrective security-constrained OPF for worst contingencies with the adoption of VPP: A safe reinforcement learning approach. Appl. Energy 2025, 380, 124970. [Google Scholar] [CrossRef]
Yang, J.; Yang, X.; Yu, T. Multi-Unmanned Aerial Vehicle Confrontation in Intelligent Air Combat: A Multi-Agent Deep Reinforcement Learning Approach. Drones 2024, 8, 382. [Google Scholar] [CrossRef]
Li, Y.; Chang, W.; Yang, Q. Deep reinforcement learning based hierarchical energy management for virtual power plant with aggregated multiple heterogeneous microgrids. Appl. Energy 2025, 382, 125333. [Google Scholar] [CrossRef]
Wang, J.; Guo, C.; Yu, C.; Liang, Y. Virtual power plant containing electric vehicles scheduling strategies based on deep reinforcement Learning. Electr. Power Syst. Res. 2022, 205, 107714. [Google Scholar] [CrossRef]
Guo, G.; Gong, Y. Multi-Microgrid Energy Management Strategy Based on Multi-Agent Deep Reinforcement Learning with Prioritized Experience Replay. Appl. Sci. 2023, 13, 2865. [Google Scholar] [CrossRef]
Domínguez-Barbero, D.; García-González, J.; Sanz-Bobi, M.Á.; García-Cerrada, A. Energy management of a microgrid considering nonlinear losses in batteries through Deep Reinforcement Learning. Appl. Energy 2024, 368, 123435. [Google Scholar] [CrossRef]
Pei, Y.; Ye, K.; Zhao, J.; Yao, Y.; Su, T.; Ding, F. Visibility-enhanced model-free deep reinforcement learning algorithm for voltage control in realistic distribution systems using smart inverters. Appl. Energy 2024, 372, 123758. [Google Scholar] [CrossRef]
Tang, X.; Wang, J. Deep Reinforcement Learning-Based Multi-Objective Optimization for Virtual Power Plants and Smart Grids: Maximizing Renewable Energy Integration and the Grid Efficiency. Processes 2025, 13, 1809. [Google Scholar] [CrossRef]
Yan, Q.; Zhang, M.; Lin, H.; Li, W. Two-stage adjustable robust optimal dispatching model for multi-energy virtual power plant considering multiple uncertainties and carbon trading. J. Clean. Prod. 2022, 336, 130400. [Google Scholar] [CrossRef]
Axehill, D.; Besselmann, T.; Raimondo, D.M.; Morari, M. A parametric branch and bound approach to suboptimal explicit hybrid MPC. Automatica 2014, 50, 240–246. [Google Scholar] [CrossRef]
Manwell, J.F.; McGowan, J.G.; Rogers, A.L. Wind Energy Explained: Theory, Design and Application, 2nd ed.; Wiley: Hoboken, NJ, USA, 2010. [Google Scholar]
Skoplaki, E.; Palyvos, J.A. On the temperature dependence of photovoltaic module electrical performance: A review of efficiency/power correlations. Sol. Energy 2009, 83, 614–624. [Google Scholar] [CrossRef]
Chicco, G.; Mancarella, P. Distributed multi-generation: A comprehensive view. Renew. Sustain. Energy Rev. 2009, 13, 535–551. [Google Scholar] [CrossRef]
Lokupitiya, E.; Paustian, K. Agricultural soil greenhouse gas emissions: A review of National Inventory Methods. J. Environ. Qual. 2006, 35, 1413–1427. [Google Scholar] [CrossRef]
Wächter, A.; Biegler, L.T. On the implementation of an interior-point filter line-search algorithm for large-scale nonlinear programming. Math. Program. 2006, 106, 5–57. [Google Scholar] [CrossRef]
National Solar Radiation Database (NSRDB). Data for North China Region. 2022. Available online: https://nsrdb.nrel.gov/ (accessed on 10 February 2025).
Fan, J.; Zhang, J.; Yuan, L.; Yan, R.; He, Y.; Zhao, W.; Nin, N. Deep Low-Carbon Economic Optimization Using CCUS and Two-Stage P2G with Multiple Hydrogen Utilizations for an Integrated Energy System with a High Penetration Level of Renewables. Sustainability 2024, 16, 5722. [Google Scholar] [CrossRef]
Samende, C.; Fan, Z.; Cao, J.; Fabián, R.; Baltas, G.N.; Rodríguez, P. Battery and Hydrogen Energy Storage Control in a Smart Energy Network with Flexible Energy Demand Using Deep Reinforcement Learning. Energies 2023, 16, 6770. [Google Scholar] [CrossRef]

Figure 1. Schematic diagram of the MEVPPs system architecture.

Figure 2. Flowchart of the two-stage MADDPG–MPC training and deployment process.

Figure 3. Comparison of convergence curves of different algorithms.

Figure 4. Typical daily scheduling results of the improved MADDPG–MPC.

Figure 5. Sensitivity analysis of reward weight (±variation). Note: α = cost weight (0.4 benchmark); δ = carbon weight (0.1 benchmark); γ = absorption weight (0.3 benchmark); orange indicates the absorption rate of renewable energy; dark-gray represents the total cost of the system; Grayish-white indicates carbon emissions.

Table 1. Limitations of existing DRL-based VPP scheduling studies.

Study	Method	Strengths	Weaknesses
Lin et al. [14]	DDPG + Edge	Low computation latency	Simplified PV/Wind models
Wei et al. [15]	L-SAC	Robust to contingencies	Slow convergence (>350 episodes)
Li et al. [17]	MADDPG	Multi-microgrid coordination	No carbon emission optimization
Wang et al. [18]	SAC-TD3 hybrid	EV strategy optimization	Ignored thermal-electrical coupling
Barbero et al. [20]	TD3	Nonlinear battery modeling	Single-objective (cost minimization)

Table 2. Comparison with existing DRL-MPC frameworks for VPP scheduling.

Aspect	Typical SAC/TD3-MPC [e.g., (Axehill et al. 2014) [24]]	Proposed MADDPG–MPC Framework
DRL Core	Single-agent (SAC, TD3)	Multi-agent (MADDPG) for distributed entity coordination
Model Integration	Often uses standard, linearized equipment models	Integrates high-fidelity, nonlinear physical models (wind turbulence, PV thermal coupling, variable storage efficiency)
Feedback Mechanism	MPC corrects actions; DRL policy is static post-training	Bidirectional feedback: MPC deviations are fed into MADDPG’s experience replay for online policy refinement
Objective Handling	Often single-objective or fixed weighted sum	Explicit multi-objective reward with AHP-derived weights, synergizing cost, carbon, and renewable utilization
Real-time Adaptation	MPC handles short-term deviations	Combined MAS long-term learning with MPC short-term robustness, enhanced by adaptive exploration

Table 3. Key equipment parameters of MEVPPs.

Device Type	Rated Power	Efficiency/Characteristics	Carbon Emission Coefficient (kg/MWh)	Response Time (s)
Wind turbine	1.5 MW	N/A (Wind speed determined)	0	10–30
Photovoltaic power station	2.0 MWp	≈18 (affected by temperature)	0	5–15
Gas turbine	5.0 MW	Electrical efficiency 35, electric-to-heat ratio 0.6	0.35	5–60
Battery energy storage	2.0 MW/2 MWH	Charge 90%, discharge 85%	0	<1
Energy storage device	1.5 MWth/3 MWH	Charge 95%, discharge 95%	0.35	<1

Note: The data in Table 1 are self-defined benchmark parameters for the simulation experiment, which are part of the research design based on common models and typical values in the energy field, aiming to construct a standardized scenario that meets the requirements of algorithm verification.

Table 4. Random experiment statistics.

Algorithms	Episode (Mean ± Standard Deviation) Required for Convergence	Final Reward (Mean ± Standard Deviation)
Improved DDPG	1350 ± 42	−1250 ± 18
Traditional DDPG	1600 ± 68	−1190 ± 25
SAC	1450 ± 55	−1220 ± 22

Table 5. Results of the ablation experiment.

Model	Episode Required for Convergence	The Convergence Speed Has Increased
Basic DDPG (A)	1600	-
A + PER(B)	1500	6.25%
A + Adaptive noise (C)	1530	4.38%
Complete model (D)	1350	15.63%

Table 6. Comparison of multi-objective optimization results of different algorithms.

Algorithms	Operating Costs (USD)	Carbon Emissions (kg)	Absorption Rate (%)	Time of Calculation (s)	Cost MAPE (%)
PSO	12,568	876	78.5	45.6	7.8
DE	11,842	901	81.2	38.9	6.5
Traditional DDPG	11,256	765	85.3	22.4	4.5
Improving DDPG	10,892	721	88.6	25.7	3.2

Note: MAPE is the mean absolute percentage error of operating costs in the scenario of wind and solar prediction error ± 10%.

Table 7. Sensitivity analysis data sheet.

Carbon Price ($/t CO₂)	Gas Engine Operation (h)	Carbon Emissions (kg)	Energy Storage Cycle Capacity (MWh)	Operating Costs ($)
20 (Benchmark)	8	721	6.0z	10,892
50	5	557	1.153 z	11,410

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ni, H.; Wang, Y.; Tang, X.; Wang, J. An Intelligent Two-Stage Dispatch Framework for Cost and Carbon Reduction in Multi-Energy Virtual Power Plants. Processes 2026, 14, 743. https://doi.org/10.3390/pr14050743

AMA Style

Ni H, Wang Y, Tang X, Wang J. An Intelligent Two-Stage Dispatch Framework for Cost and Carbon Reduction in Multi-Energy Virtual Power Plants. Processes. 2026; 14(5):743. https://doi.org/10.3390/pr14050743

Chicago/Turabian Style

Ni, Haochen, Yonghua Wang, Xinfa Tang, and Jingjing Wang. 2026. "An Intelligent Two-Stage Dispatch Framework for Cost and Carbon Reduction in Multi-Energy Virtual Power Plants" Processes 14, no. 5: 743. https://doi.org/10.3390/pr14050743

APA Style

Ni, H., Wang, Y., Tang, X., & Wang, J. (2026). An Intelligent Two-Stage Dispatch Framework for Cost and Carbon Reduction in Multi-Energy Virtual Power Plants. Processes, 14(5), 743. https://doi.org/10.3390/pr14050743

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Intelligent Two-Stage Dispatch Framework for Cost and Carbon Reduction in Multi-Energy Virtual Power Plants

Abstract

1. Introduction

2. Review of Related Work

2.1. Stochastic Optimization and Game-Theoretic Approaches

2.2. Deep Reinforcement Learning Breakthroughs

2.3. Research Gaps and Our Contributions

2.4. Distinction from Existing DRL-MPC Frameworks

3. Modeling the VPPs System

3.1. Physical Model of Energy Resources

3.2. System Architecture and Parameters

3.3. Operational Constraints

4. DRL Optimization Scheduling Algorithm

4.1. Overview of the Two-Stage Scheduling Framework

4.2. Algorithm Improvements and Optimizations

4.3. Algorithm Framework Design

5. Case Analysis and Result Discussion

5.1. Simulation Environment and Parameter Settings

5.2. Convergence Performance Analysis

5.3. Multi-Objective Optimization Effects

5.4. Scheduling Strategy Analysis

5.5. Sensitivity Analysis

5.6. Reward Weight Sensitivity Analysis

5.7. Online Deployment and Edge Computing Feasibility

6. Conclusions and Future Prospects

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI