Low-Carbon Economic Dispatch and Settable Incentive-Based Demand Response for Integrated Electro–Heat–Hydrogen Energy Systems Based on Safety Transformer–PPO

Zhengjian, Jia; Wanchun, Yang; Xin, Huang; Nan, Liang; Yupeng, Liu; Xiaojun, Wang; Yu, Song

doi:10.3390/en19061578

Open AccessArticle

Low-Carbon Economic Dispatch and Settable Incentive-Based Demand Response for Integrated Electro–Heat–Hydrogen Energy Systems Based on Safety Transformer–PPO

by

Jia Zhengjian

¹,

Yang Wanchun

¹,

Huang Xin

^1,*,

Liang Nan

¹,

Liu Yupeng

¹,

Wang Xiaojun

¹ and

Song Yu

²

¹

Inner Mongolia Power (Group) Co., Ltd., Baotou 014000, China

²

School of Electrical Engineering and Automation, Tianjin University of Technology, Tianjin 300384, China

^*

Author to whom correspondence should be addressed.

Energies 2026, 19(6), 1578; https://doi.org/10.3390/en19061578

Submission received: 4 March 2026 / Revised: 19 March 2026 / Accepted: 20 March 2026 / Published: 23 March 2026

(This article belongs to the Special Issue Energy Systems: Optimization, Modeling, and Simulation)

Download

Browse Figures

Versions Notes

Abstract

This paper proposes a safety-constrained Transformer–PPO framework for low-carbon economic dispatch with settable incentive-based demand response (DR) in wind–PV integrated electro–thermal–hydrogen industrial-park energy systems. Hydrogen is modeled as exogenous hydrogen-domain demand and is satisfied through electrolyzer production and hydrogen inventory dynamics. A causal Transformer captures long-horizon multi-energy coupling and intertemporal constraints and is trained with PPO under uncertainty. A dual-layer safety mechanism combines dual-variable (Lagrange multiplier) updates for statistical constraints with an execution-layer quadratic-programming action projection to enforce hard physical constraints, including operating limits, ramping, battery SOC, hydrogen inventory bounds, and energy balance. Baseline–verification–settlement rules and budget-ledger states are embedded to ensure verifiable response quantities and settlement outcomes that are traceable and independently recompilable. Case studies on a real industrial-park scenario in Inner Mongolia show reduced peak-hour maximum grid purchase demand and constraint violations, together with lower total cost, carbon cost, and curtailment penalties versus MILP, PPO-MLP, and Transformer–PPO without safety mechanisms.

Keywords:

integrated energy systems; electricity–heat–hydrogen coupling; low-carbon economic dispatch; reinforcement learning; demand response

1. Introduction

Against the backdrop of high penetration of wind and solar grid integration alongside the rapid expansion of the hydrogen economy, integrated energy systems (IES) within industrial parks increasingly require the coordination of electricity, heat energy and hydrogen under conditions of intense multi-energy coupling, significant intertemporal dynamics and high uncertainty [1,2]. Hydrogen energy, particularly through electrolyzer and hydrogen storage systems, introduces an electricity-to-hydrogen coupling pathway [3]. Its demand side is driven by exogenous hydrogen sector requirements, necessitating dynamic fulfilment through hydrogen production and inventory management. Conversion and storage equipment collectively determine viable operational zones and carbon emission outcomes [4].

In parallel, incentive-based demand response (DR) has been increasingly framed as a settlement-grade resource rather than a purely price-guided flexibility option [5]. Demand response has also been applied in different industrial contexts. For instance, Iris and Lam [6] investigated demand response and energy management in seaports under renewable uncertainty, showing that DR is relevant not only to conventional power systems but also to broader industry-specific energy operations. In such programs, baseline definition, measurement & verification (M&V), and auditable settlement rules directly determine the correctness of realized reduction and the exposure of incentive payments [7]. Valentini et al. [8] used a structured review to summarize customer baseline load (CBL) estimation methods and showed that baseline uncertainty and methodological choices can materially bias impact evaluation and settlement results. Li et al. [9] used an application-requirement-driven review to categorize baseline estimation techniques for incentive-based DR and highlighted that data availability, behind-the-meter behaviors, and multi-service participation can undermine settlement accuracy and controllability. Ellman and Xiao [10] used a multi-stage stochastic dynamic programming model to address incentives for baseline manipulation under uncertain event schedules, demonstrating that settlement rules can induce strategic behavior beyond intended load curtailment. Wang et al. [11] used an MDP-based analytical framework to address baseline manipulation in baseline-based DR programs and derived structural insights on customers’ underconsumption/overconsumption strategies, which distort settlement fairness. Qi et al. [12] used probabilistic baseline prediction to address settlement-sensitive load reduction potential assessment, showing that uncertainty-aware baseline modeling is essential for contract execution and payment credibility. Šikšnys et al. [13] used a two-stage decision-oriented baseline selection framework to address practical baseline choice under real consumption data, emphasizing that auditable baselines require both technical feasibility screening and performance evaluation. Collectively, these studies indicate that baseline definition, verification of realized response, and payment recalculation rules directly determine settlement correctness and incentive exposure, while budget controllability becomes an explicit operational requirement, thereby elevating dispatch from single-objective cost minimization to a coupled coordination task that must simultaneously deliver low-carbon economic performance and settlement feasibility under strict operational security boundaries [14,15,16]. In settlement-oriented incentive-based demand response, dispatch decisions are no longer evaluated solely by operational cost but also by the auditability of realized reductions and the controllability of incentive exposure. Baseline definitions, measurement & verification (M&V) procedures, and payment recalculation rules induce an implicit “settlement feasibility region”: a schedule that is physically feasible may still be non-settleable if the verified reduction cannot be reconstructed from metering data, or if the incentive budget can be exceeded under the prescribed accounting rules. This coupling makes economic dispatch a joint optimization of multi-energy operation and settlement integrity, where operational security constraints and settlement eligibility must be jointly satisfied.

Deterministic MILP formulations remain effective for structured constraint representation in IES; however, their online performance is often sensitive to forecast errors and modeling deviations under renewable variability and device dynamics. Ma et al. [17] used data-driven distributionally robust optimization to address source–load uncertainty in electric–thermal–hydrogen IES scheduling, illustrating the need for uncertainty-aware formulations beyond deterministic MILP. Cuisinier et al. [18] used extended rolling-horizon optimization to address the accumulation of forecast errors in operational planning, confirming that receding-horizon updates are critical when predictions degrade over time. Fernández et al. [19] used a two-stage deterministic EMS architecture (rolling-horizon planning and fast local adaptation) to address forecast-error impacts on objective values, further motivating closed-loop corrective control. In parallel, DRL provides an alternative by learning closed-loop policies through interaction. Liang et al. [20] used deep RL (SAC) to address real-time optimal scheduling in integrated energy systems, demonstrating improved renewable utilization with online policy control. Liu et al. [21] used a data-driven DRL scheduling framework to address coordinated dispatch in integrated electricity–heat–gas–hydrogen systems with demand-side flexibility. Li et al. [22] used a safe DRL scheduling approach (AutoML-enhanced safe RL with forecasting and DR) to address constraint-aware IES scheduling under renewable uncertainty. Prabawa and Choi [23] used safe-DRL-assisted two-stage energy management to address operational security in active distribution networks with hydrogen fueling stations. Nevertheless, despite the growing body of research on DRL-based scheduling for integrated energy systems, two key limitations remain insufficiently addressed. First, most existing studies treat demand response primarily as an operational flexibility resource, but do not explicitly incorporate settlement-grade mechanisms such as baseline reconstruction, measurement & verification (M&V), payment recalculation, and incentive-budget ledger evolution into the dispatch state and transition process. As a result, a schedule that is physically feasible and economically attractive in simulation may still be non-settleable in practice if the realized reduction cannot be independently verified or if the incentive expenditure exceeds the prescribed accounting rules. Second, although safe reinforcement learning has been introduced to improve constraint awareness, existing approaches mainly focus on physical feasibility and rarely address the joint requirement of operational security and settlement eligibility in incentive-based demand response. In multi-constraint electro–heat–hydrogen dispatch, this omission is particularly critical because settlement-grade DR introduces additional intertemporal states beyond energy storage, including baseline windows, verified response histories, and remaining incentive budgets. Ignoring these variables can lead to policies that appear cost-effective during training but fail in deployment due to hidden ledger violations and non-auditable response outcomes.

To close this gap, this paper proposes a Safety Transformer–PPO framework for low-carbon economic dispatch with settlement-oriented incentive-based demand response in integrated electro–heat–hydrogen energy systems. The main contributions are threefold. First, a settlement-aware dispatch model is established by explicitly embedding baseline–verification–settlement rules and budget-ledger states into the environment, so that feasibility is defined jointly by physical operability and settlement eligibility. Second, a causal Transformer–PPO architecture is developed to capture long-horizon temporal dependencies induced by multi-energy coupling, renewable uncertainty, and intertemporal ledger evolution. Third, a dual-layer safety mechanism is introduced, in which Lagrange-based statistical constraint regulation is combined with execution-layer quadratic-programming projection to enforce hard physical feasibility during deployment. In this way, the proposed framework moves beyond conventional cost-oriented DRL dispatch and provides an audit-ready, safety-constrained, and settlement-compatible solution for industrial-park integrated energy management.

2. System Architecture and Low-Carbon Economic Dispatch Model

2.1. Electricity–Heat–Hydrogen Coupling System Configuration

This section constructs an IES model for electricity–heat–hydrogen systems tailored to scenarios with high penetration of wind and solar power, targeting a specific industrial park in eastern Inner Mongolia, China. This model facilitates coordinated energy supply and flexible dispatch. Figure 1 illustrates the energy supply structure and its coupling relationships, where the park delivers energy to end-users through an electricity–heat–hydrogen coupling chain.

2.2. Energy Balance and Equipment Constraints

To ensure physical consistency and settlement capability, this section presents the balancing relationships among the electricity, gas and heat networks, along with the energy conversion models and operational constraints for each controllable unit [24].

Electrical side power balance:

P_{t}^{g} + P_{t}^{p v} + P_{t}^{w i n d} + P_{t}^{g t} + P_{t}^{d i s} = L_{t}^{e} + P_{t}^{e l} + P_{t}^{e b} + P_{t}^{c h} + P_{t}^{c u r t}

(1)

where

P_{t}^{g}

,

P_{t}^{w i n d}

,

P_{t}^{p v}

,

P_{t}^{g t}

,

P_{t}^{e l}

,

P_{t}^{e b}

,

P_{t}^{c h}

,

P_{t}^{d i s}

,

L_{t}^{e}

, and

P_{t}^{c u r t}

represent grid electricity purchase and sale power, wind and solar output, gas turbine generation power, electrolytic power consumption, electric boiler power consumption, battery charge/discharge power, demand response-adjusted electrical load, and renewable energy curtailment, respectively.

Heat side balance:

Q_{t}^{g t} + Q_{t}^{g b} + Q_{t}^{e b} = D_{t}^{h} + Q_{t}^{l o s s}

(2)

where

Q_{t}^{g t}

represents the heat converted by the gas turbine via the waste heat boiler, and

Q_{t}^{g b}

,

Q_{t}^{e b}

denote the heat generated by the gas boiler and electric boiler, respectively.

Hydrogen Side Inventory Dynamics:

S_{t + 1}^{H 2} = S_{t}^{H 2} + H_{t}^{p r o d} - D_{t}^{H 2} - H_{t}^{v e n t}

(3)

where

S_{t}^{H 2}

denotes capacity,

H_{t}^{p r o d}

denotes output,

D_{t}^{H 2}

denotes consumption, and

H_{t}^{v e n t}

denotes self-consumption.

The relationship between electrolytic hydrogen production and electricity consumption is converted using a linear method:

H_{t}^{p r o d} = α^{e l} P_{t}^{e l} Δ t

(4)

Waste heat heating can be approximated using the heat-to-power ratio:

Q_{t}^{g t} = κ^{w h b} P_{t}^{g t}

(5)

Power generation capacity limits and ramping capability of gas turbines:

{\underline{P}}^{g t} \leq P_{t}^{g t} \leq {\bar{P}}^{g t}, |P_{t}^{g t} - P_{t - 1}^{g t}| \leq R^{g t}

(6)

where

{\underline{P}}^{g t}

and

{\bar{P}}^{g t}

denote the upper and lower limits of power, and

R^{g t}

represents the climbing power.

Gas consumption converted to electrical efficiency:

G_{t}^{g t} = \frac{P_{t}^{g t} Δ t}{η^{g t} L H V}

(7)

where

L H V

denotes the lower heating value, and

η^{g t}

represents the gas turbine efficiency.

Gas boilers share similar constraints with electric boilers and shall not be elaborated upon further. The dynamic state of the battery system is as follows:

S O C_{t + 1} = S O C_{t} + \frac{η^{c h} P_{t}^{c h} Δ t}{E^{b a t}} - \frac{P_{t}^{d i s} Δ t}{η^{d i s} E^{b a t}}

(8)

To prevent simultaneous charging and discharging during the same period, impose mutual exclusion and other constraints on the action projection:

P_{t}^{c h} \cdot P_{t}^{d i s} = 0

(9)

0 \leq P_{t}^{c h} \leq {\bar{P}}^{c h}, 0 \leq P_{t}^{d i s} \leq {\bar{P}}^{d i s}

(10)

Grid electricity purchase and sale boundary:

0 \leq P_{t}^{g +} \leq {\bar{P}}^{i m p}, 0 \leq P_{t}^{g -} \leq {\bar{P}}^{\exp}

(11)

where

P_{t}^{g +}

and

P_{t}^{g -}

denote electricity purchases and sales, and

{\bar{P}}^{i m p}

,

{\bar{P}}^{\exp}

represent the interconnection line upper limits.

Hydrogen Storage Safety Boundary:

{\underline{S}}^{H 2} \leq S_{t}^{H 2} \leq {\bar{S}}^{H 2}

(12)

2.3. Low-Carbon Economic Objective Function

To establish a unified metric for evaluating the costs incurred by the industrial park in energy procurement and sales, equipment operation, curtailment penalties/comfort penalties, and carbon emissions, the following comprehensive objective function is first presented [25]:

J = [\sum_{t = 1}^{T} \begin{array}{l} c_{t}^{b u y} P_{t}^{g +} - c_{t}^{s e l l} P_{t}^{g -} + c^{g a s} (G_{t}^{g t} + G_{t}^{g b}) + c^{c u r t} P_{t}^{c u r t} \\ + c^{b a t} (|P_{t}^{c h}| + |P_{t}^{d i s}|) + P a y_{t} \end{array}] + C^{c o 2}

(13)

where

c^{b a t}

denotes the equivalent battery lifetime depreciation cost coefficient (approximated as the depreciation cost per 1 kWh charged and discharged),

P a y_{t}

represents demand response settlement payments, and

c^{c u r t}

signifies curtailment penalties. Other unit operational and maintenance costs are negligible and may be disregarded.

The carbon cost is calculated as ‘carbon price × excess emissions’:

C^{C O 2} = π^{C O 2} \max (0, \sum_{t} γ^{g} P_{t}^{g +} + γ^{g a s} (G_{t}^{g t} + G_{t}^{g b}) - E^{q u o t a})

(14)

where

γ^{g}

and

γ^{g a s}

denote the emission factors for electricity and gas consumption, respectively, and

E^{q u o t a}

represents the daily quota.

2.4. Settlement-Eligible DR Modeling

To ensure DR is settlement-eligible, we model the baseline definition, execution, verification, and accounting as explicit system dynamics [26]. In the present formulation, DR is modeled at the park-aggregated controllable-load level rather than through an individually calibrated customer behavioral function. This operator-centric formulation is intended for contract-based, centrally coordinated industrial-park DR, where an agreed transferable-load envelope is schedulable at the aggregator level rather than through decentralized customer price elasticity. Let

L_{t}^{r a w}

denote the underlying electricity demand without DR action,

L_{t}^{r e q}

the requested aggregate load-shifting amount issued by the operator, and

L_{t}^{m e t e r}

the realized metered demand after DR execution, with:

L_{t}^{m e t e r} = c l i p (L_{t}^{r a w} - Δ L_{t}^{r e q}, L_{t}^{\min}, L_{t}^{\max})

(15)

Therefore,

L_{t}^{m e t e r}

is not directly selected by the agent, but is the realized post-DR outcome within the transferable-load envelope. The incentive price

ρ_{t}

determines the settlement payment level, whereas

Δ L_{t}^{r e q}

determines the requested load-shifting quantity. The baseline

L_{t}^{b a s e}

is computed from a historical window

W_{t}

through an operator

B (\cdot)

.

L_{t}^{b a s e} = B ({L_{τ}^{m e t e r}}_{τ \in W_{t}})

(16)

where

B (\cdot)

can represent commonly used baseline families and is kept fixed during evaluation for auditability. Thus, the verified reduction used for settlement is computed from the realized metered demand rather than being directly engineered as a control action.

The historical window

W_{t}

is fully pre-decision and consists of fixed metered observations preceding the current dispatch horizon; it is not updated using simulated loads generated within the current episode. Hence, the agent cannot manipulate the current baseline

L_{t}^{b a s e}

through its within-horizon actions. After the baseline is fixed, the clipping rule bounds settlement-eligible verified reduction, and the rebound constraints require energy compensation within the rebound window, thereby constraining strategic within-horizon profile distortion.

The verified reduction is computed using a measurement & verification rule with clipping and non-negativity:

L_{t}^{s h i f t} = c l i p (L_{t}^{b a s e} - L_{t}^{m e t e r}, 0, {\bar{L}}_{t}^{s h i f t})

(17)

To reflect practical “rebound” and comfort considerations, we enforce intertemporal consistency for shiftable energy. Over a dispatch horizon

τ

, the shifted energy is required to be compensated within a rebound window

R (t)

:

\sum_{t \in τ} L_{t}^{s h i f t} Δ t \leq {\bar{E}}^{s h i f t}, L_{t}^{m e t e r} \geq L_{t}^{\min}, L_{t}^{m e t e r} \leq L_{t}^{\max}

(18)

\sum_{τ \in R (t)} (L_{τ}^{m e t e r} - L_{τ}^{r a w}) Δ t = \sum_{τ \in R (t)} L_{τ}^{s h i f t} Δ t

(19)

which guarantees that DR primarily reshapes the load temporally rather than unrealistically eliminating energy demand. Settlement payment is computed as:

P a y_{t} = ρ_{t} L_{t}^{s h i f t} Δ t

(20)

and the incentive budget ledger evolves as:

B_{t + 1}^{r e m} = B_{t}^{r e m} - P a y_{t}, 0 \leq B_{t}^{r e m} \leq \bar{B}

(21)

thereby enforcing budget controllability. These settlement dynamics are included in the environment transition so that feasibility is defined jointly by physical constraints and settlement/accounting constraints.

3. Safety Transformer–PPO with Integrated Settable IDR Approach

3.1. MDP Modelling: States, Actions and Ledger Variables

Construct the scheduling as a Markov Decision Process (MDP) [27]. To ensure settlement consistency and safe convergence, state

s_{t}

, which explicitly incorporates inter-period energy states and settlement ledger variables alongside load, generation capacity, and price signals. The state adopted in this paper can be represented as:

s_{t} = {L_{t}^{e}, D_{t}^{h}, D_{t}^{H 2}, P_{t}^{p v}, P_{t}^{w i n d}, c_{t}^{b u y}, c_{t}^{s e l l}, π^{C O 2}, P_{t}^{g t}, Q_{t}^{g b}, S O C_{t}, S_{t}^{H 2}, B_{t}^{r e m}, T}

(22)

where

B_{t}^{r e m}

represents the residual budget for load transfer.

Action as a continuous vector, balancing supply-side scheduling with demand-side incentive deployment:

a_{t} = {P_{t}^{g t}, P_{t}^{e l}, P_{t}^{e b}, Q_{t}^{g b}, P_{t}^{c h}, P_{t}^{d i s}, ρ_{t}, Δ L_{t}^{r e q}}

(23)

where

ρ_{t}

governs settlement payment and incentive-budget consumption and

Δ L_{t}^{r e q}

determines the requested physical load-shifting quantity within the transferable-load envelope.

Instant return set to negative cost:

r_{t} = - J - C^{c o 2}

(24)

For readability, the main state and action variables, together with their units and bounds, are compactly summarized in Table 1.

3.2. Causal Transformer–PPO Architecture

To capture long-horizon coupling induced by ramping limits, multi-energy storage dynamics, renewable uncertainty, and settlement ledger evolution, we adopt a causal Transformer as the sequence encoder inside an actor–critic PPO [28] framework. The overall architecture is illustrated in Figure 2. Unlike a feedforward policy that maps only the current observation to an action, the proposed encoder explicitly models the most recent operating trajectory, which is essential for dispatch problems where feasible and economic decisions depend on temporal context

The policy and value networks are conditioned on a sliding window of the most recent T = 24 hourly observations, which is designed to capture the dominant daily periodicity in loads, renewable availability, and price signals while remaining lightweight for online deployment. Each hourly observation aggregates heterogeneous physical and economic variables. To avoid scale imbalance across these heterogeneous inputs and to stabilize optimization, all features are normalized using training-set statistics and then mapped through a learned linear projection into a shared latent space with embedding dimension d = 128. The resulting token sequence is processed by a compact Transformer encoder composed of N = 3 stacked blocks. Each block uses four-head self-attention and a position-wise feedforward network with hidden size 256, together with residual connections and layer normalization to support stable gradient flow and improve generalization. Importantly, we apply a strict causal attention mask so that the representation at the current hour attends only to the available history within the window, i.e., the interval spanning from t − T + 1 to t. This prevents any information leakage from future steps during training and evaluation, and ensures that the learned policy remains consistent with real-time operation where future realizations are not observable. After encoding, the representation of the final token (corresponding to the current hour) is used as a compact context vector summarizing the recent system trajectory, and it is passed to the actor–critic heads.

On top of the causal Transformer encoder, the actor is implemented as a lightweight two-layer MLP (128 → 128) that outputs the parameters of a diagonal Gaussian policy for continuous controls, enabling stochastic exploration during training while maintaining a simple and scalable action distribution. The sampled action is then passed through a squashing and affine scaling procedure to enforce element-wise actuator bounds before it enters the safety layer described. This separation is intentional: the actor focuses on producing a high-quality raw control signal in a normalized action space, while feasibility with respect to hard operational constraints is enforced downstream by the safety mechanism. The critic shares the same causal Transformer encoder to form a consistent representation of the recent trajectory and reduce computational overhead. A separate two-layer MLP (128 → 128 → 1) maps the context vector to a scalar state-value estimate, which is used for advantage estimation and policy updates. Sharing the encoder between actor and critic improves sample efficiency and stabilizes training, while keeping the heads separate prevents interference between action generation and value regression.

Training is performed using PPO with the standard clipped surrogate objective and generalized advantage estimation (GAE) to balance bias and variance in policy gradients. For reproducibility, we fix all key hyperparameters across experiments: discount factor γ = 0.99, GAE parameter λ = 0.95, clipping coefficient 0.2, entropy coefficient 0.01, value-loss coefficient 0.5, and Adam learning rate 3 × 10⁻⁴. Each policy update uses a mini-batch size of 256 and runs for 10 epochs over collected rollouts. The rollout length is set to 2048 steps, which provides sufficiently diverse on-policy trajectories for stable optimization without excessively delaying updates. In total, training proceeds for 3 × 10⁵ interaction steps. This configuration offers a pragmatic trade-off between stability and computational cost, and it is kept identical across baselines to ensure that performance differences are attributable to architectural and safety-design choices rather than tuning artifacts.

3.3. Dual-Layer Safety Mechanisms

This paper categorizes constraints into statistical constraints and hard physical constraints for separate treatment. Statistical constraints include excitation budget, maximum power curtailment rate, desired emission cap, and maximum average power purchase rate, represented by constraint cost

c_{t}^{(k)}

and updated using the Lagrange primal–dual approach:

\max_{π} E [\sum_{t} r_{t}] s . t . E [\sum_{t} c_{t}^{(k)}] \leq d_{k}

(25)

where

d_{k}

denotes the upper bound constraint.

λ_{k} \leftarrow {[λ_{k} + η_{k} ({\hat{C}}_{t}^{(k)} - d_{k})]}_{+}

(26)

where

λ_{k}

denotes the Lagrange multiplier (dual variable) corresponding to the Kth constraint,

η_{k}

represents the learning rate size for the dual variables, and

{\hat{C}}_{t}^{(k)}

signifies the sample estimate.

Hard physical constraints encompass power upper and lower bounds, turbine/boiler ramping, battery state of charge, hydrogen storage inventory, safety venting, and energy balance. To prevent training collapse due to excessive boundary violations during exploration, this paper introduces action-feasible region projection at the execution layer, solving a minimally modified quadratic programming for the sampled policy action:

a_{t}^{*} = \arg \min_{a} {‖a - a_{t}‖}^{2}

(27)

At execution, the raw policy action

{\tilde{a}}_{t}

may violate hard constraints due to stochastic exploration or approximation errors. We therefore compute the deployed action

a_{t}

by solving a quadratic projection problem that minimally modifies

{\tilde{a}}_{t}

while satisfying a convex approximation of the feasible set. A standard form is:

\min_{a_{t}, ξ_{t}} \frac{1}{2} {‖ a_{t} - {\tilde{a}}_{t} ‖}_{2}^{2} + β {‖ζ_{t}‖}_{1}, s . t . A_{t} a_{t} \leq b_{t} + ζ_{t}, ζ_{t} \geq 0

(28)

where

A_{t} a_{t} \leq b_{t}

encodes the standard unit-level bounds and consistency constraints defined by Table 2 and the corresponding device equations above, including ramping limits, SOC and hydrogen-inventory bounds, interconnection limits, and one-step linearized balance constraints. The slack variable

ζ_{t}

is introduced only for numerical robustness, with a large penalty factor

β

to strongly discourage violations. In implementation, mutual exclusivity between battery charging and discharging is enforced by rule-based gating before the QP projection, so that only one of

{\bar{P}}^{c h}, {\bar{P}}^{d i s}

can remain active at each step. The projection problem is low-dimensional (equal to the action dimension) and can be solved efficiently at each control step, making it suitable for rolling online control.

The explicit unit-level operating limits and parameter ranges used by the execution-layer projection are those already defined in Table 2 and the corresponding device equations above; they are therefore not repeated here for brevity. These standard algebraic forms are common in safe control and constrained RL implementations [29,30], while the focus of the present work is on their integration into the proposed Safety Transformer–PPO dispatch framework.

4. Results

4.1. Scenario, Data, and Experimental Protocol

The case study is based on an anonymized real-world industrial-park integrated energy system in eastern Inner Mongolia, featuring a peak electrical load of approximately 9 MW, a trough electrical load of approximately 5 MW, a peak thermal load of approximately 10 MW, and a hydrogen demand of approximately 0.8 tons per day. Table 2 summarizes the key equipment parameters.

Comparative methods include: deterministic MILP, PPO-MLP, Transformer–PPO (without safety mechanisms), and the proposed Safety Transformer–PPO. The empirical data used in this study are derived from an anonymized industrial-park operation scenario in eastern Inner Mongolia. To protect commercially sensitive information, the representative time-series shown in Figure 3 are not raw plant measurements, but confidentiality-preserving profiles obtained after anonymization and bounded perturbation processing. These profiles retain the main temporal characteristics of electricity load, heat load, hydrogen demand, and renewable availability, and Figure 3 presents the representative anonymized typical-day profiles used to illustrate the empirical operating conditions. In this study, the MILP benchmark serves as a deterministic optimization reference under the same equipment capacities, tariff settings, and perturbation-evaluation protocol, rather than as a receding-horizon MPC or a robust/stochastic MILP benchmark.

For the learning-based methods, training is performed on a 1-month dataset composed of hourly multi-energy operational series, whereas final evaluation is conducted on disjoint out-of-sample perturbation realizations generated under a common scenario family and perturbation protocol. Here, the same perturbation protocol means a common scenario family and uncertainty-generation rule shared across methods for fair comparison, rather than reuse of identical realizations in both training and testing. The perturbations mainly include renewable forecast deviations and metering noise, introduced both to reflect practical uncertainty and to avoid disclosure of sensitive original trajectories. For confidentiality reasons, the exact perturbation realizations are not released; however, all methods are evaluated under the same equipment capacities, price settings, and perturbation-generation protocol to ensure fair comparison.

All experiments were implemented in Python 3.8 using PyTorch as the deep-learning framework. The execution-layer quadratic projection was solved using Gurobi. Experiments were conducted on a Windows-based workstation equipped with an Intel Core i7-14700KF CPU and an NVIDIA GeForce RTX 4070 SUPER GPU. Unless otherwise specified, each learning-based method was evaluated over 20 random seeds, i.e., {25, 1025, 2025, …, 19,025}, and the reported results are aggregated over the corresponding out-of-sample test runs.

4.2. Comparative Analysis of Economic Efficiency and Low-Carbon Performance

Based on the typical-day comparison results in Table 3, the proposed Safety Transformer–PPO achieves the lowest total cost, with 12.52 ± 0.13 × 10⁴ CNY, outperforming MILP (13.63 × 10⁴ CNY), PPO-MLP (14.36 ± 0.32 × 10⁴ CNY), and Transformer–PPO (13.38 ± 0.23 × 10⁴ CNY). This corresponds to cost reductions of approximately 8.1% relative to MILP, 12.8% relative to PPO-MLP, and 6.4% relative to Transformer–PPO. To further quantify cross-seed variability, the 95% confidence intervals of the total cost are 12.46–12.58 ×10⁴ CNY for Safety Transformer–PPO, 13.27–13.49 × 10⁴ CNY for Transformer–PPO, and 14.22–14.50 × 10⁴ CNY for PPO-MLP. These confidence intervals show that the proposed method not only attains the lowest mean total cost, but also exhibits the narrowest uncertainty band among the compared learning-based methods, thereby indicating stronger run-to-run consistency. The cost advantage is mainly attributed to lower procurement cost (1.55 ± 0.09 × 10⁴ CNY), lower gas cost (10.20 ± 0.08 × 10⁴ CNY), lower carbon cost (0.54 ± 0.03 × 10⁴ CNY), and lower penalty cost (0.06 ± 0.03 × 10⁴ CNY), which together indicate improved peak-purchase suppression, more effective renewable accommodation, and better overall low-carbon economic performance under the same system configuration.

In terms of operational feasibility, Table 3 further shows that constraint handling is a key differentiator. Transformer–PPO without safety yields a non-compliance rate of 4.8 ± 1.2%, and PPO-MLP still exhibits 0.8 ± 0.5%, whereas the proposed Safety Transformer–PPO maintains 0.0 ± 0.0%, matching the deterministic MILP result while achieving better economy. Correspondingly, the approximate 95% confidence intervals of the non-compliance rate are 4.27–5.33% for Transformer–PPO, 0.58–1.02% for PPO-MLP, and 0.00–0.00% for Safety Transformer–PPO. This suggests that the economic advantage of the proposed method is achieved without relying on hard-constraint violations and remains consistently feasible across the evaluated runs. Although its incentive expenditure is the highest among the compared methods (0.10 ± 0.03 × 10⁴ CNY), the overall effect remains favorable because the reductions in procurement, penalty, and carbon-related costs dominate. Overall, the proposed method provides the most favorable balance between economic performance and operational safety among the compared approaches.

Figure 4 shows the typical-day load transfer profile under the proposed method. The transferred load is mainly shifted out of the peak window (18:00–21:00) and compensated during off-peak hours, which indicates that the demand response is used as a settlement-eligible temporal reshaping rather than an unrealistic load “deletion”. This concentration of negative adjustments in the evening peak is consistent with the goal of suppressing peak procurement, while the rebound arranged in low-price periods reduces the likelihood of daytime disturbance, improving the verifiability and billability of DR in practical settlement.

Figure 5 presents the typical-day SOC trajectory of a battery. The SOC exhibits a clear “off-peak charging, peak discharging” pattern, and—more importantly—remains within the prescribed safety boundaries throughout the horizon, demonstrating that the safety constraints do not merely penalize violations after the fact but effectively shape deployable actions at execution. In operational terms, this SOC discipline provides the cross-period flexibility needed to support peak shaving and renewable accommodation while preventing boundary-hitting behaviors that often occur in unconstrained exploration.

Figure 6 extends the analysis to a monthly window (30 days, 720 h) and shows the distribution of load-shifting decisions over time. Compared with a fixed-rule peak-shifting strategy, the shifting periods and magnitudes vary across different dates, suggesting that the controller adjusts the DR volume in response to changing operating conditions, including renewable output, load levels, and price signals.

Figure 7 shows the monthly SOC evolution of the battery. Throughout the 30-day horizon, the SOC remains within the admissible operating bounds, and no out-of-bound event is observed. To avoid relying solely on visual inspection, two month-scale indicators are further reported for the same rollout, namely the cumulative non-compliance rate and the SOC boundary-hit frequency. The former quantifies the proportion of time steps with operational constraint violations over the monthly horizon, while the latter quantifies the proportion of time steps at which the battery SOC falls within 2% of either operating bound. In the reported monthly rollout, the cumulative non-compliance rate is 0.0%, whereas the SOC boundary-hit frequency is approximately 50%. These results show that the battery is repeatedly dispatched close to its admissible limits over the extended horizon, yet without any observed boundary violation.

4.3. Training Convergence and Safety Statistics

This subsection reports the training convergence and safety statistics of the proposed Safety Transformer–PPO using learning curves and execution-layer diagnostics. Convergence is evaluated by the typical-day total cost (consistent with the cost-based reward/return definition), while safety is quantified by the hard-constraint non-compliance rate measured during execution. In addition, we report two projection-layer indicators—projection activation rate and correction magnitude—to characterize the extent to which the execution-layer feasibility projection intervenes throughout training.

Figure 8 shows the evolution of the typical-day total cost over training episodes for the three learning-based methods, namely Safety Transformer–PPO, Transformer–PPO without safety, and PPO-MLP, with the deterministic MILP result included as a reference. The solid curves denote the mean values over 20 random-seed runs, while the shaded bands indicate ±1 standard deviation. Overall, the training process exhibits a clear pattern of progressive cost reduction followed by gradual stabilization, which is typical of PPO-style policy optimization. In the early training stage, all learning-based methods remain at relatively high-cost levels, reflecting exploratory behavior and limited policy quality. As training proceeds, the total cost decreases steadily and eventually approach a stable plateau, indicating that the policy updates become progressively smaller and the agent enters a relatively stable operating regime.

Across the entire training horizon, Safety Transformer–PPO converges to the lowest mean cost level among the compared learning-based methods. The final plateau is consistent with the evaluation results reported in Table 3, where Safety Transformer–PPO achieves a total cost of 12.52 ± 0.13 × 10⁴ CNY, outperforming Transformer–PPO without safety (13.38 ± 0.23 × 10⁴ CNY) and PPO-MLP (14.36 ± 0.32 × 10⁴ CNY). In addition, the cost gap becomes visible already in the middle stage of training, suggesting that the proposed method reaches a competitive regime earlier and maintains a more favorable cost profile thereafter. Another noteworthy characteristic is that the proposed method exhibits a narrower late-stage standard-deviation band than the baseline learning methods. This pattern is consistent with the cross-seed statistics in Table 3 and indicates lower cross-run dispersion after the policy approaches convergence.

Figure 9 reports the hard-constraint non-compliance rate versus training episodes. Here, non-compliance is defined as the fraction of decision steps where any hard physical constraint is violated during execution. This metric directly reflects whether a learned policy is operationally deployable under strict security constraints. The curves show a pronounced separation between the proposed method and the baselines. Safety Transformer–PPO drives the non-compliance rate down rapidly and maintains an approximately zero level at convergence. In contrast, Transformer–PPO without safety stabilizes at a substantially higher violation exposure, and PPO-MLP converges to a smaller yet non-negligible level. Beyond the final values, the transient behavior is also informative: non-compliance is typically higher in early training and decreases as training progresses, which indicates that infeasible behaviors are more frequent during exploration and gradually diminish as the policy improves. Figure 10 provides an execution-layer diagnostic by reporting the projection activation rate over training. The activation rate is high during early training and decreases gradually as training proceeds. This indicates that, at the beginning, the raw policy frequently proposes infeasible or near-infeasible actions, requiring frequent projection. As the policy improves, the raw action distribution becomes increasingly compatible with feasibility requirements, so fewer projection interventions are needed.

Figure 11 complements Figure 10 by reporting the average correction magnitude, measured by the L₂ norm between the projected action and the raw policy action (‖a − ã‖₂), averaged over decision steps. While the activation rate indicates “how often” the projection intervenes, the correction magnitude indicates “how strongly” it modifies the action when it intervenes. In the curve, the correction magnitude decreases from a larger initial level to a small plateau, consistent with the decreasing activation rate in Figure 10. In practice, these two diagnostics should be interpreted together: an ideal learning outcome is characterized by both a low activation rate and a small correction magnitude, indicating that the policy itself is producing feasible actions and the projection layer acts primarily as a lightweight safeguard.

The projection diagnostics are mainly intended to assess whether the learned policy systematically relies on execution-layer correction. Some degree of correction is expected during early training because of stochastic exploration and approximation errors. However, as training proceeds, both the projection activation rate and the average correction magnitude decrease markedly, suggesting that the learned policy becomes increasingly compatible with the feasible region rather than systematically relying on the projection layer after convergence.

Overall, the training curves indicate that Safety Transformer–PPO converges to a lower cost level while achieving zero hard-constraint non-compliance at convergence. The projection diagnostics further show that intervention frequency and intervention strength both decrease over training, implying that the learned policy becomes increasingly consistent with feasibility requirements.

5. Discussion

5.1. Ablation of the Dual-Layer Safety Mechanism

To isolate the contribution of each safety component, we conduct an ablation study on the proposed dual-layer safety design. We evaluate four variants: (i) full Safety Transformer–PPO (Lagrange updates + execution-layer QP projection), (ii) Lagrange-only (remove QP projection), (iii) QP-only (remove Lagrange updates), and (iv) Transformer–PPO without safety. All variants share the same Transformer–PPO backbone, training budget, uncertainty injection, and state/action definitions; only the safety components are toggled. We report the typical-day cost breakdown and the hard-constraint non-compliance rate to jointly assess economic performance and operational feasibility.

Table 4 reveals a clear cost–safety trade-off across the four variants. The full Safety Transformer–PPO achieves the lowest total cost (12.52 × 10⁴ CNY) with zero non-compliance (0.0%). In contrast, the safety-free Transformer–PPO baseline yields a higher total cost (13.38 × 10⁴ CNY) and a markedly higher violation exposure (4.8%). When QP projection is removed (Lagrange-only), the total cost improves to 12.70 × 10⁴ CNY, but non-compliance increases to 1.6%, indicating that dual-variable regulation alone is insufficient to eliminate hard violations in execution. When Lagrange updates are removed (QP-only), hard feasibility is preserved (0.0% non-compliance), but total cost degrades to 12.88 × 10⁴ CNY, suggesting that feasibility enforcement alone does not guarantee cost-efficient operation.

Beyond the headline totals, the component-level breakdown in Table 4 helps localize the sources of improvement. Relative to the safety-free Transformer–PPO, the full method reduces procurement cost (1.55 × 10⁴ CNY vs. 1.95 × 10⁴ CNY), gas costs (10.20 × 10⁴ CNY vs. 10.55 × 10⁴ CNY), carbon cost (0.54 × 10⁴ CNY vs. 0.60 × 10⁴ CNY), and penalty (0.06 ×10⁴ CNY vs. 0.14 × 10⁴ CNY), while keeping battery cost in a narrow range (0.06 × 10⁴ CNY −0.07 × 10⁴ CNY). Incentive expenditure remains bounded (0.08 × 10⁴ CNY −0.11 × 10⁴ CNY), implying that the observed cost reduction is primarily driven by improved operational decisions rather than aggressive incentive spending. Comparing the two ablations, QP-only exhibits higher carbon cost (0.60 × 10⁴ CNY) and penalty (0.07 × 10⁴ CNY) than the full method, whereas Lagrange-only attains a closer cost composition but incurs non-compliance (1.6%).

Overall, the ablation results indicate that feasibility and economic optimality are coupled but not identical objectives. Execution-layer projection is decisive for eliminating hard violations, whereas dual-variable regulation improves the quality of feasible actions, particularly by reducing procurement reliance and carbon-related cost. The combined design therefore achieves a preferable cost–safety trade-off compared with either component alone.

5.2. Robustness to Forecast Error Bounds

Since the evaluation already accounts for renewable forecast errors and metering noise, we further test robustness by scaling the renewable forecast error bound to ε ∈ {±5%, ±10%, ±15%} while keeping capacities and price settings unchanged. For strict comparability with the Results section, the ε = ±10% case matches the typical-day evaluation setting, and Table 5 reports the corresponding cost components together with the non-compliance rate.

As shown in Table 5, the total cost increases monotonically as uncertainty grows, from 12.40 × 10⁴ CNY at ε = ±5% to 12.74 × 10⁴ CNY at ε = ±15%. The increase is primarily driven by higher procurement cost (1.50 → 1.65), carbon cost (0.52 → 0.58), and penalty (0.04 → 0.09), while gas costs change only slightly (10.18 → 10.24) and battery cost remains nearly invariant (0.07). Incentive expenditure shows a mild increase (0.09 → 0.11), consistent with a larger verified response under higher perturbations in the settlement calculation.

Across all tested uncertainty levels, the non-compliance rate remains 0.0%, indicating that the execution-layer safety projection preserves hard feasibility even when forecast perturbations widen. From an operational viewpoint, this means robustness is reflected not only by moderate cost degradation but also by guaranteed constraint satisfaction. Nevertheless, the economic deterioration under larger ε suggests that very high uncertainty may require retraining under broader disturbance distributions and/or introducing conservative feasibility margins in the projection constraints.

5.3. Quantitative Comparison with Related Studies

To better position the proposed method within the recent literature, two levels of comparison should be distinguished. First, Table 3 already provides a controlled same-case benchmark by reproducing representative baseline methods under identical system settings, perturbation protocol, and evaluation metrics. Second, Table 6 below offers a literature-level quantitative comparison with representative recent studies. Since the compared studies differ in case scale, device portfolio, tariff setting, uncertainty assumption, and benchmark definition, the comparison is intended as a relative positioning analysis rather than a strict ranking of absolute operating performance.

As shown in Table 6, existing studies have reported meaningful economic and low-carbon improvements in hydrogen-related integrated energy systems, but most of them do not explicitly combine settlement-oriented DR execution with execution-layer hard-feasibility protection. In this sense, the distinctive contribution of the present work does not lie in claiming the largest reported cost-reduction percentage across heterogeneous cases, but in jointly achieving economic improvement, settlement-aware DR implementation, and zero observed hard-constraint non-compliance within a unified sequential decision framework. Therefore, Table 6 should be interpreted as a literature-level positioning analysis, whereas the controlled same-case method comparison is already provided in Table 3.

5.4. Limitations

Although the proposed Safety Transformer–PPO shows favorable economic performance, hard-constraint feasibility, and robustness within the tested settings, several limitations should be noted. First, the current validation is conducted on a single anonymized industrial-park configuration; therefore, the reported results should be interpreted as evidence of effectiveness for this class of electro–heat–hydrogen dispatch problems rather than as a guarantee of direct transferability to parks with substantially different device portfolios, tariff mechanisms, demand structures, or hydrogen-consumption patterns. Second, the method remains sensitive to data quality and settlement design, especially baseline-definition choices, metering reliability, and disturbance distributions, all of which may affect verified response quantities, incentive-ledger evolution, and economic performance. Finally, real deployment would also require stable integration with plant-level EMS/SCADA infrastructure, reliable M&V pipelines, and periodic model maintenance under seasonal variation or distribution shifts. These issues define important directions for future work on broader cross-site validation, stronger interpretability, and deployment-oriented system integration.

It should be noted that the present study adopts fixed-cost parameters for hydrogen- and heat-related technologies in the tested dispatch setting. However, in practical industrial deployment, these costs may evolve with technology learning, market maturity, and economies of scale, which could further affect the relative utilization of hydrogen conversion, storage, and heat supply units. Therefore, the reported economic results should be interpreted under the current parameter setting rather than as a prediction of future technology-cost trajectories. Future work will incorporate cost-learning scenarios and scale-dependent parameter settings for hydrogen- and heat-related equipment to further assess their impact on dispatch performance and economic competitiveness.

6. Conclusions

This paper proposed a Safety Transformer–PPO framework for low-carbon economic dispatch in integrated electro–heat–hydrogen energy systems with settlement-oriented incentive-based demand response. By combining a causal Transformer encoder, PPO-based policy learning, and a dual-layer safety mechanism, the proposed method achieved lower total cost, lower carbon-related cost, and zero observed hard-constraint non-compliance in the tested industrial-park setting, while also showing improved convergence stability across random seeds.

These results should be interpreted within the scope of the current validation setting. The study is conducted on a single anonymized industrial-park configuration under specific tariff and disturbance assumptions, and therefore does not by itself guarantee direct transferability to substantially different operating environments. Future work will further strengthen the benchmarking by incorporating receding-horizon MILP/MPC and robust/stochastic MILP baselines under the same forecast and disturbance settings.

Author Contributions

J.Z.: Methodology; Formal analysis; Writing—original draft. Y.W.: Data curation; Validation; Supervision. H.X.: Methodology; Data curation; Formal analysis. L.N.; Data curation; Project administration. L.Y.: Methodology; Writing—review & editing. W.X.: Methodology; Formal analysis. S.Y.: Formal analysis; Validation. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Due to the confidentiality of industrial-park operational data, the original dataset and exact perturbation realizations used in this study are not publicly available. Representative anonymized data supporting the findings of this study may be obtained from the corresponding authors upon reasonable request, subject to institutional review and the signing of a confidentiality agreement.

Conflicts of Interest

Authors Jia Zhengjian, Yang Wanchun, Huang Xin, Liang Nan, Liu Yupeng and Wang Xiaojun were employed by the company Inner Mongolia Power (Group) Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Wang, W.; Chen, X.; Liu, Y.; Wu, L. Forming multi-transmission-node distributed energy resource aggregations in wholesale energy market: An optimal node aggregation approach and admissible capacity expansion regions. Appl. Energy 2026, 410, 127536. [Google Scholar] [CrossRef]
Washizu, A.; Nozu, T. Sustainability Transition to a Low-Carbon Society: Focusing on Rural Areas. In Climate Change Issues and Social Sciences: Towards a Carbon Neutral Society; Springer: Berlin/Heidelberg, Germany, 2025; pp. 45–63. [Google Scholar] [CrossRef]
Kurucan, M.; Özbaltan, M.; Yetgin, Z.; Alkaya, A. Applications of artificial neural network based battery management systems: A literature review. Renew. Sustain. Energy Rev. 2024, 192, 114262. [Google Scholar] [CrossRef]
Wu, L.; Zhang, W.; Chen, W.; Pei, T. A Multi-Time scale optimal scheduling strategy for integrated energy systems considering the power randomness of wind and photovoltaic. Electr. Eng. 2025, 107, 9109–9123. [Google Scholar] [CrossRef]
Pinson, P.; Madsen, H. Benefits and challenges of electrical demand response: A critical review. Renew. Sustain. Energy Rev. 2014, 39, 686–699. [Google Scholar] [CrossRef]
Iris, Ç.; Lam, J.S.L. Optimal energy management and operations planning in seaports with smart grid while harnessing renewable energy under uncertainty. Omega 2021, 103, 102445. [Google Scholar] [CrossRef]
Stanelyte, D.; Radziukyniene, N.; Radziukynas, V. Overview of demand-response services: A review. Energies 2022, 15, 1659. [Google Scholar] [CrossRef]
Valentini, O.; Andreadou, N.; Bertoldi, P.; Lucas, A.; Saviuc, I.; Kotsakis, E. Demand response impact evaluation: A review of methods for estimating the customer baseline load. Energies 2022, 15, 5259. [Google Scholar] [CrossRef]
Li, Z.; Li, H.; Wang, S. Customer baseline load estimation in incentive-based demand response programs: Requirements, solutions, challenges and future perspectives. Renew. Sustain. Energy Rev. 2026, 226, 116383. [Google Scholar] [CrossRef]
Ellman, D.; Xiao, Y. Incentives to manipulate demand response baselines with uncertain event schedules. IEEE Trans. Smart Grid 2020, 12, 1358–1369. [Google Scholar] [CrossRef]
Wang, X.; Tang, W. Modeling and analysis of baseline manipulation in demand response programs. IEEE Trans. Smart Grid 2021, 13, 1178–1186. [Google Scholar] [CrossRef]
Qi, X.; Gong, M.; Huang, F.; Liu, H. Assessment of Load Reduction Potential Based on Probabilistic Prediction of Demand Response Baseline Load. Processes 2025, 14, 52. [Google Scholar] [CrossRef]
Šikšnys, D.; Vaičys, J.; Gudžius, S.; Račkienė, R.; Grigošaitis, M. Unlocking thermal flexibility through demand-side response: Baseline methodology assessment and heating electrification in the Baltic region. Therm. Sci. Eng. Prog. 2026, 70, 104498. [Google Scholar] [CrossRef]
Ukoba, K.; Olatunji, K.O.; Adeoye, E.; Jen, T.-C.; Madyira, D.M. Optimizing renewable energy systems through artificial intelligence: Review and future prospects. Energy Environ. 2024, 35, 3833–3879. [Google Scholar] [CrossRef]
Ferdaus, M.M.; Dam, T.; Anavatti, S.; Das, S. Digital technologies for a net-zero energy future: A comprehensive review. Renew. Sustain. Energy Rev. 2024, 202, 114681. [Google Scholar] [CrossRef]
Gharibvand, H.; Gharehpetian, G.B.; Anvari-Moghaddam, A. A survey on microgrid flexibility resources, evaluation metrics and energy storage effects. Renew. Sustain. Energy Rev. 2024, 201, 114632. [Google Scholar] [CrossRef]
Ma, M.; Long, Z.; Liu, X.; Lee, K.Y. Distributionally robust optimization of electric–thermal–hydrogen integrated energy system considering source–load uncertainty. Energy 2025, 316, 134568. [Google Scholar] [CrossRef]
Cuisinier, É.; Lemaire, P.; Penz, B.; Ruby, A.; Bourasseau, C. New rolling horizon optimization approaches to balance short-term and long-term decisions: An application to energy planning. Energy 2022, 245, 122773. [Google Scholar] [CrossRef]
Fernández, G.; Sanz Osorio, J.; Rocca, R.; Luengo-Baranguan, L.; Torres, M. Practical Considerations for the Development of Two-Stage Deterministic EMS (Cloud–Edge) to Mitigate Forecast Error Impact on the Objective Function. Appl. Sci. 2026, 16, 1844. [Google Scholar] [CrossRef]
Liang, T.; Zhang, X.; Tan, J.; Jing, Y.; Liangnian, L. Deep reinforcement learning-based optimal scheduling of integrated energy systems for electricity, heat, and hydrogen storage. Electr. Power Syst. Res. 2024, 233, 110480. [Google Scholar] [CrossRef]
Liu, J.; Meng, X.; Wu, J. Data-driven optimal scheduling for integrated electricity-heat-gas-hydrogen energy system considering demand-side management: A deep reinforcement learning approach. Int. J. Hydrogen Energy 2025, 103, 147–165. [Google Scholar] [CrossRef]
Li, Y.; Zhao, B.; Li, Y.; Long, C.; Li, S.; Dong, Z.; Shahidehpour, M. Safe-AutoSAC: AutoML-enhanced safe deep reinforcement learning for integrated energy system scheduling with multi-channel informer forecasting and electric vehicle demand response. Appl. Energy 2025, 399, 126468. [Google Scholar] [CrossRef]
Prabawa, P.; Choi, D.-H. Safe deep reinforcement learning-assisted two-stage energy management for active power distribution networks with hydrogen fueling stations. Appl. Energy 2024, 375, 124170. [Google Scholar] [CrossRef]
Zhu, H.; Wang, X.; Wen, Y.; Zhu, J.; Li, J.; Luo, Q.; Liao, C. A review of integrated energy system modeling and operation. Appl. Energy 2025, 400, 126572. [Google Scholar] [CrossRef]
Liu, H.; Li, Y.; Li, S.; Kou, X.; Dong, Y.; Jiang, J.; Ji, F.; Duan, M.; Hao, X.; Hu, W. Heat pump-assisted waste heat recovery for thermal management in hydrogen-enabled integrated energy systems. Energy 2025, 338, 138874. [Google Scholar] [CrossRef]
Xie, Y.; Xiong, W.; Zhang, S.; Li, Z.; Johnson, B.C.; Zhu, H. Deep learning-based distributionally robust optimization scheduling in the low carbon park integrated energy system under multiple uncertainties. Energy 2026, 344, 139886. [Google Scholar] [CrossRef]
Puterman, M.L. Markov decision processes. Handb. Oper. Res. Manag. Sci. 1990, 2, 331–434. [Google Scholar] [CrossRef]
Wen, X.; Duan, Z.; Wang, J.; Hong, Q. The application of improved PPO algorithm in microgrid energy management. Eng. Res. Express 2026, 8, 025317. [Google Scholar] [CrossRef]
Qiu, D.; Dong, Z.; Zhang, X.; Wang, Y.; Strbac, G. Safe reinforcement learning for real-time automatic control in a smart energy-hub. Appl. Energy 2022, 309, 118403. [Google Scholar] [CrossRef]
Wang, Y.; Qiu, D.; Sun, M.; Strbac, G.; Gao, Z. Secure energy management of multi-energy microgrid: A physical-informed safe reinforcement learning approach. Appl. Energy 2023, 335, 120759. [Google Scholar] [CrossRef]
Ecoffet, P.; Fontbonne, N.; André, J.-B.; Bredeche, N. Reinforcement learning with rare significant events: Direct policy search vs. gradient policy search. In Proceedings of the Genetic and Evolutionary Computation Conference Companion; Gecco: Avelin, France, 2021; pp. 97–98. Available online: https://dl.acm.org/doi/pdf/10.1145/3449726.3459462 (accessed on 4 January 2026).
Sopegno, L.; Cirrincione, G.; Martini, S.; Rutherford, M.J.; Livreri, P.; Valavanis, K.P. Transformer-based physics informed proximal policy optimization for UAV autonomous navigation. In 2025 International Conference on Unmanned Aircraft Systems (ICUAS); IEEE: Piscataway, NJ, USA, 2025; pp. 1094–1099. [Google Scholar] [CrossRef]
Su, X.; Zhang, Q.; Fu, Z.; Wu, J.; Qin, T.; Li, C.; Huang, S.; Bi, K. The coordinated multi-energy trading framework for integrated energy systems considering electricity-hydrogen trading and carbon emission flow. Energy 2025, 339, 139145. [Google Scholar] [CrossRef]
Xu, X.; Du, Y. Two-Stage Robust Optimal Configuration of Multi-Energy Microgrid Considering Tiered Carbon Trading and Demand Response. Symmetry 2025, 17, 1999. [Google Scholar] [CrossRef]
Wang, H.; Wu, Q.; Guo, H. Low-Carbon Optimal Operation Strategy of Multi-Energy Multi-Microgrid Electricity–Hydrogen Sharing Based on Asymmetric Nash Bargaining. Sustainability 2025, 17, 4703. [Google Scholar] [CrossRef]

Figure 1. Electricity–Heat–Hydrogen coupling system.

Figure 2. The Causal Transformer–PPO architecture.

Figure 3. The typical daily data.

Figure 4. Typical daily load transfer curve for the proposed method.

Figure 5. Typical daily SOC trajectory of a battery for the proposed method.

Figure 6. Monthly load transfer curve for the proposed method.

Figure 7. Monthly SOC trajectory of a battery for the proposed method.

Figure 8. Training convergence of typical-day total cost. (Solid lines denote the mean values over 20 random seeds, and the shaded bands indicate ±1 standard deviation).

Figure 9. Hard-constraint non-compliance rate during training.

Figure 10. Projection activation rate (the shaded bands indicate correction magnitude).

Figure 11. Average correction magnitude (the shaded bands indicate correction magnitude).

Table 1. Compact definition of state and action variables in the MDP formulation.

Group	Symbol	Meaning	Unit	Bound/Range
State	$L_{t}^{r a w}$	Underlying electrical demand before DR execution	MW	Exogenous scenario input
State	$D_{t}^{h}$	Heat demand	MW_th	Exogenous scenario input
State	$D_{t}^{H 2}$	Hydrogen demand	t/h	Exogenous scenario input
State	$P_{t}^{w i n d}$	Available wind power	MW	Exogenous scenario input
State	$P_{t}^{p v}$	Available PV power	MW	Exogenous scenario input
State	$S O C_{t}$	Battery state of charge	p.u.	0.1–0.9
State	$S_{t}^{H 2}$	Hydrogen inventory in storage tank	t	0.2–1.5
State	$L_{t}^{e}$	Settlement baseline load	MW	/
State	$B_{t}^{r e m}$	Remaining incentive budget	CNY	0–10,000
Action	$P_{t}^{g t}$	Gas turbine electric output	MW	1–4
Action	$P_{t}^{e l}$	Electrolyzer power consumption	MW	0–2
Action	$P_{t}^{e b}$	Electric boiler power consumption	MW	0–2.11
Action	$Q_{t}^{g b}$	Gas boiler heat output	MW_th	0–6
Action	$P_{t}^{c h}$	Battery charging power	MW	0–1
Action	$P_{t}^{d i s}$	Battery discharging power	MW	0–1
Action	$ρ_{t}$	DR incentive price	CNY	/
Action	$Δ L_{t}^{r e q}$	Requested aggregate load-shifting amount	MW	/

Table 2. Equipment and Constraint Parameters.

Category	Parameters
Power grid	${\bar{P}}^{i m p} = 8 MW, {\bar{P}}^{\exp} = 2 MW$
Gas turbine	${\bar{P}}^{g t} = 4 MW, {\underline{P}}^{g t} = 1 MW, R^{g t} = 1 MW / h$
Waste heat heating	$κ^{w h b} = 1.2$
Gas boiler	${\bar{Q}}^{g b} = 6 MW, R^{g b} = 1.5 MW / h, η^{g b} = 0.9$
Electric boiler	${\bar{Q}}^{e b} = 2 MW, η^{e b} = 0.95$
Electrolytic cell	${\bar{P}}^{e l} = 2 MW, R^{e l} = 1 MW / h, α^{e l} = 0.020 t / MWh$
Hydrogen storage tank	${\underline{S}}^{H_{2}} = 0.2 t, {\bar{S}}^{H_{2}} = 1.5 t$
Battery	$\begin{array}{l} E^{b a t} = 2 MWh, {\bar{P}}^{c h} = {\bar{P}}^{d i s} = 1 MW, \\ η^{c h} = η^{d i s} = 0.95, S O C \in [0.1, 0.9], c^{b a t} = 0.02 \end{array}$
Electricity price (Peak/Flat/Valley)	$\begin{array}{l} c_{t}^{b u y} = 1.10 / 0.75 / 0.42 RMB / kWh, \\ c_{t}^{s e l l} = 0.35 RMB / kWh \end{array}$
Gas	$c^{g a s} = 3.5 {/ Nm}^{3}, L H V = 9.5$
Penalty	$c^{c u r t} = 0.2 / kWh$
Carbon	$π^{C O 2} = 80 RMB / t$
Incentive budget	$B_{t}^{r e m} = 10000 / day$

Table 3. Comparison of Typical Daily Optimization Scheduling Outcomes (×10⁴ CNY).

Category	MILP	PPO-MLP [31]	Transformer–PPO [32]	Safety Transformer–PPO
Total cost	13.63	14.36 ± 0.32	13.38 ± 0.23	12.52 ± 0.13
Procurement cost	2.05	2.45 ± 0.19	1.95 ± 0.15	1.55 ± 0.09
Gas costs	10.70	10.95 ± 0.13	10.55 ± 0.11	10.20 ± 0.08
Carbon cost	0.62	0.66 ± 0.04	0.60 ± 0.04	0.54 ± 0.03
Penalty	0.18	0.22 ± 0.06	0.14 ± 0.05	0.06 ± 0.03
Battery cost	0.06	0.05 ± 0.02	0.06 ± 0.02	0.07 ± 0.02
Incentive expenditure	0.02	0.03 ± 0.02	0.08 ± 0.03	0.10 ± 0.03
Non-compliance rate (%)	0.0	0.8 ± 0.5	4.8 ± 1.2	0.0 ± 0.0

Note: Results for the learning-based methods are reported as mean ± standard deviation over 20 random seeds. The MILP baseline is deterministic and is therefore reported as a single value.

Table 4. Safety ablation on a typical day (×10⁴ CNY).

Category	Lagrange-Only	QP-Only	Transformer–PPO	Safety Transformer–PPO
Total cost	12.70	12.88	13.38	12.52
Procurement cost	1.60	1.58	1.95	1.55
Gas costs	10.22	10.23	10.55	10.20
Carbon cost	0.56	0.60	0.60	0.54
Penalty	0.08	0.07	0.14	0.06
Battery cost	0.07	0.07	0.06	0.07
Incentive expenditure	0.10	0.11	0.08	0.10
Non-compliance rate (%)	1.6	0.0	4.8	0.0

Table 5. Safety Transformer–PPO under different forecast error bounds ε (×10⁴ CNY).

Category	ε = ±5%	ε = ±10%	ε = ±15%
Total cost	12.40	12.52	12.74
Procurement cost	1.50	1.55	1.65
Gas costs	10.18	10.20	10.24
Carbon cost	0.52	0.54	0.58
Penalty	0.04	0.06	0.09
Battery cost	0.07	0.07	0.07
Incentive expenditure	0.09	0.10	0.11
Non-compliance rate (%)	0.0	0.0	0.0

Table 6. Quantitative comparison with representative recent studies from the perspectives of safety handling and settlement-oriented demand response.

Ref.	System Type	Method	Reported Economic	Settlement-Oriented DR	Safety/ Feasibility Handling	Main Distinction Relative to This Work
Su et al. [33]	Integrated energy system with electricity–hydrogen trading and carbon-emission flow	Coordinated multi-energy trading framework	Total system cost reduced by 6.2%; carbon emissions reduced by 17.3%	No	No explicit execution-layer safety correction	Focuses on carbon-flow-aware coordinated trading rather than settlement-aware DR and safe RL
Xu et al. [34]	Multi-energy microgrid with P2G and CCS	Two-stage robust optimization	Total cost reduced by 3.28%; carbon emissions reduced by up to 31.9%	No	No explicit execution-layer safety correction	Focuses on carbon-flow-aware coordinated trading rather than settlement-aware DR and safe RL
Wang et al. [35]	Multi-energy multi-microgrid electricity–hydrogen sharing system	Asymmetric Nash bargaining + distributed optimization	Total cost of the MEMG network decreased by 12,431.22 CNY; carbon emission reduction ratio reached 11.12%	No	Physical feasibility handled in optimization model	Emphasizes robust planning/operation, but not settlement-grade DR or execution-layer correction
Ours	Electro–heat–hydrogen industrial park	Safety Transformer–PPO	Total cost reduced by 8.1% vs. MILP, 12.8% vs. PPO-MLP, and 6.4% vs. Transformer–PPO	Yes	Physical feasibility handled in optimization model	Focuses on bargaining-based multi-microgrid coordination rather than safe sequential control

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhengjian, J.; Wanchun, Y.; Xin, H.; Nan, L.; Yupeng, L.; Xiaojun, W.; Yu, S. Low-Carbon Economic Dispatch and Settable Incentive-Based Demand Response for Integrated Electro–Heat–Hydrogen Energy Systems Based on Safety Transformer–PPO. Energies 2026, 19, 1578. https://doi.org/10.3390/en19061578

AMA Style

Zhengjian J, Wanchun Y, Xin H, Nan L, Yupeng L, Xiaojun W, Yu S. Low-Carbon Economic Dispatch and Settable Incentive-Based Demand Response for Integrated Electro–Heat–Hydrogen Energy Systems Based on Safety Transformer–PPO. Energies. 2026; 19(6):1578. https://doi.org/10.3390/en19061578

Chicago/Turabian Style

Zhengjian, Jia, Yang Wanchun, Huang Xin, Liang Nan, Liu Yupeng, Wang Xiaojun, and Song Yu. 2026. "Low-Carbon Economic Dispatch and Settable Incentive-Based Demand Response for Integrated Electro–Heat–Hydrogen Energy Systems Based on Safety Transformer–PPO" Energies 19, no. 6: 1578. https://doi.org/10.3390/en19061578

APA Style

Zhengjian, J., Wanchun, Y., Xin, H., Nan, L., Yupeng, L., Xiaojun, W., & Yu, S. (2026). Low-Carbon Economic Dispatch and Settable Incentive-Based Demand Response for Integrated Electro–Heat–Hydrogen Energy Systems Based on Safety Transformer–PPO. Energies, 19(6), 1578. https://doi.org/10.3390/en19061578

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Low-Carbon Economic Dispatch and Settable Incentive-Based Demand Response for Integrated Electro–Heat–Hydrogen Energy Systems Based on Safety Transformer–PPO

Abstract

1. Introduction

2. System Architecture and Low-Carbon Economic Dispatch Model

2.1. Electricity–Heat–Hydrogen Coupling System Configuration

2.2. Energy Balance and Equipment Constraints

2.3. Low-Carbon Economic Objective Function

2.4. Settlement-Eligible DR Modeling

3. Safety Transformer–PPO with Integrated Settable IDR Approach

3.1. MDP Modelling: States, Actions and Ledger Variables

3.2. Causal Transformer–PPO Architecture

3.3. Dual-Layer Safety Mechanisms

4. Results

4.1. Scenario, Data, and Experimental Protocol

4.2. Comparative Analysis of Economic Efficiency and Low-Carbon Performance

4.3. Training Convergence and Safety Statistics

5. Discussion

5.1. Ablation of the Dual-Layer Safety Mechanism

5.2. Robustness to Forecast Error Bounds

5.3. Quantitative Comparison with Related Studies

5.4. Limitations

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI