Next Article in Journal
A VSG Transient Improvement Method from the Perspective of Equivalent Circuits
Previous Article in Journal
Modulation Optimization and Load Power Boundary Condition for a Five-Level ANPC Converter Under DC-Side Unbalanced Loads
Previous Article in Special Issue
Application of the Directed Cone Method for the Identification of Mathematical Models of Electromechanical Systems
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Low-Carbon Economic Dispatch and Settable Incentive-Based Demand Response for Integrated Electro–Heat–Hydrogen Energy Systems Based on Safety Transformer–PPO

1
Inner Mongolia Power (Group) Co., Ltd., Baotou 014000, China
2
School of Electrical Engineering and Automation, Tianjin University of Technology, Tianjin 300384, China
*
Author to whom correspondence should be addressed.
Energies 2026, 19(6), 1578; https://doi.org/10.3390/en19061578
Submission received: 4 March 2026 / Revised: 19 March 2026 / Accepted: 20 March 2026 / Published: 23 March 2026
(This article belongs to the Special Issue Energy Systems: Optimization, Modeling, and Simulation)

Abstract

This paper proposes a safety-constrained Transformer–PPO framework for low-carbon economic dispatch with settable incentive-based demand response (DR) in wind–PV integrated electro–thermal–hydrogen industrial-park energy systems. Hydrogen is modeled as exogenous hydrogen-domain demand and is satisfied through electrolyzer production and hydrogen inventory dynamics. A causal Transformer captures long-horizon multi-energy coupling and intertemporal constraints and is trained with PPO under uncertainty. A dual-layer safety mechanism combines dual-variable (Lagrange multiplier) updates for statistical constraints with an execution-layer quadratic-programming action projection to enforce hard physical constraints, including operating limits, ramping, battery SOC, hydrogen inventory bounds, and energy balance. Baseline–verification–settlement rules and budget-ledger states are embedded to ensure verifiable response quantities and settlement outcomes that are traceable and independently recompilable. Case studies on a real industrial-park scenario in Inner Mongolia show reduced peak-hour maximum grid purchase demand and constraint violations, together with lower total cost, carbon cost, and curtailment penalties versus MILP, PPO-MLP, and Transformer–PPO without safety mechanisms.

1. Introduction

Against the backdrop of high penetration of wind and solar grid integration alongside the rapid expansion of the hydrogen economy, integrated energy systems (IES) within industrial parks increasingly require the coordination of electricity, heat energy and hydrogen under conditions of intense multi-energy coupling, significant intertemporal dynamics and high uncertainty [1,2]. Hydrogen energy, particularly through electrolyzer and hydrogen storage systems, introduces an electricity-to-hydrogen coupling pathway [3]. Its demand side is driven by exogenous hydrogen sector requirements, necessitating dynamic fulfilment through hydrogen production and inventory management. Conversion and storage equipment collectively determine viable operational zones and carbon emission outcomes [4].
In parallel, incentive-based demand response (DR) has been increasingly framed as a settlement-grade resource rather than a purely price-guided flexibility option [5]. Demand response has also been applied in different industrial contexts. For instance, Iris and Lam [6] investigated demand response and energy management in seaports under renewable uncertainty, showing that DR is relevant not only to conventional power systems but also to broader industry-specific energy operations. In such programs, baseline definition, measurement & verification (M&V), and auditable settlement rules directly determine the correctness of realized reduction and the exposure of incentive payments [7]. Valentini et al. [8] used a structured review to summarize customer baseline load (CBL) estimation methods and showed that baseline uncertainty and methodological choices can materially bias impact evaluation and settlement results. Li et al. [9] used an application-requirement-driven review to categorize baseline estimation techniques for incentive-based DR and highlighted that data availability, behind-the-meter behaviors, and multi-service participation can undermine settlement accuracy and controllability. Ellman and Xiao [10] used a multi-stage stochastic dynamic programming model to address incentives for baseline manipulation under uncertain event schedules, demonstrating that settlement rules can induce strategic behavior beyond intended load curtailment. Wang et al. [11] used an MDP-based analytical framework to address baseline manipulation in baseline-based DR programs and derived structural insights on customers’ underconsumption/overconsumption strategies, which distort settlement fairness. Qi et al. [12] used probabilistic baseline prediction to address settlement-sensitive load reduction potential assessment, showing that uncertainty-aware baseline modeling is essential for contract execution and payment credibility. Šikšnys et al. [13] used a two-stage decision-oriented baseline selection framework to address practical baseline choice under real consumption data, emphasizing that auditable baselines require both technical feasibility screening and performance evaluation. Collectively, these studies indicate that baseline definition, verification of realized response, and payment recalculation rules directly determine settlement correctness and incentive exposure, while budget controllability becomes an explicit operational requirement, thereby elevating dispatch from single-objective cost minimization to a coupled coordination task that must simultaneously deliver low-carbon economic performance and settlement feasibility under strict operational security boundaries [14,15,16]. In settlement-oriented incentive-based demand response, dispatch decisions are no longer evaluated solely by operational cost but also by the auditability of realized reductions and the controllability of incentive exposure. Baseline definitions, measurement & verification (M&V) procedures, and payment recalculation rules induce an implicit “settlement feasibility region”: a schedule that is physically feasible may still be non-settleable if the verified reduction cannot be reconstructed from metering data, or if the incentive budget can be exceeded under the prescribed accounting rules. This coupling makes economic dispatch a joint optimization of multi-energy operation and settlement integrity, where operational security constraints and settlement eligibility must be jointly satisfied.
Deterministic MILP formulations remain effective for structured constraint representation in IES; however, their online performance is often sensitive to forecast errors and modeling deviations under renewable variability and device dynamics. Ma et al. [17] used data-driven distributionally robust optimization to address source–load uncertainty in electric–thermal–hydrogen IES scheduling, illustrating the need for uncertainty-aware formulations beyond deterministic MILP. Cuisinier et al. [18] used extended rolling-horizon optimization to address the accumulation of forecast errors in operational planning, confirming that receding-horizon updates are critical when predictions degrade over time. Fernández et al. [19] used a two-stage deterministic EMS architecture (rolling-horizon planning and fast local adaptation) to address forecast-error impacts on objective values, further motivating closed-loop corrective control. In parallel, DRL provides an alternative by learning closed-loop policies through interaction. Liang et al. [20] used deep RL (SAC) to address real-time optimal scheduling in integrated energy systems, demonstrating improved renewable utilization with online policy control. Liu et al. [21] used a data-driven DRL scheduling framework to address coordinated dispatch in integrated electricity–heat–gas–hydrogen systems with demand-side flexibility. Li et al. [22] used a safe DRL scheduling approach (AutoML-enhanced safe RL with forecasting and DR) to address constraint-aware IES scheduling under renewable uncertainty. Prabawa and Choi [23] used safe-DRL-assisted two-stage energy management to address operational security in active distribution networks with hydrogen fueling stations. Nevertheless, despite the growing body of research on DRL-based scheduling for integrated energy systems, two key limitations remain insufficiently addressed. First, most existing studies treat demand response primarily as an operational flexibility resource, but do not explicitly incorporate settlement-grade mechanisms such as baseline reconstruction, measurement & verification (M&V), payment recalculation, and incentive-budget ledger evolution into the dispatch state and transition process. As a result, a schedule that is physically feasible and economically attractive in simulation may still be non-settleable in practice if the realized reduction cannot be independently verified or if the incentive expenditure exceeds the prescribed accounting rules. Second, although safe reinforcement learning has been introduced to improve constraint awareness, existing approaches mainly focus on physical feasibility and rarely address the joint requirement of operational security and settlement eligibility in incentive-based demand response. In multi-constraint electro–heat–hydrogen dispatch, this omission is particularly critical because settlement-grade DR introduces additional intertemporal states beyond energy storage, including baseline windows, verified response histories, and remaining incentive budgets. Ignoring these variables can lead to policies that appear cost-effective during training but fail in deployment due to hidden ledger violations and non-auditable response outcomes.
To close this gap, this paper proposes a Safety Transformer–PPO framework for low-carbon economic dispatch with settlement-oriented incentive-based demand response in integrated electro–heat–hydrogen energy systems. The main contributions are threefold. First, a settlement-aware dispatch model is established by explicitly embedding baseline–verification–settlement rules and budget-ledger states into the environment, so that feasibility is defined jointly by physical operability and settlement eligibility. Second, a causal Transformer–PPO architecture is developed to capture long-horizon temporal dependencies induced by multi-energy coupling, renewable uncertainty, and intertemporal ledger evolution. Third, a dual-layer safety mechanism is introduced, in which Lagrange-based statistical constraint regulation is combined with execution-layer quadratic-programming projection to enforce hard physical feasibility during deployment. In this way, the proposed framework moves beyond conventional cost-oriented DRL dispatch and provides an audit-ready, safety-constrained, and settlement-compatible solution for industrial-park integrated energy management.

2. System Architecture and Low-Carbon Economic Dispatch Model

2.1. Electricity–Heat–Hydrogen Coupling System Configuration

This section constructs an IES model for electricity–heat–hydrogen systems tailored to scenarios with high penetration of wind and solar power, targeting a specific industrial park in eastern Inner Mongolia, China. This model facilitates coordinated energy supply and flexible dispatch. Figure 1 illustrates the energy supply structure and its coupling relationships, where the park delivers energy to end-users through an electricity–heat–hydrogen coupling chain.

2.2. Energy Balance and Equipment Constraints

To ensure physical consistency and settlement capability, this section presents the balancing relationships among the electricity, gas and heat networks, along with the energy conversion models and operational constraints for each controllable unit [24].
Electrical side power balance:
P t g + P t p v + P t w i n d + P t g t + P t d i s = L t e + P t e l + P t e b + P t c h + P t c u r t
where P t g , P t w i n d , P t p v , P t g t , P t e l , P t e b , P t c h , P t d i s , L t e , and P t c u r t represent grid electricity purchase and sale power, wind and solar output, gas turbine generation power, electrolytic power consumption, electric boiler power consumption, battery charge/discharge power, demand response-adjusted electrical load, and renewable energy curtailment, respectively.
Heat side balance:
Q t g t + Q t g b + Q t e b = D t h + Q t l o s s
where Q t g t represents the heat converted by the gas turbine via the waste heat boiler, and Q t g b , Q t e b denote the heat generated by the gas boiler and electric boiler, respectively.
Hydrogen Side Inventory Dynamics:
S t + 1 H 2 = S t H 2 + H t p r o d D t H 2 H t v e n t
where S t H 2 denotes capacity, H t p r o d denotes output, D t H 2 denotes consumption, and H t v e n t denotes self-consumption.
The relationship between electrolytic hydrogen production and electricity consumption is converted using a linear method:
H t p r o d = α e l P t e l Δ t
Waste heat heating can be approximated using the heat-to-power ratio:
Q t g t = κ w h b P t g t
Power generation capacity limits and ramping capability of gas turbines:
P _ g t P t g t P ¯ g t , P t g t P t 1 g t R g t
where P _ g t and P ¯ g t denote the upper and lower limits of power, and R g t represents the climbing power.
Gas consumption converted to electrical efficiency:
G t g t = P t g t Δ t η g t L H V
where L H V denotes the lower heating value, and η g t represents the gas turbine efficiency.
Gas boilers share similar constraints with electric boilers and shall not be elaborated upon further. The dynamic state of the battery system is as follows:
S O C t + 1 = S O C t + η c h P t c h Δ t E b a t P t d i s Δ t η d i s E b a t
To prevent simultaneous charging and discharging during the same period, impose mutual exclusion and other constraints on the action projection:
P t c h · P t d i s = 0
0 P t c h P ¯ c h , 0 P t d i s P ¯ d i s
Grid electricity purchase and sale boundary:
0 P t g + P ¯ i m p , 0 P t g P ¯ exp
where P t g + and P t g denote electricity purchases and sales, and P ¯ i m p P ¯ exp represent the interconnection line upper limits.
Hydrogen Storage Safety Boundary:
S _ H 2 S t H 2 S ¯ H 2

2.3. Low-Carbon Economic Objective Function

To establish a unified metric for evaluating the costs incurred by the industrial park in energy procurement and sales, equipment operation, curtailment penalties/comfort penalties, and carbon emissions, the following comprehensive objective function is first presented [25]:
J = [ t = 1 T c t b u y P t g + c t s e l l P t g + c g a s ( G t g t + G t g b ) + c c u r t P t c u r t + c b a t ( P t c h + P t d i s ) + P a y t ] + C c o 2
where c b a t denotes the equivalent battery lifetime depreciation cost coefficient (approximated as the depreciation cost per 1 kWh charged and discharged), P a y t represents demand response settlement payments, and c c u r t signifies curtailment penalties. Other unit operational and maintenance costs are negligible and may be disregarded.
The carbon cost is calculated as ‘carbon price × excess emissions’:
C C O 2 = π C O 2 max ( 0 , t γ g P t g + + γ g a s ( G t g t + G t g b ) E q u o t a )
where γ g and γ g a s denote the emission factors for electricity and gas consumption, respectively, and E q u o t a represents the daily quota.

2.4. Settlement-Eligible DR Modeling

To ensure DR is settlement-eligible, we model the baseline definition, execution, verification, and accounting as explicit system dynamics [26]. In the present formulation, DR is modeled at the park-aggregated controllable-load level rather than through an individually calibrated customer behavioral function. This operator-centric formulation is intended for contract-based, centrally coordinated industrial-park DR, where an agreed transferable-load envelope is schedulable at the aggregator level rather than through decentralized customer price elasticity. Let L t r a w denote the underlying electricity demand without DR action, L t r e q the requested aggregate load-shifting amount issued by the operator, and L t m e t e r the realized metered demand after DR execution, with:
L t m e t e r = c l i p ( L t r a w Δ L t r e q , L t min , L t max )
Therefore, L t m e t e r is not directly selected by the agent, but is the realized post-DR outcome within the transferable-load envelope. The incentive price ρ t determines the settlement payment level, whereas Δ L t r e q determines the requested load-shifting quantity. The baseline L t b a s e is computed from a historical window W t through an operator B .
L t b a s e = B { L τ m e t e r } τ W t
where B can represent commonly used baseline families and is kept fixed during evaluation for auditability. Thus, the verified reduction used for settlement is computed from the realized metered demand rather than being directly engineered as a control action.
The historical window W t is fully pre-decision and consists of fixed metered observations preceding the current dispatch horizon; it is not updated using simulated loads generated within the current episode. Hence, the agent cannot manipulate the current baseline L t b a s e through its within-horizon actions. After the baseline is fixed, the clipping rule bounds settlement-eligible verified reduction, and the rebound constraints require energy compensation within the rebound window, thereby constraining strategic within-horizon profile distortion.
The verified reduction is computed using a measurement & verification rule with clipping and non-negativity:
L t s h i f t = c l i p ( L t b a s e L t m e t e r , 0 , L ¯ t s h i f t )
To reflect practical “rebound” and comfort considerations, we enforce intertemporal consistency for shiftable energy. Over a dispatch horizon τ , the shifted energy is required to be compensated within a rebound window R ( t ) :
t τ L t s h i f t Δ t E ¯ s h i f t , L t m e t e r L t min , L t m e t e r L t max
τ R t ( L τ m e t e r L τ r a w ) Δ t = τ R t L τ s h i f t Δ t
which guarantees that DR primarily reshapes the load temporally rather than unrealistically eliminating energy demand. Settlement payment is computed as:
P a y t = ρ t L t s h i f t Δ t
and the incentive budget ledger evolves as:
B t + 1 r e m = B t r e m P a y t , 0 B t r e m B ¯
thereby enforcing budget controllability. These settlement dynamics are included in the environment transition so that feasibility is defined jointly by physical constraints and settlement/accounting constraints.

3. Safety Transformer–PPO with Integrated Settable IDR Approach

3.1. MDP Modelling: States, Actions and Ledger Variables

Construct the scheduling as a Markov Decision Process (MDP) [27]. To ensure settlement consistency and safe convergence, state s t , which explicitly incorporates inter-period energy states and settlement ledger variables alongside load, generation capacity, and price signals. The state adopted in this paper can be represented as:
s t = { L t e , D t h , D t H 2 , P t p v , P t w i n d , c t b u y , c t s e l l , π C O 2 , P t g t , Q t g b , S O C t , S t H 2 , B t r e m , T }
where B t r e m represents the residual budget for load transfer.
Action as a continuous vector, balancing supply-side scheduling with demand-side incentive deployment:
a t = { P t g t , P t e l , P t e b , Q t g b , P t c h , P t d i s , ρ t , Δ L t r e q }
where ρ t governs settlement payment and incentive-budget consumption and Δ L t r e q determines the requested physical load-shifting quantity within the transferable-load envelope.
Instant return set to negative cost:
r t = J C c o 2
For readability, the main state and action variables, together with their units and bounds, are compactly summarized in Table 1.

3.2. Causal Transformer–PPO Architecture

To capture long-horizon coupling induced by ramping limits, multi-energy storage dynamics, renewable uncertainty, and settlement ledger evolution, we adopt a causal Transformer as the sequence encoder inside an actor–critic PPO [28] framework. The overall architecture is illustrated in Figure 2. Unlike a feedforward policy that maps only the current observation to an action, the proposed encoder explicitly models the most recent operating trajectory, which is essential for dispatch problems where feasible and economic decisions depend on temporal context
The policy and value networks are conditioned on a sliding window of the most recent T = 24 hourly observations, which is designed to capture the dominant daily periodicity in loads, renewable availability, and price signals while remaining lightweight for online deployment. Each hourly observation aggregates heterogeneous physical and economic variables. To avoid scale imbalance across these heterogeneous inputs and to stabilize optimization, all features are normalized using training-set statistics and then mapped through a learned linear projection into a shared latent space with embedding dimension d = 128. The resulting token sequence is processed by a compact Transformer encoder composed of N = 3 stacked blocks. Each block uses four-head self-attention and a position-wise feedforward network with hidden size 256, together with residual connections and layer normalization to support stable gradient flow and improve generalization. Importantly, we apply a strict causal attention mask so that the representation at the current hour attends only to the available history within the window, i.e., the interval spanning from tT + 1 to t. This prevents any information leakage from future steps during training and evaluation, and ensures that the learned policy remains consistent with real-time operation where future realizations are not observable. After encoding, the representation of the final token (corresponding to the current hour) is used as a compact context vector summarizing the recent system trajectory, and it is passed to the actor–critic heads.
On top of the causal Transformer encoder, the actor is implemented as a lightweight two-layer MLP (128 → 128) that outputs the parameters of a diagonal Gaussian policy for continuous controls, enabling stochastic exploration during training while maintaining a simple and scalable action distribution. The sampled action is then passed through a squashing and affine scaling procedure to enforce element-wise actuator bounds before it enters the safety layer described. This separation is intentional: the actor focuses on producing a high-quality raw control signal in a normalized action space, while feasibility with respect to hard operational constraints is enforced downstream by the safety mechanism. The critic shares the same causal Transformer encoder to form a consistent representation of the recent trajectory and reduce computational overhead. A separate two-layer MLP (128 → 128 → 1) maps the context vector to a scalar state-value estimate, which is used for advantage estimation and policy updates. Sharing the encoder between actor and critic improves sample efficiency and stabilizes training, while keeping the heads separate prevents interference between action generation and value regression.
Training is performed using PPO with the standard clipped surrogate objective and generalized advantage estimation (GAE) to balance bias and variance in policy gradients. For reproducibility, we fix all key hyperparameters across experiments: discount factor γ = 0.99, GAE parameter λ = 0.95, clipping coefficient 0.2, entropy coefficient 0.01, value-loss coefficient 0.5, and Adam learning rate 3 × 10−4. Each policy update uses a mini-batch size of 256 and runs for 10 epochs over collected rollouts. The rollout length is set to 2048 steps, which provides sufficiently diverse on-policy trajectories for stable optimization without excessively delaying updates. In total, training proceeds for 3 × 105 interaction steps. This configuration offers a pragmatic trade-off between stability and computational cost, and it is kept identical across baselines to ensure that performance differences are attributable to architectural and safety-design choices rather than tuning artifacts.

3.3. Dual-Layer Safety Mechanisms

This paper categorizes constraints into statistical constraints and hard physical constraints for separate treatment. Statistical constraints include excitation budget, maximum power curtailment rate, desired emission cap, and maximum average power purchase rate, represented by constraint cost c t ( k ) and updated using the Lagrange primal–dual approach:
max π E [ t r t ] s . t . E [ t c t ( k ) ] d k
where d k denotes the upper bound constraint.
λ k [ λ k + η k ( C ^ t ( k ) d k ) ] +
where λ k denotes the Lagrange multiplier (dual variable) corresponding to the Kth constraint, η k represents the learning rate size for the dual variables, and C ^ t ( k ) signifies the sample estimate.
Hard physical constraints encompass power upper and lower bounds, turbine/boiler ramping, battery state of charge, hydrogen storage inventory, safety venting, and energy balance. To prevent training collapse due to excessive boundary violations during exploration, this paper introduces action-feasible region projection at the execution layer, solving a minimally modified quadratic programming for the sampled policy action:
a t * = arg min a a a t 2
At execution, the raw policy action a ˜ t may violate hard constraints due to stochastic exploration or approximation errors. We therefore compute the deployed action a t by solving a quadratic projection problem that minimally modifies a ˜ t while satisfying a convex approximation of the feasible set. A standard form is:
min a t , ξ t 1 2 a t a ˜ t 2 2 + β ζ t 1 , s . t . A t a t b t + ζ t , ζ t 0
where A t a t b t encodes the standard unit-level bounds and consistency constraints defined by Table 2 and the corresponding device equations above, including ramping limits, SOC and hydrogen-inventory bounds, interconnection limits, and one-step linearized balance constraints. The slack variable ζ t is introduced only for numerical robustness, with a large penalty factor β to strongly discourage violations. In implementation, mutual exclusivity between battery charging and discharging is enforced by rule-based gating before the QP projection, so that only one of P ¯ c h , P ¯ d i s can remain active at each step. The projection problem is low-dimensional (equal to the action dimension) and can be solved efficiently at each control step, making it suitable for rolling online control.
The explicit unit-level operating limits and parameter ranges used by the execution-layer projection are those already defined in Table 2 and the corresponding device equations above; they are therefore not repeated here for brevity. These standard algebraic forms are common in safe control and constrained RL implementations [29,30], while the focus of the present work is on their integration into the proposed Safety Transformer–PPO dispatch framework.

4. Results

4.1. Scenario, Data, and Experimental Protocol

The case study is based on an anonymized real-world industrial-park integrated energy system in eastern Inner Mongolia, featuring a peak electrical load of approximately 9 MW, a trough electrical load of approximately 5 MW, a peak thermal load of approximately 10 MW, and a hydrogen demand of approximately 0.8 tons per day. Table 2 summarizes the key equipment parameters.
Comparative methods include: deterministic MILP, PPO-MLP, Transformer–PPO (without safety mechanisms), and the proposed Safety Transformer–PPO. The empirical data used in this study are derived from an anonymized industrial-park operation scenario in eastern Inner Mongolia. To protect commercially sensitive information, the representative time-series shown in Figure 3 are not raw plant measurements, but confidentiality-preserving profiles obtained after anonymization and bounded perturbation processing. These profiles retain the main temporal characteristics of electricity load, heat load, hydrogen demand, and renewable availability, and Figure 3 presents the representative anonymized typical-day profiles used to illustrate the empirical operating conditions. In this study, the MILP benchmark serves as a deterministic optimization reference under the same equipment capacities, tariff settings, and perturbation-evaluation protocol, rather than as a receding-horizon MPC or a robust/stochastic MILP benchmark.
For the learning-based methods, training is performed on a 1-month dataset composed of hourly multi-energy operational series, whereas final evaluation is conducted on disjoint out-of-sample perturbation realizations generated under a common scenario family and perturbation protocol. Here, the same perturbation protocol means a common scenario family and uncertainty-generation rule shared across methods for fair comparison, rather than reuse of identical realizations in both training and testing. The perturbations mainly include renewable forecast deviations and metering noise, introduced both to reflect practical uncertainty and to avoid disclosure of sensitive original trajectories. For confidentiality reasons, the exact perturbation realizations are not released; however, all methods are evaluated under the same equipment capacities, price settings, and perturbation-generation protocol to ensure fair comparison.
All experiments were implemented in Python 3.8 using PyTorch as the deep-learning framework. The execution-layer quadratic projection was solved using Gurobi. Experiments were conducted on a Windows-based workstation equipped with an Intel Core i7-14700KF CPU and an NVIDIA GeForce RTX 4070 SUPER GPU. Unless otherwise specified, each learning-based method was evaluated over 20 random seeds, i.e., {25, 1025, 2025, …, 19,025}, and the reported results are aggregated over the corresponding out-of-sample test runs.

4.2. Comparative Analysis of Economic Efficiency and Low-Carbon Performance

Based on the typical-day comparison results in Table 3, the proposed Safety Transformer–PPO achieves the lowest total cost, with 12.52 ± 0.13 × 104 CNY, outperforming MILP (13.63 × 104 CNY), PPO-MLP (14.36 ± 0.32 × 104 CNY), and Transformer–PPO (13.38 ± 0.23 × 104 CNY). This corresponds to cost reductions of approximately 8.1% relative to MILP, 12.8% relative to PPO-MLP, and 6.4% relative to Transformer–PPO. To further quantify cross-seed variability, the 95% confidence intervals of the total cost are 12.46–12.58 ×104 CNY for Safety Transformer–PPO, 13.27–13.49 × 104 CNY for Transformer–PPO, and 14.22–14.50 × 104 CNY for PPO-MLP. These confidence intervals show that the proposed method not only attains the lowest mean total cost, but also exhibits the narrowest uncertainty band among the compared learning-based methods, thereby indicating stronger run-to-run consistency. The cost advantage is mainly attributed to lower procurement cost (1.55 ± 0.09 × 104 CNY), lower gas cost (10.20 ± 0.08 × 104 CNY), lower carbon cost (0.54 ± 0.03 × 104 CNY), and lower penalty cost (0.06 ± 0.03 × 104 CNY), which together indicate improved peak-purchase suppression, more effective renewable accommodation, and better overall low-carbon economic performance under the same system configuration.
In terms of operational feasibility, Table 3 further shows that constraint handling is a key differentiator. Transformer–PPO without safety yields a non-compliance rate of 4.8 ± 1.2%, and PPO-MLP still exhibits 0.8 ± 0.5%, whereas the proposed Safety Transformer–PPO maintains 0.0 ± 0.0%, matching the deterministic MILP result while achieving better economy. Correspondingly, the approximate 95% confidence intervals of the non-compliance rate are 4.27–5.33% for Transformer–PPO, 0.58–1.02% for PPO-MLP, and 0.00–0.00% for Safety Transformer–PPO. This suggests that the economic advantage of the proposed method is achieved without relying on hard-constraint violations and remains consistently feasible across the evaluated runs. Although its incentive expenditure is the highest among the compared methods (0.10 ± 0.03 × 104 CNY), the overall effect remains favorable because the reductions in procurement, penalty, and carbon-related costs dominate. Overall, the proposed method provides the most favorable balance between economic performance and operational safety among the compared approaches.
Figure 4 shows the typical-day load transfer profile under the proposed method. The transferred load is mainly shifted out of the peak window (18:00–21:00) and compensated during off-peak hours, which indicates that the demand response is used as a settlement-eligible temporal reshaping rather than an unrealistic load “deletion”. This concentration of negative adjustments in the evening peak is consistent with the goal of suppressing peak procurement, while the rebound arranged in low-price periods reduces the likelihood of daytime disturbance, improving the verifiability and billability of DR in practical settlement.
Figure 5 presents the typical-day SOC trajectory of a battery. The SOC exhibits a clear “off-peak charging, peak discharging” pattern, and—more importantly—remains within the prescribed safety boundaries throughout the horizon, demonstrating that the safety constraints do not merely penalize violations after the fact but effectively shape deployable actions at execution. In operational terms, this SOC discipline provides the cross-period flexibility needed to support peak shaving and renewable accommodation while preventing boundary-hitting behaviors that often occur in unconstrained exploration.
Figure 6 extends the analysis to a monthly window (30 days, 720 h) and shows the distribution of load-shifting decisions over time. Compared with a fixed-rule peak-shifting strategy, the shifting periods and magnitudes vary across different dates, suggesting that the controller adjusts the DR volume in response to changing operating conditions, including renewable output, load levels, and price signals.
Figure 7 shows the monthly SOC evolution of the battery. Throughout the 30-day horizon, the SOC remains within the admissible operating bounds, and no out-of-bound event is observed. To avoid relying solely on visual inspection, two month-scale indicators are further reported for the same rollout, namely the cumulative non-compliance rate and the SOC boundary-hit frequency. The former quantifies the proportion of time steps with operational constraint violations over the monthly horizon, while the latter quantifies the proportion of time steps at which the battery SOC falls within 2% of either operating bound. In the reported monthly rollout, the cumulative non-compliance rate is 0.0%, whereas the SOC boundary-hit frequency is approximately 50%. These results show that the battery is repeatedly dispatched close to its admissible limits over the extended horizon, yet without any observed boundary violation.

4.3. Training Convergence and Safety Statistics

This subsection reports the training convergence and safety statistics of the proposed Safety Transformer–PPO using learning curves and execution-layer diagnostics. Convergence is evaluated by the typical-day total cost (consistent with the cost-based reward/return definition), while safety is quantified by the hard-constraint non-compliance rate measured during execution. In addition, we report two projection-layer indicators—projection activation rate and correction magnitude—to characterize the extent to which the execution-layer feasibility projection intervenes throughout training.
Figure 8 shows the evolution of the typical-day total cost over training episodes for the three learning-based methods, namely Safety Transformer–PPO, Transformer–PPO without safety, and PPO-MLP, with the deterministic MILP result included as a reference. The solid curves denote the mean values over 20 random-seed runs, while the shaded bands indicate ±1 standard deviation. Overall, the training process exhibits a clear pattern of progressive cost reduction followed by gradual stabilization, which is typical of PPO-style policy optimization. In the early training stage, all learning-based methods remain at relatively high-cost levels, reflecting exploratory behavior and limited policy quality. As training proceeds, the total cost decreases steadily and eventually approach a stable plateau, indicating that the policy updates become progressively smaller and the agent enters a relatively stable operating regime.
Across the entire training horizon, Safety Transformer–PPO converges to the lowest mean cost level among the compared learning-based methods. The final plateau is consistent with the evaluation results reported in Table 3, where Safety Transformer–PPO achieves a total cost of 12.52 ± 0.13 × 104 CNY, outperforming Transformer–PPO without safety (13.38 ± 0.23 × 104 CNY) and PPO-MLP (14.36 ± 0.32 × 104 CNY). In addition, the cost gap becomes visible already in the middle stage of training, suggesting that the proposed method reaches a competitive regime earlier and maintains a more favorable cost profile thereafter. Another noteworthy characteristic is that the proposed method exhibits a narrower late-stage standard-deviation band than the baseline learning methods. This pattern is consistent with the cross-seed statistics in Table 3 and indicates lower cross-run dispersion after the policy approaches convergence.
Figure 9 reports the hard-constraint non-compliance rate versus training episodes. Here, non-compliance is defined as the fraction of decision steps where any hard physical constraint is violated during execution. This metric directly reflects whether a learned policy is operationally deployable under strict security constraints. The curves show a pronounced separation between the proposed method and the baselines. Safety Transformer–PPO drives the non-compliance rate down rapidly and maintains an approximately zero level at convergence. In contrast, Transformer–PPO without safety stabilizes at a substantially higher violation exposure, and PPO-MLP converges to a smaller yet non-negligible level. Beyond the final values, the transient behavior is also informative: non-compliance is typically higher in early training and decreases as training progresses, which indicates that infeasible behaviors are more frequent during exploration and gradually diminish as the policy improves. Figure 10 provides an execution-layer diagnostic by reporting the projection activation rate over training. The activation rate is high during early training and decreases gradually as training proceeds. This indicates that, at the beginning, the raw policy frequently proposes infeasible or near-infeasible actions, requiring frequent projection. As the policy improves, the raw action distribution becomes increasingly compatible with feasibility requirements, so fewer projection interventions are needed.
Figure 11 complements Figure 10 by reporting the average correction magnitude, measured by the L2 norm between the projected action and the raw policy action (‖a − ã‖2), averaged over decision steps. While the activation rate indicates “how often” the projection intervenes, the correction magnitude indicates “how strongly” it modifies the action when it intervenes. In the curve, the correction magnitude decreases from a larger initial level to a small plateau, consistent with the decreasing activation rate in Figure 10. In practice, these two diagnostics should be interpreted together: an ideal learning outcome is characterized by both a low activation rate and a small correction magnitude, indicating that the policy itself is producing feasible actions and the projection layer acts primarily as a lightweight safeguard.
The projection diagnostics are mainly intended to assess whether the learned policy systematically relies on execution-layer correction. Some degree of correction is expected during early training because of stochastic exploration and approximation errors. However, as training proceeds, both the projection activation rate and the average correction magnitude decrease markedly, suggesting that the learned policy becomes increasingly compatible with the feasible region rather than systematically relying on the projection layer after convergence.
Overall, the training curves indicate that Safety Transformer–PPO converges to a lower cost level while achieving zero hard-constraint non-compliance at convergence. The projection diagnostics further show that intervention frequency and intervention strength both decrease over training, implying that the learned policy becomes increasingly consistent with feasibility requirements.

5. Discussion

5.1. Ablation of the Dual-Layer Safety Mechanism

To isolate the contribution of each safety component, we conduct an ablation study on the proposed dual-layer safety design. We evaluate four variants: (i) full Safety Transformer–PPO (Lagrange updates + execution-layer QP projection), (ii) Lagrange-only (remove QP projection), (iii) QP-only (remove Lagrange updates), and (iv) Transformer–PPO without safety. All variants share the same Transformer–PPO backbone, training budget, uncertainty injection, and state/action definitions; only the safety components are toggled. We report the typical-day cost breakdown and the hard-constraint non-compliance rate to jointly assess economic performance and operational feasibility.
Table 4 reveals a clear cost–safety trade-off across the four variants. The full Safety Transformer–PPO achieves the lowest total cost (12.52 × 104 CNY) with zero non-compliance (0.0%). In contrast, the safety-free Transformer–PPO baseline yields a higher total cost (13.38 × 104 CNY) and a markedly higher violation exposure (4.8%). When QP projection is removed (Lagrange-only), the total cost improves to 12.70 × 104 CNY, but non-compliance increases to 1.6%, indicating that dual-variable regulation alone is insufficient to eliminate hard violations in execution. When Lagrange updates are removed (QP-only), hard feasibility is preserved (0.0% non-compliance), but total cost degrades to 12.88 × 104 CNY, suggesting that feasibility enforcement alone does not guarantee cost-efficient operation.
Beyond the headline totals, the component-level breakdown in Table 4 helps localize the sources of improvement. Relative to the safety-free Transformer–PPO, the full method reduces procurement cost (1.55 × 104 CNY vs. 1.95 × 104 CNY), gas costs (10.20 × 104 CNY vs. 10.55 × 104 CNY), carbon cost (0.54 × 104 CNY vs. 0.60 × 104 CNY), and penalty (0.06 ×104 CNY vs. 0.14 × 104 CNY), while keeping battery cost in a narrow range (0.06 × 104 CNY −0.07 × 104 CNY). Incentive expenditure remains bounded (0.08 × 104 CNY −0.11 × 104 CNY), implying that the observed cost reduction is primarily driven by improved operational decisions rather than aggressive incentive spending. Comparing the two ablations, QP-only exhibits higher carbon cost (0.60 × 104 CNY) and penalty (0.07 × 104 CNY) than the full method, whereas Lagrange-only attains a closer cost composition but incurs non-compliance (1.6%).
Overall, the ablation results indicate that feasibility and economic optimality are coupled but not identical objectives. Execution-layer projection is decisive for eliminating hard violations, whereas dual-variable regulation improves the quality of feasible actions, particularly by reducing procurement reliance and carbon-related cost. The combined design therefore achieves a preferable cost–safety trade-off compared with either component alone.

5.2. Robustness to Forecast Error Bounds

Since the evaluation already accounts for renewable forecast errors and metering noise, we further test robustness by scaling the renewable forecast error bound to ε ∈ {±5%, ±10%, ±15%} while keeping capacities and price settings unchanged. For strict comparability with the Results section, the ε = ±10% case matches the typical-day evaluation setting, and Table 5 reports the corresponding cost components together with the non-compliance rate.
As shown in Table 5, the total cost increases monotonically as uncertainty grows, from 12.40 × 104 CNY at ε = ±5% to 12.74 × 104 CNY at ε = ±15%. The increase is primarily driven by higher procurement cost (1.50 → 1.65), carbon cost (0.52 → 0.58), and penalty (0.04 → 0.09), while gas costs change only slightly (10.18 → 10.24) and battery cost remains nearly invariant (0.07). Incentive expenditure shows a mild increase (0.09 → 0.11), consistent with a larger verified response under higher perturbations in the settlement calculation.
Across all tested uncertainty levels, the non-compliance rate remains 0.0%, indicating that the execution-layer safety projection preserves hard feasibility even when forecast perturbations widen. From an operational viewpoint, this means robustness is reflected not only by moderate cost degradation but also by guaranteed constraint satisfaction. Nevertheless, the economic deterioration under larger ε suggests that very high uncertainty may require retraining under broader disturbance distributions and/or introducing conservative feasibility margins in the projection constraints.

5.3. Quantitative Comparison with Related Studies

To better position the proposed method within the recent literature, two levels of comparison should be distinguished. First, Table 3 already provides a controlled same-case benchmark by reproducing representative baseline methods under identical system settings, perturbation protocol, and evaluation metrics. Second, Table 6 below offers a literature-level quantitative comparison with representative recent studies. Since the compared studies differ in case scale, device portfolio, tariff setting, uncertainty assumption, and benchmark definition, the comparison is intended as a relative positioning analysis rather than a strict ranking of absolute operating performance.
As shown in Table 6, existing studies have reported meaningful economic and low-carbon improvements in hydrogen-related integrated energy systems, but most of them do not explicitly combine settlement-oriented DR execution with execution-layer hard-feasibility protection. In this sense, the distinctive contribution of the present work does not lie in claiming the largest reported cost-reduction percentage across heterogeneous cases, but in jointly achieving economic improvement, settlement-aware DR implementation, and zero observed hard-constraint non-compliance within a unified sequential decision framework. Therefore, Table 6 should be interpreted as a literature-level positioning analysis, whereas the controlled same-case method comparison is already provided in Table 3.

5.4. Limitations

Although the proposed Safety Transformer–PPO shows favorable economic performance, hard-constraint feasibility, and robustness within the tested settings, several limitations should be noted. First, the current validation is conducted on a single anonymized industrial-park configuration; therefore, the reported results should be interpreted as evidence of effectiveness for this class of electro–heat–hydrogen dispatch problems rather than as a guarantee of direct transferability to parks with substantially different device portfolios, tariff mechanisms, demand structures, or hydrogen-consumption patterns. Second, the method remains sensitive to data quality and settlement design, especially baseline-definition choices, metering reliability, and disturbance distributions, all of which may affect verified response quantities, incentive-ledger evolution, and economic performance. Finally, real deployment would also require stable integration with plant-level EMS/SCADA infrastructure, reliable M&V pipelines, and periodic model maintenance under seasonal variation or distribution shifts. These issues define important directions for future work on broader cross-site validation, stronger interpretability, and deployment-oriented system integration.
It should be noted that the present study adopts fixed-cost parameters for hydrogen- and heat-related technologies in the tested dispatch setting. However, in practical industrial deployment, these costs may evolve with technology learning, market maturity, and economies of scale, which could further affect the relative utilization of hydrogen conversion, storage, and heat supply units. Therefore, the reported economic results should be interpreted under the current parameter setting rather than as a prediction of future technology-cost trajectories. Future work will incorporate cost-learning scenarios and scale-dependent parameter settings for hydrogen- and heat-related equipment to further assess their impact on dispatch performance and economic competitiveness.

6. Conclusions

This paper proposed a Safety Transformer–PPO framework for low-carbon economic dispatch in integrated electro–heat–hydrogen energy systems with settlement-oriented incentive-based demand response. By combining a causal Transformer encoder, PPO-based policy learning, and a dual-layer safety mechanism, the proposed method achieved lower total cost, lower carbon-related cost, and zero observed hard-constraint non-compliance in the tested industrial-park setting, while also showing improved convergence stability across random seeds.
These results should be interpreted within the scope of the current validation setting. The study is conducted on a single anonymized industrial-park configuration under specific tariff and disturbance assumptions, and therefore does not by itself guarantee direct transferability to substantially different operating environments. Future work will further strengthen the benchmarking by incorporating receding-horizon MILP/MPC and robust/stochastic MILP baselines under the same forecast and disturbance settings.

Author Contributions

J.Z.: Methodology; Formal analysis; Writing—original draft. Y.W.: Data curation; Validation; Supervision. H.X.: Methodology; Data curation; Formal analysis. L.N.; Data curation; Project administration. L.Y.: Methodology; Writing—review & editing. W.X.: Methodology; Formal analysis. S.Y.: Formal analysis; Validation. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Due to the confidentiality of industrial-park operational data, the original dataset and exact perturbation realizations used in this study are not publicly available. Representative anonymized data supporting the findings of this study may be obtained from the corresponding authors upon reasonable request, subject to institutional review and the signing of a confidentiality agreement.

Conflicts of Interest

Authors Jia Zhengjian, Yang Wanchun, Huang Xin, Liang Nan, Liu Yupeng and Wang Xiaojun were employed by the company Inner Mongolia Power (Group) Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

  1. Wang, W.; Chen, X.; Liu, Y.; Wu, L. Forming multi-transmission-node distributed energy resource aggregations in wholesale energy market: An optimal node aggregation approach and admissible capacity expansion regions. Appl. Energy 2026, 410, 127536. [Google Scholar] [CrossRef]
  2. Washizu, A.; Nozu, T. Sustainability Transition to a Low-Carbon Society: Focusing on Rural Areas. In Climate Change Issues and Social Sciences: Towards a Carbon Neutral Society; Springer: Berlin/Heidelberg, Germany, 2025; pp. 45–63. [Google Scholar] [CrossRef]
  3. Kurucan, M.; Özbaltan, M.; Yetgin, Z.; Alkaya, A. Applications of artificial neural network based battery management systems: A literature review. Renew. Sustain. Energy Rev. 2024, 192, 114262. [Google Scholar] [CrossRef]
  4. Wu, L.; Zhang, W.; Chen, W.; Pei, T. A Multi-Time scale optimal scheduling strategy for integrated energy systems considering the power randomness of wind and photovoltaic. Electr. Eng. 2025, 107, 9109–9123. [Google Scholar] [CrossRef]
  5. Pinson, P.; Madsen, H. Benefits and challenges of electrical demand response: A critical review. Renew. Sustain. Energy Rev. 2014, 39, 686–699. [Google Scholar] [CrossRef]
  6. Iris, Ç.; Lam, J.S.L. Optimal energy management and operations planning in seaports with smart grid while harnessing renewable energy under uncertainty. Omega 2021, 103, 102445. [Google Scholar] [CrossRef]
  7. Stanelyte, D.; Radziukyniene, N.; Radziukynas, V. Overview of demand-response services: A review. Energies 2022, 15, 1659. [Google Scholar] [CrossRef]
  8. Valentini, O.; Andreadou, N.; Bertoldi, P.; Lucas, A.; Saviuc, I.; Kotsakis, E. Demand response impact evaluation: A review of methods for estimating the customer baseline load. Energies 2022, 15, 5259. [Google Scholar] [CrossRef]
  9. Li, Z.; Li, H.; Wang, S. Customer baseline load estimation in incentive-based demand response programs: Requirements, solutions, challenges and future perspectives. Renew. Sustain. Energy Rev. 2026, 226, 116383. [Google Scholar] [CrossRef]
  10. Ellman, D.; Xiao, Y. Incentives to manipulate demand response baselines with uncertain event schedules. IEEE Trans. Smart Grid 2020, 12, 1358–1369. [Google Scholar] [CrossRef]
  11. Wang, X.; Tang, W. Modeling and analysis of baseline manipulation in demand response programs. IEEE Trans. Smart Grid 2021, 13, 1178–1186. [Google Scholar] [CrossRef]
  12. Qi, X.; Gong, M.; Huang, F.; Liu, H. Assessment of Load Reduction Potential Based on Probabilistic Prediction of Demand Response Baseline Load. Processes 2025, 14, 52. [Google Scholar] [CrossRef]
  13. Šikšnys, D.; Vaičys, J.; Gudžius, S.; Račkienė, R.; Grigošaitis, M. Unlocking thermal flexibility through demand-side response: Baseline methodology assessment and heating electrification in the Baltic region. Therm. Sci. Eng. Prog. 2026, 70, 104498. [Google Scholar] [CrossRef]
  14. Ukoba, K.; Olatunji, K.O.; Adeoye, E.; Jen, T.-C.; Madyira, D.M. Optimizing renewable energy systems through artificial intelligence: Review and future prospects. Energy Environ. 2024, 35, 3833–3879. [Google Scholar] [CrossRef]
  15. Ferdaus, M.M.; Dam, T.; Anavatti, S.; Das, S. Digital technologies for a net-zero energy future: A comprehensive review. Renew. Sustain. Energy Rev. 2024, 202, 114681. [Google Scholar] [CrossRef]
  16. Gharibvand, H.; Gharehpetian, G.B.; Anvari-Moghaddam, A. A survey on microgrid flexibility resources, evaluation metrics and energy storage effects. Renew. Sustain. Energy Rev. 2024, 201, 114632. [Google Scholar] [CrossRef]
  17. Ma, M.; Long, Z.; Liu, X.; Lee, K.Y. Distributionally robust optimization of electric–thermal–hydrogen integrated energy system considering source–load uncertainty. Energy 2025, 316, 134568. [Google Scholar] [CrossRef]
  18. Cuisinier, É.; Lemaire, P.; Penz, B.; Ruby, A.; Bourasseau, C. New rolling horizon optimization approaches to balance short-term and long-term decisions: An application to energy planning. Energy 2022, 245, 122773. [Google Scholar] [CrossRef]
  19. Fernández, G.; Sanz Osorio, J.; Rocca, R.; Luengo-Baranguan, L.; Torres, M. Practical Considerations for the Development of Two-Stage Deterministic EMS (Cloud–Edge) to Mitigate Forecast Error Impact on the Objective Function. Appl. Sci. 2026, 16, 1844. [Google Scholar] [CrossRef]
  20. Liang, T.; Zhang, X.; Tan, J.; Jing, Y.; Liangnian, L. Deep reinforcement learning-based optimal scheduling of integrated energy systems for electricity, heat, and hydrogen storage. Electr. Power Syst. Res. 2024, 233, 110480. [Google Scholar] [CrossRef]
  21. Liu, J.; Meng, X.; Wu, J. Data-driven optimal scheduling for integrated electricity-heat-gas-hydrogen energy system considering demand-side management: A deep reinforcement learning approach. Int. J. Hydrogen Energy 2025, 103, 147–165. [Google Scholar] [CrossRef]
  22. Li, Y.; Zhao, B.; Li, Y.; Long, C.; Li, S.; Dong, Z.; Shahidehpour, M. Safe-AutoSAC: AutoML-enhanced safe deep reinforcement learning for integrated energy system scheduling with multi-channel informer forecasting and electric vehicle demand response. Appl. Energy 2025, 399, 126468. [Google Scholar] [CrossRef]
  23. Prabawa, P.; Choi, D.-H. Safe deep reinforcement learning-assisted two-stage energy management for active power distribution networks with hydrogen fueling stations. Appl. Energy 2024, 375, 124170. [Google Scholar] [CrossRef]
  24. Zhu, H.; Wang, X.; Wen, Y.; Zhu, J.; Li, J.; Luo, Q.; Liao, C. A review of integrated energy system modeling and operation. Appl. Energy 2025, 400, 126572. [Google Scholar] [CrossRef]
  25. Liu, H.; Li, Y.; Li, S.; Kou, X.; Dong, Y.; Jiang, J.; Ji, F.; Duan, M.; Hao, X.; Hu, W. Heat pump-assisted waste heat recovery for thermal management in hydrogen-enabled integrated energy systems. Energy 2025, 338, 138874. [Google Scholar] [CrossRef]
  26. Xie, Y.; Xiong, W.; Zhang, S.; Li, Z.; Johnson, B.C.; Zhu, H. Deep learning-based distributionally robust optimization scheduling in the low carbon park integrated energy system under multiple uncertainties. Energy 2026, 344, 139886. [Google Scholar] [CrossRef]
  27. Puterman, M.L. Markov decision processes. Handb. Oper. Res. Manag. Sci. 1990, 2, 331–434. [Google Scholar] [CrossRef]
  28. Wen, X.; Duan, Z.; Wang, J.; Hong, Q. The application of improved PPO algorithm in microgrid energy management. Eng. Res. Express 2026, 8, 025317. [Google Scholar] [CrossRef]
  29. Qiu, D.; Dong, Z.; Zhang, X.; Wang, Y.; Strbac, G. Safe reinforcement learning for real-time automatic control in a smart energy-hub. Appl. Energy 2022, 309, 118403. [Google Scholar] [CrossRef]
  30. Wang, Y.; Qiu, D.; Sun, M.; Strbac, G.; Gao, Z. Secure energy management of multi-energy microgrid: A physical-informed safe reinforcement learning approach. Appl. Energy 2023, 335, 120759. [Google Scholar] [CrossRef]
  31. Ecoffet, P.; Fontbonne, N.; André, J.-B.; Bredeche, N. Reinforcement learning with rare significant events: Direct policy search vs. gradient policy search. In Proceedings of the Genetic and Evolutionary Computation Conference Companion; Gecco: Avelin, France, 2021; pp. 97–98. Available online: https://dl.acm.org/doi/pdf/10.1145/3449726.3459462 (accessed on 4 January 2026).
  32. Sopegno, L.; Cirrincione, G.; Martini, S.; Rutherford, M.J.; Livreri, P.; Valavanis, K.P. Transformer-based physics informed proximal policy optimization for UAV autonomous navigation. In 2025 International Conference on Unmanned Aircraft Systems (ICUAS); IEEE: Piscataway, NJ, USA, 2025; pp. 1094–1099. [Google Scholar] [CrossRef]
  33. Su, X.; Zhang, Q.; Fu, Z.; Wu, J.; Qin, T.; Li, C.; Huang, S.; Bi, K. The coordinated multi-energy trading framework for integrated energy systems considering electricity-hydrogen trading and carbon emission flow. Energy 2025, 339, 139145. [Google Scholar] [CrossRef]
  34. Xu, X.; Du, Y. Two-Stage Robust Optimal Configuration of Multi-Energy Microgrid Considering Tiered Carbon Trading and Demand Response. Symmetry 2025, 17, 1999. [Google Scholar] [CrossRef]
  35. Wang, H.; Wu, Q.; Guo, H. Low-Carbon Optimal Operation Strategy of Multi-Energy Multi-Microgrid Electricity–Hydrogen Sharing Based on Asymmetric Nash Bargaining. Sustainability 2025, 17, 4703. [Google Scholar] [CrossRef]
Figure 1. Electricity–Heat–Hydrogen coupling system.
Figure 1. Electricity–Heat–Hydrogen coupling system.
Energies 19 01578 g001
Figure 2. The Causal Transformer–PPO architecture.
Figure 2. The Causal Transformer–PPO architecture.
Energies 19 01578 g002
Figure 3. The typical daily data.
Figure 3. The typical daily data.
Energies 19 01578 g003
Figure 4. Typical daily load transfer curve for the proposed method.
Figure 4. Typical daily load transfer curve for the proposed method.
Energies 19 01578 g004
Figure 5. Typical daily SOC trajectory of a battery for the proposed method.
Figure 5. Typical daily SOC trajectory of a battery for the proposed method.
Energies 19 01578 g005
Figure 6. Monthly load transfer curve for the proposed method.
Figure 6. Monthly load transfer curve for the proposed method.
Energies 19 01578 g006
Figure 7. Monthly SOC trajectory of a battery for the proposed method.
Figure 7. Monthly SOC trajectory of a battery for the proposed method.
Energies 19 01578 g007
Figure 8. Training convergence of typical-day total cost. (Solid lines denote the mean values over 20 random seeds, and the shaded bands indicate ±1 standard deviation).
Figure 8. Training convergence of typical-day total cost. (Solid lines denote the mean values over 20 random seeds, and the shaded bands indicate ±1 standard deviation).
Energies 19 01578 g008
Figure 9. Hard-constraint non-compliance rate during training.
Figure 9. Hard-constraint non-compliance rate during training.
Energies 19 01578 g009
Figure 10. Projection activation rate (the shaded bands indicate correction magnitude).
Figure 10. Projection activation rate (the shaded bands indicate correction magnitude).
Energies 19 01578 g010
Figure 11. Average correction magnitude (the shaded bands indicate correction magnitude).
Figure 11. Average correction magnitude (the shaded bands indicate correction magnitude).
Energies 19 01578 g011
Table 1. Compact definition of state and action variables in the MDP formulation.
Table 1. Compact definition of state and action variables in the MDP formulation.
GroupSymbolMeaningUnitBound/Range
State L t r a w Underlying electrical demand before DR executionMWExogenous scenario input
State D t h Heat demandMWthExogenous scenario input
State D t H 2 Hydrogen demandt/hExogenous scenario input
State P t w i n d Available wind powerMWExogenous scenario input
State P t p v Available PV powerMWExogenous scenario input
State S O C t Battery state of chargep.u.0.1–0.9
State S t H 2 Hydrogen inventory in storage tankt0.2–1.5
State L t e Settlement baseline loadMW/
State B t r e m Remaining incentive budgetCNY0–10,000
Action P t g t Gas turbine electric outputMW1–4
Action P t e l Electrolyzer power consumptionMW0–2
Action P t e b Electric boiler power consumptionMW0–2.11
Action Q t g b Gas boiler heat outputMWth0–6
Action P t c h Battery charging powerMW0–1
Action P t d i s Battery discharging powerMW0–1
Action ρ t DR incentive priceCNY/
Action Δ L t r e q Requested aggregate load-shifting amountMW/
Table 2. Equipment and Constraint Parameters.
Table 2. Equipment and Constraint Parameters.
CategoryParameters
Power grid P ¯ i m p = 8   MW , P ¯ exp = 2   MW
Gas turbine P ¯ g t = 4   MW , P _ g t = 1   MW , R g t = 1   MW / h
Waste heat heating κ w h b = 1.2
Gas boiler Q ¯ g b = 6   MW , R g b = 1.5   MW / h , η g b = 0.9
Electric boiler Q ¯ e b = 2   MW , η e b = 0.95
Electrolytic cell P ¯ e l = 2   MW , R e l = 1   MW / h , α e l = 0.020   t / MWh
Hydrogen storage tank S _ H 2 = 0.2   t , S ¯ H 2 = 1.5   t
Battery E b a t = 2   MWh , P ¯ c h = P ¯ d i s = 1   MW , η c h = η d i s = 0.95 , S O C [ 0.1 , 0.9 ] , c b a t = 0.02
Electricity price (Peak/Flat/Valley) c t b u y = 1.10 / 0.75 / 0.42 RMB / kWh , c t s e l l = 0.35 RMB / kWh
Gas c g a s = 3.5 / Nm 3 , L H V = 9.5
Penalty c c u r t = 0.2 / kWh
Carbon π C O 2 = 80 RMB / t
Incentive budget B t r e m = 10000 / day
Table 3. Comparison of Typical Daily Optimization Scheduling Outcomes (×104 CNY).
Table 3. Comparison of Typical Daily Optimization Scheduling Outcomes (×104 CNY).
CategoryMILPPPO-MLP [31]Transformer–PPO [32]Safety Transformer–PPO
Total cost13.6314.36 ± 0.3213.38 ± 0.2312.52 ± 0.13
Procurement cost2.052.45 ± 0.191.95 ± 0.151.55 ± 0.09
Gas costs10.7010.95 ± 0.1310.55 ± 0.1110.20 ± 0.08
Carbon cost0.620.66 ± 0.040.60 ± 0.040.54 ± 0.03
Penalty0.180.22 ± 0.060.14 ± 0.050.06 ± 0.03
Battery cost0.060.05 ± 0.020.06 ± 0.020.07 ± 0.02
Incentive expenditure0.020.03 ± 0.020.08 ± 0.030.10 ± 0.03
Non-compliance rate (%)0.00.8 ± 0.54.8 ± 1.20.0 ± 0.0
Note: Results for the learning-based methods are reported as mean ± standard deviation over 20 random seeds. The MILP baseline is deterministic and is therefore reported as a single value.
Table 4. Safety ablation on a typical day (×104 CNY).
Table 4. Safety ablation on a typical day (×104 CNY).
CategoryLagrange-OnlyQP-OnlyTransformer–PPOSafety Transformer–PPO
Total cost12.7012.8813.3812.52
Procurement cost1.601.581.951.55
Gas costs10.2210.2310.5510.20
Carbon cost0.560.600.600.54
Penalty0.080.070.140.06
Battery cost0.070.070.060.07
Incentive expenditure0.100.110.080.10
Non-compliance rate (%)1.60.04.80.0
Table 5. Safety Transformer–PPO under different forecast error bounds ε (×104 CNY).
Table 5. Safety Transformer–PPO under different forecast error bounds ε (×104 CNY).
Categoryε = ±5%ε = ±10%ε = ±15%
Total cost12.4012.5212.74
Procurement cost1.501.551.65
Gas costs10.1810.2010.24
Carbon cost0.520.540.58
Penalty0.040.060.09
Battery cost0.070.070.07
Incentive expenditure0.090.100.11
Non-compliance rate (%)0.00.00.0
Table 6. Quantitative comparison with representative recent studies from the perspectives of safety handling and settlement-oriented demand response.
Table 6. Quantitative comparison with representative recent studies from the perspectives of safety handling and settlement-oriented demand response.
Ref.System TypeMethodReported EconomicSettlement-Oriented DRSafety/
Feasibility Handling
Main Distinction Relative to This Work
Su et al. [33]Integrated energy system with electricity–hydrogen trading and carbon-emission flowCoordinated multi-energy trading frameworkTotal system cost reduced by 6.2%; carbon emissions reduced by 17.3%NoNo explicit execution-layer safety correctionFocuses on carbon-flow-aware coordinated trading rather than settlement-aware DR and safe RL
Xu et al. [34]Multi-energy microgrid with P2G and CCSTwo-stage robust optimizationTotal cost reduced by 3.28%; carbon emissions reduced by up to 31.9%NoNo explicit execution-layer safety correctionFocuses on carbon-flow-aware coordinated trading rather than settlement-aware DR and safe RL
Wang et al. [35]Multi-energy multi-microgrid electricity–hydrogen sharing systemAsymmetric Nash bargaining + distributed optimizationTotal cost of the MEMG network decreased by 12,431.22 CNY; carbon emission reduction ratio reached 11.12%NoPhysical feasibility handled in optimization modelEmphasizes robust planning/operation, but not settlement-grade DR or execution-layer correction
OursElectro–heat–hydrogen industrial parkSafety Transformer–PPOTotal cost reduced by 8.1% vs. MILP, 12.8% vs. PPO-MLP, and 6.4% vs. Transformer–PPOYesPhysical feasibility handled in optimization modelFocuses on bargaining-based multi-microgrid coordination rather than safe sequential control
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhengjian, J.; Wanchun, Y.; Xin, H.; Nan, L.; Yupeng, L.; Xiaojun, W.; Yu, S. Low-Carbon Economic Dispatch and Settable Incentive-Based Demand Response for Integrated Electro–Heat–Hydrogen Energy Systems Based on Safety Transformer–PPO. Energies 2026, 19, 1578. https://doi.org/10.3390/en19061578

AMA Style

Zhengjian J, Wanchun Y, Xin H, Nan L, Yupeng L, Xiaojun W, Yu S. Low-Carbon Economic Dispatch and Settable Incentive-Based Demand Response for Integrated Electro–Heat–Hydrogen Energy Systems Based on Safety Transformer–PPO. Energies. 2026; 19(6):1578. https://doi.org/10.3390/en19061578

Chicago/Turabian Style

Zhengjian, Jia, Yang Wanchun, Huang Xin, Liang Nan, Liu Yupeng, Wang Xiaojun, and Song Yu. 2026. "Low-Carbon Economic Dispatch and Settable Incentive-Based Demand Response for Integrated Electro–Heat–Hydrogen Energy Systems Based on Safety Transformer–PPO" Energies 19, no. 6: 1578. https://doi.org/10.3390/en19061578

APA Style

Zhengjian, J., Wanchun, Y., Xin, H., Nan, L., Yupeng, L., Xiaojun, W., & Yu, S. (2026). Low-Carbon Economic Dispatch and Settable Incentive-Based Demand Response for Integrated Electro–Heat–Hydrogen Energy Systems Based on Safety Transformer–PPO. Energies, 19(6), 1578. https://doi.org/10.3390/en19061578

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop