1. Introduction
Against the backdrop of high penetration of wind and solar grid integration alongside the rapid expansion of the hydrogen economy, integrated energy systems (IES) within industrial parks increasingly require the coordination of electricity, heat energy and hydrogen under conditions of intense multi-energy coupling, significant intertemporal dynamics and high uncertainty [
1,
2]. Hydrogen energy, particularly through electrolyzer and hydrogen storage systems, introduces an electricity-to-hydrogen coupling pathway [
3]. Its demand side is driven by exogenous hydrogen sector requirements, necessitating dynamic fulfilment through hydrogen production and inventory management. Conversion and storage equipment collectively determine viable operational zones and carbon emission outcomes [
4].
In parallel, incentive-based demand response (DR) has been increasingly framed as a settlement-grade resource rather than a purely price-guided flexibility option [
5]. Demand response has also been applied in different industrial contexts. For instance, Iris and Lam [
6] investigated demand response and energy management in seaports under renewable uncertainty, showing that DR is relevant not only to conventional power systems but also to broader industry-specific energy operations. In such programs, baseline definition, measurement & verification (M&V), and auditable settlement rules directly determine the correctness of realized reduction and the exposure of incentive payments [
7]. Valentini et al. [
8] used a structured review to summarize customer baseline load (CBL) estimation methods and showed that baseline uncertainty and methodological choices can materially bias impact evaluation and settlement results. Li et al. [
9] used an application-requirement-driven review to categorize baseline estimation techniques for incentive-based DR and highlighted that data availability, behind-the-meter behaviors, and multi-service participation can undermine settlement accuracy and controllability. Ellman and Xiao [
10] used a multi-stage stochastic dynamic programming model to address incentives for baseline manipulation under uncertain event schedules, demonstrating that settlement rules can induce strategic behavior beyond intended load curtailment. Wang et al. [
11] used an MDP-based analytical framework to address baseline manipulation in baseline-based DR programs and derived structural insights on customers’ underconsumption/overconsumption strategies, which distort settlement fairness. Qi et al. [
12] used probabilistic baseline prediction to address settlement-sensitive load reduction potential assessment, showing that uncertainty-aware baseline modeling is essential for contract execution and payment credibility. Šikšnys et al. [
13] used a two-stage decision-oriented baseline selection framework to address practical baseline choice under real consumption data, emphasizing that auditable baselines require both technical feasibility screening and performance evaluation. Collectively, these studies indicate that baseline definition, verification of realized response, and payment recalculation rules directly determine settlement correctness and incentive exposure, while budget controllability becomes an explicit operational requirement, thereby elevating dispatch from single-objective cost minimization to a coupled coordination task that must simultaneously deliver low-carbon economic performance and settlement feasibility under strict operational security boundaries [
14,
15,
16]. In settlement-oriented incentive-based demand response, dispatch decisions are no longer evaluated solely by operational cost but also by the auditability of realized reductions and the controllability of incentive exposure. Baseline definitions, measurement & verification (M&V) procedures, and payment recalculation rules induce an implicit “settlement feasibility region”: a schedule that is physically feasible may still be non-settleable if the verified reduction cannot be reconstructed from metering data, or if the incentive budget can be exceeded under the prescribed accounting rules. This coupling makes economic dispatch a joint optimization of multi-energy operation and settlement integrity, where operational security constraints and settlement eligibility must be jointly satisfied.
Deterministic MILP formulations remain effective for structured constraint representation in IES; however, their online performance is often sensitive to forecast errors and modeling deviations under renewable variability and device dynamics. Ma et al. [
17] used data-driven distributionally robust optimization to address source–load uncertainty in electric–thermal–hydrogen IES scheduling, illustrating the need for uncertainty-aware formulations beyond deterministic MILP. Cuisinier et al. [
18] used extended rolling-horizon optimization to address the accumulation of forecast errors in operational planning, confirming that receding-horizon updates are critical when predictions degrade over time. Fernández et al. [
19] used a two-stage deterministic EMS architecture (rolling-horizon planning and fast local adaptation) to address forecast-error impacts on objective values, further motivating closed-loop corrective control. In parallel, DRL provides an alternative by learning closed-loop policies through interaction. Liang et al. [
20] used deep RL (SAC) to address real-time optimal scheduling in integrated energy systems, demonstrating improved renewable utilization with online policy control. Liu et al. [
21] used a data-driven DRL scheduling framework to address coordinated dispatch in integrated electricity–heat–gas–hydrogen systems with demand-side flexibility. Li et al. [
22] used a safe DRL scheduling approach (AutoML-enhanced safe RL with forecasting and DR) to address constraint-aware IES scheduling under renewable uncertainty. Prabawa and Choi [
23] used safe-DRL-assisted two-stage energy management to address operational security in active distribution networks with hydrogen fueling stations. Nevertheless, despite the growing body of research on DRL-based scheduling for integrated energy systems, two key limitations remain insufficiently addressed. First, most existing studies treat demand response primarily as an operational flexibility resource, but do not explicitly incorporate settlement-grade mechanisms such as baseline reconstruction, measurement & verification (M&V), payment recalculation, and incentive-budget ledger evolution into the dispatch state and transition process. As a result, a schedule that is physically feasible and economically attractive in simulation may still be non-settleable in practice if the realized reduction cannot be independently verified or if the incentive expenditure exceeds the prescribed accounting rules. Second, although safe reinforcement learning has been introduced to improve constraint awareness, existing approaches mainly focus on physical feasibility and rarely address the joint requirement of operational security and settlement eligibility in incentive-based demand response. In multi-constraint electro–heat–hydrogen dispatch, this omission is particularly critical because settlement-grade DR introduces additional intertemporal states beyond energy storage, including baseline windows, verified response histories, and remaining incentive budgets. Ignoring these variables can lead to policies that appear cost-effective during training but fail in deployment due to hidden ledger violations and non-auditable response outcomes.
To close this gap, this paper proposes a Safety Transformer–PPO framework for low-carbon economic dispatch with settlement-oriented incentive-based demand response in integrated electro–heat–hydrogen energy systems. The main contributions are threefold. First, a settlement-aware dispatch model is established by explicitly embedding baseline–verification–settlement rules and budget-ledger states into the environment, so that feasibility is defined jointly by physical operability and settlement eligibility. Second, a causal Transformer–PPO architecture is developed to capture long-horizon temporal dependencies induced by multi-energy coupling, renewable uncertainty, and intertemporal ledger evolution. Third, a dual-layer safety mechanism is introduced, in which Lagrange-based statistical constraint regulation is combined with execution-layer quadratic-programming projection to enforce hard physical feasibility during deployment. In this way, the proposed framework moves beyond conventional cost-oriented DRL dispatch and provides an audit-ready, safety-constrained, and settlement-compatible solution for industrial-park integrated energy management.
3. Safety Transformer–PPO with Integrated Settable IDR Approach
3.1. MDP Modelling: States, Actions and Ledger Variables
Construct the scheduling as a Markov Decision Process (MDP) [
27]. To ensure settlement consistency and safe convergence, state
, which explicitly incorporates inter-period energy states and settlement ledger variables alongside load, generation capacity, and price signals. The state adopted in this paper can be represented as:
where
represents the residual budget for load transfer.
Action as a continuous vector, balancing supply-side scheduling with demand-side incentive deployment:
where
governs settlement payment and incentive-budget consumption and
determines the requested physical load-shifting quantity within the transferable-load envelope.
Instant return set to negative cost:
For readability, the main state and action variables, together with their units and bounds, are compactly summarized in
Table 1.
3.2. Causal Transformer–PPO Architecture
To capture long-horizon coupling induced by ramping limits, multi-energy storage dynamics, renewable uncertainty, and settlement ledger evolution, we adopt a causal Transformer as the sequence encoder inside an actor–critic PPO [
28] framework. The overall architecture is illustrated in
Figure 2. Unlike a feedforward policy that maps only the current observation to an action, the proposed encoder explicitly models the most recent operating trajectory, which is essential for dispatch problems where feasible and economic decisions depend on temporal context
The policy and value networks are conditioned on a sliding window of the most recent T = 24 hourly observations, which is designed to capture the dominant daily periodicity in loads, renewable availability, and price signals while remaining lightweight for online deployment. Each hourly observation aggregates heterogeneous physical and economic variables. To avoid scale imbalance across these heterogeneous inputs and to stabilize optimization, all features are normalized using training-set statistics and then mapped through a learned linear projection into a shared latent space with embedding dimension d = 128. The resulting token sequence is processed by a compact Transformer encoder composed of N = 3 stacked blocks. Each block uses four-head self-attention and a position-wise feedforward network with hidden size 256, together with residual connections and layer normalization to support stable gradient flow and improve generalization. Importantly, we apply a strict causal attention mask so that the representation at the current hour attends only to the available history within the window, i.e., the interval spanning from t − T + 1 to t. This prevents any information leakage from future steps during training and evaluation, and ensures that the learned policy remains consistent with real-time operation where future realizations are not observable. After encoding, the representation of the final token (corresponding to the current hour) is used as a compact context vector summarizing the recent system trajectory, and it is passed to the actor–critic heads.
On top of the causal Transformer encoder, the actor is implemented as a lightweight two-layer MLP (128 → 128) that outputs the parameters of a diagonal Gaussian policy for continuous controls, enabling stochastic exploration during training while maintaining a simple and scalable action distribution. The sampled action is then passed through a squashing and affine scaling procedure to enforce element-wise actuator bounds before it enters the safety layer described. This separation is intentional: the actor focuses on producing a high-quality raw control signal in a normalized action space, while feasibility with respect to hard operational constraints is enforced downstream by the safety mechanism. The critic shares the same causal Transformer encoder to form a consistent representation of the recent trajectory and reduce computational overhead. A separate two-layer MLP (128 → 128 → 1) maps the context vector to a scalar state-value estimate, which is used for advantage estimation and policy updates. Sharing the encoder between actor and critic improves sample efficiency and stabilizes training, while keeping the heads separate prevents interference between action generation and value regression.
Training is performed using PPO with the standard clipped surrogate objective and generalized advantage estimation (GAE) to balance bias and variance in policy gradients. For reproducibility, we fix all key hyperparameters across experiments: discount factor γ = 0.99, GAE parameter λ = 0.95, clipping coefficient 0.2, entropy coefficient 0.01, value-loss coefficient 0.5, and Adam learning rate 3 × 10−4. Each policy update uses a mini-batch size of 256 and runs for 10 epochs over collected rollouts. The rollout length is set to 2048 steps, which provides sufficiently diverse on-policy trajectories for stable optimization without excessively delaying updates. In total, training proceeds for 3 × 105 interaction steps. This configuration offers a pragmatic trade-off between stability and computational cost, and it is kept identical across baselines to ensure that performance differences are attributable to architectural and safety-design choices rather than tuning artifacts.
3.3. Dual-Layer Safety Mechanisms
This paper categorizes constraints into statistical constraints and hard physical constraints for separate treatment. Statistical constraints include excitation budget, maximum power curtailment rate, desired emission cap, and maximum average power purchase rate, represented by constraint cost
and updated using the Lagrange primal–dual approach:
where
denotes the upper bound constraint.
where
denotes the Lagrange multiplier (dual variable) corresponding to the Kth constraint,
represents the learning rate size for the dual variables, and
signifies the sample estimate.
Hard physical constraints encompass power upper and lower bounds, turbine/boiler ramping, battery state of charge, hydrogen storage inventory, safety venting, and energy balance. To prevent training collapse due to excessive boundary violations during exploration, this paper introduces action-feasible region projection at the execution layer, solving a minimally modified quadratic programming for the sampled policy action:
At execution, the raw policy action
may violate hard constraints due to stochastic exploration or approximation errors. We therefore compute the deployed action
by solving a quadratic projection problem that minimally modifies
while satisfying a convex approximation of the feasible set. A standard form is:
where
encodes the standard unit-level bounds and consistency constraints defined by
Table 2 and the corresponding device equations above, including ramping limits, SOC and hydrogen-inventory bounds, interconnection limits, and one-step linearized balance constraints. The slack variable
is introduced only for numerical robustness, with a large penalty factor
to strongly discourage violations. In implementation, mutual exclusivity between battery charging and discharging is enforced by rule-based gating before the QP projection, so that only one of
can remain active at each step. The projection problem is low-dimensional (equal to the action dimension) and can be solved efficiently at each control step, making it suitable for rolling online control.
The explicit unit-level operating limits and parameter ranges used by the execution-layer projection are those already defined in
Table 2 and the corresponding device equations above; they are therefore not repeated here for brevity. These standard algebraic forms are common in safe control and constrained RL implementations [
29,
30], while the focus of the present work is on their integration into the proposed Safety Transformer–PPO dispatch framework.
4. Results
4.1. Scenario, Data, and Experimental Protocol
The case study is based on an anonymized real-world industrial-park integrated energy system in eastern Inner Mongolia, featuring a peak electrical load of approximately 9 MW, a trough electrical load of approximately 5 MW, a peak thermal load of approximately 10 MW, and a hydrogen demand of approximately 0.8 tons per day.
Table 2 summarizes the key equipment parameters.
Comparative methods include: deterministic MILP, PPO-MLP, Transformer–PPO (without safety mechanisms), and the proposed Safety Transformer–PPO. The empirical data used in this study are derived from an anonymized industrial-park operation scenario in eastern Inner Mongolia. To protect commercially sensitive information, the representative time-series shown in
Figure 3 are not raw plant measurements, but confidentiality-preserving profiles obtained after anonymization and bounded perturbation processing. These profiles retain the main temporal characteristics of electricity load, heat load, hydrogen demand, and renewable availability, and
Figure 3 presents the representative anonymized typical-day profiles used to illustrate the empirical operating conditions. In this study, the MILP benchmark serves as a deterministic optimization reference under the same equipment capacities, tariff settings, and perturbation-evaluation protocol, rather than as a receding-horizon MPC or a robust/stochastic MILP benchmark.
For the learning-based methods, training is performed on a 1-month dataset composed of hourly multi-energy operational series, whereas final evaluation is conducted on disjoint out-of-sample perturbation realizations generated under a common scenario family and perturbation protocol. Here, the same perturbation protocol means a common scenario family and uncertainty-generation rule shared across methods for fair comparison, rather than reuse of identical realizations in both training and testing. The perturbations mainly include renewable forecast deviations and metering noise, introduced both to reflect practical uncertainty and to avoid disclosure of sensitive original trajectories. For confidentiality reasons, the exact perturbation realizations are not released; however, all methods are evaluated under the same equipment capacities, price settings, and perturbation-generation protocol to ensure fair comparison.
All experiments were implemented in Python 3.8 using PyTorch as the deep-learning framework. The execution-layer quadratic projection was solved using Gurobi. Experiments were conducted on a Windows-based workstation equipped with an Intel Core i7-14700KF CPU and an NVIDIA GeForce RTX 4070 SUPER GPU. Unless otherwise specified, each learning-based method was evaluated over 20 random seeds, i.e., {25, 1025, 2025, …, 19,025}, and the reported results are aggregated over the corresponding out-of-sample test runs.
4.2. Comparative Analysis of Economic Efficiency and Low-Carbon Performance
Based on the typical-day comparison results in
Table 3, the proposed Safety Transformer–PPO achieves the lowest total cost, with 12.52 ± 0.13 × 10
4 CNY, outperforming MILP (13.63 × 10
4 CNY), PPO-MLP (14.36 ± 0.32 × 10
4 CNY), and Transformer–PPO (13.38 ± 0.23 × 10
4 CNY). This corresponds to cost reductions of approximately 8.1% relative to MILP, 12.8% relative to PPO-MLP, and 6.4% relative to Transformer–PPO. To further quantify cross-seed variability, the 95% confidence intervals of the total cost are 12.46–12.58 ×10
4 CNY for Safety Transformer–PPO, 13.27–13.49 × 10
4 CNY for Transformer–PPO, and 14.22–14.50 × 10
4 CNY for PPO-MLP. These confidence intervals show that the proposed method not only attains the lowest mean total cost, but also exhibits the narrowest uncertainty band among the compared learning-based methods, thereby indicating stronger run-to-run consistency. The cost advantage is mainly attributed to lower procurement cost (1.55 ± 0.09 × 10
4 CNY), lower gas cost (10.20 ± 0.08 × 10
4 CNY), lower carbon cost (0.54 ± 0.03 × 10
4 CNY), and lower penalty cost (0.06 ± 0.03 × 10
4 CNY), which together indicate improved peak-purchase suppression, more effective renewable accommodation, and better overall low-carbon economic performance under the same system configuration.
In terms of operational feasibility,
Table 3 further shows that constraint handling is a key differentiator. Transformer–PPO without safety yields a non-compliance rate of 4.8 ± 1.2%, and PPO-MLP still exhibits 0.8 ± 0.5%, whereas the proposed Safety Transformer–PPO maintains 0.0 ± 0.0%, matching the deterministic MILP result while achieving better economy. Correspondingly, the approximate 95% confidence intervals of the non-compliance rate are 4.27–5.33% for Transformer–PPO, 0.58–1.02% for PPO-MLP, and 0.00–0.00% for Safety Transformer–PPO. This suggests that the economic advantage of the proposed method is achieved without relying on hard-constraint violations and remains consistently feasible across the evaluated runs. Although its incentive expenditure is the highest among the compared methods (0.10 ± 0.03 × 10
4 CNY), the overall effect remains favorable because the reductions in procurement, penalty, and carbon-related costs dominate. Overall, the proposed method provides the most favorable balance between economic performance and operational safety among the compared approaches.
Figure 4 shows the typical-day load transfer profile under the proposed method. The transferred load is mainly shifted out of the peak window (18:00–21:00) and compensated during off-peak hours, which indicates that the demand response is used as a settlement-eligible temporal reshaping rather than an unrealistic load “deletion”. This concentration of negative adjustments in the evening peak is consistent with the goal of suppressing peak procurement, while the rebound arranged in low-price periods reduces the likelihood of daytime disturbance, improving the verifiability and billability of DR in practical settlement.
Figure 5 presents the typical-day SOC trajectory of a battery. The SOC exhibits a clear “off-peak charging, peak discharging” pattern, and—more importantly—remains within the prescribed safety boundaries throughout the horizon, demonstrating that the safety constraints do not merely penalize violations after the fact but effectively shape deployable actions at execution. In operational terms, this SOC discipline provides the cross-period flexibility needed to support peak shaving and renewable accommodation while preventing boundary-hitting behaviors that often occur in unconstrained exploration.
Figure 6 extends the analysis to a monthly window (30 days, 720 h) and shows the distribution of load-shifting decisions over time. Compared with a fixed-rule peak-shifting strategy, the shifting periods and magnitudes vary across different dates, suggesting that the controller adjusts the DR volume in response to changing operating conditions, including renewable output, load levels, and price signals.
Figure 7 shows the monthly SOC evolution of the battery. Throughout the 30-day horizon, the SOC remains within the admissible operating bounds, and no out-of-bound event is observed. To avoid relying solely on visual inspection, two month-scale indicators are further reported for the same rollout, namely the cumulative non-compliance rate and the SOC boundary-hit frequency. The former quantifies the proportion of time steps with operational constraint violations over the monthly horizon, while the latter quantifies the proportion of time steps at which the battery SOC falls within 2% of either operating bound. In the reported monthly rollout, the cumulative non-compliance rate is 0.0%, whereas the SOC boundary-hit frequency is approximately 50%. These results show that the battery is repeatedly dispatched close to its admissible limits over the extended horizon, yet without any observed boundary violation.
4.3. Training Convergence and Safety Statistics
This subsection reports the training convergence and safety statistics of the proposed Safety Transformer–PPO using learning curves and execution-layer diagnostics. Convergence is evaluated by the typical-day total cost (consistent with the cost-based reward/return definition), while safety is quantified by the hard-constraint non-compliance rate measured during execution. In addition, we report two projection-layer indicators—projection activation rate and correction magnitude—to characterize the extent to which the execution-layer feasibility projection intervenes throughout training.
Figure 8 shows the evolution of the typical-day total cost over training episodes for the three learning-based methods, namely Safety Transformer–PPO, Transformer–PPO without safety, and PPO-MLP, with the deterministic MILP result included as a reference. The solid curves denote the mean values over 20 random-seed runs, while the shaded bands indicate ±1 standard deviation. Overall, the training process exhibits a clear pattern of progressive cost reduction followed by gradual stabilization, which is typical of PPO-style policy optimization. In the early training stage, all learning-based methods remain at relatively high-cost levels, reflecting exploratory behavior and limited policy quality. As training proceeds, the total cost decreases steadily and eventually approach a stable plateau, indicating that the policy updates become progressively smaller and the agent enters a relatively stable operating regime.
Across the entire training horizon, Safety Transformer–PPO converges to the lowest mean cost level among the compared learning-based methods. The final plateau is consistent with the evaluation results reported in
Table 3, where Safety Transformer–PPO achieves a total cost of 12.52 ± 0.13 × 10
4 CNY, outperforming Transformer–PPO without safety (13.38 ± 0.23 × 10
4 CNY) and PPO-MLP (14.36 ± 0.32 × 10
4 CNY). In addition, the cost gap becomes visible already in the middle stage of training, suggesting that the proposed method reaches a competitive regime earlier and maintains a more favorable cost profile thereafter. Another noteworthy characteristic is that the proposed method exhibits a narrower late-stage standard-deviation band than the baseline learning methods. This pattern is consistent with the cross-seed statistics in
Table 3 and indicates lower cross-run dispersion after the policy approaches convergence.
Figure 9 reports the hard-constraint non-compliance rate versus training episodes. Here, non-compliance is defined as the fraction of decision steps where any hard physical constraint is violated during execution. This metric directly reflects whether a learned policy is operationally deployable under strict security constraints. The curves show a pronounced separation between the proposed method and the baselines. Safety Transformer–PPO drives the non-compliance rate down rapidly and maintains an approximately zero level at convergence. In contrast, Transformer–PPO without safety stabilizes at a substantially higher violation exposure, and PPO-MLP converges to a smaller yet non-negligible level. Beyond the final values, the transient behavior is also informative: non-compliance is typically higher in early training and decreases as training progresses, which indicates that infeasible behaviors are more frequent during exploration and gradually diminish as the policy improves.
Figure 10 provides an execution-layer diagnostic by reporting the projection activation rate over training. The activation rate is high during early training and decreases gradually as training proceeds. This indicates that, at the beginning, the raw policy frequently proposes infeasible or near-infeasible actions, requiring frequent projection. As the policy improves, the raw action distribution becomes increasingly compatible with feasibility requirements, so fewer projection interventions are needed.
Figure 11 complements
Figure 10 by reporting the average correction magnitude, measured by the L
2 norm between the projected action and the raw policy action (‖a − ã‖
2), averaged over decision steps. While the activation rate indicates “how often” the projection intervenes, the correction magnitude indicates “how strongly” it modifies the action when it intervenes. In the curve, the correction magnitude decreases from a larger initial level to a small plateau, consistent with the decreasing activation rate in
Figure 10. In practice, these two diagnostics should be interpreted together: an ideal learning outcome is characterized by both a low activation rate and a small correction magnitude, indicating that the policy itself is producing feasible actions and the projection layer acts primarily as a lightweight safeguard.
The projection diagnostics are mainly intended to assess whether the learned policy systematically relies on execution-layer correction. Some degree of correction is expected during early training because of stochastic exploration and approximation errors. However, as training proceeds, both the projection activation rate and the average correction magnitude decrease markedly, suggesting that the learned policy becomes increasingly compatible with the feasible region rather than systematically relying on the projection layer after convergence.
Overall, the training curves indicate that Safety Transformer–PPO converges to a lower cost level while achieving zero hard-constraint non-compliance at convergence. The projection diagnostics further show that intervention frequency and intervention strength both decrease over training, implying that the learned policy becomes increasingly consistent with feasibility requirements.
5. Discussion
5.1. Ablation of the Dual-Layer Safety Mechanism
To isolate the contribution of each safety component, we conduct an ablation study on the proposed dual-layer safety design. We evaluate four variants: (i) full Safety Transformer–PPO (Lagrange updates + execution-layer QP projection), (ii) Lagrange-only (remove QP projection), (iii) QP-only (remove Lagrange updates), and (iv) Transformer–PPO without safety. All variants share the same Transformer–PPO backbone, training budget, uncertainty injection, and state/action definitions; only the safety components are toggled. We report the typical-day cost breakdown and the hard-constraint non-compliance rate to jointly assess economic performance and operational feasibility.
Table 4 reveals a clear cost–safety trade-off across the four variants. The full Safety Transformer–PPO achieves the lowest total cost (12.52 × 10
4 CNY) with zero non-compliance (0.0%). In contrast, the safety-free Transformer–PPO baseline yields a higher total cost (13.38 × 10
4 CNY) and a markedly higher violation exposure (4.8%). When QP projection is removed (Lagrange-only), the total cost improves to 12.70 × 10
4 CNY, but non-compliance increases to 1.6%, indicating that dual-variable regulation alone is insufficient to eliminate hard violations in execution. When Lagrange updates are removed (QP-only), hard feasibility is preserved (0.0% non-compliance), but total cost degrades to 12.88 × 10
4 CNY, suggesting that feasibility enforcement alone does not guarantee cost-efficient operation.
Beyond the headline totals, the component-level breakdown in
Table 4 helps localize the sources of improvement. Relative to the safety-free Transformer–PPO, the full method reduces procurement cost (1.55 × 10
4 CNY vs. 1.95 × 10
4 CNY), gas costs (10.20 × 10
4 CNY vs. 10.55 × 10
4 CNY), carbon cost (0.54 × 10
4 CNY vs. 0.60 × 10
4 CNY), and penalty (0.06 ×10
4 CNY vs. 0.14 × 10
4 CNY), while keeping battery cost in a narrow range (0.06 × 10
4 CNY −0.07 × 10
4 CNY). Incentive expenditure remains bounded (0.08 × 10
4 CNY −0.11 × 10
4 CNY), implying that the observed cost reduction is primarily driven by improved operational decisions rather than aggressive incentive spending. Comparing the two ablations, QP-only exhibits higher carbon cost (0.60 × 10
4 CNY) and penalty (0.07 × 10
4 CNY) than the full method, whereas Lagrange-only attains a closer cost composition but incurs non-compliance (1.6%).
Overall, the ablation results indicate that feasibility and economic optimality are coupled but not identical objectives. Execution-layer projection is decisive for eliminating hard violations, whereas dual-variable regulation improves the quality of feasible actions, particularly by reducing procurement reliance and carbon-related cost. The combined design therefore achieves a preferable cost–safety trade-off compared with either component alone.
5.2. Robustness to Forecast Error Bounds
Since the evaluation already accounts for renewable forecast errors and metering noise, we further test robustness by scaling the renewable forecast error bound to ε ∈ {±5%, ±10%, ±15%} while keeping capacities and price settings unchanged. For strict comparability with the Results section, the ε = ±10% case matches the typical-day evaluation setting, and
Table 5 reports the corresponding cost components together with the non-compliance rate.
As shown in
Table 5, the total cost increases monotonically as uncertainty grows, from 12.40 × 10
4 CNY at ε = ±5% to 12.74 × 10
4 CNY at ε = ±15%. The increase is primarily driven by higher procurement cost (1.50 → 1.65), carbon cost (0.52 → 0.58), and penalty (0.04 → 0.09), while gas costs change only slightly (10.18 → 10.24) and battery cost remains nearly invariant (0.07). Incentive expenditure shows a mild increase (0.09 → 0.11), consistent with a larger verified response under higher perturbations in the settlement calculation.
Across all tested uncertainty levels, the non-compliance rate remains 0.0%, indicating that the execution-layer safety projection preserves hard feasibility even when forecast perturbations widen. From an operational viewpoint, this means robustness is reflected not only by moderate cost degradation but also by guaranteed constraint satisfaction. Nevertheless, the economic deterioration under larger ε suggests that very high uncertainty may require retraining under broader disturbance distributions and/or introducing conservative feasibility margins in the projection constraints.
5.3. Quantitative Comparison with Related Studies
To better position the proposed method within the recent literature, two levels of comparison should be distinguished. First,
Table 3 already provides a controlled same-case benchmark by reproducing representative baseline methods under identical system settings, perturbation protocol, and evaluation metrics. Second,
Table 6 below offers a literature-level quantitative comparison with representative recent studies. Since the compared studies differ in case scale, device portfolio, tariff setting, uncertainty assumption, and benchmark definition, the comparison is intended as a relative positioning analysis rather than a strict ranking of absolute operating performance.
As shown in
Table 6, existing studies have reported meaningful economic and low-carbon improvements in hydrogen-related integrated energy systems, but most of them do not explicitly combine settlement-oriented DR execution with execution-layer hard-feasibility protection. In this sense, the distinctive contribution of the present work does not lie in claiming the largest reported cost-reduction percentage across heterogeneous cases, but in jointly achieving economic improvement, settlement-aware DR implementation, and zero observed hard-constraint non-compliance within a unified sequential decision framework. Therefore,
Table 6 should be interpreted as a literature-level positioning analysis, whereas the controlled same-case method comparison is already provided in
Table 3.
5.4. Limitations
Although the proposed Safety Transformer–PPO shows favorable economic performance, hard-constraint feasibility, and robustness within the tested settings, several limitations should be noted. First, the current validation is conducted on a single anonymized industrial-park configuration; therefore, the reported results should be interpreted as evidence of effectiveness for this class of electro–heat–hydrogen dispatch problems rather than as a guarantee of direct transferability to parks with substantially different device portfolios, tariff mechanisms, demand structures, or hydrogen-consumption patterns. Second, the method remains sensitive to data quality and settlement design, especially baseline-definition choices, metering reliability, and disturbance distributions, all of which may affect verified response quantities, incentive-ledger evolution, and economic performance. Finally, real deployment would also require stable integration with plant-level EMS/SCADA infrastructure, reliable M&V pipelines, and periodic model maintenance under seasonal variation or distribution shifts. These issues define important directions for future work on broader cross-site validation, stronger interpretability, and deployment-oriented system integration.
It should be noted that the present study adopts fixed-cost parameters for hydrogen- and heat-related technologies in the tested dispatch setting. However, in practical industrial deployment, these costs may evolve with technology learning, market maturity, and economies of scale, which could further affect the relative utilization of hydrogen conversion, storage, and heat supply units. Therefore, the reported economic results should be interpreted under the current parameter setting rather than as a prediction of future technology-cost trajectories. Future work will incorporate cost-learning scenarios and scale-dependent parameter settings for hydrogen- and heat-related equipment to further assess their impact on dispatch performance and economic competitiveness.