Next Article in Journal
Low-Carbon and Optimized Dispatching of Regional Integrated Energy Systems, Taking into Account the Uncertainties of Wind–Solar Power and Dynamic Hydrogen Prices
Previous Article in Journal
A Hierarchical Control Framework for HVAC Systems: Day-Ahead Scheduling and Real-Time Model Predictive Control Co-Optimization
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Deterministic Assurance Framework for Licensable Explainable AI Grid-Interactive Nuclear Control

by
Ahmed Abdelrahman Ibrahim
* and
Hak-Kyu Lim
Department of Nuclear Engineering, KEPCO International Nuclear Graduate School (KINGS), Ulsan 45014, Republic of Korea
*
Author to whom correspondence should be addressed.
Energies 2025, 18(23), 6268; https://doi.org/10.3390/en18236268 (registering DOI)
Submission received: 5 October 2025 / Revised: 2 November 2025 / Accepted: 10 November 2025 / Published: 28 November 2025

Abstract

Deploying deep reinforcement learning (DRL) in safety-critical nuclear control is limited less by raw performance than by the absence of licensable, audit-ready evidence. We introduce a Deterministic Assurance Framework (DTAF) that converts controller behavior into licensing-grade proof by combining the following: (i) deterministic licensing gates tied to formal safety and performance limits (e.g., Total Time Unsafe (TTU) = 0; bounded Transient Severity Score (TSS); and minimum Grid Load-Following Index (GLFI)); (ii) a portfolio of adversarial stress tests representative of off-nominal operation; and (iii) a traceability and explainability package that renders every evaluated action auditable. The DTAF is demonstrated on a high-fidelity pressurized-water-reactor (PWR) simulation model used as a software-in-the-loop testbed. Three governor architectures are evaluated under identical, fixed scenarios: a curriculum-trained Soft Actor–Critic (SAC) agent, and Differential-Evolution-optimized Proportional–Integral–Derivative (PID-DE) and Fuzzy-Logic (FLC-DE) Controllers. Performance is assessed deterministically via gate-aligned metrics—TTU, TSS, GLFI, cumulative control effort (CE_sum), valve-reversal count (V_rev), and speed overshoot (OS_ω). Across the adversarial portfolio, the SAC controller meets the predeclared licensing gates in single-run evaluations, whereas the strong conventional baselines violate gates in specific high-severity cases; where all methods remain within the safe envelope, the SAC delivers a higher GLFI and lower CE_sum, with fewer reversals and reduced overshoot. All licensing conclusions derive from deterministic single-run tests; a small, fixed-seed check (three seeds with descriptive intervals) is reported separately as non-licensing supplementary analysis. By producing transparent, reproducible artifacts, the DTAF offers a regulator-oriented pathway for qualifying DRL controllers in grid-interactive nuclear operations.

1. Introduction

Deep reinforcement learning (DRL) has great potential to improve nuclear reactor control performance, especially for load-following in Pressurized Water Reactors (PWRs). However, a major gap separates this potential from practical deployment: current DRL controllers cannot be licensed under today’s nuclear safety regulations. The U.S. Nuclear Regulatory Commission (NRC) and others have identified verification and validation (V&V), trustworthiness, and assurance as the primary hurdles to adopting AI-based control in safety-critical nuclear systems [1,2]. In essence, black-box control policies that lack deterministic, auditable guarantees are disqualified from licensing. PWR grid load-following demands extremely high reliability and predictability—yet DRL agents, despite their technical prowess, pose a regulatory stalemate without a suitable testing and assurance protocol. The missing piece is an auditable deterministic test protocol that can demonstrate an AI controller’s safety to licensing authorities beyond any statistical doubt.
To address this barrier, we propose a Deterministic Assurance Framework (DTAF) tailored for licensing-grade evaluation of AI controllers. The DTAF introduces the following: (i) explicit deterministic licensing gates tied to hard safety limits (for example, preventing frequency trips and bounding speed overshoot OS_ω within allowable margins); (ii) an adversarial stress-test portfolio comprising worst-case grid disturbances and transients designed to probe the controller’s safety envelope; and (iii) a comprehensive traceability and explainability package (integrating eXplainable Reinforcement Learning, XRL) that produces human-interpretable evidence of the controller’s decisions. Under the DTAF, an AI governor must provably keep all key performance indicators within pre-defined deterministic limits during these stress tests. The framework thereby transforms black-box DRL behavior into a licensable pipeline of quantitative gates, stress scenarios, and explainable artifacts that regulators can audit.
We demonstrate the DTAF using a high-fidelity PWR digital model (DM) as a realistic simulation testbed. Three governor controllers are evaluated under identical conditions: a Soft Actor–Critic (SAC) DRL agent, a PID controller optimized by Differential Evolution (PID-DE), and a Fuzzy-Logic Controller optimized by Differential Evolution (FLC-DE). The evaluation metrics include the Total Time Unsafe (TTU), Transient Severity Score (TSS), Grid Load-Following Index (GLFI), cumulative control effort (CE_sum), valve reversal count (V_rev), and speed overshoot (OS_ω). Each metric is mapped to a specific licensing gate with a fixed threshold. For example, the TTU must remain zero for all protected variables, the TSS must not exceed defined severity limits, and the GLFI must stay above minimum grid-following requirements. By design, if any single gate is violated in a test scenario, the candidate controller is disqualified—a stringent criterion aligned with nuclear safety standards.
Our research addresses three core questions under this deterministic protocol: (Q1) Can the SAC agent meet all deterministic licensing gates across an array of adversarial grid disturbance scenarios? (Q2) How do the SAC’s safety, robustness, and performance compare to the PID-DE and FLC-DE baselines under identical stress conditions? (Q3) Can the integrated XRL traceability package explain the SAC’s control actions during severe transients in a way that supports regulatory auditability? To ensure absolute clarity, we declare upfront that all evaluations are fully deterministic. Every stress-test scenario is run in a non-stochastic manner with fixed initial conditions and no random disturbances. The only exception is an auxiliary robustness check (Section 4.10) where we repeat tests with three different fixed random seeds to compute descriptive 95% confidence intervals—and even there, the seeds are predetermined for reproducibility. Crucially, all licensing conclusions in this paper are drawn solely from deterministic results, not from any statistical or probabilistic analysis.
In summary, this work offers three main contributions. (1) We develop the Deterministic Assurance Framework (DTAF)—a regulator-aligned evaluation framework that combines deterministic performance gates, adversarial stress scenarios, and XRL-based traceability into a unified, licensable test pipeline. (2) We present a comprehensive benchmark comparison of a state-aware SAC controller against optimally tuned conventional controllers (PID-DE and FLC-DE) on a high-fidelity PWR simulation model, using identical operating conditions and disturbance scenarios for a fair, rigorous assessment. (3) We provide a complete reproducibility and audit package for public release (containing all training scripts, configuration files, logged time series, and agent checkpoints), enabling independent validation of results. Together, these contributions demonstrate a viable path to make DRL controllers licensable for grid-interactive nuclear plant control by design—resolving the DRL licensing barrier through determinism, stress-testing, and explainability.

2. Related Work

2.1. Reinforcement Learning in Nuclear, Power, and Turbomachinery Control (2019–2025)

In recent years, researchers have explored DRL-based control across nuclear and energy domains to test feasibility. In the nuclear sector, deep RL agents have been applied to reactor operation tasks with encouraging results. For example, Gong et al. survey numerous implementations and conclude that RL can handle complex multi-objective control scenarios in nuclear power plants [3]. Specific case studies have demonstrated that DRL governors can perform continuous reactor coolant and power regulation while meeting multiple objectives [4]. Even microreactors—compact reactors with fast dynamics—have seen prototype RL controllers achieving improved load-following performance over baseline PID tuning [5]. Similarly, in the broader power and turbomachinery arena, DRL techniques have shown promise. A notable example is the use of multi-agent deep RL to optimize boiler–turbine control systems, where adaptive RL-tuned PID policies outperformed conventional tuning in managing a nonlinear multi-input system [6]. These works collectively indicate that DRL can indeed learn effective control policies for complex energy systems. However, they stop at demonstrating performance and do not furnish the licensing-grade evidence (deterministic guarantees, formal safety checks, etc.) that regulators require. In other words, while DRL feasibility in nuclear and turbomachinery control has been established in the 2019–2025 literature, none of these studies deliver the audit-ready determinism or comprehensive safety assurance needed for actual deployment.

2.2. Safety-Aware, Constrained, and Robust RL; Adversarial Evaluation

Concurrently, the DRL research community has developed various techniques to enhance the safety and robustness of learning agents, as well as methods to adversarially test them. One line of work focuses on constrained and safe RL algorithms that enforce safety criteria during training. For instance, Sun et al. propose a chance-constrained RL controller for power plant supervision that uses Lagrange multipliers to strictly respect state constraints (e.g., reactor thermal limits) throughout the learning process [7]. Such approaches embed nuclear engineering knowledge (safety setpoints and margins) directly into the RL optimization, yielding agents that never experience unsafe excursions even while exploring. Another important direction is the adversarial stress-testing of RL policies. Here, the idea is to actively generate worst-case scenarios to probe an agent’s reliability. In the autonomous driving domain, Feng et al. demonstrate a “dense” deep-RL approach that trains adversarial background agents to expose rare failure modes of a vehicle’s policy [8]. This concept—using AI to test AI—highlights how critical safety scenarios can be systematically uncovered. Despite these advances in safe RL and adversarial evaluation, few efforts have integrated them into a unified, regulator-oriented pipeline. In prior nuclear control studies, RL controllers were typically evaluated on nominal or randomly sampled scenarios, rather than an exhaustive set of adversarial drills. Moreover, while robust RL algorithms can limit certain risks (e.g., by adding noise or using domain randomization), regulators ultimately demand deterministic evidence of safety. To date, there is no standard framework in the nuclear domain that combines constrained RL training, adversarial scenario generation, and formal gate-checking of results. Our work fills this gap by incorporating safety constraints and adversarial tests within the DTAF’s deterministic evaluation regime.

2.3. Explainability and Traceability for Deep RL

As DRL agents make increasingly complex control decisions, explainability and traceability have become critical for acceptance in safety-critical systems. The subfield of eXplainable RL (XRL) has produced a variety of techniques to interpret an agent’s behavior [9]. These include feature local sensitivity methods, which highlight the most influential state variables behind an action (e.g., identifying that a reactor RL agent mainly responds to power level and fuel temperature deviations), and policy simplification methods such as surrogate modeling or rule extraction, which approximate the agent’s policy with a human-readable model. For instance, one can train a simple decision tree or linear model to mimic the DRL policy locally, thereby revealing the policy’s decision logic in specific scenarios. Another approach is behavioral cloning or trajectory analysis, where the RL agent’s state-action trajectories under various disturbances are recorded and compared to those of classical controllers to pinpoint differences in strategy. In the nuclear domain, initial steps have been taken to apply XAI techniques for operator support—for example, M. Najar and X. Wang develop explainable AI models to aid reactor operators during accidents [10]. However, achieving full traceability of a DRL controller’s decisions under extreme transients remains an open challenge. Most XRL studies to date focus on either relatively simple environments or post-hoc visualization tools, often divorced from formal verification needs. In our framework, explainability is not an afterthought but a built-in component: the DTAF’s traceability package generates artifacts (such as annotated time-series plots, policy local sensitivity maps, and scenario-wise action comparisons) that accompany each stress-test result. This ensures that for every deterministic pass/fail outcome, there is a corresponding human-interpretable explanation. By aligning XRL techniques with the adversarial test scenarios, we aim to make the DRL agent’s behavior transparent and auditable—satisfying the regulator’s need to know why the AI acted as it did, especially in borderline conditions near safety limits.

2.4. High-Fidelity Simulation Models (Digital Models) for Nuclear/Power Control

High-fidelity simulation models, also referred to as digital models (DMs), serve as indispensable tools for developing and validating control logic in nuclear and power systems. These models emulate the real plant’s physics with sufficient detail to capture transient behaviors, making them ideal for rigorous software-in-the-loop testing. In our context, a PWR simulation DM is used as the core testbed within the DTAF to evaluate all controllers under identical conditions. The advantage of a DM is that extreme scenarios—including rapid load ramps, large disturbances, and equipment faults—can be safely executed and repeated deterministically. Researchers have increasingly employed such digital platforms for AI-driven control studies. For example, Lim et al. describe a high-fidelity PWR simulation framework to train and test an RL-based supervisory controller for advanced reactors [11]. By leveraging a detailed simulator of a Generation-IV plant, they could evaluate the RL agent’s long-term performance and maintenance decisions without any risk to real equipment. In general, DMs allow controllers to be stress-tested in silico against scenarios that might be too dangerous or rare to test on actual reactors or turbines. This not only accelerates development but is also a pre-requisite for licensing—regulators will not consider AI control strategies that have not been thoroughly vetted on validated simulation models. Today, such evaluations are typically confined to software-in-the-loop experiments, but the same models can be extended to hardware-in-the-loop setups (e.g., connecting the simulator to physical controller hardware or plant interfaces) under the DTAF methodology. The key point is that the DTAF’s deterministic protocol is model-agnostic: it can be applied on any high-fidelity DM to produce evidence (logs, safety gate outcomes, and traceability reports) that is replayable and reviewable. This approach ensures that by the time an AI controller is a candidate for on-site trials, it comes with a complete simulation-backed safety dossier. In summary, high-fidelity DMs act as the proving ground where modern control algorithms can earn trust by demonstrating compliance with all operational limits and safety requirements in a virtual yet realistic environment [12,13].

2.5. Comparative Synthesis and Benchmark Rationale

Our study synthesizes insights from the above strands into a cohesive assurance framework, with an emphasis on comparing DRL against strong conventional baselines. In commercial reactor operations, traditional controllers like PID and fuzzy logic remain the dominant solutions due to their stability and regulatory acceptance [14,15]. Over the past decade, many researchers have improved these classical controllers using modern optimization techniques (genetic algorithms, particle swarm, and differential evolution) to enhance performance for complex reactor maneuvers [14,15]. Despite this, prior RL works in nuclear control have rarely, if ever, pitted a DRL agent against an equally well-tuned classical controller under stress conditions. Most feasibility studies compared RL to either a default PID or no baseline at all, leaving open the question of whether the AI truly excels beyond what a properly optimized conventional controller could do. By contrast, we adopt the “Strong Benchmark” philosophy: the SAC agent must demonstrably surpass a PID-DE and FLC-DE that have been optimized across many scenarios. This stringent baseline provides a higher confidence threshold for safety-critical acceptance—an AI that only matches a mediocre controller would not justify the licensing risk. Furthermore, our work combines safety, robustness, and explainability in one deterministic framework, whereas previous research typically addressed these aspects in isolation. For instance, some studies incorporate safety constraints or robust training, and others propose XAI methods, but none have unified them into a single pipeline oriented toward regulator review. It is worth noting that we do not include advanced model-based controllers like MPC or H∞ in our benchmarks; this is a deliberate choice to keep the evaluation controller-centric. While methods such as MPC can yield excellent performance, they introduce model-dependent tuning and complexity that are beyond our scope—our focus is on comparing a learning-based policy with human-engineered policies under identical conditions. Importantly, excluding MPC/H∞ also reflects practical considerations: nuclear plants today still rely on PID-family controllers [14], so demonstrating AI superiority over this familiar baseline is a more direct and convincing argument for stakeholders. Finally, we emphasize that our DTAF approach aligns with formal safety expectations. Nuclear design standards like IAEA No. SSR-2/1 (Rev. 1) mandate that all operational transients remain within defined, bounded limits [12], and software safety guidelines (IAEA Safety Standards Series No. SSG-39) stress the importance of deterministic behavior in systems important to safety [13]. However, as a recent review pointed out, most AI control research lacks comprehensive adversarial validation frameworks to ensure these criteria are met [16]. By integrating optimized baseline comparisons, adversarial scenario testing, and XRL-driven transparency, we provide a template for deterministic assurance that addresses this gap. In short, our benchmark rationale and synthesis highlight the novelty of the DTAF: it is not about pushing an RL agent in isolation but about proving that the agent can reliably and explainably beat the best conventional solutions under the exacting conditions regulators care about.

2.6. Rationale for Differential Evolution (DE) in Strong Baseline Optimization

We adopt Differential Evolution (DE) to optimize the Proportional–Integral–Derivative (PID) and Fuzzy-Logic Controller (FLC) baselines, so that the conventional comparators in the Deterministic Assurance Framework (DTAF) are truly “strong.” DE is a derivative-free, population-based global optimizer with few hyperparameters; it is simple to implement and reproduce, and it is effective on nonconvex, multimodal, and noisy or piecewise-discontinuous closed-loop objectives typical of controller tuning [17,18,19]. Strategy-adaptive DE variants further improve robustness across heterogeneous problems [20], and large comparative studies show DE variants to be highly competitive—often superior on average—to Particle Swarm Optimization (PSO) across numerical benchmarks and real-world tasks [21]. In power-system control specifically, DE has been applied successfully to tune load-frequency and governor-related controllers, providing a domain-proximal precedent for our use in baseline construction [22]. Alternative optimizers are less suitable for our gate-aware objective: exhaustive grid searches are inefficient in high-dimensional, constrained spaces; gradient-based methods require smoothness and reliable derivatives that closed-loop plants and penalty-augmented objectives generally lack; and Bayesian optimization introduces surrogate-modeling overhead and typically needs specialized treatments for heteroscedastic/noisy evaluations—factors that complicate transparency and reproducibility for licensing evidence [23,24]. In the DTAF, we formulate a composite cost that aggregates the licensing metrics—the Total Time Unsafe (TTU), Transient Severity Score (TSS), Grid Load-Following Index (GLFI), cumulative control effort (CE_sum), valve-reversal count (V_rev), and speed overshoot (OS_ω)—with hard (infinite) penalties for any gate violation and soft penalties within the safe envelope. For comparability, the population size, mutation factor, crossover rate, and generation budget are fixed across governors; all settings and seeds are released with the reproducibility package.

3. Methodology

3.1. Plant Simulation Environment

We model a Pressurized Water Reactor (PWR) as a deterministic, single-loop surrogate coupling six-group point kinetics, lumped thermal–hydraulics, a first-order valve servo with rate limiting, a first-order turbine path, and a synchronous generator tied to an infinite bus. The overall architecture and signal flow are shown in Figure 1 and Figure 2, respectively. All symbols are defined at point of use; all parameter values appear in-section to ensure exact reproducibility.
Figure 1 illustrates the complete architecture of the proposed Deterministic Assurance Framework (DTAF), which is engineered to rigorously train, test, and formally verify AI controllers within a safe, high-fidelity simulation environment. The framework is structured as a four-layer hierarchy, with each layer encapsulating a distinct set of functions. At the core, Layer 4 (agent) comprises the Soft Actor-Critic (SAC) reinforcement learning controller, which is responsible for generating real-time control commands, or actions. The agent interacts directly with Layer 3 (environment), the PWR-simulation model, which is wrapped in a PWR Unified Gym Environment interface. This environment executes the agent’s action, calculates the resultant system dynamics, and returns the new state vector and a scalar reward signal, thus closing the standard RL feedback loop. This core loop is governed by Layer 2 (The Analysis Engine), which orchestrates the entire experimental and validation process. This layer is responsible for executing the suite of adversarial stress tests, performing multi-run robustness checks, and conducting the explainability (XAI) analyses required to interpret the agent’s decision-making logic.
Finally, Layer 1 (The Deterministic Assurance Engine) serves as the highest level of oversight. It defines the formal safety and control objectives, such as the operational limits for grid frequency (ftrip) and average reactor temperature (Tavg). This engine performs the ultimate safety verification, systematically ensuring that all agent-driven behaviors, even under stress, remain within the pre-defined safe operating envelope.
Figure 2 presents the schematic of the high-fidelity Pressurized Water Reactor (PWR) simulation model, which functions as the core testbed (Layer 3) within the DTAF. The model accurately captures the plant’s essential dual-loop thermodynamic processes critical for load-following simulations. The Primary Loop circulates pressurized water to transfer thermal energy from the Reactor Core to the Steam Generator. The Pressurizer maintains the high system pressure required to prevent the primary coolant from boiling. In the Secondary Loop, this thermal energy is converted into electrical power. Water in the Steam Generator boils, producing high-pressure steam that drives the turbine. The turbine’s rotational energy is converted into electricity by the generator, which is synchronized with the Electrical Grid. This diagram highlights the primary control interface for the AI agent. The agent’s scalar action output directly actuates the Governor Valve (Av), which regulates the mass flow rate of steam to the turbine. The agent’s control objective is to modulate this valve to maintain the stability of the grid frequency (f) by balancing power generation with grid demand, particularly during challenging load-following transients.
Clarification on the Digital Model and Framework Scope:
It is important to clarify the precise role of the PWR simulation within the DTAF. In this study, the high-fidelity model serves as a digital model—a validated, self-contained simulation environment that acts as a representative proxy for the physical plant. While the term ‘digital twin’ often implies a persistent, bidirectional data link to a specific physical asset, our use here refers to a high-fidelity simulation testbed.
The primary novelty of this paper is the DTAF itself: a model-agnostic assurance workflow. The ‘bidirectional data exchange’ and ‘communication protocols’ are therefore software-in-the-loop (SIL) interactions, representing the data passed between the DTAF’s analysis engine and the simulation environment (i.e., states, actions, and rewards). The DTAF is architected to be equally applicable to a fully-fledged digital twin or even a physical system in a hardware-in-the-loop (HIL) configuration. This study uses the high-fidelity digital model to formally validate the framework’s ability to assess, stress-test, and certify controller behavior against established benchmarks. The DTAF framework can be extended in future work to a full-scale digital twin.

3.1.1. Neutron Point Kinetics (Six Delayed Groups)

Λ x . n t = ρ t β n t + i = 1 6 λ i C i t + S t
where n(t) is the normalized neutron density [-]; Λ is the prompt neutron generation time [s]; ρ(t) is the total reactivity [Δk/k]; β is the total delayed neutron fraction [-]; λ i is the decay constant of precursor group i [s−1]; C i t is the concentration of group-i precursors [-]; and S(t) is an external source (set to 0 in nominal runs).
x . C i t = β i / Λ n t λ i C i t , i = 1 , , 6
where β_i is the fraction of delayed neutrons from group i [-], satisfying ∑_{I = 1}^{6} β_i = β.

3.1.2. Reactivity Feedback and Power Mapping

ρ t = ρ e x t t + α f T f t T f 0 + α c T c t T c 0
where ρext (t) is the exogenous reactivity [Δk/k] (0 in this study); α f , α c are the fuel and coolant temperature coefficients [Δk/k*°C−1]; T f t and T_c(t) are the lumped fuel and coolant temperatures [°C]; and T f 0 and T c 0 are the reference temperatures [°C]. Here α f = 0 and α c = 0 to isolate the controller behavior deterministically.
P t h t = κ P n t
where P t h (t) is the reactor thermal power [MW] and κ P = 1000.0 MW maps the normalized neutron density to thermal power.

3.1.3. Lumped Thermal–Hydraulic Model

C f x . T f t = P t h t U f c T f t T c t
where C f = 30.0 MJ/°C is the effective fuel heat capacity and U f c = 2.0 MW/°C is the fuel-coolant conductance.
C c x . T c t = U f c T f t T c t U c s T c t T s 0
where C c = 50.0 MJ/°C is the effective coolant heat capacity; U c s = 20.0 MW/°C is the coolant-secondary conductance; and T s 0 = 270.0 °C is the fixed secondary-side sink temperature.

3.1.4. Valve, Turbine-Governor, and Generator

τ v x . v t = u t v t , v 0,1
where v(t) is the valve position [-]; u(t) is the controller command [-]; and τ v = 0.30 s is the valve-servo time constant. A deterministic rate limiter | x . v | ≤ rmax with rmax = 0.15 s−1 is applied to the commanded motion.
τ m x . P m t = K t v t P m t
where P m t is the mechanical power into the turbine [MW]; τ m = 3.0 s is the steam-path/turbine lag; and K t = 900.0 MW/- maps the valve to mechanical power.
P e t = η g P m t
where P e t is the electrical power to the grid [MW] and η g = 0.98 is the generator efficiency.

3.1.5. Measured Outputs and Signals

y t = P e t , v t , T f t , T c t T , f t = f g r i d t , f n o m = 60   H z
where y(t) collects the outputs used by controllers and metrics; f(t) is the measured grid frequency [Hz] from the infinite bus; and f n o m = 60 Hz is nominal.

3.1.6. Parameters and Numerics

Numerical integration uses a fixed step Δt = 0.05 s (forward Euler). Initial conditions correspond to the steady-state at Pe = Pref = 600 MW with infinite-bus synchronism. The implied nominal valve position is v0 ≈ Pref/(ηg * Kt) = 600/(0.98 * 900) ≈ 0.680, and the thermal states satisfy (5) and (6) at steady power with T s 0 .
The six-group point-kinetics constants used in (1) and (2) are drawn from a standard PWR benchmark set, and can be seen in Table 1.
For reproducibility of the precursor balance in (2), the delayed-neutron fractions are provided per group, as seen in Table 2.
The thermal blocks used in (5) and (6) and the power-mapping in (4) use the constants listed below in Table 3.
The valve servo (7), turbine path (8), and electrical output (9) are parameterized as follows in Table 4.
Actuator bounds and the integration step used throughout are summarized next, as seen in Table 5.

3.2. Controllers

3.2.1. Proportional–Integral–Derivative (PID) Governor

We employ a discrete-time PID with a filtered derivative and conditional anti-windup. Letting e k = r(k) − y k be the tracking error, Δt the loop period, u k u m i n , u m a x the valve command, and r m a x the rate limit, saturation precedes rate limiting.
u P k = K p e k
where K p is the proportional gain [—], and e(k) is the tracking error (setpoint minus measured output).
I k = I k 1 + K i Δ t e k
where K i is the integral gain [s−1], and Δ t   is the loop period [s].
τ d d ψ / d t + ψ t = d e / d t , u D t = K d ψ t
where ψ(t) is the filtered derivative state [1/s]; τd is the derivative filter time constant [s]; and K d is the derivative gain [s].
ψ k = a ψ k 1 + b e k e k 1 , a = 2 τ d Δ t / 2 τ d + Δ t , b = 2 / 2 τ d + Δ t
where ψ(k) is the discrete filtered derivative; and a and b are bilinear (Tustin) coefficients ensuring a stable first-order low-pass on the derivative of e(k).
u D k = K d ψ k
where u D k is the derivative contribution to the command [-].
u r a w k = u P k + I k + u D k
where u r a w k is the unsaturated command [-].
u s a t k = m i n m a x u r a w k , u m i n , u m a x
enforcing the physical valve limits uminusat(k)umax.
u k = u k 1 + c l i p u s a t k u k 1 , r m a x Δ t , r m a x Δ t
Post-saturation rate limiting ensures |u(k) − u(k − 1)| r m a x ·Δt.
The PID gains and limits used in all deterministic runs are fixed and listed below in Table 6.

3.2.2. Mamdani Fuzzy-Logic Governor (FLC)

The FLC uses two antecedents—error e and error-rate Δ e —and one consequent Δu. Each linguistic variable employs five triangular/trapezoidal sets {NB, NS, ZE, PS, PB}. Inputs are scaled by se and sΔe; the consequent is scaled by su. Implication is min(·), aggregation is max(·), and defuzzification is the centroid.
e s = e / s e , Δ e s = Δ e / s Δ e
where es and Δes are scaled antecedents [-]; s_e and s_{Δe} are their scales [-].
μ C i j z = m i n μ A i e s , μ B j Δ e s
where R_{ij}: (Ai, Bj) → C_{ij} are the rules; implication uses min(·).
μ C z = m a x i , j μ C i j z
where μ C z is the aggregated consequent membership [—].
Δ u = s u z μ C z d z / μ C z d z
where su is the output scale [-]; Δu is the defuzzified valve increment [-]; and the final command u is obtained by accumulating Δu and enforcing (17) and (18).
μ T R I z ; a , b , c = m a x m i n z a / b a , c z / c b , 0
is the normalized triangular membership function used for NB, NS, ZE, PS, PB with breakpoints (abc) clamped to z ∈ [−1, 1].
The 5 × 5 rule base maps (e, Δe) to Δu linguistic labels as follows, in Table 7.
Input/output scales are fixed and listed next in Table 8.
To ensure exact reproducibility of the FLC, normalized triangular membership breakpoints for es and Δes are provided in Table 9.

3.2.3. Observation Normalization and Safety Bounds

Continuous variables are normalized by fixed scales S i and clipped to admissible bounds. Safety gates are deterministically applied and terminate an episode when violated.
x n o r m , i = c l i p x i / S i , x i , m i n , x i , m a x
where x{norm,i} denotes the normalized ith component. We avoid hat/overbar diacritics to guarantee robust rendering across Word installations.
The normalization scales and safety thresholds used in all runs are listed below in Table 10.

3.2.4. Reward Shaping

Reward is defined once here and referenced in downstream Section 3.3 and Section 3.4 without redefinition. A calm-state multiplier attenuates penalties inside tight frequency/power bands to reduce chattering.
r t = κ t w f Δ f t 2 + w m o v e Δ u t + w j e r k Δ u t Δ u t 1 + b t
where κt gates penalties during calm periods (Table 10). All weights are dimensionless because signals are normalized.
Reward weights and bonuses (dimensionless) are fixed as follows in Table 11.

3.2.5. Five-Phase Curriculum

Training progresses through five deterministic phases. Advancement requires zero safety breaches across the current phase’s evaluation bundle and non-decreasing mean reward.
N u n s a f e = 0 r a v g k r a v g k 1
where N{unsafe} is the count of safety-gate violations in the phase evaluation bundle and r{avg}^{(k)} is the mean return for phase k. The wedge symbol ∧ expresses logical AND and avoids ‘0AND’ concatenation issues.
Curriculum phases, scenario bundles, and promotion conditions are summarized below in Table 12.

3.2.6. Soft Actor–Critic (SAC) Governor

We use a tanh-squashed Gaussian policy with twin Q-critics and target networks. During evaluation, the action is mapped to the valve command and constrained by the same saturation and rate-limit logic in (17) and (18).
a = t a n h μ θ s + σ θ s ε , ε ~ N 0 , I
where ⊙ denotes the elementwise product between σ θ s and ε , with ε drawn from a standard normal.
l o g π θ a s = l o g N z ; μ θ , σ θ i l o g 1 t a n h z i 2 , z = a r t a n h a
y = r + 1 d γ m i n j Q j t a r g s , a α l o g π θ a s , a ~ π θ · s
L Q = E Q i s , a y 2 , i 1,2
L π = E α l o g π θ a s m i n i Q i s , a
L α = E α l o g π θ a s + H
φ t a r g τ φ + 1 τ φ t a r g
where φ are the critic parameters and φ t a r g are the target-critic parameters updated by Polyak averaging with rate τ.
The SAC hyperparameters used in this work are listed below in Table 13.
Replay/evaluation cadence is deterministic and fixed as follows in Table 14.

3.2.7. Differential Evolution (DE) for Deterministic Tuning

We tune PID { Kp, Ki, Kd, τd } and FLC { se, sΔe, su } using SciPy’s DE (strategy = ‘best1bin’) on a fixed catalogue of scenarios. The objective aggregates metrics to minimize (M↓) and to maximize (M↑), with a failure penalty per gate violation.
J θ = m M w m m θ h M w h h θ + λ N f a i l θ
The metric weights used in the DE objective are listed below in Table 15.
The bounds and algorithmic settings for DE are shown next in Table 16 and Algorithm 1 as following:.
Algorithm 1. DE-based multi-scenario tuning.
0procedure DA_LGE(DT, C = {PID_DE, FLC_DE, SAC}, S, K, G, primary_seed)
1 ▷ Inputs: digital twin DT(Δt), controllers C, scenario portfolio S,
1a  KPI set K, licensing gates G, primary_seed
2 ▷ Outputs: ranking, PASS_c, CRS_c per controller, portfolio report, evidence bundle
3 set_random_seed(primary_seed); enable_global_determinism()
4 for each controller c in C do
5  if c ∈ {PID_DE, FLC_DE} then
6   θ_c ← DifferentialEvolution(J_multi_scenario, bounds, seed = primary_seed)
7   freeze(θ_c)
8  end if
9  R_c ← ∅ ▷ portfolio record for controller c
10  for each scenario s in S do
11   reset(DT, initial_state = s.x0, power_level = s.P)
12   schedule_disturbances(DT, s.disturbances)    ▷ deterministic
13   for k = 0 … ⌊s.T/Δt⌋ − 1 do
14    y_k ← sense(DT)
15    u_k ← policy_c(y_k)
16    u_k ← rate_limit(saturate(u_k))
17    x_{k + 1} ← step(DT, u_k, Δt)
18    log⟨k, x_k, y_k, u_k⟩
19   end for
20   m_{c,s} ← compute_KPIs(K, log)
21   g_{c,s} ← evaluate_gates(G, m_{c,s})    ▷ vector of PASS/FAIL
22   r_{c,s} ← composite_score(m_{c,s}, g_{c,s})
23   R_c ← R_c ∪ {(s, m_{c,s}, g_{c,s}, r_{c,s})}
24  end for
25  CRS_c ← aggregate_scores({r_{c,s}}_s; weights = s.weights)
26  PASS_c ← all_gates_pass({g_{c,s}}_s; hard_fail = ‘any’)
27 end for
28 ranking ← sort_by((PASS_c ↓, CRS_c ↓, var({r_{c,s}}) ↑))
29 export_evidence({R_c}_c, configs, seeds, figures, tables)
30 return ranking, {PASS_c, CRS_c}_c, {R_c}_c
31end procedure
Legend: Δt—simulation step; DE—Differential Evolution; KPIs—tracking/overshoot/settling/unsafe time/control effort; G—licensing gates (hard); CRS—composite rating score; PASS_c—portfolio pass if any hard gate fails → FAIL. Note: PID_DE: Differenfial Evolution opitimized PID, FLC_DE: Differenfial Evolution opitimizedFLC.

3.2.8. Evidence Capture and Reproducibility

All runs emit versioned logs, checkpoints, and manifests. The following artifacts are produced deterministically at the paths shown.
The artifacts generated by the pipeline are summarized below in Table 17.

3.3. Evaluation Scenarios—Deterministic Disturbance Models, Execution Protocol, and Metrics

This subsection defines the deterministic scenario catalogue, the disturbance models applied to the plant–grid interface, the execution protocol used to evaluate each controller, and the metric suite collected per scenario.
We first define two helper operators used throughout: the clipping operator in (35) and the time-window indicator in (36).
c l i p x ; a , b = m i n m a x x , a , b
where clip(x; a, b) saturates a scalar x to the closed interval [a, b].
w a , b t = 1 , a t b ; 0 , o t h e r w i s e
where w a , b t is a deterministic 0–1 window that activates a disturbance on the interval [a, b] (seconds).

3.3.1. Deterministic Scenario Catalogue and Disturbance Models

Letting P r e f denote the nominal electrical load (MW), and ΔLref a reference load change magnitude (MW), the commanded load profile L(t) for each scenario is defined below.
Baseline steady operation holds a constant demand level, as in (37).
L b a s e t = P r e f
where L b a s e t is the baseline demand (MW) and P r e f is the nominal power setpoint (MW).
A gradual load increase is modeled as a linear ramp over [t1, t2], clipped outside the interval, as in (38).
L g r a d t = P r e f + Δ L r e f c l i p t t 1 / t 2 t 1 ; 0,1
where 0 ≤ (t – t1)/(t2− t_1) ≤ 1 over the ramp, and ΔL_{ref} > 0 (MW).
A sudden load increase is represented by a deterministic step at time t s in (39).
L s t e p t = P r e f + Δ L r e f w t s , T t
where T is the evaluation horizon (s) and w t s , T t activates the step at t = t s .
Sensor-noise injection is deterministic and bounded: fixed multi-tone sinusoids are superposed on measured channels, as in (40) and (41).
n f t = A f , 1 s i n 2 π f 1 t + A f , 2 s i n 2 π f 2 t + φ 2
f m t = f t + n f t , P m t = P t + n P t
where f t is the true grid frequency (Hz), P(t) is the plant electrical power (MW), f m t and P m are their measured counterparts, n f t and n P t are deterministic noise signals, and A f , 1 , A f , 2 , f 1 , f 2 , and φ 2 are fixed constants provided in Table 18.
Parameter-ramp disturbances alter selected plant parameters deterministically over a window, as in (42).
θ i t = θ i , 0 + r i t t 1 w t 1 , t 2 t
where θ i t is the nominal value of parameter i, r i is the ramp rate (units of θ i per second), and w t 1 , t 2 t gates the change.
A cascading-fault profile is defined by sequential windows on the load channel, as in (43).
L c f t = P r e f + Δ 1 w t a , t b t Δ 2 w t c , t d t
where Δ 1 , Δ 2 > 0 (MW) and t a , t b , t c , t d are non-overlapping windows that realize compounding stresses.
The combined scenario aggregates the foregoing signals in (44).
L c o m b t = L b a s e t + Σ q Δ L q t
where Δ L q t denotes each active disturbance component (ramp, step, noise, parameter ramp, and fault) defined above.
Numerical values used in the deterministic disturbance models are fixed for all evaluations, as seen in Table 18.

3.3.2. Deterministic Execution Protocol

All controllers are evaluated with a single deterministic pass per scenario. For each scenario s, the environment is reset, the disturbance model Ls(t)is applied, safety gates are enforced, and the metric set is computed from the resulting closed-loop trajectory. The guard conditions and logging are deterministic and reproducible.
Evaluation proceeds as follows (high-level outline):
(1) Select a scenario (s) from the fixed catalogue; (2) reset state; (3) simulate for T seconds at step Δt under the deterministic disturbance; (4) enforce safety gates (frequency trip f t r i p , thermal and speed limits) during rollout; (5) compute and store all metrics defined in Section 3.3.3; and (6) persist logs and traces.
Safety limits and constants referenced by the protocol are listed here for completeness, as seen in Table 19.

3.3.3. Metrics Collected per Scenario

Letting f(t) be the grid frequency (Hz), e f t = f(t) − f_{nom} the frequency error, u k the valve command at discrete step k (per-unit), and N = T/Δt the number of samples, we define the metrics below; each equation is followed by the symbol explanations and units.
The frequency error is defined in (45), and the integral metrics I S E f and I A E f follow in (46) and (47).
e f t = f t f n o m
where e f t (Hz) is the frequency deviation from the nominal f n o m (Hz).
I S E f = 0 T e f t 2 d t
where I S E f has units Hz2·s and aggregates the squared frequency error over the horizon [0, T].
I A E f = 0 T e f t d t
where I A E f has units Hz·s and aggregates the absolute frequency error over [0, T].
The peak deviation and nadir are defined via extremum operators in (48).
Δ f m a x = m a x t 0 , T e f t , f n a d i r = m i n t 0 , T f t
where Δ f m a x (Hz) is the largest magnitude deviation and f n a d i r (Hz) is the minimum frequency attained.
Rotor-speed overshoot and undershoot (percent) relative to nominal are defined in (49).
O S ω = 100 % · m a x t ω t 1 + , U S ω = 100 % · m a x t 1 ω t +
where ω(t) is the rotor speed in per-unit, and (x)_{+} = max(x, 0) denotes the positive-part operator.
Cumulative actuation effort and valve reversals are computed from the discrete control sequence in (50) and (51).
C E s u m = Σ k = 1 N Δ u k , Δ u k = u k u k 1
where C E s u m (dimensionless) accumulates the absolute valve movements (per-unit commands).
V r e v = Σ k = 2 N 1 Δ u k Δ u k 1 < 0
where V r e v (count) is the number of sign reversals in the valve-movement sequence, and 1{·} is the indicator function.
Spectral damping is assessed from the Welch PSD S f f ω of e f t ; the band-averaged magnitude is given in (52) and (53).
S f f ω = W e l c h   e f t
where Welch{·} denotes a deterministic PSD estimate (fixed window/overlap).
E d a m p = 1 / ω 2 ω 1 ω 1 ω 2 S f f ω d ω
where [ ω 2 , ω 1 ] is the evaluation band (rad/s).
The total unsafe time, used later in licensing gates, is the sum of dwell times in any violating condition, as in (54).
T u n s a f e = T T f u e l > T f u e l m a x + T ω > ω m a x , l i m i t + T f f m i n , f m a x
where T(·) counts the time (s) spent in the indicated set, ω m a x , l i m i t is the rotor-speed limit (per-unit), and f m i n , f m a x is the accepted frequency band (Hz).
The Grid Load-Following Index (GLFI) is a bounded tracking score (higher is better), defined in discrete form in (55).
G L F I = 1 1 / N Σ k = 1 N P e , k L k / Δ L r e f + ε P
where P e , k is the electrical power (MW) at sample k, L k is the commanded load (MW), Δ L r e f is the reference load change magnitude (MW), and ε P = 1.0 MW is a physical denominator floor.
A composite Transient Severity Score (TSS) is formed as a weighted sum of normalized components in (56) and (57).
C E ^ = C E s u m / C E a b s , m a x ,       V ~ r e v ^ = V r e v / V r e v , m a x
where C E a b s , m a x = r_{max} · T provides a conservative bound on cumulative movement (dimensionless) and V r e v , m a x = N − 2 bounds reversal counts (both constants listed in Table 20).
T S S = w f · I A E f / I A E f , l i m + w c e ·   C E ^ + w v r ·   V ~ r e v ^ + w o s · O S ω / O S ω , m a x
where all weights are dimensionless and satisfy w f + w c e + w v r + w o s = 1; I A E f , l i m (Hz·s) and O S ω , m a x (%) are scenario-family limits used for normalization.
The fixed constants used by the metric normalizations are listed here in Table 20.
Default metric weights used for the composite TSS are shown below; all are dimensionless, as seen in Table 21.

3.4. Deterministic Assurance Orchestration and Evidence Pipeline

This subsection formalizes the Deterministic Assurance Framework used to qualify grid-interactive nuclear controllers. The framework binds scenario-level stress testing (Section 3.3), quantitative indicators, and evidence packaging into a single licensable pipeline. It is organized around four pillars—Trustworthiness and Reliability, Interpretability and Defense-in-Depth, Regulatory Readiness and Quality, and Continual Assurance—and produces auditable artifacts and objective pass/fail gates aligned to nuclear software and AI guidance (e.g., IAEA SSR-2/1 Criterion 5, NRC RG 1.168, NUREG-2261 Section 3.2, IEC 60880 [25] Category A). All evaluations are deterministic: scenarios S1–S8 are executed exactly as defined in Section 3.3, and the RL policy is evaluated with deterministic action selection. As shown in Figure 3, the Digital Twin Assurance Framework (DTAF) operationalizes these pillars by mapping stress-test evidence to licensable, regulator-aligned checks.

3.4.1. Control-Effort Indices and Hard Bounds

We first state the sample count used by all discrete-time metrics, and then derive hard bounds for cumulative movement and reversal counts.
N = T   /   Δ t
where N is the number of samples per evaluation episode (—), T is the evaluation horizon (s), and Δt is the controller step size (s).
C E a b s , m a x = r m a x T
where C E a b s , m a x (—) is a conservative upper bound on cumulative valve movement, r m a x is the rate limit (s−1), and T is the horizon (s). For r m a x = 0.15 s−1 and T = 600 s, C E a b s , m a x = 90.
V r e v , m a x = N 2
where V r e v , m a x (count) is the maximum possible number of sign reversals in a sequence of length N (two samples are needed before a reversal can occur). With N = 12,000 (Δt = 0.05 s, T = 600 s), V r e v , m a x = 11,998.
The weights used later in the composite robustness score (CRS) and the limits used by licensing gates are constants for all runs, as seen in Table 22.

3.4.2. Composite Robustness Score (CRS)

For each scenario s, a composite score C R S s is computed from the safety, transient severity, control effort, and tracking components. Safety contributes 1 when no unsafe dwell occurs and 0 otherwise; the other terms are normalized to [0, 1] (Section 3.3).
S s = 1 { T u n s a f e s = 0 }
where Ss (—) is the per-scenario safety indicator, 1 {·} is the indicator function, and T u n s a f e s is the total unsafe time in scenario (s)
C R S s = w s a f e S s + w t s s \ b i g 1 T S S s / T S S l i m \ b i g + w e f f \ b i g 1 C E s \ ^ b i g + w g l f i G L F I s
where C E s ^ = C E s u m ^{(s)}/ C E a b s , m a x (—) is the normalized control effort, G L F I s (—) is the tracking index, and T S S s (—) is the composite transient severity. Weights w s a f e , w t s s , w e f f , and w g l f i are listed in Table 22.

3.4.3. Per-Scenario Licensing Gates

Per-scenario licensing gates combine deterministic constraints on safety, severity, tracking, and overshoot. Letting TTU ≡ T u n s a f e s , the logical gate is given in (63) and must evaluate to 1 for the scenario to qualify.
G a t e s = 1 { T u n s a f e s = 0 T S S s T S S l i m G L F I s G L F I m i n O S ω s O S ω , m a x }
where ∧ denotes logical AND, T S S l i m , G L F I m i n , and O S ω , m a x are constants from Table 22, and all quantities are computed deterministically from the rollout.

3.4.4. Policy Interpretability Constraint (Entropy Band)

To guard against both saturated and erratic policies, we bound the average action entropy H p r o x y ¯ within a fixed band (nats), as shown in (64). Entropy is computed deterministically from the deployed policy’s squashed Gaussian outputs with a fixed sampling grid.
H p r o x y ¯ H m i n , H m a x
where H{min} and H{max} () define the acceptable interpretability band.
The entropy band used by the interpretability constraint is fixed for all evaluations, seen in Table 23.

3.4.5. Portfolio Aggregation and Licensing Decision

Letting S be the number of scenarios in the catalogue (Section 3.3), we aggregate CRS across scenarios by the arithmetic mean and track the worst and best individual outcomes, as in (65). The final licensing decision is a deterministic logical gate (66) that combines the mean score, per-scenario gates, and the entropy band.
C R S ¯ = 1 S s = 1 S C R S s , C R S w o r s t = min s C R S s , C R S b e s t = max s C R S s
G a t e p o r t f o l i o = 1 { C R S ¯ C R S m i n G a t e s = 1 H p r o x y ¯ H m i n , H m a x }  
where C R S m i n is the minimum acceptable mean CRS (—), G a t e s is the per-scenario gate from (63), and H p r o x y ¯ is constrained by (64).
The portfolio-level constants are shown below and apply uniformly to all controllers evaluated under this framework, seen in Table 24.

3.4.6. Evidence Artifacts and Traceability

All evidence is produced deterministically and stored under version control. Paths are stable and reproducible across runs. Table 25 lists the artifacts and their roles; Table 26 records the fixed constants referenced in this subsection.
The assurance pack is organized as a deterministic bundle with stable paths and roles, seen in Table 25.
For completeness, the fixed numeric constants used throughout Section 3.4 are summarized here, seen in Table 26.

4. Results and Discussion

4.1. Overview of the Evaluation and How to Read the Results

This section quantifies how three governors—a DE-tuned PID, DE-tuned FLC, and an entropy-regularized SAC—perform under the same plant model, limits, and eight fixed disturbance scenarios defined in Section 3.1, Section 3.2, Section 3.3 and Section 3.4. The objective is to assess, in a single deterministic pass, whether a candidate governor can (i) respect safety envelopes, (ii) provide grid-support quality commensurate with SMR needs, and (iii) do so with actuation economy.
The evaluation protocol is strictly deterministic: controller parameters, scenario waveforms, sampling, and limits are fixed ex ante; no stochastic seeding, Monte Carlo sampling, or inferential statistics are used. All KPIs (e.g., GLFI, TSS, CEI, and TTU) are computed from the same traces and windows for each governor, and all artifacts are version-controlled, as described in Section 3.4.6.
The figures are organized to move from global pattern → representative behavior → mechanism → qualification.

4.2. Coherence of the Metric Suite

Figure 4 summarizes the deterministic Pearson associations among the key KPIs computed over the fixed catalog (all scenarios × all controllers). The structure is physically plausible: the GLFI is strongly anti-correlated with TSS (r ≈ −0.90), and the damping-band energy proxy E d a m p is anti-correlated with GLFI (r ≈ −0.80). Conversely, the TSS is positively associated with the cumulative actuation burden C E s u m (r ≈ 0.70) and valve reversals Vrev (r ≈ 0.80). These descriptive (non-inferential) associations justify the combined use of the KPIs without redundancy and motivate the CRS construction in Section 3.4.
For reference, frequency-error and effort quantities used by the KPIs are defined in (67) and (68).
e f t = f t f n o m
where e f t is the grid-frequency deviation (Hz) from nominal f n o m (Hz).
C E s u m = Σ k = 1 N Δ u k , Δ u k = u k u k 1
where C E s u m is the cumulative valve movement (dimensionless per-unit commands), u k is the valve command at step k (—), and N = Tt samples the T-second episode with step Δt.

4.3. Global Scenario × Controller Landscape

Figure 5 consolidates outcomes by scenario (rows) and controller (columns). The SAC governor consistently occupies the best-performing cells, while classical baselines degrade under Cascading Faults and Combined events. Cells marked “FAIL” indicate deterministic gate violations per the licensing gate (Methods Equation (62)). Failures of PID/FLC in the hardest scenarios coincide with elevated TSS and C E s u m (cf. Figure 4), reinforcing the mechanistic linkage between effort, reversals, and instability.

4.4. Representative Transient: Nadir, Settling, and Safety Margin

Figure 6 examines a severe frequency excursion. The SAC governor limits the nadir above the trip line and returns to nominal without overshoot. The PID exhibits a deeper nadir and longer recovery; the FLC approaches the trip boundary. Two recalled definitions frame the discussion: (67) and (68).
e f t ε s t e a d y s t a t e    
The absolute magnitude of the steady-state tracking error e f t must remain below the allowable bound ε s t e a d y s t a t e , where ε denotes the acceptable steady-state band (Hz). A lower nadir (higher f_{nadir}), faster re-entry into the ε-band, and small overshoot OS_{ω} collectively indicate a safer transient (see Figure 6 below).

4.5. Policy Geometry and Actuation Economy

The controller’s geometry in state–action space (Figure 7) explains actuation-burden differences. The PID surface is essentially planar (linear in error and derivative), the FLC surface is piecewise with steep cliffs at rule boundaries, and the SAC surface is smooth and adaptive—steep only along directions needed to arrest drift. This geometry yields disciplined increments Δuk and reduced reversals Vrev.
We measure economy by a dimensionless performance-to-effort ratio Π in (70).
Π = J p e r f / E u
where J p e r f is a bounded composite performance index (e.g., per-scenario CRS) and E u is cumulative actuation effort (e.g., CEsum). Higher Π indicates better performance per unit actuation. As shown in Figure 8, the SAC sustains markedly higher Π across all scenarios, especially under Sensor Noise, Parameter Ramp, Cascading Fault, and Combined events.

4.6. Provenance of the Released SAC Policy

Figure 9 documents the curriculum trajectory for the deployed SAC model. P1 establishes stability; P2 emphasizes efficiency; P3–P4 harden resilience; and P5 is the fixed-catalog ‘final exam’. All curves are computed deterministically on the same evaluation harness; improvements and plateaus are audit-ready and repayable.

4.7. Stability Forensics: Phase Portrait and PSD

Figure 10 contrasts phase portraits for the SAC and PID. SAC trajectories contract monotonically toward the origin, whereas the PID exhibits spiral loops—consistent with lightly damped poles and slower energy dissipation. In the frequency domain (Figure 11), the PSD of Δf shows pronounced suppression near the dominant mode for the SAC; the PID and FLC retain higher modal energy.
For reference, the band-energy proxy used in Methods Section 3.3 is recalled in (71).
E b a n d = 1 / ω 2 ω 1 · ω 1 ω 2 S Δ f ω d ω
where S_{Δf}(ω) is the PSD of Δf(t), and [ω_{1}, ω_{2}] isolates the dominant plant mode (rad/s). A lower E_{band} implies better modal suppression.

4.8. Ancillary-Service Readiness and Licensability

Figure 12 aggregates the deterministic pass/fail gates—response speed, control efficiency, precision/stability—evaluated against explicit thresholds (Section 3.4: GLFI_{min} = 0.90, TSS_{lim} = 1.00, and OS_{ω,max} = 5%). Only the SAC governor clears all categories, demonstrating readiness for grid-service qualification under identical physics, scenarios, and limits. Because the pipeline is deterministic and version-controlled, every bar in Figure 12 is replayable and auditable.

4.9. Multi-Attribute Performance Profiling

This quantitative difference gives rise to distinct controller “personalities,” synthesized in the multi-attribute profile in Figure 13. The SAC agent’s large, well-balanced polygon demonstrates its holistic superiority, excelling in key areas like Robustness, Foresight, and Control Efficiency. In contrast, the classical controllers exhibit skewed, deficient profiles, visually representing their strategic limitations.

4.10. Controller Robustness and RL Policy Transparency

We evaluated three controllers—the Soft Actor–Critic (SAC), PID_DE, and FLC_DE—across eight scenarios at two power levels (100% and 80%), using three random seeds (42, 43, and 44). All controllers operated across all scenarios and both power levels (at 100% and 80% power levels). The median frequency nadirs (Hz) aggregated by controller and level were as follows: SAC 100% = 58.956, SAC 80% = 58.973; PID_DE 100% = 58.949, PID_DE 80% = 58.901; and FLC_DE 100% = 58.949, FLC_DE 80% = 58.901.
The CRS separates the controllers despite uniform pass/fail outcomes. The SAC exhibits a high mean CRS (≈0.95 at 100% and ≈0.78 at 80%), indicating robust performance with small dispersion. In contrast, the PID_DE and FLC_DE remain an order of magnitude lower at 100% and reach ≲0.25–0.26 at 80%, confirming the RL policy’s superior cross-scenario quality under equal constraints.
The surrogate isolates state channels that most strongly influence the SAC action near safety margins. The prominent negative weight on reactor_power_mw and positive weight on T_fuel are consistent with a policy that rapidly unloads when thermal inertia rises, while tracking grid_frequency_hz corrections. This behavior is absent or muted in the baseline controllers, explaining their lower CRS.
The rapid collapse of action variance after the initial transient signals a highly calibrated policy: the SAC explores only when needed, then locks into low-variance control. This is consistent with the high CRS and stable nadirs reported above.
Together, Figure 14, Figure 15, Figure 16 and Figure 17 substantiate the claim of RL superiority: (i) a higher and more stable CRS; (ii) transparent, mechanistically plausible sensitivities via a local surrogate; (iii) disciplined reduction of action variance after fast transients; and (iv) measured, risk-aware actuation in critical contexts. All artifacts were generated with fixed seeds, reproducible scripts, and 600 dpi export with tight layout to preclude label clipping or overlap.

4.11. Threats to Validity and Limitations

External validity: The current results use an infinite-bus abstraction; extension to networked grids and hardware-in-the-loop is planned. Controller set: Only the SAC, PID, and FLC are benchmarked; the harness is controller-agnostic and can admit additional baselines without changing the protocol. Determinism: By design, we report descriptive (non-inferential) evidence; uncertainty quantification and probabilistic stressors are out of scope here and will be addressed in future work.

4.12. Positioning of the Present Results Relative to Recent RL in Energy Systems

To contextualize the above findings, we relate them to two recent IEEE-TII studies, namely a hierarchical RL framework for regional multi-energy markets [26] and a hybrid policy-based RL approach for island-group energy management under transmission constraints [27]. which we compare our RL approach with them in multi-dimensional comparative analysis described in Table 27 “Concise positioning of this study relative to the two suggested works”. Although all three works employ reinforcement learning, they target different problem classes and evaluation criteria.

4.12.1. Regional Energy Market with Hierarchical RL (Zhang et al. [26])

Zhang et al. address market-clearing in a regional multi-energy system, proposing a hierarchical RL design to improve price-matching/clearing efficiency over large state–action spaces. Their evidence is provided via numerical market case studies with economic/market KPIs (e.g., matching efficiency and operator income effects). In contrast, the present study concerns a safety-critical plant-control loop (PWR governor for load-following) and reports licensing-aligned outcomes: deterministic pass/fail gates tied to plant limits, performance/severity indices (TTU, GLFI, TSS, overshoot, and control effort), and traceability artifacts (local sensitivity over time, critical state–action analysis, and local sensitivity). Within this setting, our SAC governor satisfies the gates and outperforms a DE-tuned PID/FLC under identical physics and constraints, demonstrating robustness under adversarial, fixed-replay transients.

4.12.2. Island-Group Energy Management with Hybrid Policy RL (Yang et al. [27])

Yang et al. formulate system-level energy management for an island group with transmission constraints, introducing a hybrid policy-based RL capable of handling mixed discrete–continuous actions. Their evaluation emphasizes operational/economic KPIs (e.g., energy-balance satisfaction, and cost/efficiency) over system simulations. By contrast, our evaluation addresses component-level nuclear control under licensing expectations, where success is defined by gate compliance and robustness indices during adversarial transients, again complemented by auditable decision traceability.

4.12.3. Comparability Considerations

Because the objectives, domains, and KPIs in [25,26] (market/dispatch efficiency and system-operation economics) are not commensurate with the licensing-grade, safety-gate evaluation used here, direct numerical head-to-head is methodologically inappropriate. Instead, these strands are complementary: system-level RL advances (e.g., hierarchical or hybrid policies) inform algorithmic design, while our results provide a deterministic, regulator-facing evidence protocol for safety-critical plant control. To facilitate like-for-like future comparisons, we release the fixed scenario portfolio, gates, and scripts so alternative policies (including hierarchical/hybrid designs) can be evaluated under identical, licensing-aligned conditions.

4.12.4. Implication for the Present Results

Against strong, regulator-familiar baselines (DE-tuned PID/FLC), the SAC governor meets explicit safety gates and maintains favorable robustness/effort trade-offs under adversarial replay, while producing traceable decision evidence. This shifts the evaluation of RL-based control from performance-only reporting toward licensing-grade assurance, filling a gap not addressed by market or system-management studies such as [25,26].

4.12.5. Recent RL-in-Nuclear Studies

Finally, relative to recent RL-in-nuclear studies that primarily report nominal-scenario performance without licensing-grade artifacts [3,4,5,6], the present results contribute a regulator-aligned evidence set: deterministic safety gates, adversarial fixed-replay scenarios, controller-agnostic CRS synthesis, and auditable traceability (local sensitivity and critical state–action contexts). This complements feasibility-focused nuclear RL by supplying the licensing-oriented evaluation scaffolding.

4.13. Summary

Across Figure 4, Figure 5, Figure 6, Figure 7, Figure 8, Figure 9, Figure 10, Figure 11 and Figure 12—and when the full portfolio is re-run from a reduced operating point at 0.8 P_ref (≈480 MW)—the conclusion is consistent: under identical physics, limits, and scenarios, the SAC governor achieves shallower frequency nadirs (smaller excursions), faster settling, stronger modal damping, and a lower actuation burden than the DE-tuned PID and FLC baselines. The geometry of the learned policy (Figure 7) together with the observed action economy (Figure 8) explains the global pattern in the scenario × controller landscape (Figure 5) and the representative transient (Figure 6). Stability forensics—phase portraits and spectral energy (Figure 10 and Figure 11)—are coherent with these behaviors. The deterministic qualification scorecard (Figure 12) consolidates these outcomes into an all-gates pass at both operating points under the same evaluation harness and thresholds, yielding replayable, audit-ready evidence; additionally, the traceability bundle (time-aligned local sensitivity, critical state–action mapping, and action-entropy envelopes) confirms decision regularity near gate boundaries.

5. Conclusions

This work advances a regulator-oriented path to licensable AI control by introducing the Deterministic Assurance Framework (DTAF) and demonstrating it on a high-fidelity pressurized-water-reactor (PWR) digital model. The DTAF converts controller behavior into audit-ready evidence by combining (i) fixed, pass/fail licensing gates tied to formal limits; (ii) a portfolio of adversarial stress scenarios; and (iii) an embedded traceability and explainability package, all executed under a single-run deterministic protocol.
Within this framework, three governor architectures—an entropy-regularized Soft Actor–Critic (SAC) agent, a Differential-Evolution-optimized Proportional–Integral–Derivative Controller, and a Differential-Evolution-optimized Fuzzy-Logic Controller—were evaluated on identical plant physics, limits, and fixed scenario waveforms. The gate suite and thresholds (for example, the minimum Grid Load-Following Index, upper bound on Transient Severity Score, and maximum rotor-speed overshoot) were established ex ante to reflect safety and grid-support expectations.
The evaluation protocol is strictly reproducible and non-inferential: parameters, scenarios, and sampling are fixed; claims flow from trace-level behavior to mechanism to portfolio gate outcome, and all licensing conclusions are drawn from deterministic single-run evidence. The only scoped exception is a clearly labeled auxiliary robustness check using three predetermined seeds, reported as non-licensing context.
Across the adversarial portfolio, the SAC governor satisfies the predeclared gates whenever the safe envelope is achievable, while the strong conventional baselines fail specific gates under high-severity disturbances; when all methods remain within bounds, the SAC provides higher grid-support quality with lower actuation burden and fewer control reversals, consistent with the metric associations observed over the full catalog. These outcomes arise under the same deterministic plant model and gating, and each claim is linked to transparent artifacts: scenario definitions, time-domain traces, mechanism diagnostics, and the final scorecard.

6. Future Work

Higher-fidelity physics and operating regimes. This would involve re-introducing conservative reactivity-feedback coefficients and extending thermal–hydraulic fidelity (e.g., secondary-side dynamics and multi-loop interactions) to test controller behavior under tighter physical coupling and broader operating points (including startup, low-load hot standby, and rapid dispatch ramps). This would deepen model realism while preserving DTAF determinism.
From software-in-the-loop to hardware-in-the-loop. This would involve migrating the fixed portfolio into a hardware-in-the-loop testbed to exercise I/O timing, actuator saturations, and sensor paths under the same gating logic, while maintaining identical scenarios and pass/fail criteria to keep the evidence comparable across SIL→HIL progression.
Expanded adversarial portfolio and parameter sweeps. This would involve systematically enlarging the stress catalog with worst-case composite events (e.g., ramp-with-noise, valve-rate-limited steps, and turbine lag excursions) and adversarial parametric sweeps over plant lags, gains, and limits. This would use the same deterministic harness to produce envelope-wide evidence and reveal policy brittleness modes before any probabilistic analyses are contemplated.
Runtime assurance and safety shields inside the DTAF. This would involve integrating deterministic safety layers (command governors, barrier-function filters, and invariant-set guards) as first-class DTAF components. They would be evaluated with the same gates to quantify how much margin they restore during off-nominal events, and ensure their actions are recorded in the traceability log.
Broader benchmarks under identical gating. This would involve adding model-predictive and robust control baselines (e.g., constrained MPC and H∞) implemented with identical plant models, limits, and scenarios to study trade-offs between explicit constraint handling and policy expressiveness—without changing the scorecard or evidence standard. This would isolate method effects from test-bench effects inside the DTAF.
Grid-service readiness and portfolio decisions. This would involve evolving the final scorecard toward service-qualification bundles (e.g., frequency containment and load-following with reserves) by composing existing gates into regulator-relevant portfolio decisions and documenting the traceability path from scenario to decision artifact.

7. Assumptions, Limitations, and Translational Roadmap

7.1. Study Assumptions

Deterministic evaluation: This assumes a fixed scenario × controller catalog, fixed physics, and fixed limits. Controller parametrizations, tuning procedures, and reward weights are frozen and versioned. Reproducibility artifacts (commits, configs, logs, and figure scripts) are bundled.
Plant/grid abstractions: A PWR governor-centric surrogate is used; the grid is modeled as an infinite bus with finite-band frequency disturbances defined per scenario.
Interfaces/devices: Sensors are ideal signals plus a fixed noise trace in the sensor-noise scenario. Actuation is a valve command with saturation and rate limits at the discrete period Δt.
Protocol/metrics: Pass/fail gates (trip avoidance, steady-state error, and ramp compliance) are deterministic thresholds. KPIs (GLFI, TSS, CE sum, Vrev, and E damp) are computed in fixed windows/filters. Learning curves are reported for provenance only.

7.2. Limitations

Model and data: The infinite-bus abstraction omits low-inertia inter-area modes and grid-code logic; the thermal–hydraulic surrogate omits multi-node detail (e.g., DNBR margins). Aging/drift and cyber-physical faults are not time-evolving.
Controllers/training: Baselines are limited to the DE-tuned PID and FLC; MPC/H∞/LQR are not yet included. Evaluation is deterministic by design (no parameter randomization or Monte Carlo spreads). Reward terms target governor behavior and rely on external limits/safety proxies for full-plant protection.
Assurance/deployment: Human-in-the-loop procedures, runtime assurance (RTA/CBFs), and tool/quality audits are not yet implemented end-to-end.

7.3. Translational Roadmap

Objective: The objective is to turn the deterministic DTAF into an industry-deployable, regulator-ready program that remains controller-agnostic and plant-agnostic. All upgrades below preserve scenario determinism; uncertainty is represented via cataloged envelopes rather than probabilistic spreads.

7.3.1. System Realism and Test Environments

Networked-grid dynamics: This moves from infinite-bus to reduced-order multi-area networks with finite inertia, governor/turbine models, and grid-code checks, and replay real PMU disturbances. Evidence: Evidence includes scenario replays, inter-area mode damping KPIs, and GLFI under code constraints.
High-fidelity plant twin: This couples the governor loop to multi-node thermal–hydraulic models with fuel/DNBR margins, validated against utility traces. Evidence: Evidence includes limit-envelope compliance and thermal margins per scenario.
Hardware-in-the-loop (HIL): This ports controllers to PLC/RTOS with measured I/O latency, actuator nonlinearities, and watchdogs, maintaining the same deterministic gates. Evidence: Evidence includes closed-loop HIL logs, timing budgets, and watchdog trips = 0.

7.3.2. Safety Assurance and Verification

Runtime assurance (RTA): This implements a Simplex-style supervisor with control-barrier functions and verified fallback envelopes (PID/FLC). Evidence: Evidence includes monitor verdicts, dwell times, and intervention logs integrated into the safety case.
Formal verification: This computes reachability-based invariants and signal temporal logic (STL) properties on linearized/hybridized models; results are fed into test-oracle generation for scenario-catalog expansion. Evidence: Evidence includes certified safe sets and counterexample-guided scenarios.

7.3.3. Controller Breadth and Robustness

Broaden baselines: This adds an MPC (with constraints), H∞, and LQR with anti-windup under the same protocol to strengthen comparative claims. Evidence: Evidence includes controller-agnostic scorecards and gate outcomes.
Deterministic uncertainty envelopes: This adds structured parameter sweeps (plant constants, delays, and biases) as catalog variants, reporting envelopes (min/median/max) rather than probabilistic spreads. Evidence: Evidence includes worst-case KPI/gate tables.
Fault tolerance: This includes sensor dropout, actuator stiction, and stuck-valve events with recovery gates and trip-avoidance proofs. Evidence: Evidence includes fault-recovery timing and residual limits.

7.3.4. Human Factors, Cybersecurity, and Compliance

Operator acceptance: This adds HSI artifacts (policy-intent visualization and alarm rationalization) and operator-in-the-loop drills, quantifying workload/trust metrics. Evidence: Evidence includes scenario completion with human-override windows.
Cybersecurity drills: This exercises spoofing/tampering/denial in a segmented testbed with detection/mitigation hooks tied to RTA logs. Evidence: Evidence includes attack-detect/mitigate latencies, and zero-unsafe-time under defended scenarios.
Quality and audits: This maps artifacts to safety-case structures, conducting independent V&V and configuration-management audits aligned to nuclear software practice. Evidence: Evidence includes audit checklists and tool/config qualification records.

7.3.5. Action Matrix: From Limitation to Industrial Remedy

Table 28 represents a matrixed mapping towards industrial targeting.

7.4. Research Agenda

Hybrid formal methods: These involve verified CBF synthesis on reduced models and contract-based composition across plant and grid subsystems.
Catalog design: This involves coverage metrics for scenario sets and counterexample-guided expansion from failed certificates.
Interpretable policy analysis: This involves entropy bands plus input–output fragility, local Lipschitz estimates, and action-space curvature as explainability signals tied to gates.
Controller-agnostic benchmarking: This involves publishing an open Assurance Benchmark Suite with fixed physics, gates, and evidence schemas to enable independent replication and regulator pre-review.

7.5. Cross-Industry Parallels

Highly regulated sectors mature AI control via deterministic test harnesses, explicit gates, and auditable artifacts: aviation (software/tool qualification), automotive (ISO 26262) [28], and medical (software lifecycle and safety cases). The DTAF follows the same pattern—fixed models and limits, standard gates, and traceable artifacts—enabling independent replay and audit without seeding or Monte Carlo dependence. This aligns with nuclear licensing culture and scales to SMR-era plant–grid integration.

Author Contributions

A.A.I.: Conceptualization, Methodology, Software, Writing. H.-K.L.: Supervision, Resources, Review and Editing. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the 2025 Research Fund of the KEPCO International Nuclear Graduate School (KINGS), Republic of Korea.

Data Availability Statement

All simulation code, trained agent weights and raw results are available upon request under the KEPCO International Nuclear Graduate School (KINGS) license.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. U.S. Nuclear Regulatory Commission. Artificial Intelligence Strategic Plan: Fiscal Years 2023–2027, NUREG-2261. 2023. Available online: https://www.nrc.gov/docs/ML2313/ML23132A305.pdf (accessed on 9 November 2025).
  2. Siserman-Gray, C.; Barr, J.; Burniske, J.; Eftekhari, P.E.; Marek, R.; Means, A. Regulatory Challenges Related to the Use of Artificial Intelligence for IAEA Safeguards Verification. In Proceedings of the 64th Annual Meeting of the Institute of Nuclear Materials Management (INMM), Orlando, FL, USA, 21–25 July 2023. [Google Scholar]
  3. Gong, Y.; Chen, Y.; Zhang, J.; Li, X. Possibilities of Reinforcement Learning for Nuclear Power Plants: Evidence on Current Applications and Beyond. Nucl. Eng. Technol. 2024, 56, 1959–1974. [Google Scholar] [CrossRef]
  4. Nguyen, K.H.N.; Rivas, A.; Delipei, G.K.; Hou, J. Reinforcement Learning-Based Control Sequence Optimization for Advanced Reactors. J. Nucl. Eng. 2024, 5, 209–225. [Google Scholar] [CrossRef]
  5. Tunkle, L.; Abdulraheem, K.; Lin, L.; Radaideh, M.I. Nuclear Microreactor Control with Deep Reinforcement Learning. arXiv 2025, arXiv:2504.00156. [Google Scholar] [CrossRef]
  6. Kruthika, U.; Paneerselvam, S. Novel multiagent reinforcement learning framework using twin delayed deep deterministic policy gradient for adaptive PID control in boiler turbine systems. Sci. Rep. 2025, 15, 34558. [Google Scholar] [CrossRef] [PubMed]
  7. Sun, Y.; Khairy, S.; Vilim, R.B.; Hu, R.; Dave, A.J. A Safe Reinforcement Learning Algorithm for Supervisory Control of Power Plants. Knowl.-Based Syst. 2024, 301, 112312. [Google Scholar] [CrossRef]
  8. Feng, S.; Sun, H.; Yan, X.; Zhu, H.; Zou, Z.; Shen, S.; Liu, H.X. Dense reinforcement learning for safety validation of autonomous vehicles. Nature 2023, 615, 620–627. [Google Scholar] [CrossRef] [PubMed]
  9. Milani, S.; Topin, N.; Veloso, M.; Fang, F. Explainable Reinforcement Learning: A Survey and Comparative Review. ACM Comput. Surv. 2023, 56, 140. [Google Scholar] [CrossRef]
  10. Najar, M.; Wang, X. Explainable AI Models for Enhancing Operator Reliability During Reactor Design-Based Accidents Using Radionuclide Data. Nucl. Technol. 2025; early access. [Google Scholar] [CrossRef]
  11. Lim, S.T.; Kim, K.M.; Kang, J.-Y.; Kim, T.; Jerng, D.-W.; Ahn, H.S. A Digital Twin Framework for Generation-IV Reactors with Reinforcement Learning-Enabled Health-Aware Supervisory Control. Prog. Nucl. Energy 2025, in press. [Google Scholar] [CrossRef]
  12. International Atomic Energy Agency. Safety of Nuclear Power Plants: Design, IAEA Safety Standards Series No. SSR-2/1 (Rev. 1); IAEA: Vienna, Austria, 2016; Available online: https://www.iaea.org/publications/10885/safety-of-nuclear-power-plants-design (accessed on 9 November 2025).
  13. International Atomic Energy Agency. Design of Instrumentation and Control Systems for Nuclear Power Plants, IAEA Safety Standards Series No. SSG-39; IAEA: Vienna, Austria, 2016; Available online: https://www-pub.iaea.org/MTCD/Publications/PDF/Pub1694_web.pdf (accessed on 9 November 2025).
  14. Refaat, R.M.; Fahmy, R.A. Optimized Fractional-Order PID Controller Based on Nonlinear Point Kinetic Model for VVER-1000 Reactor. Kerntechnik 2022, 87, 104–114. [Google Scholar] [CrossRef]
  15. Hasan, R.; Masud, M.S.; Haque, N.; Abdussami, M.R. Frequency Control of Nuclear-Renewable Hybrid Energy Systems Using Optimal PID and FOPID Controllers. Heliyon 2022, 8, e11770. [Google Scholar] [CrossRef] [PubMed]
  16. Parada Iturria, F.F.; Martindale, N.A.; Reasor, A.L.; Stewart, S.L.; Ukishima, L.A. AI for Nuclear Safeguards Verification; ORNL/SPR-2024/01; Oak Ridge National Laboratory: Oak Ridge, TN, USA, 2024. Available online: https://www.ornl.gov/publication/ai-nuclear-safeguards-verification (accessed on 9 November 2025).
  17. Storn, R.; Price, K. Differential Evolution—A Simple and Efficient Heuristic for Global Optimization over Continuous Spaces. J. Glob. Optim. 1997, 11, 341–359. [Google Scholar] [CrossRef]
  18. Price, K.V.; Storn, R.M.; Lampinen, J.A. Differential Evolution: A Practical Approach to Global Optimization; Springer: Berlin/Heidelberg, Germany, 2005. [Google Scholar]
  19. Das, S.; Suganthan, P.N. Differential Evolution: A Survey of the State-of-the-Art. IEEE Trans. Evol. Comput. 2011, 15, 4–31. [Google Scholar] [CrossRef]
  20. Qin, A.K.; Huang, V.L.; Suganthan, P.N. Differential Evolution Algorithm with Strategy Adaptation for Global Numerical Optimization. IEEE Trans. Evol. Comput. 2009, 13, 398–417. [Google Scholar] [CrossRef]
  21. Piotrowski, A.P.; Napiorkowski, J.J.; Piotrowska, A.E. Particle Swarm Optimization or Differential Evolution—A Comparison. Eng. Appl. Artif. Intell. 2023, 121, 106008. [Google Scholar] [CrossRef]
  22. Biswal, A.; Dwivedi, P.; Bose, S. DE-Optimized IPIDF Controller for Management Frequency in a Networked Power System with SMES and HVDC Link. Front. Energy Res. 2022, 10, 1102898. [Google Scholar] [CrossRef]
  23. Shahriari, B.; Swersky, K.; Wang, Z.; Adams, R.P.; de Freitas, N. Taking the Human Out of the Loop: A Review of Bayesian Optimization. Proc. IEEE 2016, 104, 148–175. [Google Scholar] [CrossRef]
  24. Makarova, A.; Bardenet, R.; Percival, L. Risk-Averse Heteroscedastic Bayesian Optimization. In Proceedings of the 35th Conference on Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 6–14 December 2021; pp. 1–13. [Google Scholar]
  25. IEC 60880; Nuclear Power Plants—Instrumentation and Control Systems Important to Safety—Software Aspects for Computer-Based Systems Performing Category A Functions. International Electrotechnical Commission: Geneva, Switzerland, 2019.
  26. Zhang, N.; Yan, J.; Hu, C.; Sun, Q.; Yang, L.; Gao, D.W.; Guerrero, J.M.; Li, Y. Price-Matching-Based Regional Energy Market with Hierarchical Reinforcement Learning Algorithm. IEEE Trans. Ind. Inform. 2024, 20, 11103–11114. [Google Scholar] [CrossRef]
  27. Yang, L.; Li, X.; Sun, M.; Sun, C. Hybrid Policy-Based Reinforcement Learning of Adaptive Energy Management for the Energy Transmission-Constrained Island Group. IEEE Trans. Ind. Inform. 2023, 19, 10751–10762. [Google Scholar] [CrossRef]
  28. ISO 26262 (All Parts); Road Vehicles—Functional Safety. International Organization for Standardization: Geneva, Switzerland, 2018.
Figure 1. System architecture.
Figure 1. System architecture.
Energies 18 06268 g001
Figure 2. System block diagram.
Figure 2. System block diagram.
Energies 18 06268 g002
Figure 3. Digital Twin Assurance Framework (DTAF): Deterministic orchestration that binds stress scenarios, quantitative gates, and auditable artifacts into a regulator-ready licensing pipeline. The four pillars—Trustworthiness and Reliability, Interpretability and Defense-in-Depth, Regulatory Readiness and Quality, and Continual Assurance—anchor the evidence channels and pass/fail logic.
Figure 3. Digital Twin Assurance Framework (DTAF): Deterministic orchestration that binds stress scenarios, quantitative gates, and auditable artifacts into a regulator-ready licensing pipeline. The four pillars—Trustworthiness and Reliability, Interpretability and Defense-in-Depth, Regulatory Readiness and Quality, and Continual Assurance—anchor the evidence channels and pass/fail logic.
Energies 18 06268 g003
Figure 4. Deterministic Pearson association matrix across the fixed scenario × controller catalog. KPI codes: GLFI (Grid Load-Following Index), TSS (Transient Severity Score), C E s u m (cumulative actuation effort), Vrev (valve reversals), and E d a m p (modal damping energy proxy).
Figure 4. Deterministic Pearson association matrix across the fixed scenario × controller catalog. KPI codes: GLFI (Grid Load-Following Index), TSS (Transient Severity Score), C E s u m (cumulative actuation effort), Vrev (valve reversals), and E d a m p (modal damping energy proxy).
Energies 18 06268 g004
Figure 5. Scenario × controller performance heatmap. Higher is better. “FAIL” indicates a violation of deterministic gates.
Figure 5. Scenario × controller performance heatmap. Higher is better. “FAIL” indicates a violation of deterministic gates.
Energies 18 06268 g005
Figure 6. Frequency response under a hard event; the dotted line denotes the trip limit. The SAC maintains a higher nadir and converges faster.
Figure 6. Frequency response under a hard event; the dotted line denotes the trip limit. The SAC maintains a higher nadir and converges faster.
Energies 18 06268 g006
Figure 7. Policy manifolds in a common (e, de/dt) projection: (a) PID (planar), (b) FLC (piecewise), and (c) SAC (smooth, adaptive).
Figure 7. Policy manifolds in a common (e, de/dt) projection: (a) PID (planar), (b) FLC (piecewise), and (c) SAC (smooth, adaptive).
Energies 18 06268 g007
Figure 8. Performance-to-effort ratio Π across scenarios. The SAC maintains the highest economy across the entire catalog.
Figure 8. Performance-to-effort ratio Π across scenarios. The SAC maintains the highest economy across the entire catalog.
Energies 18 06268 g008
Figure 9. Training provenance for the released SAC policy over curriculum phases P1–P5, evaluated on the fixed catalog.
Figure 9. Training provenance for the released SAC policy over curriculum phases P1–P5, evaluated on the fixed catalog.
Energies 18 06268 g009
Figure 10. Phase portrait (Δω vs. d(Δω)/dt) for the SAC and PID. The SAC contracts monotonically; the PID follows spiral trajectories indicative of under-damping.
Figure 10. Phase portrait (Δω vs. d(Δω)/dt) for the SAC and PID. The SAC contracts monotonically; the PID follows spiral trajectories indicative of under-damping.
Energies 18 06268 g010
Figure 11. Power spectral density of Δf. The SAC actively suppresses energy near the dominant mode compared with the PID and FLC.
Figure 11. Power spectral density of Δf. The SAC actively suppresses energy near the dominant mode compared with the PID and FLC.
Energies 18 06268 g011
Figure 12. Ancillary-services qualification scorecard (deterministic). The vertical dashed line denotes the qualification threshold.
Figure 12. Ancillary-services qualification scorecard (deterministic). The vertical dashed line denotes the qualification threshold.
Energies 18 06268 g012
Figure 13. Multi-attribute performance profile, with archetypal radar synthesis.
Figure 13. Multi-attribute performance profile, with archetypal radar synthesis.
Energies 18 06268 g013
Figure 14. Cross-scenario Controller Robustness Scores (CRSs, mean ±95% CI) for the SAC, FLC_DE, and PID_DE at two power levels (100% blue; 80% orange). The SAC maintains a near-unity CRS at both levels, while baseline controllers remain below 0.3 on average at 100% power and improve modestly at 80%. Error bars reflect variability across seeds and scenarios (n = 24 per bar).
Figure 14. Cross-scenario Controller Robustness Scores (CRSs, mean ±95% CI) for the SAC, FLC_DE, and PID_DE at two power levels (100% blue; 80% orange). The SAC maintains a near-unity CRS at both levels, while baseline controllers remain below 0.3 on average at 100% power and improve modestly at 80%. Error bars reflect variability across seeds and scenarios (n = 24 per bar).
Energies 18 06268 g014
Figure 15. Local linear surrogate of the SAC policy around a critical operating context. Positive loadings (right) increase the SAC action; negative loadings (left) decrease it. The surrogate highlights the dominant channels: a large negative weight on reactor_power_mw and strong positive weight on T_fuel, with secondary contributions from grid_frequency_hz and speed_rpm. This mechanistic view explains the SAC’s stabilizing reactions without resorting to opaque end-to-end reasoning.
Figure 15. Local linear surrogate of the SAC policy around a critical operating context. Positive loadings (right) increase the SAC action; negative loadings (left) decrease it. The surrogate highlights the dominant channels: a large negative weight on reactor_power_mw and strong positive weight on T_fuel, with secondary contributions from grid_frequency_hz and speed_rpm. This mechanistic view explains the SAC’s stabilizing reactions without resorting to opaque end-to-end reasoning.
Energies 18 06268 g015
Figure 16. SAC action-entropy proxy over time during a representative disturbance. Exploration collapses within ~0.6 s, after a brief transient peak (≈0.40), indicating confident, low-variance actuation once the operating point re-enters the admissible band.
Figure 16. SAC action-entropy proxy over time during a representative disturbance. Exploration collapses within ~0.6 s, after a brief transient peak (≈0.40), indicating confident, low-variance actuation once the operating point re-enters the admissible band.
Energies 18 06268 g016
Figure 17. Critical contexts (T2): pairwise action comparison at time points with elevated system risk. Bars report normalized actuation magnitudes for the SAC, PID, and FLC at multiple timestamps. The SAC consistently applies the least aggressive input compatible with risk reduction, especially near t ≈ 4.0–4.06 s, while the PID/FLC remain saturated. This selective restraint aligns with the entropy trace and explains the SAC’s superior CRS.
Figure 17. Critical contexts (T2): pairwise action comparison at time points with elevated system risk. Bars report normalized actuation magnitudes for the SAC, PID, and FLC at multiple timestamps. The SAC consistently applies the least aggressive input compatible with risk reduction, especially near t ≈ 4.0–4.06 s, while the PID/FLC remain saturated. This selective restraint aligns with the entropy trace and explains the SAC’s superior CRS.
Energies 18 06268 g017
Table 1. Neutron kinetics constants.
Table 1. Neutron kinetics constants.
SymbolDescriptionValueUnits
βTotal delayed neutron fraction0.006502-
ΛPrompt neutron generation time1.0 × 10−4s
λ 1 Precursor decay constant (group 1)0.0124s−1
λ 2 Precursor decay constant (group 2)0.0305s−1
λ3Precursor decay constant (group 3)0.111s−1
λ4Precursor decay constant (group 4)0.301s−1
λ 5 Precursor decay constant (group 5)1.14s−1
λ 6 Precursor decay constant (group 6)3.01s−1
Table 2. Delayed neutron fractions per group (six-group; sum to β).
Table 2. Delayed neutron fractions per group (six-group; sum to β).
SymbolDescriptionValueUnits
β 1 Group-1 delayed neutron fraction0.000215-
β 2 Group-2 delayed neutron fraction0.001424-
β 3 Group-3 delayed neutron fraction0.001274-
β 4 Group-4 delayed neutron fraction0.002568-
β 5 Group-5 delayed neutron fraction0.000748-
β 6 Group-6 delayed neutron fraction0.000273-
Note: β 1 + β 2 + β 3 + β 4 + β 5 + β 6 = 0.006502 = β.
Table 3. Thermal–hydraulic and power-mapping constants.
Table 3. Thermal–hydraulic and power-mapping constants.
SymbolDescriptionValueUnits
κPPower scaling (n → MW_th)1000.0MW
CfEffective fuel thermal capacity30.0MJ/°C
CcEffective coolant thermal capacity50.0MJ/°C
UfcFuel-coolant conductance2.0MW/°C
UcsCoolant-secondary conductance20.0MW/°C
Ts0Secondary-side sink temperature (fixed)270.0°C
Table 4. Turbine-governor and grid constants.
Table 4. Turbine-governor and grid constants.
SymbolDescriptionValueUnits
τvValve servo time constant0.30s
τmSteam-path/turbine lag3.0s
KtTurbine gain (v → P_m)900.0MW/-
ηgGenerator efficiency0.98-
fnomNominal grid frequency60.0Hz
PrefNominal electrical power reference600.0MW
Table 5. Actuator limits and numerical step.
Table 5. Actuator limits and numerical step.
SymbolDescriptionValueUnits
u m i n Valve lower bound0.0-
u m a x Valve upper bound1.0-
r m a x Valve rate limit0.15s−1
ΔtNumerical integration step0.05s
Table 6. PID parameters.
Table 6. PID parameters.
SymbolDescriptionValueUnits
K p Proportional gain1.800-
K i Integral gain0.300s−1
KdDerivative gain0.050s
τdDerivative filter time constant0.200s
u m i n Valve lower bound0.0-
u m a x Valve upper bound1.0-
r m a x Valve rate limit0.15s−1
ΔtLoop period (numerical step)0.05s
Table 7. FLC rule base (rows: Δe; columns: e).
Table 7. FLC rule base (rows: Δe; columns: e).
Δe\eNBNSZEPSPB
NBPBPBPSZEZE
NSPBPSZENSZE
ZEPSZEZEZENS
PSZENSZENSNB
PBZEZENSNBNB
Table 8. FLC scaling parameters.
Table 8. FLC scaling parameters.
SymbolDescriptionValueUnits
seError scaling1.50-
s{Δe}Error-rate scaling0.80-
suOutput scaling0.35-
Table 9. Normalized triangular MF breakpoints for antecedents (centers at −1, −0.5, 0, 0.5, and 1; ~50% overlap; clamped to [−1, 1]).
Table 9. Normalized triangular MF breakpoints for antecedents (centers at −1, −0.5, 0, 0.5, and 1; ~50% overlap; clamped to [−1, 1]).
Labelabc
NB−1.00−1.00−0.50
NS−1.00−0.500.00
ZE−0.500.000.50
PS0.000.501.00
PB0.501.001.00
Table 10. Normalization scales and safety thresholds (deterministic).
Table 10. Normalization scales and safety thresholds (deterministic).
SymbolQuantityValueUnitsNotes
SPReactor power (scale)1000.0MWNormalization divisor for P
STFuel temperature (scale)1000.0°CNormalization divisor for T_fuel
SfGrid frequency (scale)1.0HzNormalization divisor for f
SωRotor speed (scale)1.0puNormalization divisor for ω
f{trip}Under-frequency trip gate49.00HzHard safety gate
Δ f c a l m Calm band (frequency)0.02HzCalm multiplier applies if ≤ value
Δ P c a l m Calm band (power)2.0MWCalm multiplier applies if ≤ value
Table 11. Reward weights and bonuses (dimensionless).
Table 11. Reward weights and bonuses (dimensionless).
SymbolDescriptionValueUnitsNotes
wfFrequency error penalty1.00-Primary stability focus
w{move}Valve movement penalty0.010-Economy and wear proxy
w{jerk}Valve jerk penalty0.020-Penalizes reversals
w{bonus}Safe completion bonus5.00-Applied once if no gates breached
w{unsafe}Unsafe penalty10.00-Applied on any gate violation
c{calm}Calm-state multiplier0.50-If |Δf| ≤ 0.02 Hz and |ΔP| ≤ 2 MW
Table 12. Deterministic curriculum phases and promotion conditions.
Table 12. Deterministic curriculum phases and promotion conditions.
SymbolPhaseScenario BundlePromotion ThresholdReward Overrides
P1Stability and limitsBaseline; ramp-in-placeNunsafe = 0; non-decreasing ravgIncreasing ↑ wf; enable calm multiplier
P2EfficiencyBaseline; gradual loadNunsafe = 0; non-decreasing ravgIncreasing ↑ wmove
P3Disturbances ISensor-noise; parameter ramp (deterministic)Nunsafe _{unsafe} = 0; non-decreasing ravgIncreasing ↑ wjerk
P4Disturbances IICascading-faultNunsafe = 0; non-decreasing ravgKeep P2/P3 overrides
P5Final examCombinedNunsafe = 0; non-decreasing ravgFreeze weights; evaluate only
Where “↑” denotes a metric to be maximized (higher values are better).
Table 13. SAC hyperparameters (values used).
Table 13. SAC hyperparameters (values used).
SymbolNameValueUnitsNotes
αEntropy temperatureauto-tuned-Target entropy heuristic
γDiscount factor0.99-Stable defaults
τPolyak rate0.005-Target critic averaging
ηQCritic learning rate3 × 10−4-Adam
ηπActor learning rate3 × 10−4-Adam
ηαTemperature learning rate3 × 10−4-If α learnable
BBatch size1024samplesMinibatch size
|D|Replay capacity1,200,000transitionsFIFO
N0Learning starts120,000stepsWarm-up
TTotal timesteps15,000,000stepsTraining budget
fevalEvaluation frequency80,000stepsDeterministic evals
policyPolicy/widthsMlpPolicy/[512,512]-Hidden units
Table 14. Replay/batch schedule and evaluation settings.
Table 14. Replay/batch schedule and evaluation settings.
SymbolQuantityValueUnitsNotes
tstepEnvironment step timeΔtsLoop period from plant interface
NupdateUpdates per env step1-Once learning starts
NtargetTarget update cadence1-Per gradient step
evaldetDeterministic evaluationenabled-No exploration noise
Table 15. Metric weights used in J(θ).
Table 15. Metric weights used in J(θ).
SymbolMetricGroupWeightNotes
wTSSTransient Severity Score (TSS)M ↓1.00Primary stability objective
wCECumulative actuation effort (CEsum)M ↓0.50Economy objective
wVrevValve reversals (Vrev)M ↓0.50Mechanical wear proxy
wGLFIGrid Load-Following Index (GLFI)M ↑1.00Tracking quality
λFailure penalty multiplier100Scaled by Nfail (θ)
Note: “↓” denotes a metric to be minimized (lower values are better); “↑” denotes a metric to be maximized (higher values are better).
Table 16. DE bounds and algorithm parameters.
Table 16. DE bounds and algorithm parameters.
SymbolParameterBounds/ValueUnitsNotes
KpPID proportional gain[0.5, 3.0]-Search bound
KiPID integral gain[0.05, 0.8]s−1Search bound
KdPID derivative gain[0.00, 0.15]sSearch bound
τdDerivative time constant[0.05, 0.50]sSearch bound
seFLC error scale[0.5, 2.5]-Search bound
sΔeFLC error-rate scale[0.3, 1.5]-Search bound
suFLC output scale[0.1, 0.8]-Search bound
FDE mutation scale[0.5, 1.0]-Differential weight
CrDE crossover probability0.7-Crossover rate
GmaxMax iterations50 (PID)/30 (FLC)generationsStopping criterion
PPopulation size15candidatesPer generation
tolConvergence tolerance1 × 10−2-Early stop threshold
Table 17. Evidence artifacts emitted by the pipeline.
Table 17. Evidence artifacts emitted by the pipeline.
SymbolArtifactPath/IdentifierFrequencyMechanism
A1Raw evaluation logsresults/logs/*.csvPer evalAuto-export
A2Training checkpointsresults/checkpoints/*.zipPer save stepSB3 saver
A3Best modelresults/best/*.zipOn improvementEval callback
A4Final modelresults/final/*.zipEnd of trainingExport final
A5Reports/figuresresults/reports/*On demandFigure scripts
A6Config manifestsresults/config/*.ymlPer runHash-locked
Note: “*” denotes a wildcard matching multiple files in the specified directory (e.g., results/logs/run1.csv).
Table 19. Safety gates and constants (deterministic).
Table 19. Safety gates and constants (deterministic).
SymbolDescriptionValueUnits
f t r i p Under-frequency trip threshold49.00Hz
ω m a x , l i m i t Max rotor speed (per-unit)1.10pu
T f u e l m a x Fuel temperature limit1500°C
Table 22. CRS weights and licensing thresholds (dimensionless unless noted).
Table 22. CRS weights and licensing thresholds (dimensionless unless noted).
SymbolDescriptionValueUnits/Notes
w safeSafety contribution in CRS0.40-
w tss _{tss}Transient severity contribution in CRS0.30-
w effControl-effort contribution in CRS0.15-
w glfiTracking contribution in CRS0.15-
G L F I m i n Minimum acceptable GLFI0.90-
T S S l i m Upper bound on TSS1.00-
O S ω , m a x Max rotor-speed overshoot5%
Table 23. Entropy band constants.
Table 23. Entropy band constants.
SymbolDescriptionValueUnits
H{min}Lower entropy bound0.10nats
H{max}Upper entropy bound2.00nats
Table 24. Portfolio constants (dimensionless unless noted).
Table 24. Portfolio constants (dimensionless unless noted).
SymbolDescriptionValueUnits
SNumber of scenarios8-
C R S m i n Minimum acceptable mean CRS0.90-
Table 25. Evidence artifacts (deterministic assurance pack).
Table 25. Evidence artifacts (deterministic assurance pack).
IDPath/NamingRole
A1results/logs/*.csvRaw timeseries traces per scenario
A2results/metrics/*.csvPer-scenario metric tables (GLFI, TSS, CE sum, Vrev, E d a m p , TTU)
A3results/checkpoints/best/*.zipBest controller snapshot (by mean CRS)
A4results/checkpoints/final/*.zipFinal controller snapshot (end of training)
A5results/reports/*.mdAuto-generated markdown reports and summaries
A6results/config/*.ymlVersioned configuration and constants manifest
Table 26. Fixed constants referenced in Section 3.4.
Table 26. Fixed constants referenced in Section 3.4.
SymbolDescriptionValueUnits/Derivation
TEvaluation horizon600s (scenario constant)
ΔtController/eval step0.05s (scenario constant)
NSamples per episode12,000— (T/Δt)
r{max}Valve rate limit0.15s−1 (from Section 3.2)
CE{abs,max}Max cumulative movement90(r_{max}·T)
V{rev,max}Max valve reversals11,998count (N − 2)
GLFI{min}Minimum acceptable GLFI0.90(Table 22)
TSS{lim}Upper bound on TSS1.00(Table 22)
OS{\omega,max}Max rotor-speed overshoot5% (Table 22)
H{min}, H{max}Entropy band0.10, 2.00nats (Table 23)
CRS{min}Minimum acceptable mean CRS0.90— (Table 24)
Table 18. Scenario parameters (deterministic values).
Table 18. Scenario parameters (deterministic values).
SymbolDescriptionValueUnits
TEvaluation horizon600s
ΔtController/eval step0.05s
fnomNominal grid frequency60.0Hz
Δ L r e f Reference load change magnitude0.04·P_{ref}MW
t1, t2Gradual ramp window120, 300s
ts(s)Sudden step time60s
t a , t b Fault window #1120, 210s
t c , t d Fault window #2360, 420s
A f , 1 , A f , 2 Frequency-noise amplitudes0.003, 0.002Hz
f 1 , f 2 Noise tones (frequency)0.6, 1.2Hz
Φ2Noise phase1.0472rad (≈60°)
Table 20. Metric normalization constants (fixed).
Table 20. Metric normalization constants (fixed).
SymbolDescriptionValueUnitsDerivation
rmaxValve rate limit0.15s−1From Section 3.2 PID/limits
TEvaluation horizon600sScenario constant
NSamples per episode12,000-T/Δt with Δt = 0.05 s
C E a b s , m a x Max cumulative movement90- r m a x ·T
V r e v , m a x Max valve reversals11,998countN − 2
ε P GLFI denominator floor1.0MWPhysical floor
ω1, ω2PSD band (rad/s)3.14, 12.57rad/s0.5–2.0 Hz
Table 21. TSS weights and limits (dimensionless).
Table 21. TSS weights and limits (dimensionless).
SymbolDescriptionValueUnits
w f Frequency-error weight0.50-
wceControl-effort weight0.20-
w v r Valve-reversal weight0.20-
w o s Overshoot weight0.10-
I A E f , l i m IAE_f normalization60Hz·s
O S ω , m a x Overshoot limit5%
Table 27. Concise positioning of this study relative to the two suggested works.
Table 27. Concise positioning of this study relative to the two suggested works.
StudySystem/TaskPrimary KPIsSafety/
Licensing Gates
Deterministic ReplayTraceability
Artifacts
Relevance to Present
Results
Zhang et al. (2024) [26]Regional multi-energy market-clearing with hierarchical RLMarket matching efficiency; economic outcomesNot reportedCase-study simulationsNot reportedAlgorithmic/architectural RL advance for markets; different objective class
Yang et al. (2023) [27]Island-group energy management under transmission constraints with hybrid policy RLOperational cost; energy-balance KPIsNot reportedSimulation studiesNot reportedSystem-level management focus; different KPIs and constraints
This studyPWR governor control (load-following) with SAC vs. DE-tuned PID/FLCGate pass rate; TTU, GLFI, TSS; overshoot; control effort; CRSYes—explicit, licensing-alignedYes—adversarial, fixed-replay portfolioYes—critical state–action mapping, critical pairs, sensitivityLicensing-grade assurance for safety-critical plant control
Table 28. Matrixed mapping towards industrial towards industrial adaptation.
Table 28. Matrixed mapping towards industrial towards industrial adaptation.
Limitation (Current)Deterministic UpgradeNew Evidence ArtifactGate/KPI AdditionTarget Environment
Infinite-bus gridMulti-area RO models + PMU replayMode-damping logs, ROCOF checksInter-area damping KPIReal-time sim/HIL
TH surrogateMulti-node TH + DNBRMargin envelopes per scenarioThermal-limit gatesHigh-fidelity twin
No HILPLC/RTOS with latency and watchdogTiming budgets, watchdog logsTiming-budget gateHIL bench
Limited baselinesAdd MPC/H∞/LQRController-agnostic scorecardsPortfolio gates unchangedTwin/HIL
No RTA/CBFSimplex + control-barrier functionsIntervention/dwell logsRTA-intervention gateTwin/HIL
No formal proofsReachability/STLCertificates + counterexamplesCertificate-presence gateTwin
No uncertainty envelopesStructured sweepsWorst-case KPI tablesEnvelope-completeness gateTwin/HIL
No fault drillsDropout/stiction/stuck-valveRecovery timelinesFault-recovery gateTwin/HIL
No HSI drillsOperator-in-the-loop runsWorkload/trust metricsHuman-override gateHIL
No cyber drillsSpoof/tamper/DoS testsDetect/mitigate tracesCyber-resilience gateSegmented testbed
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Abdelrahman Ibrahim, A.; Lim, H.-K. A Deterministic Assurance Framework for Licensable Explainable AI Grid-Interactive Nuclear Control. Energies 2025, 18, 6268. https://doi.org/10.3390/en18236268

AMA Style

Abdelrahman Ibrahim A, Lim H-K. A Deterministic Assurance Framework for Licensable Explainable AI Grid-Interactive Nuclear Control. Energies. 2025; 18(23):6268. https://doi.org/10.3390/en18236268

Chicago/Turabian Style

Abdelrahman Ibrahim, Ahmed, and Hak-Kyu Lim. 2025. "A Deterministic Assurance Framework for Licensable Explainable AI Grid-Interactive Nuclear Control" Energies 18, no. 23: 6268. https://doi.org/10.3390/en18236268

APA Style

Abdelrahman Ibrahim, A., & Lim, H.-K. (2025). A Deterministic Assurance Framework for Licensable Explainable AI Grid-Interactive Nuclear Control. Energies, 18(23), 6268. https://doi.org/10.3390/en18236268

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop