A Deterministic Assurance Framework for Licensable Explainable AI Grid-Interactive Nuclear Control

Abdelrahman Ibrahim, Ahmed; Lim, Hak-Kyu

doi:10.3390/en18236268

Open AccessArticle

A Deterministic Assurance Framework for Licensable Explainable AI Grid-Interactive Nuclear Control

by

Ahmed Abdelrahman Ibrahim

^* and

Hak-Kyu Lim

Department of Nuclear Engineering, KEPCO International Nuclear Graduate School (KINGS), Ulsan 45014, Republic of Korea

^*

Author to whom correspondence should be addressed.

Energies 2025, 18(23), 6268; https://doi.org/10.3390/en18236268 (registering DOI)

Submission received: 5 October 2025 / Revised: 2 November 2025 / Accepted: 10 November 2025 / Published: 28 November 2025

Download

Browse Figures

Versions Notes

Abstract

Deploying deep reinforcement learning (DRL) in safety-critical nuclear control is limited less by raw performance than by the absence of licensable, audit-ready evidence. We introduce a Deterministic Assurance Framework (DTAF) that converts controller behavior into licensing-grade proof by combining the following: (i) deterministic licensing gates tied to formal safety and performance limits (e.g., Total Time Unsafe (TTU) = 0; bounded Transient Severity Score (TSS); and minimum Grid Load-Following Index (GLFI)); (ii) a portfolio of adversarial stress tests representative of off-nominal operation; and (iii) a traceability and explainability package that renders every evaluated action auditable. The DTAF is demonstrated on a high-fidelity pressurized-water-reactor (PWR) simulation model used as a software-in-the-loop testbed. Three governor architectures are evaluated under identical, fixed scenarios: a curriculum-trained Soft Actor–Critic (SAC) agent, and Differential-Evolution-optimized Proportional–Integral–Derivative (PID-DE) and Fuzzy-Logic (FLC-DE) Controllers. Performance is assessed deterministically via gate-aligned metrics—TTU, TSS, GLFI, cumulative control effort (CE_sum), valve-reversal count (V_rev), and speed overshoot (OS_ω). Across the adversarial portfolio, the SAC controller meets the predeclared licensing gates in single-run evaluations, whereas the strong conventional baselines violate gates in specific high-severity cases; where all methods remain within the safe envelope, the SAC delivers a higher GLFI and lower CE_sum, with fewer reversals and reduced overshoot. All licensing conclusions derive from deterministic single-run tests; a small, fixed-seed check (three seeds with descriptive intervals) is reported separately as non-licensing supplementary analysis. By producing transparent, reproducible artifacts, the DTAF offers a regulator-oriented pathway for qualifying DRL controllers in grid-interactive nuclear operations.

Keywords:

deterministic assurance framework (DTAF); high-fidelity digital model; grid-interactive reactor control; soft actor–critic (SAC); pressurized-water-reactor (PWR) simulation; licensing gates; adversarial stress testing; verification and validation (V&V) for licensing; differential evolution (DE)

1. Introduction

Deep reinforcement learning (DRL) has great potential to improve nuclear reactor control performance, especially for load-following in Pressurized Water Reactors (PWRs). However, a major gap separates this potential from practical deployment: current DRL controllers cannot be licensed under today’s nuclear safety regulations. The U.S. Nuclear Regulatory Commission (NRC) and others have identified verification and validation (V&V), trustworthiness, and assurance as the primary hurdles to adopting AI-based control in safety-critical nuclear systems [1,2]. In essence, black-box control policies that lack deterministic, auditable guarantees are disqualified from licensing. PWR grid load-following demands extremely high reliability and predictability—yet DRL agents, despite their technical prowess, pose a regulatory stalemate without a suitable testing and assurance protocol. The missing piece is an auditable deterministic test protocol that can demonstrate an AI controller’s safety to licensing authorities beyond any statistical doubt.

To address this barrier, we propose a Deterministic Assurance Framework (DTAF) tailored for licensing-grade evaluation of AI controllers. The DTAF introduces the following: (i) explicit deterministic licensing gates tied to hard safety limits (for example, preventing frequency trips and bounding speed overshoot OS_ω within allowable margins); (ii) an adversarial stress-test portfolio comprising worst-case grid disturbances and transients designed to probe the controller’s safety envelope; and (iii) a comprehensive traceability and explainability package (integrating eXplainable Reinforcement Learning, XRL) that produces human-interpretable evidence of the controller’s decisions. Under the DTAF, an AI governor must provably keep all key performance indicators within pre-defined deterministic limits during these stress tests. The framework thereby transforms black-box DRL behavior into a licensable pipeline of quantitative gates, stress scenarios, and explainable artifacts that regulators can audit.

We demonstrate the DTAF using a high-fidelity PWR digital model (DM) as a realistic simulation testbed. Three governor controllers are evaluated under identical conditions: a Soft Actor–Critic (SAC) DRL agent, a PID controller optimized by Differential Evolution (PID-DE), and a Fuzzy-Logic Controller optimized by Differential Evolution (FLC-DE). The evaluation metrics include the Total Time Unsafe (TTU), Transient Severity Score (TSS), Grid Load-Following Index (GLFI), cumulative control effort (CE_sum), valve reversal count (V_rev), and speed overshoot (OS_ω). Each metric is mapped to a specific licensing gate with a fixed threshold. For example, the TTU must remain zero for all protected variables, the TSS must not exceed defined severity limits, and the GLFI must stay above minimum grid-following requirements. By design, if any single gate is violated in a test scenario, the candidate controller is disqualified—a stringent criterion aligned with nuclear safety standards.

Our research addresses three core questions under this deterministic protocol: (Q1) Can the SAC agent meet all deterministic licensing gates across an array of adversarial grid disturbance scenarios? (Q2) How do the SAC’s safety, robustness, and performance compare to the PID-DE and FLC-DE baselines under identical stress conditions? (Q3) Can the integrated XRL traceability package explain the SAC’s control actions during severe transients in a way that supports regulatory auditability? To ensure absolute clarity, we declare upfront that all evaluations are fully deterministic. Every stress-test scenario is run in a non-stochastic manner with fixed initial conditions and no random disturbances. The only exception is an auxiliary robustness check (Section 4.10) where we repeat tests with three different fixed random seeds to compute descriptive 95% confidence intervals—and even there, the seeds are predetermined for reproducibility. Crucially, all licensing conclusions in this paper are drawn solely from deterministic results, not from any statistical or probabilistic analysis.

In summary, this work offers three main contributions. (1) We develop the Deterministic Assurance Framework (DTAF)—a regulator-aligned evaluation framework that combines deterministic performance gates, adversarial stress scenarios, and XRL-based traceability into a unified, licensable test pipeline. (2) We present a comprehensive benchmark comparison of a state-aware SAC controller against optimally tuned conventional controllers (PID-DE and FLC-DE) on a high-fidelity PWR simulation model, using identical operating conditions and disturbance scenarios for a fair, rigorous assessment. (3) We provide a complete reproducibility and audit package for public release (containing all training scripts, configuration files, logged time series, and agent checkpoints), enabling independent validation of results. Together, these contributions demonstrate a viable path to make DRL controllers licensable for grid-interactive nuclear plant control by design—resolving the DRL licensing barrier through determinism, stress-testing, and explainability.

2. Related Work

2.1. Reinforcement Learning in Nuclear, Power, and Turbomachinery Control (2019–2025)

In recent years, researchers have explored DRL-based control across nuclear and energy domains to test feasibility. In the nuclear sector, deep RL agents have been applied to reactor operation tasks with encouraging results. For example, Gong et al. survey numerous implementations and conclude that RL can handle complex multi-objective control scenarios in nuclear power plants [3]. Specific case studies have demonstrated that DRL governors can perform continuous reactor coolant and power regulation while meeting multiple objectives [4]. Even microreactors—compact reactors with fast dynamics—have seen prototype RL controllers achieving improved load-following performance over baseline PID tuning [5]. Similarly, in the broader power and turbomachinery arena, DRL techniques have shown promise. A notable example is the use of multi-agent deep RL to optimize boiler–turbine control systems, where adaptive RL-tuned PID policies outperformed conventional tuning in managing a nonlinear multi-input system [6]. These works collectively indicate that DRL can indeed learn effective control policies for complex energy systems. However, they stop at demonstrating performance and do not furnish the licensing-grade evidence (deterministic guarantees, formal safety checks, etc.) that regulators require. In other words, while DRL feasibility in nuclear and turbomachinery control has been established in the 2019–2025 literature, none of these studies deliver the audit-ready determinism or comprehensive safety assurance needed for actual deployment.

2.2. Safety-Aware, Constrained, and Robust RL; Adversarial Evaluation

Concurrently, the DRL research community has developed various techniques to enhance the safety and robustness of learning agents, as well as methods to adversarially test them. One line of work focuses on constrained and safe RL algorithms that enforce safety criteria during training. For instance, Sun et al. propose a chance-constrained RL controller for power plant supervision that uses Lagrange multipliers to strictly respect state constraints (e.g., reactor thermal limits) throughout the learning process [7]. Such approaches embed nuclear engineering knowledge (safety setpoints and margins) directly into the RL optimization, yielding agents that never experience unsafe excursions even while exploring. Another important direction is the adversarial stress-testing of RL policies. Here, the idea is to actively generate worst-case scenarios to probe an agent’s reliability. In the autonomous driving domain, Feng et al. demonstrate a “dense” deep-RL approach that trains adversarial background agents to expose rare failure modes of a vehicle’s policy [8]. This concept—using AI to test AI—highlights how critical safety scenarios can be systematically uncovered. Despite these advances in safe RL and adversarial evaluation, few efforts have integrated them into a unified, regulator-oriented pipeline. In prior nuclear control studies, RL controllers were typically evaluated on nominal or randomly sampled scenarios, rather than an exhaustive set of adversarial drills. Moreover, while robust RL algorithms can limit certain risks (e.g., by adding noise or using domain randomization), regulators ultimately demand deterministic evidence of safety. To date, there is no standard framework in the nuclear domain that combines constrained RL training, adversarial scenario generation, and formal gate-checking of results. Our work fills this gap by incorporating safety constraints and adversarial tests within the DTAF’s deterministic evaluation regime.

2.3. Explainability and Traceability for Deep RL

As DRL agents make increasingly complex control decisions, explainability and traceability have become critical for acceptance in safety-critical systems. The subfield of eXplainable RL (XRL) has produced a variety of techniques to interpret an agent’s behavior [9]. These include feature local sensitivity methods, which highlight the most influential state variables behind an action (e.g., identifying that a reactor RL agent mainly responds to power level and fuel temperature deviations), and policy simplification methods such as surrogate modeling or rule extraction, which approximate the agent’s policy with a human-readable model. For instance, one can train a simple decision tree or linear model to mimic the DRL policy locally, thereby revealing the policy’s decision logic in specific scenarios. Another approach is behavioral cloning or trajectory analysis, where the RL agent’s state-action trajectories under various disturbances are recorded and compared to those of classical controllers to pinpoint differences in strategy. In the nuclear domain, initial steps have been taken to apply XAI techniques for operator support—for example, M. Najar and X. Wang develop explainable AI models to aid reactor operators during accidents [10]. However, achieving full traceability of a DRL controller’s decisions under extreme transients remains an open challenge. Most XRL studies to date focus on either relatively simple environments or post-hoc visualization tools, often divorced from formal verification needs. In our framework, explainability is not an afterthought but a built-in component: the DTAF’s traceability package generates artifacts (such as annotated time-series plots, policy local sensitivity maps, and scenario-wise action comparisons) that accompany each stress-test result. This ensures that for every deterministic pass/fail outcome, there is a corresponding human-interpretable explanation. By aligning XRL techniques with the adversarial test scenarios, we aim to make the DRL agent’s behavior transparent and auditable—satisfying the regulator’s need to know why the AI acted as it did, especially in borderline conditions near safety limits.

2.4. High-Fidelity Simulation Models (Digital Models) for Nuclear/Power Control

High-fidelity simulation models, also referred to as digital models (DMs), serve as indispensable tools for developing and validating control logic in nuclear and power systems. These models emulate the real plant’s physics with sufficient detail to capture transient behaviors, making them ideal for rigorous software-in-the-loop testing. In our context, a PWR simulation DM is used as the core testbed within the DTAF to evaluate all controllers under identical conditions. The advantage of a DM is that extreme scenarios—including rapid load ramps, large disturbances, and equipment faults—can be safely executed and repeated deterministically. Researchers have increasingly employed such digital platforms for AI-driven control studies. For example, Lim et al. describe a high-fidelity PWR simulation framework to train and test an RL-based supervisory controller for advanced reactors [11]. By leveraging a detailed simulator of a Generation-IV plant, they could evaluate the RL agent’s long-term performance and maintenance decisions without any risk to real equipment. In general, DMs allow controllers to be stress-tested in silico against scenarios that might be too dangerous or rare to test on actual reactors or turbines. This not only accelerates development but is also a pre-requisite for licensing—regulators will not consider AI control strategies that have not been thoroughly vetted on validated simulation models. Today, such evaluations are typically confined to software-in-the-loop experiments, but the same models can be extended to hardware-in-the-loop setups (e.g., connecting the simulator to physical controller hardware or plant interfaces) under the DTAF methodology. The key point is that the DTAF’s deterministic protocol is model-agnostic: it can be applied on any high-fidelity DM to produce evidence (logs, safety gate outcomes, and traceability reports) that is replayable and reviewable. This approach ensures that by the time an AI controller is a candidate for on-site trials, it comes with a complete simulation-backed safety dossier. In summary, high-fidelity DMs act as the proving ground where modern control algorithms can earn trust by demonstrating compliance with all operational limits and safety requirements in a virtual yet realistic environment [12,13].

2.5. Comparative Synthesis and Benchmark Rationale

Our study synthesizes insights from the above strands into a cohesive assurance framework, with an emphasis on comparing DRL against strong conventional baselines. In commercial reactor operations, traditional controllers like PID and fuzzy logic remain the dominant solutions due to their stability and regulatory acceptance [14,15]. Over the past decade, many researchers have improved these classical controllers using modern optimization techniques (genetic algorithms, particle swarm, and differential evolution) to enhance performance for complex reactor maneuvers [14,15]. Despite this, prior RL works in nuclear control have rarely, if ever, pitted a DRL agent against an equally well-tuned classical controller under stress conditions. Most feasibility studies compared RL to either a default PID or no baseline at all, leaving open the question of whether the AI truly excels beyond what a properly optimized conventional controller could do. By contrast, we adopt the “Strong Benchmark” philosophy: the SAC agent must demonstrably surpass a PID-DE and FLC-DE that have been optimized across many scenarios. This stringent baseline provides a higher confidence threshold for safety-critical acceptance—an AI that only matches a mediocre controller would not justify the licensing risk. Furthermore, our work combines safety, robustness, and explainability in one deterministic framework, whereas previous research typically addressed these aspects in isolation. For instance, some studies incorporate safety constraints or robust training, and others propose XAI methods, but none have unified them into a single pipeline oriented toward regulator review. It is worth noting that we do not include advanced model-based controllers like MPC or H∞ in our benchmarks; this is a deliberate choice to keep the evaluation controller-centric. While methods such as MPC can yield excellent performance, they introduce model-dependent tuning and complexity that are beyond our scope—our focus is on comparing a learning-based policy with human-engineered policies under identical conditions. Importantly, excluding MPC/H∞ also reflects practical considerations: nuclear plants today still rely on PID-family controllers [14], so demonstrating AI superiority over this familiar baseline is a more direct and convincing argument for stakeholders. Finally, we emphasize that our DTAF approach aligns with formal safety expectations. Nuclear design standards like IAEA No. SSR-2/1 (Rev. 1) mandate that all operational transients remain within defined, bounded limits [12], and software safety guidelines (IAEA Safety Standards Series No. SSG-39) stress the importance of deterministic behavior in systems important to safety [13]. However, as a recent review pointed out, most AI control research lacks comprehensive adversarial validation frameworks to ensure these criteria are met [16]. By integrating optimized baseline comparisons, adversarial scenario testing, and XRL-driven transparency, we provide a template for deterministic assurance that addresses this gap. In short, our benchmark rationale and synthesis highlight the novelty of the DTAF: it is not about pushing an RL agent in isolation but about proving that the agent can reliably and explainably beat the best conventional solutions under the exacting conditions regulators care about.

2.6. Rationale for Differential Evolution (DE) in Strong Baseline Optimization

We adopt Differential Evolution (DE) to optimize the Proportional–Integral–Derivative (PID) and Fuzzy-Logic Controller (FLC) baselines, so that the conventional comparators in the Deterministic Assurance Framework (DTAF) are truly “strong.” DE is a derivative-free, population-based global optimizer with few hyperparameters; it is simple to implement and reproduce, and it is effective on nonconvex, multimodal, and noisy or piecewise-discontinuous closed-loop objectives typical of controller tuning [17,18,19]. Strategy-adaptive DE variants further improve robustness across heterogeneous problems [20], and large comparative studies show DE variants to be highly competitive—often superior on average—to Particle Swarm Optimization (PSO) across numerical benchmarks and real-world tasks [21]. In power-system control specifically, DE has been applied successfully to tune load-frequency and governor-related controllers, providing a domain-proximal precedent for our use in baseline construction [22]. Alternative optimizers are less suitable for our gate-aware objective: exhaustive grid searches are inefficient in high-dimensional, constrained spaces; gradient-based methods require smoothness and reliable derivatives that closed-loop plants and penalty-augmented objectives generally lack; and Bayesian optimization introduces surrogate-modeling overhead and typically needs specialized treatments for heteroscedastic/noisy evaluations—factors that complicate transparency and reproducibility for licensing evidence [23,24]. In the DTAF, we formulate a composite cost that aggregates the licensing metrics—the Total Time Unsafe (TTU), Transient Severity Score (TSS), Grid Load-Following Index (GLFI), cumulative control effort (CE_sum), valve-reversal count (V_rev), and speed overshoot (OS_ω)—with hard (infinite) penalties for any gate violation and soft penalties within the safe envelope. For comparability, the population size, mutation factor, crossover rate, and generation budget are fixed across governors; all settings and seeds are released with the reproducibility package.

3. Methodology

3.1. Plant Simulation Environment

We model a Pressurized Water Reactor (PWR) as a deterministic, single-loop surrogate coupling six-group point kinetics, lumped thermal–hydraulics, a first-order valve servo with rate limiting, a first-order turbine path, and a synchronous generator tied to an infinite bus. The overall architecture and signal flow are shown in Figure 1 and Figure 2, respectively. All symbols are defined at point of use; all parameter values appear in-section to ensure exact reproducibility.

Figure 1 illustrates the complete architecture of the proposed Deterministic Assurance Framework (DTAF), which is engineered to rigorously train, test, and formally verify AI controllers within a safe, high-fidelity simulation environment. The framework is structured as a four-layer hierarchy, with each layer encapsulating a distinct set of functions. At the core, Layer 4 (agent) comprises the Soft Actor-Critic (SAC) reinforcement learning controller, which is responsible for generating real-time control commands, or actions. The agent interacts directly with Layer 3 (environment), the PWR-simulation model, which is wrapped in a PWR Unified Gym Environment interface. This environment executes the agent’s action, calculates the resultant system dynamics, and returns the new state vector and a scalar reward signal, thus closing the standard RL feedback loop. This core loop is governed by Layer 2 (The Analysis Engine), which orchestrates the entire experimental and validation process. This layer is responsible for executing the suite of adversarial stress tests, performing multi-run robustness checks, and conducting the explainability (XAI) analyses required to interpret the agent’s decision-making logic.

Finally, Layer 1 (The Deterministic Assurance Engine) serves as the highest level of oversight. It defines the formal safety and control objectives, such as the operational limits for grid frequency (f_trip) and average reactor temperature (T_avg). This engine performs the ultimate safety verification, systematically ensuring that all agent-driven behaviors, even under stress, remain within the pre-defined safe operating envelope.

Figure 2 presents the schematic of the high-fidelity Pressurized Water Reactor (PWR) simulation model, which functions as the core testbed (Layer 3) within the DTAF. The model accurately captures the plant’s essential dual-loop thermodynamic processes critical for load-following simulations. The Primary Loop circulates pressurized water to transfer thermal energy from the Reactor Core to the Steam Generator. The Pressurizer maintains the high system pressure required to prevent the primary coolant from boiling. In the Secondary Loop, this thermal energy is converted into electrical power. Water in the Steam Generator boils, producing high-pressure steam that drives the turbine. The turbine’s rotational energy is converted into electricity by the generator, which is synchronized with the Electrical Grid. This diagram highlights the primary control interface for the AI agent. The agent’s scalar action output directly actuates the Governor Valve (A_v), which regulates the mass flow rate of steam to the turbine. The agent’s control objective is to modulate this valve to maintain the stability of the grid frequency (f) by balancing power generation with grid demand, particularly during challenging load-following transients.

Clarification on the Digital Model and Framework Scope:

It is important to clarify the precise role of the PWR simulation within the DTAF. In this study, the high-fidelity model serves as a digital model—a validated, self-contained simulation environment that acts as a representative proxy for the physical plant. While the term ‘digital twin’ often implies a persistent, bidirectional data link to a specific physical asset, our use here refers to a high-fidelity simulation testbed.

The primary novelty of this paper is the DTAF itself: a model-agnostic assurance workflow. The ‘bidirectional data exchange’ and ‘communication protocols’ are therefore software-in-the-loop (SIL) interactions, representing the data passed between the DTAF’s analysis engine and the simulation environment (i.e., states, actions, and rewards). The DTAF is architected to be equally applicable to a fully-fledged digital twin or even a physical system in a hardware-in-the-loop (HIL) configuration. This study uses the high-fidelity digital model to formally validate the framework’s ability to assess, stress-test, and certify controller behavior against established benchmarks. The DTAF framework can be extended in future work to a full-scale digital twin.

3.1.1. Neutron Point Kinetics (Six Delayed Groups)

Λ * \dot{x} n (t) = (ρ (t) - β) * n (t) + \sum_{i = 1}^{6} λ_{i} * C_{i} (t) + S (t)

(1)

where n(t) is the normalized neutron density [-]; Λ is the prompt neutron generation time [s]; ρ_(t) is the total reactivity [Δk/k]; β is the total delayed neutron fraction [-];

λ_{i}

is the decay constant of precursor group i [s⁻¹];

C_{i} (t)

is the concentration of group-i precursors [-]; and S(t) is an external source (set to 0 in nominal runs).

\dot{x} C_{i} (t) = (β_{i} / Λ) * n (t) - λ_{i} * C_{i} (t), i = 1, \dots, 6

(2)

where β_i is the fraction of delayed neutrons from group i [-], satisfying ∑_{I = 1}^{6} β_i = β.

3.1.2. Reactivity Feedback and Power Mapping

ρ (t) = ρ_{e x t} (t) + α_{f} * (T_{f} (t) - T_{f} 0) + α_{c} * (T_{c} (t) - T_{c} 0)

(3)

where ρ_ext (t) is the exogenous reactivity [Δk/k] (0 in this study);

α_{f}

,

α_{c}

are the fuel and coolant temperature coefficients [Δk/k*°C⁻¹];

T_{f} (t)

and T_c(t) are the lumped fuel and coolant temperatures [°C]; and

T_{f} 0

and

T_{c} 0

are the reference temperatures [°C]. Here

α_{f}

= 0 and

α_{c}

= 0 to isolate the controller behavior deterministically.

P_{t h} (t) = κ_{P} * n (t)

(4)

where

P_{t h}

(t) is the reactor thermal power [MW] and

κ_{P}

= 1000.0 MW maps the normalized neutron density to thermal power.

3.1.3. Lumped Thermal–Hydraulic Model

C_{f} * \dot{x} T_{f} (t) = P_{t h} (t) - U_{f c} * (T_{f} (t) - T_{c} (t))

(5)

where

C_{f}

= 30.0 MJ/°C is the effective fuel heat capacity and

U_{f c}

= 2.0 MW/°C is the fuel-coolant conductance.

C_{c} * \dot{x} T_{c} (t) = U_{f c} * (T_{f} (t) - T_{c} (t)) - U_{c s} * (T_{c} (t) - T_{s 0})

(6)

where

C_{c}

= 50.0 MJ/°C is the effective coolant heat capacity;

U_{c s}

= 20.0 MW/°C is the coolant-secondary conductance; and

T_{s 0}

= 270.0 °C is the fixed secondary-side sink temperature.

3.1.4. Valve, Turbine-Governor, and Generator

τ_{v} * \dot{x} v (t) = u (t) - v (t), v \in [0,1]

(7)

where v_(t) is the valve position [-]; u_(t) is the controller command [-]; and

τ_{v}

= 0.30 s is the valve-servo time constant. A deterministic rate limiter |

\dot{x} v

| ≤ r_max with r_max = 0.15 s⁻¹ is applied to the commanded motion.

τ_{m} * \dot{x} P_{m} (t) = K_{t} * v (t) - P_{m} (t)

(8)

where

P_{m} (t)

is the mechanical power into the turbine [MW];

τ_{m}

= 3.0 s is the steam-path/turbine lag; and

K_{t}

= 900.0 MW/- maps the valve to mechanical power.

P_{e} (t) = η_{g} * P_{m} (t)

(9)

where

P_{e} (t)

is the electrical power to the grid [MW] and

η_{g}

= 0.98 is the generator efficiency.

3.1.5. Measured Outputs and Signals

y (t) = {[P_{e} (t), v (t), T_{f} (t), T_{c} (t)]}^{T}, f (t) = f_{g r i d} (t), f_{n o m} = 60 H z

(10)

where y_(t) collects the outputs used by controllers and metrics; f_(t) is the measured grid frequency [Hz] from the infinite bus; and

f_{n o m}

= 60 Hz is nominal.

3.1.6. Parameters and Numerics

Numerical integration uses a fixed step Δt = 0.05 s (forward Euler). Initial conditions correspond to the steady-state at P_e = P_ref = 600 MW with infinite-bus synchronism. The implied nominal valve position is v₀ ≈ P_ref/(η_g * K_t) = 600/(0.98 * 900) ≈ 0.680, and the thermal states satisfy (5) and (6) at steady power with

T_{s} 0

.

The six-group point-kinetics constants used in (1) and (2) are drawn from a standard PWR benchmark set, and can be seen in Table 1.

For reproducibility of the precursor balance in (2), the delayed-neutron fractions are provided per group, as seen in Table 2.

The thermal blocks used in (5) and (6) and the power-mapping in (4) use the constants listed below in Table 3.

The valve servo (7), turbine path (8), and electrical output (9) are parameterized as follows in Table 4.

Actuator bounds and the integration step used throughout are summarized next, as seen in Table 5.

3.2. Controllers

3.2.1. Proportional–Integral–Derivative (PID) Governor

We employ a discrete-time PID with a filtered derivative and conditional anti-windup. Letting

e (k)

= r(k) −

y (k)

be the tracking error, Δt the loop period,

u (k) \in [u_{m i n}, u_{m a x}]

the valve command, and

r_{m a x}

the rate limit, saturation precedes rate limiting.

u_{P} (k) = K_{p} * e (k)

(11)

where

K_{p}

is the proportional gain [—], and e(k) is the tracking error (setpoint minus measured output).

I (k) = I (k - 1) + K_{i} * Δ t * e (k)

(12)

where

K_{i}

is the integral gain [s⁻¹], and

Δ t

is the loop period [s].

τ_{d} * d ψ / d t + ψ (t) = d e / d t, u_{D} (t) = K_{d} * ψ (t)

(13)

where ψ_(t) is the filtered derivative state [1/s]; τ_d is the derivative filter time constant [s]; and

K_{d}

is the derivative gain [s].

ψ (k) = a * ψ (k - 1) + b * (e (k) - e (k - 1)), a = (2 * τ_{d} - Δ t) / (2 * τ_{d} + Δ t), b = 2 / (2 * τ_{d} + Δ t)

(14)

where ψ_(k) is the discrete filtered derivative; and a and b are bilinear (Tustin) coefficients ensuring a stable first-order low-pass on the derivative of e_(k).

u_{D} (k) = K_{d} * ψ (k)

(15)

where

u_{D} (k)

is the derivative contribution to the command [-].

u_{r a w} (k) = u_{P} (k) + I (k) + u_{D} (k)

(16)

where

u_{r a w} (k)

is the unsaturated command [-].

u_{s a t} (k) = m i n (m a x (u_{r a w} (k), u_{m i n}), u_{m a x})

(17)

enforcing the physical valve limits u_min ≤ u_sat(k) ≤ u_max.

u (k) = u (k - 1) + c l i p (u_{s a t} (k) - u (k - 1), - r_{m a x} * Δ t, r_{m a x} * Δ t)

(18)

Post-saturation rate limiting ensures |u_(k) − u_{(k − 1)}| ≤

r_{m a x}

·Δt.

The PID gains and limits used in all deterministic runs are fixed and listed below in Table 6.

3.2.2. Mamdani Fuzzy-Logic Governor (FLC)

The FLC uses two antecedents—error e and error-rate

Δ e

—and one consequent Δu. Each linguistic variable employs five triangular/trapezoidal sets {NB, NS, ZE, PS, PB}. Inputs are scaled by s_e and s_Δe; the consequent is scaled by s_u. Implication is min(·), aggregation is max(·), and defuzzification is the centroid.

e_{s} = e / s_{e}, Δ e_{s} = (Δ e) / s_{Δ e}

(19)

where e_s and Δe_s are scaled antecedents [-]; s_e and s_{Δe} are their scales [-].

μ_{C_{i j}} (z) = m i n (μ_{A_{i}} (e_{s}), μ_{B_{j}} (Δ e_{s}))

(20)

where R_{ij}: (A_i, B_j) → C_{ij} are the rules; implication uses min(·).

μ_{C} (z) = m a x_{i, j} μ_{C_{i j}} (z)

(21)

where

μ_{C} (z)

is the aggregated consequent membership [—].

Δ u = s_{u} * (\int z * μ_{C} (z) d z) / (\int μ_{C} (z) d z)

(22)

where s_u is the output scale [-]; Δu is the defuzzified valve increment [-]; and the final command u is obtained by accumulating Δu and enforcing (17) and (18).

μ_{T R I} (z; a, b, c) = m a x (m i n ((z - a) / (b - a), (c - z) / (c - b)), 0)

(23)

is the normalized triangular membership function used for NB, NS, ZE, PS, PB with breakpoints (a ≤ b ≤ c) clamped to z ∈ [−1, 1].

The 5 × 5 rule base maps (e, Δe) to Δu linguistic labels as follows, in Table 7.

Input/output scales are fixed and listed next in Table 8.

To ensure exact reproducibility of the FLC, normalized triangular membership breakpoints for e_s and Δe_s are provided in Table 9.

3.2.3. Observation Normalization and Safety Bounds

Continuous variables are normalized by fixed scales

S_{i}

and clipped to admissible bounds. Safety gates are deterministically applied and terminate an episode when violated.

x_{n o r m, i} = c l i p (x_{i} / S_{i}, x_{i, m i n}, x_{i, m a x})

(24)

where x{_norm,i} denotes the normalized i_th component. We avoid hat/overbar diacritics to guarantee robust rendering across Word installations.

The normalization scales and safety thresholds used in all runs are listed below in Table 10.

3.2.4. Reward Shaping

Reward is defined once here and referenced in downstream Section 3.3 and Section 3.4 without redefinition. A calm-state multiplier attenuates penalties inside tight frequency/power bands to reduce chattering.

r_{t} = - κ_{t} * [w_{f} * {(Δ f_{t})}^{2} + w_{m o v e} * |Δ u_{t}| + w_{j e r k} * |Δ u_{t} - Δ u_{t - 1}|] + b_{t}

(25)

where κ_t gates penalties during calm periods (Table 10). All weights are dimensionless because signals are normalized.

Reward weights and bonuses (dimensionless) are fixed as follows in Table 11.

3.2.5. Five-Phase Curriculum

Training progresses through five deterministic phases. Advancement requires zero safety breaches across the current phase’s evaluation bundle and non-decreasing mean reward.

N_{u n s a f e} = 0 \land r_{a v g}^{(k)} \geq r_{a v g}^{(k - 1)}

(26)

where N_{unsafe} is the count of safety-gate violations in the phase evaluation bundle and r_{avg}^{(k)} is the mean return for phase k. The wedge symbol ∧ expresses logical AND and avoids ‘0AND’ concatenation issues.

Curriculum phases, scenario bundles, and promotion conditions are summarized below in Table 12.

3.2.6. Soft Actor–Critic (SAC) Governor

We use a tanh-squashed Gaussian policy with twin Q-critics and target networks. During evaluation, the action is mapped to the valve command and constrained by the same saturation and rate-limit logic in (17) and (18).

a = t a n h (μ_{θ} (s) + σ_{θ} (s) ⊙ ε), ε ~ N (0, I)

(27)

where ⊙ denotes the elementwise product between

σ_{θ} (s)

and

ε

, with ε drawn from a standard normal.

l o g π_{θ} (a| s) = l o g N (z; μ_{θ}, σ_{θ}) - \sum_{i} l o g (1 - t a n h {(z_{i})}^{2}), z = a r t a n h (a)

(28)

y = r + (1 - d) * γ * (m i n_{j} Q_{j}^{t a r g} (s^{’}, a^{’}) - α * l o g π_{θ} (a^{’}| s^{’})), a^{’} ~ π_{θ} (\cdot| s^{’})

(29)

L_{Q} = E [{(Q_{i} (s, a) - y)}^{2}], i \in 1,2

(30)

L_{π} = E [α * l o g π_{θ} (a| s) - m i n_{i} Q_{i} (s, a)]

(31)

L_{α} = E [- α * (l o g π_{θ} (a| s) + H *)]

(32)

φ^{t a r g} \leftarrow τ * φ + (1 - τ) * φ^{t a r g}

(33)

where

φ

are the critic parameters and

φ^{t a r g}

are the target-critic parameters updated by Polyak averaging with rate τ.

The SAC hyperparameters used in this work are listed below in Table 13.

Replay/evaluation cadence is deterministic and fixed as follows in Table 14.

3.2.7. Differential Evolution (DE) for Deterministic Tuning

We tune PID { K_p, K_i, K_d, τ_d } and FLC { s_e, s_Δe, s_u } using SciPy’s DE (strategy = ‘best1bin’) on a fixed catalogue of scenarios. The objective aggregates metrics to minimize (M↓) and to maximize (M↑), with a failure penalty per gate violation.

J (θ) = \sum_{m \in M ↓} w_{m} m (θ) - \sum_{h \in M ↑} w_{h} h (θ) + λ N_{f a i l} (θ)

(34)

The metric weights used in the DE objective are listed below in Table 15.

The bounds and algorithmic settings for DE are shown next in Table 16 and Algorithm 1 as following:.

Algorithm 1. DE-based multi-scenario tuning.
0	procedure DA_LGE(DT, C = {PID_DE, FLC_DE, SAC}, S, K, G, primary_seed)
1	▷ Inputs: digital twin DT(Δt), controllers C, scenario portfolio S,
1a	KPI set K, licensing gates G, primary_seed
2	▷ Outputs: ranking, PASS_c, CRS_c per controller, portfolio report, evidence bundle
3	set_random_seed(primary_seed); enable_global_determinism()
4	for each controller c in C do
5	if c ∈ {PID_DE, FLC_DE} then
6	θ_c ← DifferentialEvolution(J_multi_scenario, bounds, seed = primary_seed)
7	freeze(θ_c)
8	end if
9	R_c ← ∅ ▷ portfolio record for controller c
10	for each scenario s in S do
11	reset(DT, initial_state = s.x0, power_level = s.P)
12	schedule_disturbances(DT, s.disturbances) ▷ deterministic
13	for k = 0 … ⌊s.T/Δt⌋ − 1 do
14	y_k ← sense(DT)
15	u_k ← policy_c(y_k)
16	u_k ← rate_limit(saturate(u_k))
17	x_{k + 1} ← step(DT, u_k, Δt)
18	log⟨k, x_k, y_k, u_k⟩
19	end for
20	m_{c,s} ← compute_KPIs(K, log)
21	g_{c,s} ← evaluate_gates(G, m_{c,s}) ▷ vector of PASS/FAIL
22	r_{c,s} ← composite_score(m_{c,s}, g_{c,s})
23	R_c ← R_c ∪ {(s, m_{c,s}, g_{c,s}, r_{c,s})}
24	end for
25	CRS_c ← aggregate_scores({r_{c,s}}_s; weights = s.weights)
26	PASS_c ← all_gates_pass({g_{c,s}}_s; hard_fail = ‘any’)
27	end for
28	ranking ← sort_by((PASS_c ↓, CRS_c ↓, var({r_{c,s}}) ↑))
29	export_evidence({R_c}_c, configs, seeds, figures, tables)
30	return ranking, {PASS_c, CRS_c}_c, {R_c}_c
31	end procedure
Legend: Δt—simulation step; DE—Differential Evolution; KPIs—tracking/overshoot/settling/unsafe time/control effort; G—licensing gates (hard); CRS—composite rating score; PASS_c—portfolio pass if any hard gate fails → FAIL. Note: PID_DE: Differenfial Evolution opitimized PID, FLC_DE: Differenfial Evolution opitimizedFLC.

3.2.8. Evidence Capture and Reproducibility

All runs emit versioned logs, checkpoints, and manifests. The following artifacts are produced deterministically at the paths shown.

The artifacts generated by the pipeline are summarized below in Table 17.

3.3. Evaluation Scenarios—Deterministic Disturbance Models, Execution Protocol, and Metrics

This subsection defines the deterministic scenario catalogue, the disturbance models applied to the plant–grid interface, the execution protocol used to evaluate each controller, and the metric suite collected per scenario.

We first define two helper operators used throughout: the clipping operator in (35) and the time-window indicator in (36).

c l i p (x; a, b) = m i n (m a x (x, a), b)

(35)

where clip(x; a, b) saturates a scalar x to the closed interval [a, b].

w_{[a, b]} (t) = 1, a \leq t \leq b; 0, o t h e r w i s e

(36)

where

w_{[a, b]} (t)

is a deterministic 0–1 window that activates a disturbance on the interval [a, b] (seconds).

3.3.1. Deterministic Scenario Catalogue and Disturbance Models

Letting

P_{r e f}

denote the nominal electrical load (MW), and ΔL_ref a reference load change magnitude (MW), the commanded load profile L(t) for each scenario is defined below.

Baseline steady operation holds a constant demand level, as in (37).

L_{b a s e} (t) = P_{r e f}

(37)

where

L_{b a s e} (t)

is the baseline demand (MW) and

P_{r e f}

is the nominal power setpoint (MW).

A gradual load increase is modeled as a linear ramp over [t₁, t₂], clipped outside the interval, as in (38).

L_{g r a d} (t) = P_{r e f} + Δ L_{r e f} * c l i p ((t - t_{1}) / (t_{2} - t_{1}); 0,1)

(38)

where 0 ≤ (t – t₁)/(t₂− t_1) ≤ 1 over the ramp, and ΔL_{ref} > 0 (MW).

A sudden load increase is represented by a deterministic step at time

t_{s}

in (39).

L_{s t e p} (t) = P_{r e f} + Δ L_{r e f} * w_{[t_{s}, T]} (t)

(39)

where T is the evaluation horizon (s) and

w_{[t_{s}, T]} (t)

activates the step at t =

t_{s}

.

Sensor-noise injection is deterministic and bounded: fixed multi-tone sinusoids are superposed on measured channels, as in (40) and (41).

n_{f} (t) = A_{f, 1} s i n (2 π f_{1} t) + A_{f, 2} s i n (2 π f_{2} t + φ_{2})

(40)

f_{m} (t) = f (t) + n_{f} (t), P_{m} (t) = P (t) + n_{P} (t)

(41)

where

f (t)

is the true grid frequency (Hz), P(t) is the plant electrical power (MW),

f_{m} (t)

and

P_{m}

are their measured counterparts,

n_{f} (t)

and

n_{P} (t)

are deterministic noise signals, and

A_{f, 1}

,

A_{f, 2}

,

f_{1}

,

f_{2}

, and

φ_{2}

are fixed constants provided in Table 18.

Parameter-ramp disturbances alter selected plant parameters deterministically over a window, as in (42).

θ_{i} (t) = θ_{i, 0} + r_{i} * (t - t_{1}) * w_{[t_{1}, t_{2}]} (t)

(42)

where

θ_{i} (t)

is the nominal value of parameter i,

r_{i}

is the ramp rate (units of

θ_{i}

per second), and

w_{[t_{1}, t_{2}]} (t)

gates the change.

A cascading-fault profile is defined by sequential windows on the load channel, as in (43).

L_{c f} (t) = P_{r e f} + Δ_{1} * w_{[t_{a}, t_{b}]} (t) - Δ_{2} * w_{[t_{c}, t_{d}]} (t)

(43)

where

Δ_{1}

,

Δ_{2}

> 0 (MW) and

[t_{a}, t_{b}]

,

[t_{c}, t_{d}]

are non-overlapping windows that realize compounding stresses.

The combined scenario aggregates the foregoing signals in (44).

L_{c o m b} (t) = L_{b a s e} (t) + Σ_{q} Δ L_{q} (t)

(44)

where

Δ L_{q} (t)

denotes each active disturbance component (ramp, step, noise, parameter ramp, and fault) defined above.

Numerical values used in the deterministic disturbance models are fixed for all evaluations, as seen in Table 18.

3.3.2. Deterministic Execution Protocol

All controllers are evaluated with a single deterministic pass per scenario. For each scenario s, the environment is reset, the disturbance model L^s(t)is applied, safety gates are enforced, and the metric set is computed from the resulting closed-loop trajectory. The guard conditions and logging are deterministic and reproducible.

Evaluation proceeds as follows (high-level outline):

(1) Select a scenario (s) from the fixed catalogue; (2) reset state; (3) simulate for T seconds at step Δt under the deterministic disturbance; (4) enforce safety gates (frequency trip

f_{t r i p}

, thermal and speed limits) during rollout; (5) compute and store all metrics defined in Section 3.3.3; and (6) persist logs and traces.

Safety limits and constants referenced by the protocol are listed here for completeness, as seen in Table 19.

3.3.3. Metrics Collected per Scenario

Letting f(t) be the grid frequency (Hz),

e_{f} (t)

= f(t) − f_{nom} the frequency error,

u_{k}

the valve command at discrete step k (per-unit), and N = T/Δt the number of samples, we define the metrics below; each equation is followed by the symbol explanations and units.

The frequency error is defined in (45), and the integral metrics

I S E_{f}

and

I A E_{f}

follow in (46) and (47).

e_{f} (t) = f (t) - f_{n o m}

(45)

where

e_{f} (t)

(Hz) is the frequency deviation from the nominal

f_{n o m}

(Hz).

I S E_{f} = \int_{0}^{T} {(e_{f} (t))}^{2} d t

(46)

where

I S E_{f}

has units Hz²·s and aggregates the squared frequency error over the horizon [0, T].

I A E_{f} = \int_{0}^{T} |e_{f} (t)| d t

(47)

where

I A E_{f}

has units Hz·s and aggregates the absolute frequency error over [0, T].

The peak deviation and nadir are defined via extremum operators in (48).

Δ f_{m a x} = m a x_{t \in [0, T]} |e_{f} (t)|, f_{n a d i r} = m i n_{t \in [0, T]} f (t)

(48)

where

Δ f_{m a x}

(Hz) is the largest magnitude deviation and

f_{n a d i r}

(Hz) is the minimum frequency attained.

Rotor-speed overshoot and undershoot (percent) relative to nominal are defined in (49).

O S_{ω} = 100 % \cdot m a x_{t} {(ω (t) - 1)}_{+}, U S_{ω} = 100 % \cdot m a x_{t} {(1 - ω (t))}_{+}

(49)

where ω(t) is the rotor speed in per-unit, and (x)_{+} = max(x, 0) denotes the positive-part operator.

Cumulative actuation effort and valve reversals are computed from the discrete control sequence in (50) and (51).

C E_{s u m} = Σ_{k = 1}^{N} |Δ u_{k}|, Δ u_{k} = u_{k} - u_{k - 1}

(50)

where

C E_{s u m}

(dimensionless) accumulates the absolute valve movements (per-unit commands).

V_{r e v} = Σ_{k = 2}^{N} 1 (Δ u_{k}) (Δ u_{k - 1}) < 0

(51)

where

V_{r e v}

(count) is the number of sign reversals in the valve-movement sequence, and 1{·} is the indicator function.

Spectral damping is assessed from the Welch PSD

S_{f} f (ω)

of

e_{f} (t)

; the band-averaged magnitude is given in (52) and (53).

S_{f} f (ω) = W e l c h e_{f} (t)

(52)

where Welch{·} denotes a deterministic PSD estimate (fixed window/overlap).

E_{d a m p} = 1 / (ω_{2} - ω_{1}) * \int_{ω_{1}}^{ω_{2}} S_{f} f (ω) d ω

(53)

where [

ω_{2}, ω_{1}

] is the evaluation band (rad/s).

The total unsafe time, used later in licensing gates, is the sum of dwell times in any violating condition, as in (54).

T_{u n s a f e} = T (T_{f u e l} > T_{f u e l}^{m a x}) + T (ω > ω_{m a x, l i m i t}) + T (f \notin [f_{m i n}, f_{m a x}])

(54)

where T(·) counts the time (s) spent in the indicated set,

ω_{m a x, l i m i t}

is the rotor-speed limit (per-unit), and

[f_{m i n}, f_{m a x}]

is the accepted frequency band (Hz).

The Grid Load-Following Index (GLFI) is a bounded tracking score (higher is better), defined in discrete form in (55).

G L F I = 1 - (1 / N) * Σ_{k = 1}^{N} (|P_{e, k} - L_{k}|) / (Δ L_{r e f} + ε_{P})

(55)

where

P_{e, k}

is the electrical power (MW) at sample k,

L_{k}

is the commanded load (MW),

Δ L_{r e f}

is the reference load change magnitude (MW), and

ε_{P}

= 1.0 MW is a physical denominator floor.

A composite Transient Severity Score (TSS) is formed as a weighted sum of normalized components in (56) and (57).

\hat{C E} = C E_{s u m} / C E_{a b s, m a x}, \hat{{\tilde{V}}_{r e v}} = V_{r e v} / V_{r e v, m a x}

(56)

where

C E_{a b s, m a x}

= r_{max} · T provides a conservative bound on cumulative movement (dimensionless) and

V_{r e v, m a x}

= N − 2 bounds reversal counts (both constants listed in Table 20).

T S S = w_{f} \cdot (I A E_{f} / I A E_{f, l i m}) + w_{c e} \cdot \hat{C E} + w_{v r} \cdot \hat{{\tilde{V}}_{r e v}} + w_{o s} \cdot (O S_{ω} / O S_{ω, m a x})

(57)

where all weights are dimensionless and satisfy

w_{f}

+

w_{c e}

+

w_{v r}

+

w_{o s}

= 1;

I A E_{f, l i m}

(Hz·s) and

O S_{ω, m a x}

(%) are scenario-family limits used for normalization.

The fixed constants used by the metric normalizations are listed here in Table 20.

Default metric weights used for the composite TSS are shown below; all are dimensionless, as seen in Table 21.

3.4. Deterministic Assurance Orchestration and Evidence Pipeline

This subsection formalizes the Deterministic Assurance Framework used to qualify grid-interactive nuclear controllers. The framework binds scenario-level stress testing (Section 3.3), quantitative indicators, and evidence packaging into a single licensable pipeline. It is organized around four pillars—Trustworthiness and Reliability, Interpretability and Defense-in-Depth, Regulatory Readiness and Quality, and Continual Assurance—and produces auditable artifacts and objective pass/fail gates aligned to nuclear software and AI guidance (e.g., IAEA SSR-2/1 Criterion 5, NRC RG 1.168, NUREG-2261 Section 3.2, IEC 60880 [25] Category A). All evaluations are deterministic: scenarios S1–S8 are executed exactly as defined in Section 3.3, and the RL policy is evaluated with deterministic action selection. As shown in Figure 3, the Digital Twin Assurance Framework (DTAF) operationalizes these pillars by mapping stress-test evidence to licensable, regulator-aligned checks.

3.4.1. Control-Effort Indices and Hard Bounds

We first state the sample count used by all discrete-time metrics, and then derive hard bounds for cumulative movement and reversal counts.

N = T / Δ t

(58)

where N is the number of samples per evaluation episode (—), T is the evaluation horizon (s), and Δt is the controller step size (s).

C E_{a b s, m a x} = r_{m a x} \cdot T

(59)

where

C E_{a b s, m a x}

(—) is a conservative upper bound on cumulative valve movement,

r_{m a x}

is the rate limit (s⁻¹), and T is the horizon (s). For

r_{m a x}

= 0.15 s⁻¹ and T = 600 s,

C E_{a b s, m a x}

= 90.

V_{r e v, m a x} = N - 2

(60)

where

V_{r e v, m a x}

(count) is the maximum possible number of sign reversals in a sequence of length N (two samples are needed before a reversal can occur). With N = 12,000 (Δt = 0.05 s, T = 600 s),

V_{r e v, m a x}

= 11,998.

The weights used later in the composite robustness score (CRS) and the limits used by licensing gates are constants for all runs, as seen in Table 22.

3.4.2. Composite Robustness Score (CRS)

For each scenario s, a composite score

C R S^{(s)}

is computed from the safety, transient severity, control effort, and tracking components. Safety contributes 1 when no unsafe dwell occurs and 0 otherwise; the other terms are normalized to [0, 1] (Section 3.3).

S^{(s)} = 1 {T_{u n s a f e}^{(s)} = 0}

(61)

where S^s (—) is the per-scenario safety indicator,

1

{·} is the indicator function, and

T_{u n s a f e}^{(s)}

is the total unsafe time in scenario (s)

C R S^{(s)} = w_{s a f e} S^{(s)} + w_{t s s} \ b i g (1 - T S S^{(s)} / T S S_{l i m} \ b i g) + w_{e f f} \ b i g (1 - \hat{C E^{(s)} \} b i g) + w_{g l f i} \cdot G L F I^{(s)}

(62)

where

\hat{C E^{(s)}}

=

C E_{s u m}

^{(s)}/

C E_{a b s, m a x}

(—) is the normalized control effort,

G L F I^{(s)}

(—) is the tracking index, and

T S S^{(s)}

(—) is the composite transient severity. Weights

w_{s a f e}

,

w_{t s s}

,

w_{e f f}

, and

w_{g l f i}

are listed in Table 22.

3.4.3. Per-Scenario Licensing Gates

Per-scenario licensing gates combine deterministic constraints on safety, severity, tracking, and overshoot. Letting TTU ≡

T_{u n s a f e}^{(s)}

, the logical gate is given in (63) and must evaluate to 1 for the scenario to qualify.

G a t e^{(s)} = 1 {T_{u n s a f e}^{(s)} = 0 \land T S S^{(s)} \leq T S S_{l i m} \land G L F I^{(s)} \geq G L F I_{m i n} \land O S_{ω}^{(s)} \leq O S_{ω, m a x}}

(63)

where ∧ denotes logical AND,

T S S_{l i m}

,

G L F I_{m i n}

, and

O S_{ω, m a x}

are constants from Table 22, and all quantities are computed deterministically from the rollout.

3.4.4. Policy Interpretability Constraint (Entropy Band)

To guard against both saturated and erratic policies, we bound the average action entropy

\bar{H_{p r o x y}}

within a fixed band (nats), as shown in (64). Entropy is computed deterministically from the deployed policy’s squashed Gaussian outputs with a fixed sampling grid.

\bar{H_{p r o x y}} \in [H_{m i n}, H_{m a x}]

(64)

where H_{min} and H_{{max} ()} define the acceptable interpretability band.

The entropy band used by the interpretability constraint is fixed for all evaluations, seen in Table 23.

3.4.5. Portfolio Aggregation and Licensing Decision

Letting S be the number of scenarios in the catalogue (Section 3.3), we aggregate CRS across scenarios by the arithmetic mean and track the worst and best individual outcomes, as in (65). The final licensing decision is a deterministic logical gate (66) that combines the mean score, per-scenario gates, and the entropy band.

\bar{C R S} = \frac{1}{S} \sum_{s = 1}^{S} C R S^{(s)}, C R S_{w o r s t} = \min_{s} C R S^{(s)}, C R S_{b e s t} = \max_{s} C R S^{(s)}

(65)

G a t e_{p o r t f o l i o} = 1 {\bar{C R S} \geq C R S_{m i n} \land G a t e^{(s)} = 1 \land \bar{H_{p r o x y}} \in [H_{m i n}, H_{m a x}]}

(66)

where

C R S_{m i n}

is the minimum acceptable mean CRS (—),

G a t e^{(s)}

is the per-scenario gate from (63), and

\bar{H_{p r o x y}}

is constrained by (64).

The portfolio-level constants are shown below and apply uniformly to all controllers evaluated under this framework, seen in Table 24.

3.4.6. Evidence Artifacts and Traceability

All evidence is produced deterministically and stored under version control. Paths are stable and reproducible across runs. Table 25 lists the artifacts and their roles; Table 26 records the fixed constants referenced in this subsection.

The assurance pack is organized as a deterministic bundle with stable paths and roles, seen in Table 25.

For completeness, the fixed numeric constants used throughout Section 3.4 are summarized here, seen in Table 26.

4. Results and Discussion

4.1. Overview of the Evaluation and How to Read the Results

This section quantifies how three governors—a DE-tuned PID, DE-tuned FLC, and an entropy-regularized SAC—perform under the same plant model, limits, and eight fixed disturbance scenarios defined in Section 3.1, Section 3.2, Section 3.3 and Section 3.4. The objective is to assess, in a single deterministic pass, whether a candidate governor can (i) respect safety envelopes, (ii) provide grid-support quality commensurate with SMR needs, and (iii) do so with actuation economy.

The evaluation protocol is strictly deterministic: controller parameters, scenario waveforms, sampling, and limits are fixed ex ante; no stochastic seeding, Monte Carlo sampling, or inferential statistics are used. All KPIs (e.g., GLFI, TSS, CEI, and TTU) are computed from the same traces and windows for each governor, and all artifacts are version-controlled, as described in Section 3.4.6.

The figures are organized to move from global pattern → representative behavior → mechanism → qualification.

4.2. Coherence of the Metric Suite

Figure 4 summarizes the deterministic Pearson associations among the key KPIs computed over the fixed catalog (all scenarios × all controllers). The structure is physically plausible: the GLFI is strongly anti-correlated with TSS (r ≈ −0.90), and the damping-band energy proxy

E_{d a m p}

is anti-correlated with GLFI (r ≈ −0.80). Conversely, the TSS is positively associated with the cumulative actuation burden

C E_{s u m}

(r ≈ 0.70) and valve reversals V_rev (r ≈ 0.80). These descriptive (non-inferential) associations justify the combined use of the KPIs without redundancy and motivate the CRS construction in Section 3.4.

For reference, frequency-error and effort quantities used by the KPIs are defined in (67) and (68).

e_{f} (t) = f (t) - f_{n o m}

(67)

where

e_{f} (t)

is the grid-frequency deviation (Hz) from nominal

f_{n o m}

(Hz).

C E_{s u m} = Σ_{k = 1}^{N} |Δ u_{k}|, Δ u_{k} = u_{k} - u_{k - 1}

(68)

where

C E_{s u m}

is the cumulative valve movement (dimensionless per-unit commands),

u_{k}

is the valve command at step k (—), and N = T/Δt samples the T-second episode with step Δt.

4.3. Global Scenario × Controller Landscape

Figure 5 consolidates outcomes by scenario (rows) and controller (columns). The SAC governor consistently occupies the best-performing cells, while classical baselines degrade under Cascading Faults and Combined events. Cells marked “FAIL” indicate deterministic gate violations per the licensing gate (Methods Equation (62)). Failures of PID/FLC in the hardest scenarios coincide with elevated TSS and

C E_{s u m}

(cf. Figure 4), reinforcing the mechanistic linkage between effort, reversals, and instability.

4.4. Representative Transient: Nadir, Settling, and Safety Margin

Figure 6 examines a severe frequency excursion. The SAC governor limits the nadir above the trip line and returns to nominal without overshoot. The PID exhibits a deeper nadir and longer recovery; the FLC approaches the trip boundary. Two recalled definitions frame the discussion: (67) and (68).

|e_{f} (t)| \leq ε_{s t e a d y - s t a t e}

(69)

The absolute magnitude of the steady-state tracking error

e_{f} (t)

must remain below the allowable bound

ε_{s t e a d y - s t a t e}

, where ε denotes the acceptable steady-state band (Hz). A lower nadir (higher f_{nadir}), faster re-entry into the ε-band, and small overshoot OS_{ω} collectively indicate a safer transient (see Figure 6 below).

4.5. Policy Geometry and Actuation Economy

The controller’s geometry in state–action space (Figure 7) explains actuation-burden differences. The PID surface is essentially planar (linear in error and derivative), the FLC surface is piecewise with steep cliffs at rule boundaries, and the SAC surface is smooth and adaptive—steep only along directions needed to arrest drift. This geometry yields disciplined increments Δu_k and reduced reversals V_rev.

We measure economy by a dimensionless performance-to-effort ratio Π in (70).

Π = J_{p e r f} / E_{u}

(70)

where

J_{p e r f}

is a bounded composite performance index (e.g., per-scenario CRS) and

E_{u}

is cumulative actuation effort (e.g., CE_sum). Higher Π indicates better performance per unit actuation. As shown in Figure 8, the SAC sustains markedly higher Π across all scenarios, especially under Sensor Noise, Parameter Ramp, Cascading Fault, and Combined events.

4.6. Provenance of the Released SAC Policy

Figure 9 documents the curriculum trajectory for the deployed SAC model. P1 establishes stability; P2 emphasizes efficiency; P3–P4 harden resilience; and P5 is the fixed-catalog ‘final exam’. All curves are computed deterministically on the same evaluation harness; improvements and plateaus are audit-ready and repayable.

4.7. Stability Forensics: Phase Portrait and PSD

Figure 10 contrasts phase portraits for the SAC and PID. SAC trajectories contract monotonically toward the origin, whereas the PID exhibits spiral loops—consistent with lightly damped poles and slower energy dissipation. In the frequency domain (Figure 11), the PSD of Δf shows pronounced suppression near the dominant mode for the SAC; the PID and FLC retain higher modal energy.

For reference, the band-energy proxy used in Methods Section 3.3 is recalled in (71).

E_{b a n d} = 1 / (ω_{2} - ω_{1}) \cdot \int_{ω_{1}}^{ω_{2}} S_{Δ f} (ω) d ω

(71)

where S_{Δf}(ω) is the PSD of Δf(t), and [ω_{1}, ω_{2}] isolates the dominant plant mode (rad/s). A lower E_{band} implies better modal suppression.

4.8. Ancillary-Service Readiness and Licensability

Figure 12 aggregates the deterministic pass/fail gates—response speed, control efficiency, precision/stability—evaluated against explicit thresholds (Section 3.4: GLFI_{min} = 0.90, TSS_{lim} = 1.00, and OS_{ω,max} = 5%). Only the SAC governor clears all categories, demonstrating readiness for grid-service qualification under identical physics, scenarios, and limits. Because the pipeline is deterministic and version-controlled, every bar in Figure 12 is replayable and auditable.

4.9. Multi-Attribute Performance Profiling

This quantitative difference gives rise to distinct controller “personalities,” synthesized in the multi-attribute profile in Figure 13. The SAC agent’s large, well-balanced polygon demonstrates its holistic superiority, excelling in key areas like Robustness, Foresight, and Control Efficiency. In contrast, the classical controllers exhibit skewed, deficient profiles, visually representing their strategic limitations.

4.10. Controller Robustness and RL Policy Transparency

We evaluated three controllers—the Soft Actor–Critic (SAC), PID_DE, and FLC_DE—across eight scenarios at two power levels (100% and 80%), using three random seeds (42, 43, and 44). All controllers operated across all scenarios and both power levels (at 100% and 80% power levels). The median frequency nadirs (Hz) aggregated by controller and level were as follows: SAC 100% = 58.956, SAC 80% = 58.973; PID_DE 100% = 58.949, PID_DE 80% = 58.901; and FLC_DE 100% = 58.949, FLC_DE 80% = 58.901.

The CRS separates the controllers despite uniform pass/fail outcomes. The SAC exhibits a high mean CRS (≈0.95 at 100% and ≈0.78 at 80%), indicating robust performance with small dispersion. In contrast, the PID_DE and FLC_DE remain an order of magnitude lower at 100% and reach ≲0.25–0.26 at 80%, confirming the RL policy’s superior cross-scenario quality under equal constraints.

The surrogate isolates state channels that most strongly influence the SAC action near safety margins. The prominent negative weight on reactor_power_mw and positive weight on T_fuel are consistent with a policy that rapidly unloads when thermal inertia rises, while tracking grid_frequency_hz corrections. This behavior is absent or muted in the baseline controllers, explaining their lower CRS.

The rapid collapse of action variance after the initial transient signals a highly calibrated policy: the SAC explores only when needed, then locks into low-variance control. This is consistent with the high CRS and stable nadirs reported above.

Together, Figure 14, Figure 15, Figure 16 and Figure 17 substantiate the claim of RL superiority: (i) a higher and more stable CRS; (ii) transparent, mechanistically plausible sensitivities via a local surrogate; (iii) disciplined reduction of action variance after fast transients; and (iv) measured, risk-aware actuation in critical contexts. All artifacts were generated with fixed seeds, reproducible scripts, and 600 dpi export with tight layout to preclude label clipping or overlap.

4.11. Threats to Validity and Limitations

External validity: The current results use an infinite-bus abstraction; extension to networked grids and hardware-in-the-loop is planned. Controller set: Only the SAC, PID, and FLC are benchmarked; the harness is controller-agnostic and can admit additional baselines without changing the protocol. Determinism: By design, we report descriptive (non-inferential) evidence; uncertainty quantification and probabilistic stressors are out of scope here and will be addressed in future work.

4.12. Positioning of the Present Results Relative to Recent RL in Energy Systems

To contextualize the above findings, we relate them to two recent IEEE-TII studies, namely a hierarchical RL framework for regional multi-energy markets [26] and a hybrid policy-based RL approach for island-group energy management under transmission constraints [27]. which we compare our RL approach with them in multi-dimensional comparative analysis described in Table 27 “Concise positioning of this study relative to the two suggested works”. Although all three works employ reinforcement learning, they target different problem classes and evaluation criteria.

4.12.1. Regional Energy Market with Hierarchical RL (Zhang et al. [26])

Zhang et al. address market-clearing in a regional multi-energy system, proposing a hierarchical RL design to improve price-matching/clearing efficiency over large state–action spaces. Their evidence is provided via numerical market case studies with economic/market KPIs (e.g., matching efficiency and operator income effects). In contrast, the present study concerns a safety-critical plant-control loop (PWR governor for load-following) and reports licensing-aligned outcomes: deterministic pass/fail gates tied to plant limits, performance/severity indices (TTU, GLFI, TSS, overshoot, and control effort), and traceability artifacts (local sensitivity over time, critical state–action analysis, and local sensitivity). Within this setting, our SAC governor satisfies the gates and outperforms a DE-tuned PID/FLC under identical physics and constraints, demonstrating robustness under adversarial, fixed-replay transients.

4.12.2. Island-Group Energy Management with Hybrid Policy RL (Yang et al. [27])

Yang et al. formulate system-level energy management for an island group with transmission constraints, introducing a hybrid policy-based RL capable of handling mixed discrete–continuous actions. Their evaluation emphasizes operational/economic KPIs (e.g., energy-balance satisfaction, and cost/efficiency) over system simulations. By contrast, our evaluation addresses component-level nuclear control under licensing expectations, where success is defined by gate compliance and robustness indices during adversarial transients, again complemented by auditable decision traceability.

4.12.3. Comparability Considerations

Because the objectives, domains, and KPIs in [25,26] (market/dispatch efficiency and system-operation economics) are not commensurate with the licensing-grade, safety-gate evaluation used here, direct numerical head-to-head is methodologically inappropriate. Instead, these strands are complementary: system-level RL advances (e.g., hierarchical or hybrid policies) inform algorithmic design, while our results provide a deterministic, regulator-facing evidence protocol for safety-critical plant control. To facilitate like-for-like future comparisons, we release the fixed scenario portfolio, gates, and scripts so alternative policies (including hierarchical/hybrid designs) can be evaluated under identical, licensing-aligned conditions.

4.12.4. Implication for the Present Results

Against strong, regulator-familiar baselines (DE-tuned PID/FLC), the SAC governor meets explicit safety gates and maintains favorable robustness/effort trade-offs under adversarial replay, while producing traceable decision evidence. This shifts the evaluation of RL-based control from performance-only reporting toward licensing-grade assurance, filling a gap not addressed by market or system-management studies such as [25,26].

4.12.5. Recent RL-in-Nuclear Studies

Finally, relative to recent RL-in-nuclear studies that primarily report nominal-scenario performance without licensing-grade artifacts [3,4,5,6], the present results contribute a regulator-aligned evidence set: deterministic safety gates, adversarial fixed-replay scenarios, controller-agnostic CRS synthesis, and auditable traceability (local sensitivity and critical state–action contexts). This complements feasibility-focused nuclear RL by supplying the licensing-oriented evaluation scaffolding.

4.13. Summary

Across Figure 4, Figure 5, Figure 6, Figure 7, Figure 8, Figure 9, Figure 10, Figure 11 and Figure 12—and when the full portfolio is re-run from a reduced operating point at 0.8 P_ref (≈480 MW)—the conclusion is consistent: under identical physics, limits, and scenarios, the SAC governor achieves shallower frequency nadirs (smaller excursions), faster settling, stronger modal damping, and a lower actuation burden than the DE-tuned PID and FLC baselines. The geometry of the learned policy (Figure 7) together with the observed action economy (Figure 8) explains the global pattern in the scenario × controller landscape (Figure 5) and the representative transient (Figure 6). Stability forensics—phase portraits and spectral energy (Figure 10 and Figure 11)—are coherent with these behaviors. The deterministic qualification scorecard (Figure 12) consolidates these outcomes into an all-gates pass at both operating points under the same evaluation harness and thresholds, yielding replayable, audit-ready evidence; additionally, the traceability bundle (time-aligned local sensitivity, critical state–action mapping, and action-entropy envelopes) confirms decision regularity near gate boundaries.

5. Conclusions

This work advances a regulator-oriented path to licensable AI control by introducing the Deterministic Assurance Framework (DTAF) and demonstrating it on a high-fidelity pressurized-water-reactor (PWR) digital model. The DTAF converts controller behavior into audit-ready evidence by combining (i) fixed, pass/fail licensing gates tied to formal limits; (ii) a portfolio of adversarial stress scenarios; and (iii) an embedded traceability and explainability package, all executed under a single-run deterministic protocol.

Within this framework, three governor architectures—an entropy-regularized Soft Actor–Critic (SAC) agent, a Differential-Evolution-optimized Proportional–Integral–Derivative Controller, and a Differential-Evolution-optimized Fuzzy-Logic Controller—were evaluated on identical plant physics, limits, and fixed scenario waveforms. The gate suite and thresholds (for example, the minimum Grid Load-Following Index, upper bound on Transient Severity Score, and maximum rotor-speed overshoot) were established ex ante to reflect safety and grid-support expectations.

The evaluation protocol is strictly reproducible and non-inferential: parameters, scenarios, and sampling are fixed; claims flow from trace-level behavior to mechanism to portfolio gate outcome, and all licensing conclusions are drawn from deterministic single-run evidence. The only scoped exception is a clearly labeled auxiliary robustness check using three predetermined seeds, reported as non-licensing context.

Across the adversarial portfolio, the SAC governor satisfies the predeclared gates whenever the safe envelope is achievable, while the strong conventional baselines fail specific gates under high-severity disturbances; when all methods remain within bounds, the SAC provides higher grid-support quality with lower actuation burden and fewer control reversals, consistent with the metric associations observed over the full catalog. These outcomes arise under the same deterministic plant model and gating, and each claim is linked to transparent artifacts: scenario definitions, time-domain traces, mechanism diagnostics, and the final scorecard.

6. Future Work

Higher-fidelity physics and operating regimes. This would involve re-introducing conservative reactivity-feedback coefficients and extending thermal–hydraulic fidelity (e.g., secondary-side dynamics and multi-loop interactions) to test controller behavior under tighter physical coupling and broader operating points (including startup, low-load hot standby, and rapid dispatch ramps). This would deepen model realism while preserving DTAF determinism.

From software-in-the-loop to hardware-in-the-loop. This would involve migrating the fixed portfolio into a hardware-in-the-loop testbed to exercise I/O timing, actuator saturations, and sensor paths under the same gating logic, while maintaining identical scenarios and pass/fail criteria to keep the evidence comparable across SIL→HIL progression.

Expanded adversarial portfolio and parameter sweeps. This would involve systematically enlarging the stress catalog with worst-case composite events (e.g., ramp-with-noise, valve-rate-limited steps, and turbine lag excursions) and adversarial parametric sweeps over plant lags, gains, and limits. This would use the same deterministic harness to produce envelope-wide evidence and reveal policy brittleness modes before any probabilistic analyses are contemplated.

Runtime assurance and safety shields inside the DTAF. This would involve integrating deterministic safety layers (command governors, barrier-function filters, and invariant-set guards) as first-class DTAF components. They would be evaluated with the same gates to quantify how much margin they restore during off-nominal events, and ensure their actions are recorded in the traceability log.

Broader benchmarks under identical gating. This would involve adding model-predictive and robust control baselines (e.g., constrained MPC and H∞) implemented with identical plant models, limits, and scenarios to study trade-offs between explicit constraint handling and policy expressiveness—without changing the scorecard or evidence standard. This would isolate method effects from test-bench effects inside the DTAF.

Grid-service readiness and portfolio decisions. This would involve evolving the final scorecard toward service-qualification bundles (e.g., frequency containment and load-following with reserves) by composing existing gates into regulator-relevant portfolio decisions and documenting the traceability path from scenario to decision artifact.

7. Assumptions, Limitations, and Translational Roadmap

7.1. Study Assumptions

Deterministic evaluation: This assumes a fixed scenario × controller catalog, fixed physics, and fixed limits. Controller parametrizations, tuning procedures, and reward weights are frozen and versioned. Reproducibility artifacts (commits, configs, logs, and figure scripts) are bundled.

Plant/grid abstractions: A PWR governor-centric surrogate is used; the grid is modeled as an infinite bus with finite-band frequency disturbances defined per scenario.

Interfaces/devices: Sensors are ideal signals plus a fixed noise trace in the sensor-noise scenario. Actuation is a valve command with saturation and rate limits at the discrete period Δt.

Protocol/metrics: Pass/fail gates (trip avoidance, steady-state error, and ramp compliance) are deterministic thresholds. KPIs (GLFI, TSS, CE sum, Vrev, and E _damp) are computed in fixed windows/filters. Learning curves are reported for provenance only.

7.2. Limitations

Model and data: The infinite-bus abstraction omits low-inertia inter-area modes and grid-code logic; the thermal–hydraulic surrogate omits multi-node detail (e.g., DNBR margins). Aging/drift and cyber-physical faults are not time-evolving.

Controllers/training: Baselines are limited to the DE-tuned PID and FLC; MPC/H∞/LQR are not yet included. Evaluation is deterministic by design (no parameter randomization or Monte Carlo spreads). Reward terms target governor behavior and rely on external limits/safety proxies for full-plant protection.

Assurance/deployment: Human-in-the-loop procedures, runtime assurance (RTA/CBFs), and tool/quality audits are not yet implemented end-to-end.

7.3. Translational Roadmap

Objective: The objective is to turn the deterministic DTAF into an industry-deployable, regulator-ready program that remains controller-agnostic and plant-agnostic. All upgrades below preserve scenario determinism; uncertainty is represented via cataloged envelopes rather than probabilistic spreads.

7.3.1. System Realism and Test Environments

Networked-grid dynamics: This moves from infinite-bus to reduced-order multi-area networks with finite inertia, governor/turbine models, and grid-code checks, and replay real PMU disturbances. Evidence: Evidence includes scenario replays, inter-area mode damping KPIs, and GLFI under code constraints.

High-fidelity plant twin: This couples the governor loop to multi-node thermal–hydraulic models with fuel/DNBR margins, validated against utility traces. Evidence: Evidence includes limit-envelope compliance and thermal margins per scenario.

Hardware-in-the-loop (HIL): This ports controllers to PLC/RTOS with measured I/O latency, actuator nonlinearities, and watchdogs, maintaining the same deterministic gates. Evidence: Evidence includes closed-loop HIL logs, timing budgets, and watchdog trips = 0.

7.3.2. Safety Assurance and Verification

Runtime assurance (RTA): This implements a Simplex-style supervisor with control-barrier functions and verified fallback envelopes (PID/FLC). Evidence: Evidence includes monitor verdicts, dwell times, and intervention logs integrated into the safety case.

Formal verification: This computes reachability-based invariants and signal temporal logic (STL) properties on linearized/hybridized models; results are fed into test-oracle generation for scenario-catalog expansion. Evidence: Evidence includes certified safe sets and counterexample-guided scenarios.

7.3.3. Controller Breadth and Robustness

Broaden baselines: This adds an MPC (with constraints), H∞, and LQR with anti-windup under the same protocol to strengthen comparative claims. Evidence: Evidence includes controller-agnostic scorecards and gate outcomes.

Deterministic uncertainty envelopes: This adds structured parameter sweeps (plant constants, delays, and biases) as catalog variants, reporting envelopes (min/median/max) rather than probabilistic spreads. Evidence: Evidence includes worst-case KPI/gate tables.

Fault tolerance: This includes sensor dropout, actuator stiction, and stuck-valve events with recovery gates and trip-avoidance proofs. Evidence: Evidence includes fault-recovery timing and residual limits.

7.3.4. Human Factors, Cybersecurity, and Compliance

Operator acceptance: This adds HSI artifacts (policy-intent visualization and alarm rationalization) and operator-in-the-loop drills, quantifying workload/trust metrics. Evidence: Evidence includes scenario completion with human-override windows.

Cybersecurity drills: This exercises spoofing/tampering/denial in a segmented testbed with detection/mitigation hooks tied to RTA logs. Evidence: Evidence includes attack-detect/mitigate latencies, and zero-unsafe-time under defended scenarios.

Quality and audits: This maps artifacts to safety-case structures, conducting independent V&V and configuration-management audits aligned to nuclear software practice. Evidence: Evidence includes audit checklists and tool/config qualification records.

7.3.5. Action Matrix: From Limitation to Industrial Remedy

Table 28 represents a matrixed mapping towards industrial targeting.

7.4. Research Agenda

Hybrid formal methods: These involve verified CBF synthesis on reduced models and contract-based composition across plant and grid subsystems.

Catalog design: This involves coverage metrics for scenario sets and counterexample-guided expansion from failed certificates.

Interpretable policy analysis: This involves entropy bands plus input–output fragility, local Lipschitz estimates, and action-space curvature as explainability signals tied to gates.

Controller-agnostic benchmarking: This involves publishing an open Assurance Benchmark Suite with fixed physics, gates, and evidence schemas to enable independent replication and regulator pre-review.

7.5. Cross-Industry Parallels

Highly regulated sectors mature AI control via deterministic test harnesses, explicit gates, and auditable artifacts: aviation (software/tool qualification), automotive (ISO 26262) [28], and medical (software lifecycle and safety cases). The DTAF follows the same pattern—fixed models and limits, standard gates, and traceable artifacts—enabling independent replay and audit without seeding or Monte Carlo dependence. This aligns with nuclear licensing culture and scales to SMR-era plant–grid integration.

Author Contributions

A.A.I.: Conceptualization, Methodology, Software, Writing. H.-K.L.: Supervision, Resources, Review and Editing. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the 2025 Research Fund of the KEPCO International Nuclear Graduate School (KINGS), Republic of Korea.

Data Availability Statement

All simulation code, trained agent weights and raw results are available upon request under the KEPCO International Nuclear Graduate School (KINGS) license.

Conflicts of Interest

The authors declare no conflict of interest.

References

U.S. Nuclear Regulatory Commission. Artificial Intelligence Strategic Plan: Fiscal Years 2023–2027, NUREG-2261. 2023. Available online: https://www.nrc.gov/docs/ML2313/ML23132A305.pdf (accessed on 9 November 2025).
Siserman-Gray, C.; Barr, J.; Burniske, J.; Eftekhari, P.E.; Marek, R.; Means, A. Regulatory Challenges Related to the Use of Artificial Intelligence for IAEA Safeguards Verification. In Proceedings of the 64th Annual Meeting of the Institute of Nuclear Materials Management (INMM), Orlando, FL, USA, 21–25 July 2023. [Google Scholar]
Gong, Y.; Chen, Y.; Zhang, J.; Li, X. Possibilities of Reinforcement Learning for Nuclear Power Plants: Evidence on Current Applications and Beyond. Nucl. Eng. Technol. 2024, 56, 1959–1974. [Google Scholar] [CrossRef]
Nguyen, K.H.N.; Rivas, A.; Delipei, G.K.; Hou, J. Reinforcement Learning-Based Control Sequence Optimization for Advanced Reactors. J. Nucl. Eng. 2024, 5, 209–225. [Google Scholar] [CrossRef]
Tunkle, L.; Abdulraheem, K.; Lin, L.; Radaideh, M.I. Nuclear Microreactor Control with Deep Reinforcement Learning. arXiv 2025, arXiv:2504.00156. [Google Scholar] [CrossRef]
Kruthika, U.; Paneerselvam, S. Novel multiagent reinforcement learning framework using twin delayed deep deterministic policy gradient for adaptive PID control in boiler turbine systems. Sci. Rep. 2025, 15, 34558. [Google Scholar] [CrossRef] [PubMed]
Sun, Y.; Khairy, S.; Vilim, R.B.; Hu, R.; Dave, A.J. A Safe Reinforcement Learning Algorithm for Supervisory Control of Power Plants. Knowl.-Based Syst. 2024, 301, 112312. [Google Scholar] [CrossRef]
Feng, S.; Sun, H.; Yan, X.; Zhu, H.; Zou, Z.; Shen, S.; Liu, H.X. Dense reinforcement learning for safety validation of autonomous vehicles. Nature 2023, 615, 620–627. [Google Scholar] [CrossRef] [PubMed]
Milani, S.; Topin, N.; Veloso, M.; Fang, F. Explainable Reinforcement Learning: A Survey and Comparative Review. ACM Comput. Surv. 2023, 56, 140. [Google Scholar] [CrossRef]
Najar, M.; Wang, X. Explainable AI Models for Enhancing Operator Reliability During Reactor Design-Based Accidents Using Radionuclide Data. Nucl. Technol. 2025; early access. [Google Scholar] [CrossRef]
Lim, S.T.; Kim, K.M.; Kang, J.-Y.; Kim, T.; Jerng, D.-W.; Ahn, H.S. A Digital Twin Framework for Generation-IV Reactors with Reinforcement Learning-Enabled Health-Aware Supervisory Control. Prog. Nucl. Energy 2025, in press. [Google Scholar] [CrossRef]
International Atomic Energy Agency. Safety of Nuclear Power Plants: Design, IAEA Safety Standards Series No. SSR-2/1 (Rev. 1); IAEA: Vienna, Austria, 2016; Available online: https://www.iaea.org/publications/10885/safety-of-nuclear-power-plants-design (accessed on 9 November 2025).
International Atomic Energy Agency. Design of Instrumentation and Control Systems for Nuclear Power Plants, IAEA Safety Standards Series No. SSG-39; IAEA: Vienna, Austria, 2016; Available online: https://www-pub.iaea.org/MTCD/Publications/PDF/Pub1694_web.pdf (accessed on 9 November 2025).
Refaat, R.M.; Fahmy, R.A. Optimized Fractional-Order PID Controller Based on Nonlinear Point Kinetic Model for VVER-1000 Reactor. Kerntechnik 2022, 87, 104–114. [Google Scholar] [CrossRef]
Hasan, R.; Masud, M.S.; Haque, N.; Abdussami, M.R. Frequency Control of Nuclear-Renewable Hybrid Energy Systems Using Optimal PID and FOPID Controllers. Heliyon 2022, 8, e11770. [Google Scholar] [CrossRef] [PubMed]
Parada Iturria, F.F.; Martindale, N.A.; Reasor, A.L.; Stewart, S.L.; Ukishima, L.A. AI for Nuclear Safeguards Verification; ORNL/SPR-2024/01; Oak Ridge National Laboratory: Oak Ridge, TN, USA, 2024. Available online: https://www.ornl.gov/publication/ai-nuclear-safeguards-verification (accessed on 9 November 2025).
Storn, R.; Price, K. Differential Evolution—A Simple and Efficient Heuristic for Global Optimization over Continuous Spaces. J. Glob. Optim. 1997, 11, 341–359. [Google Scholar] [CrossRef]
Price, K.V.; Storn, R.M.; Lampinen, J.A. Differential Evolution: A Practical Approach to Global Optimization; Springer: Berlin/Heidelberg, Germany, 2005. [Google Scholar]
Das, S.; Suganthan, P.N. Differential Evolution: A Survey of the State-of-the-Art. IEEE Trans. Evol. Comput. 2011, 15, 4–31. [Google Scholar] [CrossRef]
Qin, A.K.; Huang, V.L.; Suganthan, P.N. Differential Evolution Algorithm with Strategy Adaptation for Global Numerical Optimization. IEEE Trans. Evol. Comput. 2009, 13, 398–417. [Google Scholar] [CrossRef]
Piotrowski, A.P.; Napiorkowski, J.J.; Piotrowska, A.E. Particle Swarm Optimization or Differential Evolution—A Comparison. Eng. Appl. Artif. Intell. 2023, 121, 106008. [Google Scholar] [CrossRef]
Biswal, A.; Dwivedi, P.; Bose, S. DE-Optimized IPIDF Controller for Management Frequency in a Networked Power System with SMES and HVDC Link. Front. Energy Res. 2022, 10, 1102898. [Google Scholar] [CrossRef]
Shahriari, B.; Swersky, K.; Wang, Z.; Adams, R.P.; de Freitas, N. Taking the Human Out of the Loop: A Review of Bayesian Optimization. Proc. IEEE 2016, 104, 148–175. [Google Scholar] [CrossRef]
Makarova, A.; Bardenet, R.; Percival, L. Risk-Averse Heteroscedastic Bayesian Optimization. In Proceedings of the 35th Conference on Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 6–14 December 2021; pp. 1–13. [Google Scholar]
IEC 60880; Nuclear Power Plants—Instrumentation and Control Systems Important to Safety—Software Aspects for Computer-Based Systems Performing Category A Functions. International Electrotechnical Commission: Geneva, Switzerland, 2019.
Zhang, N.; Yan, J.; Hu, C.; Sun, Q.; Yang, L.; Gao, D.W.; Guerrero, J.M.; Li, Y. Price-Matching-Based Regional Energy Market with Hierarchical Reinforcement Learning Algorithm. IEEE Trans. Ind. Inform. 2024, 20, 11103–11114. [Google Scholar] [CrossRef]
Yang, L.; Li, X.; Sun, M.; Sun, C. Hybrid Policy-Based Reinforcement Learning of Adaptive Energy Management for the Energy Transmission-Constrained Island Group. IEEE Trans. Ind. Inform. 2023, 19, 10751–10762. [Google Scholar] [CrossRef]
ISO 26262 (All Parts); Road Vehicles—Functional Safety. International Organization for Standardization: Geneva, Switzerland, 2018.

Figure 1. System architecture.

Figure 2. System block diagram.

Figure 3. Digital Twin Assurance Framework (DTAF): Deterministic orchestration that binds stress scenarios, quantitative gates, and auditable artifacts into a regulator-ready licensing pipeline. The four pillars—Trustworthiness and Reliability, Interpretability and Defense-in-Depth, Regulatory Readiness and Quality, and Continual Assurance—anchor the evidence channels and pass/fail logic.

Figure 4. Deterministic Pearson association matrix across the fixed scenario × controller catalog. KPI codes: GLFI (Grid Load-Following Index), TSS (Transient Severity Score),

C E_{s u m}

(cumulative actuation effort), V_rev (valve reversals), and

E_{d a m p}

(modal damping energy proxy).

Figure 4. Deterministic Pearson association matrix across the fixed scenario × controller catalog. KPI codes: GLFI (Grid Load-Following Index), TSS (Transient Severity Score),

C E_{s u m}

(cumulative actuation effort), V_rev (valve reversals), and

E_{d a m p}

(modal damping energy proxy).

Figure 5. Scenario × controller performance heatmap. Higher is better. “FAIL” indicates a violation of deterministic gates.

Figure 6. Frequency response under a hard event; the dotted line denotes the trip limit. The SAC maintains a higher nadir and converges faster.

Figure 7. Policy manifolds in a common (e, de/dt) projection: (a) PID (planar), (b) FLC (piecewise), and (c) SAC (smooth, adaptive).

Figure 8. Performance-to-effort ratio Π across scenarios. The SAC maintains the highest economy across the entire catalog.

Figure 9. Training provenance for the released SAC policy over curriculum phases P1–P5, evaluated on the fixed catalog.

Figure 10. Phase portrait (Δω vs. d(Δω)/dt) for the SAC and PID. The SAC contracts monotonically; the PID follows spiral trajectories indicative of under-damping.

Figure 11. Power spectral density of Δf. The SAC actively suppresses energy near the dominant mode compared with the PID and FLC.

Figure 12. Ancillary-services qualification scorecard (deterministic). The vertical dashed line denotes the qualification threshold.

Figure 13. Multi-attribute performance profile, with archetypal radar synthesis.

Figure 14. Cross-scenario Controller Robustness Scores (CRSs, mean ±95% CI) for the SAC, FLC_DE, and PID_DE at two power levels (100% blue; 80% orange). The SAC maintains a near-unity CRS at both levels, while baseline controllers remain below 0.3 on average at 100% power and improve modestly at 80%. Error bars reflect variability across seeds and scenarios (n = 24 per bar).

Figure 15. Local linear surrogate of the SAC policy around a critical operating context. Positive loadings (right) increase the SAC action; negative loadings (left) decrease it. The surrogate highlights the dominant channels: a large negative weight on reactor_power_mw and strong positive weight on T_fuel, with secondary contributions from grid_frequency_hz and speed_rpm. This mechanistic view explains the SAC’s stabilizing reactions without resorting to opaque end-to-end reasoning.

Figure 16. SAC action-entropy proxy over time during a representative disturbance. Exploration collapses within ~0.6 s, after a brief transient peak (≈0.40), indicating confident, low-variance actuation once the operating point re-enters the admissible band.

Figure 17. Critical contexts (T2): pairwise action comparison at time points with elevated system risk. Bars report normalized actuation magnitudes for the SAC, PID, and FLC at multiple timestamps. The SAC consistently applies the least aggressive input compatible with risk reduction, especially near t ≈ 4.0–4.06 s, while the PID/FLC remain saturated. This selective restraint aligns with the entropy trace and explains the SAC’s superior CRS.

Table 1. Neutron kinetics constants.

Symbol	Description	Value	Units
β	Total delayed neutron fraction	0.006502	-
Λ	Prompt neutron generation time	1.0 × 10⁻⁴	s
$λ_{1}$	Precursor decay constant (group 1)	0.0124	s⁻¹
$λ_{2}$	Precursor decay constant (group 2)	0.0305	s⁻¹
λ₃	Precursor decay constant (group 3)	0.111	s⁻¹
λ₄	Precursor decay constant (group 4)	0.301	s⁻¹
$λ_{5}$	Precursor decay constant (group 5)	1.14	s⁻¹
$λ_{6}$	Precursor decay constant (group 6)	3.01	s⁻¹

Table 2. Delayed neutron fractions per group (six-group; sum to β).

Symbol	Description	Value	Units
$β_{1}$	Group-1 delayed neutron fraction	0.000215	-
$β_{2}$	Group-2 delayed neutron fraction	0.001424	-
$β_{3}$	Group-3 delayed neutron fraction	0.001274	-
$β_{4}$	Group-4 delayed neutron fraction	0.002568	-
$β_{5}$	Group-5 delayed neutron fraction	0.000748	-
$β_{6}$	Group-6 delayed neutron fraction	0.000273	-

Note:

β_{1}

+

β_{2}

+

β_{3}

+

β_{4}

+

β_{5}

+

β_{6}

= 0.006502 = β.

Table 3. Thermal–hydraulic and power-mapping constants.

Symbol	Description	Value	Units
κ_P	Power scaling (n → MW_th)	1000.0	MW
Cf	Effective fuel thermal capacity	30.0	MJ/°C
Cc	Effective coolant thermal capacity	50.0	MJ/°C
U_fc	Fuel-coolant conductance	2.0	MW/°C
Ucs	Coolant-secondary conductance	20.0	MW/°C
T_s0	Secondary-side sink temperature (fixed)	270.0	°C

Table 4. Turbine-governor and grid constants.

Symbol	Description	Value	Units
τ_v	Valve servo time constant	0.30	s
τ_m	Steam-path/turbine lag	3.0	s
Kt	Turbine gain (v → P_m)	900.0	MW/-
η_g	Generator efficiency	0.98	-
f_nom	Nominal grid frequency	60.0	Hz
P_ref	Nominal electrical power reference	600.0	MW

Table 5. Actuator limits and numerical step.

Symbol	Description	Value	Units
$u_{m i n}$	Valve lower bound	0.0	-
$u_{m a x}$	Valve upper bound	1.0	-
$r_{m a x}$	Valve rate limit	0.15	s⁻¹
Δt	Numerical integration step	0.05	s

Table 6. PID parameters.

Symbol	Description	Value	Units
$K_{p}$	Proportional gain	1.800	-
$K_{i}$	Integral gain	0.300	s⁻¹
K_d	Derivative gain	0.050	s
τ_d	Derivative filter time constant	0.200	s
$u_{m i n}$	Valve lower bound	0.0	-
$u_{m a x}$	Valve upper bound	1.0	-
$r_{m a x}$	Valve rate limit	0.15	s⁻¹
Δt	Loop period (numerical step)	0.05	s

Table 7. FLC rule base (rows: Δe; columns: e).

Δe\e	NB	NS	ZE	PS	PB
NB	PB	PB	PS	ZE	ZE
NS	PB	PS	ZE	NS	ZE
ZE	PS	ZE	ZE	ZE	NS
PS	ZE	NS	ZE	NS	NB
PB	ZE	ZE	NS	NB	NB

Table 8. FLC scaling parameters.

Symbol	Description	Value	Units
s_e	Error scaling	1.50	-
s_{Δe}	Error-rate scaling	0.80	-
s_u	Output scaling	0.35	-

Table 9. Normalized triangular MF breakpoints for antecedents (centers at −1, −0.5, 0, 0.5, and 1; ~50% overlap; clamped to [−1, 1]).

Label	a	b	c
NB	−1.00	−1.00	−0.50
NS	−1.00	−0.50	0.00
ZE	−0.50	0.00	0.50
PS	0.00	0.50	1.00
PB	0.50	1.00	1.00

Table 10. Normalization scales and safety thresholds (deterministic).

Symbol	Quantity	Value	Units	Notes
S_P	Reactor power (scale)	1000.0	MW	Normalization divisor for P
S_T	Fuel temperature (scale)	1000.0	°C	Normalization divisor for T_fuel
S_f	Grid frequency (scale)	1.0	Hz	Normalization divisor for f
S_ω	Rotor speed (scale)	1.0	pu	Normalization divisor for ω
f_{trip}	Under-frequency trip gate	49.00	Hz	Hard safety gate
$\|Δ f\| c a l m$	Calm band (frequency)	0.02	Hz	Calm multiplier applies if ≤ value
$\|Δ P\| c a l m$	Calm band (power)	2.0	MW	Calm multiplier applies if ≤ value

Table 11. Reward weights and bonuses (dimensionless).

Symbol	Description	Value	Units	Notes
w_f	Frequency error penalty	1.00	-	Primary stability focus
w_{move}	Valve movement penalty	0.010	-	Economy and wear proxy
w_{jerk}	Valve jerk penalty	0.020	-	Penalizes reversals
w_{bonus}	Safe completion bonus	5.00	-	Applied once if no gates breached
w_{unsafe}	Unsafe penalty	10.00	-	Applied on any gate violation
c_{calm}	Calm-state multiplier	0.50	-	If \|Δf\| ≤ 0.02 Hz and \|ΔP\| ≤ 2 MW

Table 12. Deterministic curriculum phases and promotion conditions.

Symbol	Phase	Scenario Bundle	Promotion Threshold	Reward Overrides
P1	Stability and limits	Baseline; ramp-in-place	N_unsafe = 0; non-decreasing r_avg	Increasing ↑ w_f; enable calm multiplier
P2	Efficiency	Baseline; gradual load	N_unsafe = 0; non-decreasing r_avg	Increasing ↑ w_move
P3	Disturbances I	Sensor-noise; parameter ramp (deterministic)	N_unsafe _{unsafe} = 0; non-decreasing r_avg	Increasing ↑ wjerk
P4	Disturbances II	Cascading-fault	N_unsafe = 0; non-decreasing r_avg	Keep P2/P3 overrides
P5	Final exam	Combined	N_unsafe = 0; non-decreasing r_avg	Freeze weights; evaluate only

Where “↑” denotes a metric to be maximized (higher values are better).

Table 13. SAC hyperparameters (values used).

Symbol	Name	Value	Units	Notes
α	Entropy temperature	auto-tuned	-	Target entropy heuristic
γ	Discount factor	0.99	-	Stable defaults
τ	Polyak rate	0.005	-	Target critic averaging
η_Q	Critic learning rate	3 × 10⁻⁴	-	Adam
η_π	Actor learning rate	3 × 10⁻⁴	-	Adam
η_α	Temperature learning rate	3 × 10⁻⁴	-	If α learnable
B	Batch size	1024	samples	Minibatch size
\|D\|	Replay capacity	1,200,000	transitions	FIFO
N₀	Learning starts	120,000	steps	Warm-up
T	Total timesteps	15,000,000	steps	Training budget
f_eval	Evaluation frequency	80,000	steps	Deterministic evals
policy	Policy/widths	MlpPolicy/[512,512]	-	Hidden units

Table 14. Replay/batch schedule and evaluation settings.

Symbol	Quantity	Value	Units	Notes
t_step	Environment step time	Δt	s	Loop period from plant interface
N_update	Updates per env step	1	-	Once learning starts
N_target	Target update cadence	1	-	Per gradient step
eval_det	Deterministic evaluation	enabled	-	No exploration noise

Table 15. Metric weights used in J(θ).

Symbol	Metric	Group	Weight	Notes
w_TSS	Transient Severity Score (TSS)	M ↓	1.00	Primary stability objective
w_CE	Cumulative actuation effort (CE_sum)	M ↓	0.50	Economy objective
w_Vrev	Valve reversals (V_rev)	M ↓	0.50	Mechanical wear proxy
w_GLFI	Grid Load-Following Index (GLFI)	M ↑	1.00	Tracking quality
λ	Failure penalty multiplier	100	—	Scaled by N_fail (θ)

Note: “↓” denotes a metric to be minimized (lower values are better); “↑” denotes a metric to be maximized (higher values are better).

Table 16. DE bounds and algorithm parameters.

Symbol	Parameter	Bounds/Value	Units	Notes
K_p	PID proportional gain	[0.5, 3.0]	-	Search bound
K_i	PID integral gain	[0.05, 0.8]	s⁻¹	Search bound
K_d	PID derivative gain	[0.00, 0.15]	s	Search bound
τ_d	Derivative time constant	[0.05, 0.50]	s	Search bound
s_e	FLC error scale	[0.5, 2.5]	-	Search bound
s_Δe	FLC error-rate scale	[0.3, 1.5]	-	Search bound
s_u	FLC output scale	[0.1, 0.8]	-	Search bound
F	DE mutation scale	[0.5, 1.0]	-	Differential weight
C_r	DE crossover probability	0.7	-	Crossover rate
G_max	Max iterations	50 (PID)/30 (FLC)	generations	Stopping criterion
P	Population size	15	candidates	Per generation
tol	Convergence tolerance	1 × 10⁻²	-	Early stop threshold

Table 17. Evidence artifacts emitted by the pipeline.

Symbol	Artifact	Path/Identifier	Frequency	Mechanism
A1	Raw evaluation logs	results/logs/*.csv	Per eval	Auto-export
A2	Training checkpoints	results/checkpoints/*.zip	Per save step	SB3 saver
A3	Best model	results/best/*.zip	On improvement	Eval callback
A4	Final model	results/final/*.zip	End of training	Export final
A5	Reports/figures	results/reports/*	On demand	Figure scripts
A6	Config manifests	results/config/*.yml	Per run	Hash-locked

Note: “*” denotes a wildcard matching multiple files in the specified directory (e.g., results/logs/run1.csv).

Table 19. Safety gates and constants (deterministic).

Symbol	Description	Value	Units
$f_{t r i p}$	Under-frequency trip threshold	49.00	Hz
$ω_{m a x, l i m i t}$	Max rotor speed (per-unit)	1.10	pu
$T_{f u e l}^{m a x}$	Fuel temperature limit	1500	°C

Table 22. CRS weights and licensing thresholds (dimensionless unless noted).

Symbol	Description	Value	Units/Notes
w _safe	Safety contribution in CRS	0.40	-
w _tss _{tss}	Transient severity contribution in CRS	0.30	-
w _eff	Control-effort contribution in CRS	0.15	-
w _glfi	Tracking contribution in CRS	0.15	-
$G L F I_{m i n}$	Minimum acceptable GLFI	0.90	-
$T S S_{l i m}$	Upper bound on TSS	1.00	-
$O S_{ω, m a x}$	Max rotor-speed overshoot	5	%

Table 23. Entropy band constants.

Symbol	Description	Value	Units
H_{min}	Lower entropy bound	0.10	nats
H_{max}	Upper entropy bound	2.00	nats

Table 24. Portfolio constants (dimensionless unless noted).

Symbol	Description	Value	Units
S	Number of scenarios	8	-
$C R S_{m i n}$	Minimum acceptable mean CRS	0.90	-

Table 25. Evidence artifacts (deterministic assurance pack).

ID	Path/Naming	Role
A1	results/logs/*.csv	Raw timeseries traces per scenario
A2	results/metrics/*.csv	Per-scenario metric tables (GLFI, TSS, CE _sum, V_rev, $E_{d a m p}$ , TTU)
A3	results/checkpoints/best/*.zip	Best controller snapshot (by mean CRS)
A4	results/checkpoints/final/*.zip	Final controller snapshot (end of training)
A5	results/reports/*.md	Auto-generated markdown reports and summaries
A6	results/config/*.yml	Versioned configuration and constants manifest

Table 26. Fixed constants referenced in Section 3.4.

Symbol	Description	Value	Units/Derivation
T	Evaluation horizon	600	s (scenario constant)
Δt	Controller/eval step	0.05	s (scenario constant)
N	Samples per episode	12,000	— (T/Δt)
r_{max}	Valve rate limit	0.15	s⁻¹ (from Section 3.2)
CE_{abs,max}	Max cumulative movement	90	(r_{max}·T)
V_{rev,max}	Max valve reversals	11,998	count (N − 2)
GLFI_{min}	Minimum acceptable GLFI	0.90	(Table 22)
TSS_{lim}	Upper bound on TSS	1.00	(Table 22)
OS_{\omega,max}	Max rotor-speed overshoot	5	% (Table 22)
H_{min}, H_{max}	Entropy band	0.10, 2.00	nats (Table 23)
CRS_{min}	Minimum acceptable mean CRS	0.90	— (Table 24)

Table 18. Scenario parameters (deterministic values).

Symbol	Description	Value	Units
T	Evaluation horizon	600	s
Δt	Controller/eval step	0.05	s
f_nom	Nominal grid frequency	60.0	Hz
$Δ L_{r e f}$	Reference load change magnitude	0.04·P_{ref}	MW
t₁, t₂	Gradual ramp window	120, 300	s
t_s(s)	Sudden step time	60	s
$t_{a}, t_{b}$	Fault window #1	120, 210	s
$t_{c}, t_{d}$	Fault window #2	360, 420	s
$A_{f, 1}$ , $A_{f, 2}$	Frequency-noise amplitudes	0.003, 0.002	Hz
$f_{1}, f_{2}$	Noise tones (frequency)	0.6, 1.2	Hz
Φ₂	Noise phase	1.0472	rad (≈60°)

Table 20. Metric normalization constants (fixed).

Symbol	Description	Value	Units	Derivation
r_max	Valve rate limit	0.15	s⁻¹	From Section 3.2 PID/limits
T	Evaluation horizon	600	s	Scenario constant
N	Samples per episode	12,000	-	T/Δt with Δt = 0.05 s
${C E}_{a b s, m a x}$	Max cumulative movement	90	-	$r_{m a x}$ ·T
$V_{r e v, m a x}$	Max valve reversals	11,998	count	N − 2
$ε_{P}$	GLFI denominator floor	1.0	MW	Physical floor
ω₁, ω₂	PSD band (rad/s)	3.14, 12.57	rad/s	0.5–2.0 Hz

Table 21. TSS weights and limits (dimensionless).

Symbol	Description	Value	Units
$w_{f}$	Frequency-error weight	0.50	-
w_ce	Control-effort weight	0.20	-
$w_{v r}$	Valve-reversal weight	0.20	-
$w_{o s}$	Overshoot weight	0.10	-
$I A E_{f, l i m}$	IAE_f normalization	60	Hz·s
$O S_{ω, m a x}$	Overshoot limit	5	%

Table 27. Concise positioning of this study relative to the two suggested works.

Study	System/Task	Primary KPIs	Safety/ Licensing Gates	Deterministic Replay	Traceability Artifacts	Relevance to Present Results
Zhang et al. (2024) [26]	Regional multi-energy market-clearing with hierarchical RL	Market matching efficiency; economic outcomes	Not reported	Case-study simulations	Not reported	Algorithmic/architectural RL advance for markets; different objective class
Yang et al. (2023) [27]	Island-group energy management under transmission constraints with hybrid policy RL	Operational cost; energy-balance KPIs	Not reported	Simulation studies	Not reported	System-level management focus; different KPIs and constraints
This study	PWR governor control (load-following) with SAC vs. DE-tuned PID/FLC	Gate pass rate; TTU, GLFI, TSS; overshoot; control effort; CRS	Yes—explicit, licensing-aligned	Yes—adversarial, fixed-replay portfolio	Yes—critical state–action mapping, critical pairs, sensitivity	Licensing-grade assurance for safety-critical plant control

Table 28. Matrixed mapping towards industrial towards industrial adaptation.

Limitation (Current)	Deterministic Upgrade	New Evidence Artifact	Gate/KPI Addition	Target Environment
Infinite-bus grid	Multi-area RO models + PMU replay	Mode-damping logs, ROCOF checks	Inter-area damping KPI	Real-time sim/HIL
TH surrogate	Multi-node TH + DNBR	Margin envelopes per scenario	Thermal-limit gates	High-fidelity twin
No HIL	PLC/RTOS with latency and watchdog	Timing budgets, watchdog logs	Timing-budget gate	HIL bench
Limited baselines	Add MPC/H∞/LQR	Controller-agnostic scorecards	Portfolio gates unchanged	Twin/HIL
No RTA/CBF	Simplex + control-barrier functions	Intervention/dwell logs	RTA-intervention gate	Twin/HIL
No formal proofs	Reachability/STL	Certificates + counterexamples	Certificate-presence gate	Twin
No uncertainty envelopes	Structured sweeps	Worst-case KPI tables	Envelope-completeness gate	Twin/HIL
No fault drills	Dropout/stiction/stuck-valve	Recovery timelines	Fault-recovery gate	Twin/HIL
No HSI drills	Operator-in-the-loop runs	Workload/trust metrics	Human-override gate	HIL
No cyber drills	Spoof/tamper/DoS tests	Detect/mitigate traces	Cyber-resilience gate	Segmented testbed

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Abdelrahman Ibrahim, A.; Lim, H.-K. A Deterministic Assurance Framework for Licensable Explainable AI Grid-Interactive Nuclear Control. Energies 2025, 18, 6268. https://doi.org/10.3390/en18236268

AMA Style

Abdelrahman Ibrahim A, Lim H-K. A Deterministic Assurance Framework for Licensable Explainable AI Grid-Interactive Nuclear Control. Energies. 2025; 18(23):6268. https://doi.org/10.3390/en18236268

Chicago/Turabian Style

Abdelrahman Ibrahim, Ahmed, and Hak-Kyu Lim. 2025. "A Deterministic Assurance Framework for Licensable Explainable AI Grid-Interactive Nuclear Control" Energies 18, no. 23: 6268. https://doi.org/10.3390/en18236268

APA Style

Abdelrahman Ibrahim, A., & Lim, H.-K. (2025). A Deterministic Assurance Framework for Licensable Explainable AI Grid-Interactive Nuclear Control. Energies, 18(23), 6268. https://doi.org/10.3390/en18236268

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Δe\e	NB	NS	ZE	PS	PB
NB	PB	PB	PS	ZE	ZE
NS	PB	PS	ZE	NS	ZE
ZE	PS	ZE	ZE	ZE	NS
PS	ZE	NS	ZE	NS	NB
PB	ZE	ZE	NS	NB	NB

Δe\e	NB	NS	ZE	PS	PB
NB	PB	PB	PS	ZE	ZE
NS	PB	PS	ZE	NS	ZE
ZE	PS	ZE	ZE	ZE	NS
PS	ZE	NS	ZE	NS	NB
PB	ZE	ZE	NS	NB	NB

Article Menu

A Deterministic Assurance Framework for Licensable Explainable AI Grid-Interactive Nuclear Control

Abstract

1. Introduction

2. Related Work

2.1. Reinforcement Learning in Nuclear, Power, and Turbomachinery Control (2019–2025)

2.2. Safety-Aware, Constrained, and Robust RL; Adversarial Evaluation

2.3. Explainability and Traceability for Deep RL

2.4. High-Fidelity Simulation Models (Digital Models) for Nuclear/Power Control

2.5. Comparative Synthesis and Benchmark Rationale

2.6. Rationale for Differential Evolution (DE) in Strong Baseline Optimization

3. Methodology

3.1. Plant Simulation Environment

3.1.1. Neutron Point Kinetics (Six Delayed Groups)

3.1.2. Reactivity Feedback and Power Mapping

3.1.3. Lumped Thermal–Hydraulic Model

3.1.4. Valve, Turbine-Governor, and Generator

3.1.5. Measured Outputs and Signals

3.1.6. Parameters and Numerics

3.2. Controllers

3.2.1. Proportional–Integral–Derivative (PID) Governor

3.2.2. Mamdani Fuzzy-Logic Governor (FLC)

3.2.3. Observation Normalization and Safety Bounds

3.2.4. Reward Shaping

3.2.5. Five-Phase Curriculum

3.2.6. Soft Actor–Critic (SAC) Governor

3.2.7. Differential Evolution (DE) for Deterministic Tuning

3.2.8. Evidence Capture and Reproducibility

3.3. Evaluation Scenarios—Deterministic Disturbance Models, Execution Protocol, and Metrics

3.3.1. Deterministic Scenario Catalogue and Disturbance Models

3.3.2. Deterministic Execution Protocol

3.3.3. Metrics Collected per Scenario

3.4. Deterministic Assurance Orchestration and Evidence Pipeline

3.4.1. Control-Effort Indices and Hard Bounds

3.4.2. Composite Robustness Score (CRS)

3.4.3. Per-Scenario Licensing Gates

3.4.4. Policy Interpretability Constraint (Entropy Band)

3.4.5. Portfolio Aggregation and Licensing Decision

3.4.6. Evidence Artifacts and Traceability

4. Results and Discussion

4.1. Overview of the Evaluation and How to Read the Results

4.2. Coherence of the Metric Suite

4.3. Global Scenario × Controller Landscape

4.4. Representative Transient: Nadir, Settling, and Safety Margin

4.5. Policy Geometry and Actuation Economy

4.6. Provenance of the Released SAC Policy

4.7. Stability Forensics: Phase Portrait and PSD

4.8. Ancillary-Service Readiness and Licensability

4.9. Multi-Attribute Performance Profiling

4.10. Controller Robustness and RL Policy Transparency

4.11. Threats to Validity and Limitations

4.12. Positioning of the Present Results Relative to Recent RL in Energy Systems

4.12.1. Regional Energy Market with Hierarchical RL (Zhang et al. [26])

4.12.2. Island-Group Energy Management with Hybrid Policy RL (Yang et al. [27])

4.12.3. Comparability Considerations

4.12.4. Implication for the Present Results

4.12.5. Recent RL-in-Nuclear Studies

4.13. Summary

5. Conclusions

6. Future Work

7. Assumptions, Limitations, and Translational Roadmap

7.1. Study Assumptions

7.2. Limitations

7.3. Translational Roadmap

7.3.1. System Realism and Test Environments

7.3.2. Safety Assurance and Verification

7.3.3. Controller Breadth and Robustness

7.3.4. Human Factors, Cybersecurity, and Compliance

7.3.5. Action Matrix: From Limitation to Industrial Remedy

7.4. Research Agenda

7.5. Cross-Industry Parallels

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Δe\e	NB	NS	ZE	PS	PB
NB	PB	PB	PS	ZE	ZE
NS	PB	PS	ZE	NS	ZE
ZE	PS	ZE	ZE	ZE	NS
PS	ZE	NS	ZE	NS	NB
PB	ZE	ZE	NS	NB	NB