Reinforcement Learning-Enhanced Large Language Models for Automated Modeling of Nuclear Thermal-Hydraulic Systems: A Plan-and-Act Agent Framework

Jun, Luo; Yan, Xiong; Lin, Jing-Chen; Zhang, Da-Zhi

doi:10.3390/app16125885

Open AccessArticle

Reinforcement Learning-Enhanced Large Language Models for Automated Modeling of Nuclear Thermal-Hydraulic Systems: A Plan-and-Act Agent Framework

¹

Department of Engineering Physics, Tsinghua University, Beijing 100084, China

²

China Nuclear Power Operation Technology, Co., Ltd., Wuhan 430223, China

³

Department of Thermal Science and Energy Engineering, University of Science and Technology of China, Hefei 230026, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(12), 5885; https://doi.org/10.3390/app16125885

Submission received: 30 April 2026 / Revised: 29 May 2026 / Accepted: 4 June 2026 / Published: 11 June 2026

Download

Browse Figures

Versions Notes

Abstract

Automating system-level nuclear thermal-hydraulic (T-H) model construction remains challenging because platform-specific API syntax, graph connectivity, parameter dependency ordering, and solver admissibility must be satisfied simultaneously. This study develops a closed-loop modeling framework on the SAFRI platform by combining supervised fine-tuning (SFT), a Plan-and-Act agent with retrieval-grounded parameter completion, and reinforcement learning based on group relative policy optimization (GRPO). The SFT stage uses a 6003-record domain corpus derived from expert-authored or expert-verified SAFRI modeling exemplars, while system-level generalization is evaluated on a held-out 50-case in-house evaluation set separated at the case-template level. At the component level, LoRA-adapted Qwen3-8B achieves 100% code accuracy, compared with 50% for zero-shot and 74% for one-shot prompting. At the system level, the SFT agent attains a 100% syntax success rate (SSR), 90% topology success rate (TSR), and 72.4% physical convergence rate (PCR), showing that local API correctness is insufficient for solver-valid model assembly. After GRPO training with schema, topology, physics, and sequence rewards, the full SAFRI-SFT-RL agent reaches a 100% SSR, 100% TSR, and 88.8% PCR on the in-house evaluation set, while an error self-healing loop resolves execution-time failures in an average of 2.3 corrective iterations. These results show that solver-grounded reinforcement learning is effective for closing the gap between syntactically correct script generation and physically convergent nuclear T-H model construction.

Keywords:

nuclear thermal-hydraulics; automated modeling; large language models; reinforcement learning; GRPO; supervised fine-tuning; Plan-and-Act agent

1. Introduction

Nuclear power remains an important component of reliable, high-density, and low-carbon energy portfolios [1]. As advanced reactor concepts, safety-analysis requirements, and digital engineering workflows continue to expand, nuclear thermal-hydraulic (T-H) simulation has become a core analytical capability in reactor design, licensing, and operation [2,3]. System codes are used to resolve the coupled heat-transfer and fluid-flow behavior of primary and secondary circuits, providing estimates of temperature, pressure, flow distribution, and safety margins under both nominal and off-normal conditions [4,5].

Despite the maturity of contemporary numerical solvers, manual T-H model construction remains a major engineering bottleneck. A system-level model of a Pressurized Water Reactor (PWR) loop may contain hundreds of interconnected objects, including pipes, valves, pumps, branches, junctions, and boundary-condition components [3,4]. Each object must be instantiated with physically admissible parameters and linked in the correct topological and procedural order. In practice, platforms such as RELAP5, TRACE, and SAFRI/PANTHER impose strict interface and sequencing constraints; comparatively small mistakes in parameterization, port matching, or flow-direction assignment can propagate into initialization failure or non-convergent steady-state solutions [5].

Recent progress in Large Language Models (LLMs), especially code-oriented models such as Qwen3 and related systems, has made natural-language-driven generation of engineering scripts increasingly plausible [6,7,8,9]. The central difficulty, however, is not merely one of text generation. In a proprietary simulation environment, a script may appear plausible at the token level while still violating hidden interface constraints, dependency ordering rules, or physical consistency conditions. Recent energy-domain studies similarly show that reliable performance in specialized technical workflows usually requires domain-adaptive training rather than direct transfer from general-purpose model priors [10].

Supervised Fine-Tuning (SFT) and agent-based orchestration therefore provide a more credible foundation than prompting alone. SFT is effective for internalizing domain API usage patterns into model weights, while agent frameworks such as ReAct provide a natural mechanism for decomposing long-horizon engineering tasks into executable subtasks with intermediate feedback [11,12]. Even so, imitation learning alone cannot ensure that an assembled network is executable, topologically complete, and physically convergent. This limitation is consistent with the nuclear digital twin literature, which emphasizes that trustworthy deployment depends on physics-grounded model behavior, runtime validation, and disciplined treatment of uncertainty rather than on surface-level script plausibility [13,14].

Within the broader landscape of engineering-modeling automation, four lines of work define the relevant baseline. (i) Rule-based and template-based generators encode platform syntax into deterministic templates; they are reliable for repeating layouts but cannot generalize across new topologies without rewriting the rules. (ii) Pure agent-based prompting frameworks decompose long tasks into steps but rely on a frozen base model, so they cannot eliminate platform-specific API hallucinations and they do not improve from execution outcomes. (iii) Simulation-in-the-loop optimizers couple a code generator with the solver, but typically operate on a fixed, narrow generator and do not jointly adapt the language model. (iv) Domain-tuned LLM scripting can fix API grammar but, on its own, leaves systemic topology, dependency, and convergence failures unresolved. None of these lines simultaneously offer (a) domain-adapted scripting, (b) dependency-aware long-task decomposition, (c) solver-grounded policy improvement, and (d) an in-trajectory self-healing mechanism. The present framework is designed precisely to close that gap, and the novelty claim of this paper is restricted to that combined closed-loop scope rather than to any one component in isolation.

In this paper, we propose a closed-loop automated modeling framework that integrates SFT, a Plan-and-Act agent architecture, retrieval-grounded parameter completion, and reinforcement learning. Using the SAFRI platform, which serves as the modeling interface for the PANTHER T-H code, as the primary testbed, we advance the LLM from syntax imitation toward solver-grounded optimization of physical convergence. The principal contributions of the study are summarized below.

(1): We constructed a specialized 6003-record domain dataset to fine-tune the Qwen3-8B model using Low-Rank Adaptation (LoRA). We demonstrate that SFT effectively resolves variable hallucinations, achieving 100% local syntax accuracy, substantially outperforming zero-shot (50%) and one-shot (74%) baselines.
(2): We designed a Plan-and-Act agent workflow incorporating Retrieval-Augmented Generation (RAG) and precise dependency tracking to translate raw, structured design documents into executable modeling actions via a Transmission Control Protocol (TCP) interface.
(3): We formalized the system-level modeling task as a Markov Decision Process (MDP) and applied Group Relative Policy Optimization (GRPO). By designing a multi-dimensional reward function (schema, topology, physics, sequence) grounded in real simulation feedback, the framework transitions from syntactic imitation to physical optimization.
(4): We introduced an autonomous error self-healing mechanism. Across the in-house evaluation set, the RL-enhanced agent increased the Physical Convergence Rate (PCR) from 72.4% for the SFT baseline to 88.8% and reduced peak GPU memory by approximately 40% relative to PPO under matched profiling conditions, supporting the practical value of physics-informed RL for solver-grounded modeling.

2. Related Work

2.1. Nuclear Thermal-Hydraulic Simulation and Automated Modeling

Modern nuclear T-H simulation platforms discretize the governing mass, momentum, and energy conservation equations over user-defined networks of control volumes and junctions [3,5]. RELAP5 [2] and TRACE [4], for example, rely largely on text-based input decks, whereas the self-hosted SAFRI platform used here exposes a component-oriented Python API for model construction. Existing attempts to reduce modeling workload have typically remained limited to scripted templates, rule-based preprocessing, or narrowly scoped surrogate models tailored to specific loop families. Those approaches can accelerate parameter filling or repetitive editing, but they do not automatically produce the full executable structure required for solver-ready model assembly. As a result, the translation from design specification to executable simulation model remains largely manual.

2.2. Large Language Models for Engineering Code Generation

Models pre-trained on large code corpora, such as Qwen3 [6], exhibit strong performance on general software generation tasks. Yet direct application to proprietary engineering APIs remains challenging because the target syntax, parameter semantics, and execution constraints are substantially out of distribution [8]. Parameter-Efficient Fine-Tuning (PEFT) methods, especially LoRA [15], make this adaptation practical on institutional hardware by freezing the base model and updating only low-rank matrices. However, SFT remains an imitation-learning procedure. It can encode local API grammar and recurring parameter patterns, but it does not by itself explain why a syntactically correct model may still fail under physical execution [12].

2.3. Reinforcement Learning for Tool-Using Language Agents

To move beyond imitation, LLMs are increasingly aligned using reinforcement learning driven either by human preference signals or by environment execution feedback [16]. Proximal Policy Optimization (PPO) [17] is the standard baseline, but it requires a separate value network and therefore increases GPU-memory demand during training. More recent reasoning-oriented systems, including DeepSeekMath and DeepSeek-R1, have popularized Group Relative Policy Optimization (GRPO) as a lighter-weight alternative for long-horizon policy improvement [18,19]. GRPO avoids an explicit value model by computing relative advantages within a group of candidate trajectories sampled for the same prompt. In the present work, that idea is adapted to engineering simulation, with T-H solver behavior serving as the decisive reward signal for a tool-using, physics-aware modeling agent.

2.4. Critical Synthesis and Positioning

Taken together, the three streams reviewed above (see Table 1 for a structured comparison) leave a concrete methodological gap that motivates the present work. Existing rule- and template-based automation is engineering-pragmatic but brittle, because it assumes a fixed system topology and cannot generalize to new plant architectures without manual rewriting. Pure LLM prompting agents (ReAct, AutoGen, MetaGPT) inherit the strengths and limitations of their frozen base model and therefore reproduce two systemic failure modes on proprietary engineering APIs: variable hallucination and dependency-ordering errors. Reinforcement-learning methods such as PPO and GRPO have so far been developed in mathematical, code-competition, and dialogue settings, in which the reward signal can be defined symbolically; their adaptation to physics-grounded engineering simulators in which the reward must be the solver itself is still under-explored. The present study therefore is not claimed to introduce a new RL algorithm; rather, it makes three more specific contributions of an applied-systems nature: (i) it operationalizes solver convergence as a reward source for a nuclear thermal-hydraulic platform; (ii) it formulates self-healing as an in-trajectory RL action rather than as an external rule; and (iii) it integrates domain-adapted scripting, dependency-aware task decomposition, and solver-grounded GRPO into a single closed loop. We acknowledge that the algorithmic building blocks are individually known; the contribution is in how they are combined and grounded in a safety-critical engineering platform.

3. Simulation Platform and Problem Formulation

3.1. Modeling Hierarchy, Component Taxonomy, and API Parameterization

The SAFRI (Symbolic Analysis Foundation and Research Tools Integrator) platform serves as the graphical and programmatic front end of the PANTHER solver. From the standpoint of automated model construction, the platform can be understood as a three-level hierarchy: system, component, and node/junction. At the system level, the agent must assemble a complete T-H network consistent with the target loop architecture and boundary specification. At the component level, that network is expressed through typed physical objects carrying geometry, constitutive behavior, and operating conditions. At the node/junction level, those objects are resolved into the discrete control volumes and interconnecting flow paths on which the solver enforces conservation and coupling relations. This hierarchy is important because the agent does not merely emit isolated API statements; it incrementally defines a solver-interpretable graph whose local object choices ultimately determine global numerical admissibility.

The practical SAFRI workflow considered in this study can be expressed through 12 typed component objects, which are subsequently grouped into the six broader evaluation families reported in Table 2. These 12 implementation-level objects are pipe, annulus, valve, pump, pressure boundary, temperature boundary, mass-flow boundary, time-dependent boundary, branch, tee, single junction, and connection. For dataset and evaluation set reporting, they are aggregated into the higher-level families Pipe/Annulus, Valve, Pump, Boundary Condition, Branch, and Connection/Junction so that the statistical summaries remain concise while preserving engineering granularity. This distinction clarifies that the task is richer than a six-label classification problem: the agent must determine not only which broad family is needed, but also which typed object, port logic, and parameter set satisfy the intended system function.

The Python API correspondingly exposes three core classes of operations: component instantiation, parameter configuration, and topological connection. In practice, the parameter layer can be organized into six categories: identity and indexing parameters; geometry and discretization parameters; hydraulic and thermodynamic state parameters; equipment-specific constitutive parameters; boundary and operating-condition parameters; and topology and port-binding parameters. Framed in this manner, automated modeling is not equivalent to generic code generation. It is a constrained process of selecting typed objects and assigning parameter bundles that must remain mutually consistent across the system, component, and node/junction hierarchy.

At the governing-equation level, the underlying system-code formulation follows the standard one-dimensional two-phase conservation framework widely used in reactor safety analysis [2,3,4,5,20]. Phase-level mass, momentum, and energy balances are written as

\frac{\partial (α_{k} ρ_{k} A)}{\partial t} + \frac{\partial (α_{k} ρ_{k} u_{k} A)}{\partial x} = Γ_{k} A,

\frac{\partial (α_{k} ρ_{k} u_{k} A)}{\partial t} + \frac{\partial (α_{k} ρ_{k} u_{k}^{2} A)}{\partial x} = - α_{k} A \frac{\partial p}{\partial x} + S_{m, k},

\frac{\partial (α_{k} ρ_{k} h_{k} A)}{\partial t} + \frac{\partial (α_{k} ρ_{k} u_{k} h_{k} A)}{\partial x} = S_{e, k},

where

α_{k}

is the phase volume fraction,

ρ_{k}

is density,

u_{k}

is velocity,

h_{k}

is specific enthalpy,

A

is the flow area, and

Γ_{k}

,

S_{(m, k)}

, and

S_{(e, k)}

represent interfacial mass transfer together with the lumped momentum and energy source terms associated with wall friction, gravity, pumps, heat transfer, and phase interaction. In the SAFRI implementation, the node-level discretization carries the control-volume state variables, while junction-like entities transmit momentum and enthalpy exchange between neighboring volumes. The agent therefore does not manipulate the partial differential equations directly; instead, it constructs the discrete graph whose components, nodes, and junctions determine whether the numerical realization of these balances is physically meaningful.

This interpretation clarifies why seemingly minor script errors can have system-level consequences. A component must exist before its parameters can be modified, ports must be declared before they can be linked, and active devices such as pumps or controlled valves must be embedded within a topologically continuous network before meaningful initialization is possible. A reversed junction direction, inconsistent boundary type, omitted geometry field, or incorrect sequencing step may be trivial as text, yet it can produce disconnected control volumes, non-physical pressure propagation, or steady-state divergence once the solver assembles the node/junction equations. From the automation standpoint, the essential challenge is therefore to generate an action sequence that is simultaneously API-valid, topologically coherent, and compatible with the conservation-law structure enforced by the PANTHER backend.

3.2. Task Formalization as a Markov Decision Process (MDP)

To incorporate reinforcement learning, the automated modeling workflow is formalized as a Markov Decision Process defined by the tuple (S, A, P, R, γ). The state space S contains the target design document in structured JSON form, the executed action history, environment feedback, and the current internal SAFRI graph state. The action space A consists of the structured LLM output that specifies the next SAFRI Python API operation. The transition function P is deterministic: executing an action either updates the platform state or returns an error traceback when an API or dependency constraint is violated. The reward function R is a multi-dimensional scalar signal that evaluates intermediate syntax and topology quality together with terminal physical convergence. The discount factor γ is set to 0.95 so that unnecessarily long action sequences are penalized.

State Space (S): The state st includes the target structural design document (JSON format), the operational history (previously executed API calls and their environment feedback), and the internal state of the SAFRI platform (instantiated topological graph).
Action Space (A): The action at comprises the structured JSON response generated by the LLM, which dictates the target SAFRI Python API code snippet for execution.
Transition Function (P): The environment is deterministic. Executing at either updates the platform’s topological state or returns an error traceback if the API constraints are violated.
Reward Function (R): A multi-dimensional scalar feedback mechanism evaluating intermediate syntax and terminal physical convergence.
Discount Factor ( $γ)$ : Set to 0.95 to penalize unnecessarily protracted modeling sequences.

4. Methodology

4.1. Overall Framework

The proposed framework is organized as a closed-loop architecture coupling offline domain adaptation, online tool-using agent execution, and solver-grounded policy refinement, as showed in Figure 1. In the offline stage, a domain-specific instruction-code corpus is used to adapt Qwen3-8B through LoRA-SFT, producing a policy that reliably internalizes SAFRI API grammar, component parameterization patterns, and local dependency rules. In the online stage, the adapted model is embedded in a Plan-and-Act agent that decomposes a design specification into executable subtasks, retrieves only the locally relevant engineering parameters, and emits structured API actions for immediate execution in the SAFRI/PANTHER environment. In the optimization stage, execution traces, topology checks, and solver outcomes are transformed into multi-channel rewards and corrective trajectories, enabling GRPO to improve the policy toward system-level topological integrity and physical convergence rather than token-level plausibility alone.

From an information-flow perspective in Figure 2, the framework maps a structured plant description into a sequence of state transitions over a partially constructed T-H graph. The planner determines task order under dependency constraints; the retrieval module grounds each subtask in case-specific parameters and component attributes; the code-generation policy proposes the next executable action; and the SAFRI client updates the evolving system graph while returning logs that expose syntax faults, topological inconsistencies, and convergence behavior. The resulting feedback is not used merely as a pass/fail filter. Instead, it is injected back into the trajectory as a learning signal that supports both immediate repair and long-horizon policy improvement. This design is necessary because automated T-H modeling is constrained simultaneously by API legality, graph connectivity, and solver admissibility, so the architecture must coordinate linguistic reasoning with executable engineering state transitions.

To make the transition from natural-language instructions to executable component construction more concrete, Figure 3 reproduces a representative source-document illustration in which a short pipe-creation request is translated into a structured configuration workflow.

4.2. Domain Dataset Construction

Because no public fine-tuning datasets exist for the proprietary SAFRI platform, a domain-specific corpus had to be constructed from scratch. A semi-automated synthesis pipeline was therefore developed to expand a curated set of expert-authored seed exemplars through controlled instruction variation and parameter instantiation. The resulting supervised dataset contains 6003 unique records.

Each record follows an Instruction-Input-Output format. The Instruction field specifies the natural-language engineering intent, the Input field stores a structured JSON parameter dictionary, and the Output field contains the exact SAFRI Python API sequence expected for execution. The corpus is dominated by component-generation and parameter-modification tasks. To preserve engineering realism, physical parameters were sampled from ranges consistent with nominal PWR operating conditions. The overall composition of the dataset is summarized in Table 2.

The 6003-record supervised corpus was curated as a derived engineering instruction-response dataset rather than as a raw export of production plant models. Seed records were first written or manually checked by domain personnel against SAFRI component semantics, argument conventions, and connection rules. Controlled variation was then applied to diversify component names, parameter combinations, and operation sequences without altering the intended engineering meaning of each sample. Before training, all samples were screened through parser validation, malformed-code rejection, parameter-range checks, duplicate near-template removal, and instruction-output consistency review. To reduce leakage, cases used in the 50-case in-house evaluation set were excluded at the case-template level rather than only at the prompt-string level. Accordingly, the SFT corpus teaches component- and operation-level SAFRI syntax, whereas the evaluation set evaluates a distinct generalization task: multi-component assembly that must remain executable, topologically complete, and physically convergent under solver feedback.

To make the curation protocol concrete and auditable, the following additional information is reported. (i) Two domain experts independently scored a random 200-record subset on a four-axis rubric (syntax legality, parameter physical plausibility, dependency order, intent fidelity); inter-expert agreement, measured by Cohen’s kappa, indicated substantial reliability. Disagreements were resolved by a third reviewer before the record was accepted. (ii) Acceptance criteria for an SFT record required all of: executable on a clean SAFRI sandbox, all required arguments present, parameter values within the prescribed PWR-nominal sampling intervals, dependencies declared, and instruction text consistent with the produced code. Records failing any criterion were discarded or repaired. (iii) Representative example records covering creation and modification tasks for pipe, pump, boundary, and connection objects are reported below. (iv) Parameter sampling ranges follow the PWR-nominal envelope: pressure [0.10, 17.5] MPa, temperature [285, 595] K, mass-flow [0.5, 21,000] kg/s, hydraulic diameter [0.005, 0.80] m, length [0.05, 30.0] m. (v) To guard against near-template leakage, a Jaccard token-similarity check (threshold 0.85 over normalized API tokens) was applied across the SFT corpus and against every evaluation-case template, and any record exceeding the threshold against an evaluation-case template was removed from training.

Representative instruction-input-output records are provided below to illustrate the structure of the curated corpus and the corresponding executable SAFRI Python API sequences used for supervised learning and agent execution. The functions shown here are part of the platform wrapper interface; their signatures and brief docstrings are listed in Appendix B.

Example 1—Pipe component creation (success case).

Instruction: (“Generate code for a single pipe component”).

Input (JSON): {“entityName”: “pipe_7”, “nvols”: 1, “x”: −10.00, “y”: 20.00, “length”: 5.0, “area”: 1.0}

Output (SAFRI Python API): pipe_7 = addPipe(); setPipeVolNumber(pipe_7, 1); setEntityPosition(pipe_7, −10.00, 20.00); setPipeVolLength(pipe_7, 0, 5.0); setPipeVolArea(pipe_7, 0, 1.0).

Example 2—Two-pipe assembly with hydraulic connection (success case).

Instruction: (“Generate two serially connected pipes and the linking connection”).

Input (JSON): {“pipes”: [{“name”: “p1”, “nvols”: 3, “x”: 0, “y”: 0}, {“name”: “p2”, “nvols”: 3, “x”: 10, “y”: 0}], “connection”: {“from”: “p1”, “fromCell”: 2, “fromFace”: 2, “to”: “p2”, “toCell”: 0, “toFace”: 1}}

Output (SAFRI Python API): p1 = addPipe(); setPipeVolNumber(p1, 3); setEntityPosition(p1, 0.0, 0.0); p2 = addPipe(); setPipeVolNumber(p2, 3); setEntityPosition(p2, 10.0, 0.0); createConnection(p1, 2, 2, p2, 0, 1).

Example 3—Pipe with heat structure (representative failure/repair case used to train the self-healing policy).

Instruction: (“Attach a left-surface heat structure to the pipe”).

Input (JSON): {“entityName”: “pipe_3”, “nvols”: 2, “heatSurface”: 1, “attachCells”: [1,2]}

Output (initial draft, fails): pipe_3 = addPipe(); setPipeVolNumber(pipe_3, 2); he = createHeatstructure(1); addHeatstructureCell(he, 1, pipe_3, 0) # rejected by SAFRI: cellNumber starts from 1, not 0.

Output (after one self-healing step): pipe_3 = addPipe(); setPipeVolNumber(pipe_3, 2); he = createHeatstructure(1); addHeatstructureCell(he, 1, pipe_3, 1); addHeatstructureCell(he, 1, pipe_3, 2) # accepted.

Prompt template used during Plan-and-Act execution.

All component-generation prompts are wrapped in the following standardized template (verbatim, used at both SFT and GRPO stages so that training and deployment distributions match):

# User Requirement

{user_requirement}

# Plan Status

## Finished Tasks

### code

```python

{code_written}

```

### execution result

{task_results}

## Current Task

{current_task}

### component information

{component_information}

# Constraints

- Take on Current Task if it is in Plan Status, otherwise, tackle User Requirement directly.

- Ensure the output new code is executable in the same Jupyter notebook as the previous executed code.

# Output

While some concise thoughts are helpful, code is absolutely required. Always output one and only one code block in your response. Output code in the following format:

```python

your code

```

The four placeholders {user_requirement}, {code_written}, {task_results}, {current_task}, and {component_information} are filled at run time by the Plan-and-Act controller (Section 4.4) using, respectively, the parsed design intent, the executed-so-far script, the SAFRI execution log, the next planned sub-task, and the retrieval-augmented parameter block returned by the RAG module.

4.3. Supervised Fine-Tuning with LoRA

Qwen3-8B [6] was selected as the foundation model because it provides strong baseline code-generation capability while remaining deployable on institutional hardware. To avoid the cost of full-parameter optimization, Low-Rank Adaptation (LoRA) [15] was adopted as the supervised adaptation method.

LoRA freezes the pre-trained backbone weights and injects trainable rank-decomposition matrices into the Transformer’s attention projection layers. In the present implementation, the LoRA configuration used rank r = 8, alpha = 16, dropout = 0.05, batch size = 16, and learning rate 2 × 10⁻⁴. Training stabilized after approximately 150 optimization steps. The learned adapters were then merged into the base model and deployed through vLLM for high-throughput inference, providing the syntax-competent initialization used by the subsequent RL stage. The corresponding optimization trajectory is reported in Figure 4.

4.4. Agent Workflow: Planning, Retrieval, and Execution

Processing comprehensive system-level design documents, which often exceed 10,000 lines of JSON, within a single LLM prompt exceeds effective context limits and degrades attention allocation [9]. To address this, a Plan-and-Act workflow was implemented, drawing on ReAct-style reasoning and multi-agent software orchestration ideas such as MetaGPT [11,21].

Plan: the agent receives the global instruction, parses the design-document structure, and generates a hierarchical task list with explicit identifiers and dependency annotations. Tasks that create components therefore precede tasks that modify them, and connection operations are constrained to occur only after the relevant ports exist.
Retrieve: using Retrieval-Augmented Generation (RAG) [7], an embedding-based retriever isolates only the physical parameters and local structural context relevant to the current subtask. This step compresses the effective prompt context and reduces interference from irrelevant portions of the full design document.
Act: the SFT-initialized policy generates the corresponding API sequence. That code is transmitted via TCP to the SAFRI client, executed immediately, and the resulting platform state and error logs are returned to the agent memory for subsequent planning or repair.

Representative implementation examples are shown in Figure 5 and Figure 6. Figure 5 contrasts an error-prone pump-generation script with its corrected API-compliant counterpart, while Figure 6 presents a code-driven system-construction case in which the graphical topology and executable script are displayed together.

4.5. GRPO-Based Policy Optimization

Although the SFT agent is highly reliable at the level of local API syntax, system-level loop assembly still exposes failure modes associated with global topology and physical convergence. To move the policy beyond syntax imitation toward solver-level validity, GRPO is adopted as the reinforcement-learning optimizer [18,19].

Unlike PPO [17], which introduces an explicit value network to estimate a baseline, GRPO computes relative advantages within a group of candidate responses sampled for the same modeling prompt. For a prompt q, the current policy samples a trajectory group; in the present implementation, the group size is G = 8. Each trajectory contains a complete sequence of planning, retrieval, code generation, execution, and, when necessary, corrective actions.

R_{i} = \sum_{t = 1}^{T} γ^{t - 1} r_{t},

Here the discount factor is the γ = 0.95 value defined in Section 3.2. After all sampled trajectories have been executed, the environment returns one scalar value for each trajectory according to the weighted multi-objective design introduced in Section 4.6. The group-normalized advantage is then computed from within-group reward statistics so that policy updates are less sensitive to task-specific reward scale and more sensitive to relative quality among competing assembly strategies for the same case.

A_{i} = (R_{i} - μ_{G}) / (σ_{G} + ε),

where

μ_{G} = \frac{1}{G} \sum_{j = 1}^{G} R_{j}

and

σ_{G} = \sqrt{\frac{1}{G} \sum_{j = 1}^{G} (R_{j} - μ_{G})^{2}}

. In practice, this normalization reduces the sensitivity of policy updates to task-specific reward scale and enables a more stable comparison among alternative model-construction strategies that target the same design case.

The policy is updated through a clipped importance-ratio objective analogous to PPO, except that the normalized group advantage replaces value-model-based advantage estimation. The clipped term suppresses destabilizing policy jumps, whereas the KL penalty keeps the optimized policy close to the syntax-competent SFT initialization. This balance is important because the RL stage is intended to improve system-level validity without eroding the platform-specific API behavior already acquired during supervised training. From a theoretical standpoint, the GRPO update can be interpreted as an unbiased estimator of a group-baselined policy gradient: replacing the learned value function with a within-group sample mean does not change the gradient direction in expectation, but it reduces variance whenever responses to the same prompt have correlated returns. In the SAFRI setting this correlation is structural, since trajectories sampled for the same case share a common dependency graph; hence the group baseline naturally suppresses reward variance that is not informative about the policy improvement direction. This perspective also clarifies why GRPO benefits the most from sparse, terminal solver feedback, which is precisely the regime encountered in physics-grounded engineering automation.

L_{G R P O} (θ) = \frac{1}{G} \sum_{i = 1}^{G} \min [r_{i} (θ) A_{i}, c l i p (r_{i} (θ), 1 - ε_{c}, 1 + ε_{c}) A_{i}] - β D_{K L} (π_{θ} ‖ π_{r e f}),

with

r_{i} (θ) = \frac{π_{θ} (τ_{i} ∣ q)}{π_{θ_{o l d}} (τ_{i} ∣ q)},

where

ε_{c}

is the clipping threshold,

β

is the KL-regularization coefficient, and

π_{r e f}

denotes the SFT reference policy. The clipped term suppresses destabilizing policy jumps, whereas the KL penalty keeps the optimized policy close to the syntax-competent SFT initialization. This balance is important because the RL stage is intended to improve system-level validity without eroding the platform-specific API behavior acquired during supervised training.

GRPO is particularly suitable for the present task for two practical reasons. First, the reward signal is sparse and partly terminal, because full physical validity is observed only after a sequence of API calls has been executed and the solver has attempted initialization. Second, multiple candidate trajectories for the same case may differ only in sequencing, port orientation, or initialization strategy, so within-group comparison is more informative than absolute reward magnitude alone. In implementation, each rollout terminates when the system reaches physical convergence, exceeds the maximum allowed correction cycles, or encounters an unrecoverable execution failure. Intermediate rewards are assigned after schema checking, dependency validation, and topological verification, whereas the physics reward is finalized only after solver execution. Under the current implementation, removing the value network reduces memory consumption by approximately 40% relative to PPO, which is a meaningful deployment advantage for 8B-class models. The 50-case evaluation set used throughout Section 5 and Section 6 is summarized in Table 3. The controlled profiling protocol and the per-configuration peak-memory comparison between PPO and GRPO are reported in Section 5.2 and Table 4. The training dynamics of total reward and policy entropy under GRPO are illustrated in Figure 7.

4.6. Four-Dimensional Reward Design

The reinforcement-learning signal is defined through a weighted four-dimensional reward function. During roll-out, the step-level reward contributes to the discounted return defined in Section 4.5. The schema, topology, and sequence channels become available immediately after execution checks, whereas the physics channel is finalized only after solver initialization. To make the reward operational rather than merely conceptual, each channel is implemented as a normalized score in the interval [0, 1].

R_{t o t a l} = ω_{1} R_{s c h e m a} + ω_{2} R_{t o p o} + ω_{3} R_{p h y s} + ω_{4} R_{s e q},

where

ω_{1} = 0.20

,

ω_{2} = 0.30

,

ω_{3} = 0.35

, and

ω_{4} = 0.15

. During rollout, the step-level reward contributes to the discounted return defined in Section 4.5. The schema, topology, and sequence terms become available immediately after execution checks, whereas the physics term is finalized only after solver initialization. To ensure that the reward design is operational rather than merely conceptual, each channel is implemented as a normalized score in [0, 1]:

Schema Reward: this channel verifies that generated parameters lie within physically admissible bounds, such as pressure between 0.1 and 20 MPa, and that required argument fields are not omitted. Missing-field, out-of-range, and datatype violations are explicitly penalized within the normalized score.

R_{s c h e m a} = m a x (0, 1 - (N_{m i s s i n g} + N_{r a n g e} + N_{t y p e}) / m a x (1, N_{s c h e m a})),

where

N_{m i s s i n g}, N_{r a n g e}

, and

N_{t y p e}

denote missing-field, out-of-range, and datatype violations, respectively.

Topology Reward: this channel evaluates whether the generated network matches the intended graph connectivity by penalizing isolated nodes, duplicated links, unresolved ports, and reversed flow directions. Its normalized form is computed from the number of topology faults detected for the current case.

R_{t o p o} = m a x (0, 1 - (N_{i s o} + N_{d u p} + N_{o p e n} + N_{r e v}) / m a x (1, N_{t o p o})),

where the numerator counts detected topology faults and N_topo is the number of topology checks for the current case.

Physics Reward: this is the decisive terminal reward that interfaces directly with the PANTHER solver. A score of 1 is assigned when the model reaches full steady-state convergence, an intermediate score is assigned when initialization succeeds but convergence criteria are only partially satisfied, and a score of 0 is assigned when the solver fails before meaningful physical evaluation.

R_{p h y s}

= 1, if the model reaches full steady-state convergence;

R_{p h y s}

=

η_{i n i t}

, if initialization succeeds but convergence criteria are not fully met;

R_{p h y s}

= 0, if the solver fails before meaningful physical evaluation,

where

0 < η_{i n i t} < 1

denotes partial credit for physically plausible but non-converged models.

Sequence Reward: this channel incentivizes adherence to API logic, ensuring that components are created before they are modified and that dependent objects or boundaries are defined before connection steps are attempted. Ordering and dependency violations are penalized through a normalized count-based score.

R_{s e q} = m a x (0,1 - N_{o r d e r} / m a x (1, N_{s e q})),

where

N_{o r d e r}

counts ordering/dependency violations and

N_{s e q}

is the number of sequencing checks in the current trajectory.

This decomposition separates error classes that may appear similar in text form but differ substantially in engineering consequence. A misplaced argument and an incorrect pump-boundary ordering can both invalidate a script, yet they demand different corrective behavior from the policy. The multi-channel reward therefore provides denser supervision than a single terminal convergence label and supports credit assignment across long action sequences. It also explains why the TSR and PCR do not evolve identically during training: topology faults are often recoverable through reconnection actions, whereas physics faults may become visible only after full network assembly and initialization. The evolution of the four reward components and success rates across case difficulty during GRPO training is shown in Figure 8.

The four reward weights follow an engineering-failure prior: among observed SFT-baseline failures, more than half were attributable to topology and physics inconsistencies (graph connectivity, port direction, boundary coupling, initialization instability), whereas schema and sequence issues were already largely suppressed by SFT. The default weights (

ω_{1}

= 0.20 schema,

ω_{2}

= 0.30 topology,

ω_{3}

= 0.35 physics,

ω_{4} = 0.15

sequence) therefore allocate more credit to the channels driving the residual failure modes. A one-dimensional sensitivity study was conducted in which each weight was perturbed by approximately ±50% (absolute step 0.10) with the remaining weights renormalized. The PCR varies within ±2.6 percentage points across all perturbations, indicating that the framework is robust to the exact weight values within reasonable ranges.

4.7. Error Self-Healing Mechanism

Multi-component model construction is rarely successful in a single pass, especially when the first failure becomes visible only after execution. For that reason, the agent includes an autonomous self-healing mechanism within the action loop rather than treating debugging as an external post-processing stage. When SAFRI returns an execution exception or a non-convergence log, the environment packages the failed API sequence, the error traceback, and the intended structural context into a repair prompt. The same policy then proposes a corrective action, typically involving parameter revision, action reordering, port remapping, or local reconnection. Each trajectory is allowed up to three corrective attempts. From the reinforcement-learning perspective, these repair steps remain part of the same trajectory, so successful recovery is reinforced directly.

To make the self-healing mechanism learnable rather than purely heuristic, each repair episode is represented as a continuation of the original trajectory under an augmented state description. After an execution failure, the next state contains the current partial system graph, the most recent API sequence, the structured error message returned by SAFRI/PANTHER, and the unresolved task specification for the failed subgraph. The policy therefore conditions not only on the original design intent but also on explicit evidence of how the previous action violated schema, dependency, topology, or solver constraints. This design converts debugging from an external engineering intervention into a state transition compatible with the MDP formalization in Section 3.2.

At the action level, the repair policy does not regenerate the full model from scratch unless local recovery becomes impossible. Instead, it selects a targeted correction operator over the active subgraph, including parameter revision, component reordering, port remapping, junction-direction adjustment, boundary replacement, or local rollback followed by re-execution. This locality is important in practice because many failures are graph-local rather than case-global: an incorrect branch connection or inconsistent boundary assignment can often be repaired by editing only a small neighborhood of the assembled network.

a_{t}^{r e p a i r} \sim π_{θ} (a ∣ q, G_{t}, e_{t}, h_{t}), a \in A,

where

q

is the case instruction and

h_{t}

is the trajectory history. This conditional form is practically important because many failures are graph-local rather than case-global: an incorrect branch connection or inconsistent boundary assignment is often recoverable by editing only a small neighborhood of the assembled network. By constraining repair to the failure-relevant subgraph, the agent reduces unnecessary action length and preserves valid upstream construction steps.

The reward design introduced in Section 4.6 is applied across both the original and repaired segments of the same rollout. Early repair actions may recover schema or topology validity before physical convergence can be re-evaluated, so credit assignment is delayed but remains trajectory-consistent. In implementation, a rollout terminates when one of three conditions is met: the corrected model reaches steady-state convergence, the maximum number of repair cycles is exhausted, or an unrecoverable failure indicates that the current exploration branch should be abandoned. This termination logic distinguishes between repairable and non-repairable errors and helps GRPO favor action sequences that preserve recoverability even when the first attempt is imperfect.

This trajectory-level treatment explains why the self-healing loop contributes more than superficial robustness. The policy is not rewarded merely for issuing another response after failure; it is rewarded for transforming execution feedback into graph edits that move the model toward solver admissibility. Environment error messages, local graph correction, and final PCR improvement therefore form a single methodological chain rather than three independent stages. A full quantitative breakdown of the self-healing mechanism, including triggering rate, mean repair attempts per triggered case, dominant failure classes intercepted, and per-difficulty net PCR contribution, is provided in Appendix A.

5. Experimental Setup

5.1. Benchmark Cases and Evaluation Protocol

The proposed framework was evaluated on an evaluation set of 50 nuclear modeling cases designed to stress different layers of the assembly problem. The evaluation set spans isolated component checks, short serial assemblies, branch-containing loop fragments, and integrated Pressurized Water Reactor (PWR) subsystem configurations involving pumps, valves, branches, boundary conditions, steam-generator-related structures, and pressurizer-linked connections. To preserve evaluation credibility, the evaluation set was separated from both the SFT corpus and the RL roll-out distribution at the case-template level rather than only at the prompt-text level. This choice reduces the possibility that high system-level performance is driven by memorized assembly blueprints instead of genuine policy generalization over component ordering, graph closure, and physical initialization constraints.

The evaluation protocol distinguishes two levels of difficulty and two corresponding output spaces. The first level is component-level code generation, where the model must translate a structured instruction into a valid SAFRI API fragment with correct object type, argument set, and parameter assignment. The second level is system-level model assembly, where the agent must generate an ordered sequence of planning, retrieval, coding, execution, and repair actions that produce a connected and solver-admissible T-H network. This separation is methodologically important because exact local code correctness is necessary but not sufficient for global model validity: a script can be free of API errors while still violating hidden port dependencies, flow-direction assumptions, or boundary-condition consistency at run time.

For consistency across the manuscript, the 50 system-level cases are partitioned into 15 simple, 20 medium-difficulty, and 15 complex cases, as summarized in Table 3. The simple subset contains isolated components and short serial chains; the medium subset emphasizes branch-containing loop segments and active-passive couplings; and the complex subset contains integrated PWR subsystem configurations. This stratification is used consistently for aggregate reporting, checkpoint comparison, and the difficulty-resolved interpretation reported later in Section 6.3.

Although the evaluation-set size remains modest, it was intentionally designed to expose failure modes that are operationally meaningful in engineering practice, including dependency violations, topological discontinuity, and solver non-convergence. The fixed simple/medium/complex partition is used consistently in the later stratified discussion so that every reported result can be traced back to a stable case distribution. The evaluation set is not claimed to be exhaustive; rather, it serves as a controlled stress test for whether the agent can move from local code correctness to solver-level model validity.

5.2. Implementation Details

Memory-efficiency measurements were obtained through controlled profiling on a single NVIDIA A100 80 GB GPU using PyTorch built-in CUDA peak-memory counters (torch.cuda.max_memory_allocated). The runtime stack was PyTorch 2.3.0 with CUDA 12.1, bf16 mixed precision, FlashAttention-2 as the attention backend, and gradient checkpointing enabled for all transformer blocks; DeepSpeed ZeRO-2 was disabled and gradient accumulation was fixed to one. Optimization used AdamW with the fused kernel, a dataloader prefetch of two, and Python garbage collection disabled during measurement windows. The actor policy was Qwen3-8B with LoRA adapters (rank r = 8), held identical across PPO and GRPO; only the auxiliary component differed, namely a separate value head for PPO versus the group-baseline estimator for GRPO. The effective update batch comprised 4 trajectories × 8 candidates (32 samples per update) with a 1024-token prompt length and a 1024-token maximum response length. Each peak-memory value reported in Table 4 is the mean of five consecutive optimization steps after a 20-step warmup, with a standard deviation below 0.4 GB in every condition. Across the three matched operating points, GRPO reduces peak allocated memory by 47.5%, 38.6%, and 35.4% relative to PPO, yielding the average 40% reduction quoted in the abstract.

Experiments were conducted on a private GPU cluster equipped with NVIDIA A100 accelerators. The supervised stage used Qwen3-8B as the base model and LoRA adapters with rank r = 8, alpha = 16, dropout = 0.05, batch size = 16, and learning rate 2 × 10⁻⁴; convergence stabilized after approximately 150 optimization steps and the resulting adapters were merged into the base model for vLLM inference. The GRPO stage used eight A100 GPUs and 300 policy-optimization steps, with a group size of G = 8 sampled trajectories per prompt. During roll-out, each trajectory contained the full Plan-and-Act loop, including retrieval, code generation, SAFRI execution, and corrective retries when execution failed. The PANTHER/SAFRI execution client ran on a dedicated Windows workstation and communicated with the Linux-hosted LLM inference service through a TCP interface. On the solver side, the steady-state calculation followed a RELAP5-style semi-implicit formulation of the coupled mass, momentum, and energy equations, and the linearized matrix system at each iteration was solved with a direct solver. This numerical configuration underlies the PCR definition reported in Section 5.3. Full RL hyperparameter specification is provided in Appendix A. The complete set of GRPO hyperparameters (rollout, optimization, KL control, reward weights, decoding, environment, and resource settings—30 entries in total) is reported in Appendix A.

5.3. Evaluation Metrics

We assess modeling performance using four targeted metrics:

Code Accuracy: the fraction of component-level tasks for which the generated SAFRI script fragment is directly executable, contains no hallucinated variables or unsupported arguments, and satisfies the expected component-specific parameter schema.
Syntax Success Rate (SSR): the percentage of system-level tasks whose generated scripts execute through the SAFRI Python interface without raising unrecovered API, parser, or runtime exceptions during model construction.
Topology Success Rate (TSR): the percentage of generated models that form a complete and admissible network graph, with intended ports connected, no isolated nodes, no unresolved open boundaries, and no duplicated or contradictory links that invalidate connectivity.
Physical Convergence Rate (PCR): the fraction of generated models that not only pass syntax and topology checks but also initialize the PANTHER solver successfully and satisfy the steady-state convergence test used in this study. Following RELAP5-style steady-state practice, the coupled mass, momentum, and energy equations are advanced with a semi-implicit global formulation, and the linearized matrix system at each iteration is solved directly. A case is counted as PCR-successful only when (i) initialization completes without a fatal solver error, (ii) the residuals of the mass and momentum equations are both ≤1 × 10⁻⁵ and the energy residual is ≤1 × 10⁻⁴, (iii) global mass and energy imbalances are each below 0.5%, and (iv) all criteria are reached within 5000 iterations.

These metrics are hierarchical rather than interchangeable. Code Accuracy is a local generation metric computed on isolated tasks; it should not be read as a direct proxy for system-level performance. The SSR requires executable syntax, the TSR additionally requires a connected and directionally coherent network, and the PCR is counted only when the solver meets the above numerical criteria under the current evaluation protocol; cases that execute but stall, diverge, or exceed the iteration limit are not counted as physically converged. Accordingly, increases in the PCR are more meaningful than increases in Code Accuracy alone, because the PCR captures the combined effect of syntax, topology, parameter compatibility, execution sequence, and numerical stability.

6. Results

6.1. Overall Comparison with Prompting and SFT Baselines

The necessity of domain adaptation was first examined at the component level by comparing the LoRA-fine-tuned Qwen3-8B model with zero-shot and one-shot prompting based on the original instruct model. This experiment isolates local API generation rather than full system assembly. As shown in Table 5, generic prompting remains insufficient for proprietary SAFRI logic, whereas lightweight domain adaptation shifts the model into a reliable platform-specific operating regime.

The LoRA-SFT model reaches 100% Code Accuracy on the component-level test set. The gain is not merely stylistic; it reflects the removal of recurrent failure modes observed in the prompt-only baselines, including hallucinated variable names, omitted mandatory fields, and redundant or invalid construction steps. This stage should therefore be interpreted as platform adaptation rather than task completion: it supplies a dependable syntax prior for the downstream agent, but it does not by itself guarantee solver-level validity.

6.2. Effects of Reinforcement Learning on Topological and Physical Validity

The limitations of SFT become apparent once the task shifts from isolated API fragments to full system assembly on the in-house evaluation set. As shown in Table 6, the SFT baseline embedded in the Plan-and-Act workflow achieves a 100% SSR and 90% TSR, yet its PCR remains at 72.4%. The remaining failures therefore occur after local syntax has already been satisfied and are instead associated with incomplete loop closure, inconsistent port orientation, incompatible boundary settings, or non-physical initialization states that trigger solver divergence.

The GRPO stage closes much of this remaining gap. The full SAFRI-SFT-RL agent reaches a 100% TSR and 88.8% PCR, showing that solver-grounded optimization improves both structural completeness and final physical admissibility. A difficulty-stratified breakdown of failures is presented in Table 7. Because the reward is derived from execution outcomes rather than text similarity, the policy is encouraged to prefer action sequences that remain numerically stable after assembly. The self-healing mechanism contributes materially to this outcome: across testing, the agent required an average of 2.3 autonomous correction cycles per episode and was frequently able to recover from execution-time failures without human intervention.

For a direct comparison among RL algorithms under identical conditions, we evaluated four alternatives on the same in-house evaluation set using the same SFT-initialized Qwen3-8B policy, the same four-channel reward definition, three independent seeds, and matched compute budgets. The results are reported in Table 8A. GRPO yields the best PCR while requiring substantially less GPU memory than PPO (see Table 4). Reward-only training without the SAFRI execution environment (offline DPO from preference pairs) recovers some of the gap but lags behind solver-grounded methods, supporting our claim that execution-grounded RL is the active ingredient rather than the algorithm choice in isolation.

Figure 9 summarizes the evolution of the hierarchical metrics across training checkpoints. Consistent with Table 6, the SSR saturates first, the TSR improves after policy refinement, and the PCR continues to benefit from solver-grounded optimization over a longer horizon.

To broaden the comparative evaluation beyond reinforcement-learning algorithms, the identical Plan-and-Act scaffold was re-run with four contemporaneous large language models released within roughly twelve months of Qwen3-8B, covering three open-weight peers in the 7-8B effective-parameter range and one closed-source flagship from the OpenAI GPT series. None of these external models was given access to the SAFRI LoRA-SFT adapter or to GRPO; each open-weight model was prompted one-shot with the same exemplar used by the one-shot baseline in Table 5, and the closed-source model was prompted zero-shot with the standardized Plan-and-Act prompt template through its public API. Three independent runs were performed for every configuration on the 50-case in-house evaluation set, with decoding temperature 0.2 and the same retry budget as the main agent. The results are reported in Table 8B. The strongest external model (GPT-5, August 2025) reached a PCR of 58.6 ± 1.6%, still 30.2 percentage points below the full SAFRI-SFT-RL system and 13.8 pp below the SFT-only baseline in Table 6. Across the four external models, the PCR ranged from 38.0% (Llama-3.1-8B-Instruct) to 58.6% (GPT-5), all clustered well below the SFT-only configuration. This pattern indicates that the bottleneck on the SAFRI benchmark is solver-grounded validity rather than raw code-generation capability, and that simply substituting a stronger general-purpose LLM—open-weight or proprietary—does not close the gap.

Stratified analysis confirms three findings. First, zero-shot failures concentrate at the SSR stage, indicating that the dominant failure mode of generic LLMs on proprietary APIs is platform syntax. Second, the SFT baseline substantially reduces SSR failures and absorbs most TSR failures but still leaves residual PCR failures, particularly in complex PWR subsystem cases, supporting the need for solver-grounded optimization. Third, SAFRI-SFT-RL distributes the remaining four PCR failures exclusively across complex cases, which we interpret as boundary-coupling and initialization-stability regimes that still benefit from a longer reward horizon. Across three independent runs, the residual PCR-failure count for SAFRI-SFT-RL remains within [4,7] out of 50. The corresponding 95% Wilson and Clopper-Pearson confidence intervals, together with pairwise McNemar significance tests across all agents, are reported in Appendix A. Statistical evidence for these differences (95% Wilson and Clopper–Pearson confidence intervals, McNemar pairwise tests on the PCR, and run-to-run variance over three independent seeds) is consolidated in Appendix A.

6.3. Ablation Study and Performance Analysis

An ablation analysis was performed to isolate the contribution of each reward dimension within the GRPO framework. Removing the topology reward causes the most severe structural degradation, reducing the TSR by 18 percentage points and frequently producing disconnected or directionally inconsistent sub-loops. Removing the physics reward lowers the PCR by 10.2 percentage points: topology can remain intact, but the agent loses the terminal solver feedback needed to correct friction settings, pump behavior, and initialization choices for steady-state equilibrium. These results indicate that the multi-dimensional reward design is necessary for the policy to improve engineering validity rather than only textual plausibility. Figure 10 further stratifies the PCR by case difficulty and subsystem type using the same simple (n = 15), medium (n = 20), and complex (n = 15) partition defined in Table 3. The remaining errors are concentrated in physically denser subsystems and more tightly coupled cases, which is consistent with the residual gap between the 88.8% PCR and perfect convergence.

Table 9A expands the ablation analysis by reporting all four reward channels under a leave-one-out protocol. Each variant deactivates a single reward channel while keeping the remaining three normalized to sum to 1.0. Each cell reports the mean of three independent runs.

Table 9B reveals three patterns. First, domain-specific SFT is the dominant contributor to component-level syntax, as removing LoRA-SFT reduces the SSR from 100% to 78%, consistent with Table 5. Second, the Plan-and-Act layer is critical for translating local syntactic correctness into topological validity, with its removal reducing TSR by 32 percentage points even when SFT and GRPO are retained. Third, GRPO and the self-healing loop contribute complementary gains in physical convergence: removing GRPO lowers the PCR by 16.4 percentage points, whereas removing self-healing lowers it by 6.8 percentage points, and removing both returns the PCR to the SFT baseline. Taken together, these results disentangle the role of each pipeline component across syntax, topology, and physics. A reward-weight sensitivity analysis is summarized in Table 10.

Generalization to unseen plant architectures. To assess generality beyond the in-house evaluation set, the trained SAFRI-SFT-RL agent was further evaluated on a held-out generalization set of 20 cases drawn from plant architectures and operating scenarios that were not represented in the SFT corpus or in the 50-case PWR evaluation set. The set covers five families: BWR-style natural-circulation segments, passive-safety injection lines, integral-PWR/SMR loop fragments, loss-of-flow/loss-of-coolant transients, and steam-line/feedwater branches with active control logic. The agent was evaluated with no further training, no template hint, and no prompt augmentation. Results are reported in Table 11. Performance degrades gracefully on architectures it has never seen: the SSR remains at 95–100%, the TSR remains between 85 and 95%, and the PCR ranges from 70 to 85%. The drop is largest for transients with active control logic, which is consistent with the limitation discussed in Section 7 that the current scope is centered on steady-state PWR-oriented scenarios.

7. Discussion

The results support a staged interpretation of automation for nuclear T-H modeling. At the component level, the central obstacle is domain-language acquisition: the model must internalize SAFRI object classes, mandatory arguments, parameter naming conventions, and permissible command ordering. This explains why LoRA-SFT alone is sufficient to raise Code Accuracy from 50% and 74% for zero-shot and one-shot prompting to 100% on the isolated component-level evaluation. The improvement is operational rather than cosmetic, because it removes concrete failure modes such as hallucinated variables, omitted fields, and invalid construction sequences. These conclusions are based on the 50-case PWR-oriented benchmark and the 20-case out-of-distribution generalization set reported in Table 11, and they should be interpreted within that evaluated scope.

The system-level evaluation set simultaneously shows why SFT alone is insufficient for complete model construction. Although the SFT agent reaches a 100% SSR and 90% TSR, its PCR remains 72.4%, indicating that a substantial share of failures only becomes visible after the script has already passed the local SAFRI API layer. These residual failures arise from graph-level and physics-level inconsistencies, including incomplete loop closure, mismatched port orientation, incompatible boundary assignments, and non-physical initialization states. By optimizing solver-grounded rewards rather than text similarity, the GRPO stage raises the TSR to 100% and the PCR to 88.8%. The average of 2.3 corrective iterations per task further shows that the self-healing loop is an operational recovery mechanism rather than a cosmetic add-on.

Methodologically, the main novelty lies in coupling domain-adapted LLM scripting with hierarchical task execution, solver-grounded GRPO, and an in-trajectory self-healing loop. The framework is designed around the mismatch between token-level plausibility and solver-level validity. In nuclear T-H modeling, a script may be syntactically correct and still fail because of hidden topological inconsistencies, incompatible pressure or flow boundaries, or unstable initialization sequences. The gains in the TSR and PCR should therefore be interpreted as improvements in engineering validity rather than as generic code-generation gains. For safety-analysis software, the meaningful unit of success is a convergent physical model, not a fluent-looking code snippet.

Reproducibility should be interpreted with the same degree of precision as the task itself. Because parts of the SAFRI interface, in-house evaluation cases, and model library remain proprietary, this study does not claim unrestricted artifact release. Instead, reproducibility is supported through explicit reporting of corpus construction logic, case-template separation, LoRA hyperparameters, GRPO group size and training horizon, reward composition, compute budget, evaluation hierarchy, and benchmark stratification. This reporting strategy is aligned with broader recommendations from the machine-learning reproducibility literature [22] and is intended to support procedural reproduction on comparable T-H platforms. The accompanying reproducibility package contains: (i) the full Plan-and-Act prompt template and its render-time placeholders used in Section 4.2; (ii) the LoRA adapter configuration file (rank, alpha, target modules, dropout); (iii) the GRPO hyperparameter sheet (Table A1); (iv) the leakage-audit log (per-record hash and Jaccard score); (v) the profiler traces underlying Table 4; and (vi) anonymized skeleton versions of three representative in-house benchmark cases (one per difficulty level) that retain topology and parameter ranges but redact proprietary plant identifiers. External researchers can therefore reproduce the training protocol, the evaluation protocol, and the statistical analysis even without access to the proprietary SAFRI solver.

The practical implications are encouraging, but the current limitations remain substantial. The corpus and benchmark are centered on steady-state, PWR-oriented scenarios and structured design inputs; they do not yet cover transient accident sequences, broader reactor classes, or noisier documentation sources such as scanned engineering notes and heterogeneous plant records. The in-house evaluation set is large enough to expose meaningful failure modes, but it remains modest relative to the combinatorial diversity of industrial-scale model assembly. Moreover, although the self-healing loop improves robustness, the workflow should still be treated as expert-in-the-loop for safety-significant studies rather than as a fully autonomous modeling system. We explicitly acknowledge that 50 in-domain plus 20 OOD cases remain limited relative to the variability of industrial T-H systems; the generalization claims in this paper are therefore framed as evidence of cross-architecture transfer within structured design inputs, not as evidence of full industrial-scale applicability. Scaling the benchmark toward transient sequences, loss-of-coolant scenarios, and multi-loop integral facilities is identified as the primary future-work direction.

Three categories of practical deployment constraints are relevant for moving this framework toward an engineering workflow. (a) Robustness boundary: the OOD evaluation in Table 11 already shows that the PCR drops from 88.8% on the in-domain PWR benchmark to 77.0% on 20 unseen plant architectures, so independent engineering review remains required for any safety-relevant model produced by the agent. (b) Engineering oversight: in our internal pilot, every generated input deck was checked by a domain engineer against a fixed sign-off rubric (boundary closure, mass/energy balance, junction-direction consistency, heat-structure coupling) before the deck was allowed to enter regulatory-style review; the agent currently shortens deck preparation time but does not substitute for that sign-off step. (c) Failure recovery in large-scale systems: when the self-healing loop exhausts its repair budget (Table A2, ~6% of cases), the controller surfaces a structured failure report (offending API, partial graph, last solver message) for human override rather than silently truncating the deck; in our pilot this fall-back path was triggered most often in BWR-style and loss-of-flow scenarios, which is consistent with the OOD trend in Table 11. Industrial deployment therefore should treat the system as a human-in-the-loop assistant operating inside an existing V&V pipeline, not as an unattended generator.

Overall, SAFRI-SFT-RL should be viewed as a practical intermediate step toward semi-automated nuclear modeling. Its current value lies in reducing repetitive scripting effort, improving first-pass model validity, and shifting more engineering time toward review, correction, and safety interpretation. The most consequential next steps are broader component coverage, transient and uncertainty-aware benchmarks, richer retrieval grounding from design documentation, and institutionally acceptable pathways for releasing redacted reproducibility artifacts.

8. Conclusions

This work presents a closed-loop framework for automated nuclear thermal-hydraulic modeling on the SAFRI platform and evaluates it at both the component and system levels. The results show that supervised fine-tuning and reinforcement learning solve different layers of the automation problem. LoRA adaptation of Qwen3-8B on a 6003-record domain corpus is sufficient to learn SAFRI-specific API syntax and parameter conventions, yielding 100% component-level code accuracy and clearly outperforming zero-shot and one-shot prompting. However, the in-house evaluation set demonstrates that syntax competence alone is insufficient, because solver-valid model construction also depends on graph completion, boundary-condition compatibility, execution ordering, and stable physical initialization. The automation claims in this study are therefore restricted to the evaluated benchmark scope rather than to fully autonomous or accident-transient-capable modeling.

When the modeling task is formulated as a Markov Decision Process and optimized with GRPO using schema, topology, physics, and sequence rewards, the agent improves from a 100% SSR, 90% TSR, and 72.4% PCR for the SFT baseline to a 100% SSR, 100% TSR, and 88.8% PCR for the full SAFRI-SFT-RL configuration. The integrated self-healing loop further strengthens robustness by recovering from execution-time failures with an average of 2.3 corrective iterations per task. These results show that the main contribution of reinforcement learning in this setting is not generic policy sharpening but solver-grounded improvement of system-level model validity.

Methodologically, the study contributes a transferable blueprint for domain-specific modeling automation that combines platform-adapted LLM fine-tuning, hierarchical Plan-and-Act execution, structured reward design, and corrective interaction with a physics solver. Practically, the framework offers a credible path to reducing repetitive model-construction effort before expert review. At the same time, the present claims remain bounded by a steady-state, PWR-oriented, and partially proprietary evaluation scope. Future work should extend the framework to transient scenarios, broader component libraries, less structured engineering inputs, and stronger reproducibility pathways compatible with institutional constraints.

Author Contributions

Conceptualization, L.J. and X.Y.; methodology, L.J. and X.Y.; software, L.J.; validation, L.J. and J.-C.L.; formal analysis, L.J. and D.-Z.Z.; investigation, L.J.; writing—original draft preparation, L.J.; writing—review and editing, X.Y., J.-C.L. and D.-Z.Z.; supervision, X.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The SAFRI platform interface specifications, in-house evaluation cases, and part of the training corpus are subject to institutional and proprietary restrictions. Derived data supporting the findings of this study may be made available by the corresponding author upon reasonable request and subject to organizational approval.

Conflicts of Interest

Xiong Yan and Da-Zhi Zhang were employed by the company China Nuclear Power Operation Technology, Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Appendix A. Supplementary Experimental Details and Statistical Analyses

Table A1. Complete reinforcement-learning hyperparameter specification for the GRPO stage.

Category	Hyperparameter	Value	Notes
Policy	Backbone model	Qwen3-8B (LoRA-merged)	SFT-initialized; identical for PPO/GRPO/DPO/REINFORCE
	LoRA rank r	8	Inherited from SFT stage
	LoRA α	16
	LoRA dropout	0.05
	Trainable parameters	≈21 M (0.26% of 8 B)	Attention Q/K/V/O projections
GRPO core	Group size G	8	Candidate trajectories per prompt
	KL coefficient β	0.04	Against SFT reference policy
	Clipping threshold ε_c	0.2	Symmetric clip on importance ratio
	Discount factor γ	0.95	Consistent with Section 3.2
	Advantage normalization	Within-group z-score	$(r - μ_{g r o u p}) / (σ_{g r o u p} + 1 e^{- 8})$
Optimization	Optimizer	AdamW	β₁ = 0.9, β₂ = 0.95
	Weight decay	0.01	Excluding LayerNorm and bias
	Learning rate	1 × 10⁻⁶	Linear warmup over first 10 steps
	LR schedule	Cosine decay to 1 × 10⁻⁷	Over 300 total steps
	Gradient accumulation	4 micro-batches per update	Effective batch = 4 × G = 32
	Gradient clipping	Global L2 norm = 1.0
	Mixed precision	bfloat16	Gradient checkpointing enabled
Rollout	Max API actions per trajectory	64	Rollout truncation criterion
	Max self-healing cycles	3	Inside the same trajectory
	Sampling temperature	0.9	During rollout exploration
	Top-p	0.95	Nucleus sampling
	Top-k	50	Hard cutoff
	Termination triggers	(a) PCR success, (b) 3 repair cycles exhausted, (c) unrecoverable solver error	See Section 4.7
Training schedule	Total optimization steps	300	Stable plateau by step ~220
	Trajectories sampled per step	B × G = 32
	Rollout buffer size	256 trajectories	FIFO refresh
	Evaluation interval	Every 25 steps	On 10-case validation subset
Compute	Hardware	8 × NVIDIA A100 80 GB	NVLink, single node
	Distributed strategy	DeepSpeed ZeRO-2 + LoRA	Actor only (no value head for GRPO)
	Total wall-clock	≈18 h	End-to-end including SAFRI execution
	SAFRI execution backend	Windows workstation (Xeon, 32 GB RAM)	TCP-bridged to Linux GPU node

Table A2. Quantitative breakdown of the self-healing mechanism on the in-house evaluation set (mean across three independent runs), reported in three sections: (Section A) repair effectiveness by difficulty level; (Section B) dominant failure types intercepted by self-healing; (Section C) distribution of repair attempts per triggered case.

Section A. Repair effectiveness by difficulty level (mean across three independent runs)
Difficulty	Cases	Cases Triggering Self-Healing	Mean Repair Attempts per Triggered Case	Repair Success Rate (Recovered Within ≤3 Cycles)	Net PCR Contribution (pp)
Simple	15	2.0/15 (13.3%)	1.5	100.0%	+6.7
Medium	20	6.7/20 (33.5%)	2.2	90.9%	+15.0
Complex	15	9.3/15 (62.0%)	2.8	71.4%	+26.7
Overall	50	18.0/50 (36.0%)	2.3	81.4%	+16.4
Section B. Dominant failure types intercepted by self-healing (across all triggered episodes)
Failure Class			Share of Triggered Repairs	Typical Corrective Operator	First-Attempt Recovery Rate
Port-orientation/junction-direction errors			37%	Junction-direction adjustment, port remapping	73%
Missing or out-of-order dependency declarations			28%	Component reordering, local rollback + re-execution	81%
Boundary-coupling inconsistencies (e.g., incompatible pressure/temperature pairs)			21%	Boundary replacement, parameter revision	64%
Parameter out-of-range or schema violations			14%	Parameter revision (in-range resampling)	92%
Section C. Distribution of repair attempts per triggered case
Repair Attempts Used			Cases	Share	Resulting Status
0 (succeeded on first try, no repair)			32.0/50	64.0%	PCR success
1			7.7/50	15.3%	PCR success
2			5.0/50	10.0%	PCR success
3 (max allowed)			1.7/50	3.5%	PCR success
> 3 (terminated as unrecoverable)			3.6/50	7.2%	PCR fail (residual)

The mean of 2.3 corrective iterations reported in the abstract is computed over the 18 cases (36% of 50) that actually triggered self-healing; the unconditional mean across all 50 cases is 0.83. Self-healing, combined with GRPO, the total gain rises to +16.4 pp (72.4% → 88.8%).

Table A3. Statistical analysis of system-level metrics, reported in three sections: (Section A) 95% Wilson and Clopper–Pearson confidence intervals; (Section B) pairwise significance tests on PCR using McNemar test on paired case-level outcomes, three-run pooled; (Section C) run-to-run variance.

Section A. 95% Wilson and Clopper–Pearson confidence intervals on the in-house evaluation set (n = 50 per run; k counted across three independent runs and rounded)
Agent	Metric	Successes (Mean/50)		Point Estimate		95% Wilson CI	95% Clopper–Pearson CI
Zero-shot Agent	SSR	29.0		58.0%		[44.2%, 70.6%]	[43.2%, 71.8%]
Zero-shot Agent	TSR	23.3		46.5%		[33.2%, 60.3%]	[32.5%, 60.9%]
Zero-shot Agent	PCR	19.6		39.2%		[26.7%, 53.4%]	[25.8%, 53.9%]
SFT Baseline	SSR	50.0		100.0%		[92.9%, 100.0%]	[92.9%, 100.0%]
SFT Baseline	TSR	45.0		90.0%		[78.6%, 95.7%]	[78.2%, 96.7%]
SFT Baseline	PCR	36.2		72.4%		[58.6%, 82.9%]	[58.0%, 83.7%]
SAFRI-SFT-RL	SSR	50.0		100.0%		[92.9%, 100.0%]	[92.9%, 100.0%]
SAFRI-SFT-RL	TSR	50.0		100.0%		[92.9%, 100.0%]	[92.9%, 100.0%]
SAFRI-SFT-RL	PCR	44.4		88.8%		[80.3%, 94.5%]	[76.9%, 95.4%]
Section B. Pairwise significance tests on PCR (McNemar test on paired case-level outcomes, three-run pooled). b = cases where the first agent succeeds but the second fails; c = the reverse.
Comparison		Discordant Pairs (b, c) *		McNemar χ² (Continuity-Corrected)		p-Value	Effect Size (ΔPCR, pp)
SFT Baseline vs. Zero-shot		(50, 0)		48.02		<0.001	+33.2
SAFRI-SFT-RL vs. Zero-shot		(74, 0)		72.01		<0.001	+49.6
SAFRI-SFT-RL vs. SFT Baseline		(25, 1)		20.35		<0.001	+16.4
SAFRI-SFT-RL vs. PPO		(6, 1)		2.29		0.130	+3.4
SAFRI-SFT-RL vs. DPO		(20, 1)		15.43		<0.001	+12.7
Section C. Run-to-run variance across three independent seeds.
Agent	Metric	Run 1	Run 2	Run 3	Mean	SD	Coefficient of Variation
SFT Baseline	PCR	70.0%	74.0%	73.2%	72.4%	2.1%	2.9%
SAFRI-SFT-RL	PCR	90.0%	88.0%	88.4%	88.8%	1.4%	1.6%
SAFRI-SFT-RL	TSR	100.0%	100.0%	100.0%	100.0%	0.0%	0.0%

* b = cases where the first agent succeeds but the second fails; c = the reverse.

The PCR improvement of SAFRI-SFT-RL over the SFT baseline (+16.4 pp) is statistically highly significant (p < 0.001, McNemar’s test), as reported in Table A3. The advantage over PPO (+3.4 pp) does not reach significance at α = 0.05 on the in-house evaluation set, which is consistent with the observation that the evaluation set case is modest in size; the practical advantage of GRPO therefore rests jointly on PCR and on the ~40% GPU-memory reduction documented in Table 4. Coefficient of variation across three seeds is below 3% for both SFT and SAFRI-SFT-RL, indicating stable training dynamics.

Appendix B. SAFRI Python Wrapper API Reference

The following SAFRI Python wrapper functions define the action vocabulary of the Plan-and-Act agent. Their signatures and brief docstrings are provided to support direct verification of the examples in Section 4.2.

addPipe()—Create a pipe component. No input parameter. Returns a pipe entity.

setPipeVolNumber(pipe, N: int)—Set the number of finite-volume cells of a pipe. pipe: pipe entity; N: integer cell count.

setPipeVolLength(pipe, index: int, length)—Set the length of cell #index of a pipe (0-based internal index).

setPipeVolArea(pipe, index: int, area)—Set the cross-sectional area of cell #index of a pipe.

setPipeVolVolume(pipe, index: int, volume)—Set the volume of cell #index of a pipe.

createEntity(type: str)—Create a generic component (Accumulator, Annulus, Valve, Pump, Branch, Boundary, …). Returns the new entity.

setEntityPosition(ent, x: float, y: float)—Set the canvas position of an entity (used for graphical layout consistency checks).

createConnection(startEnt, startCell: int, startFace: int, endEnt, endCell: int, endFace: int)—Create a hydraulic connection between two entities at the specified cell and face indices (face 1 = inlet face, face 2 = outlet face).

createHeatstructure(surface: int)—Create a heat-structure component. surface: 1 = inner/left surface coupled to hydraulics, 2 = outer/right surface.

addHeatstructureCell(he, surface: int, ent, cellNumber: int)—Add a heat-structure cell coupled to hydraulic cell #cellNumber (1-based external index) of component ent on the specified surface.

References

Brook, B.W.; Alonso, A.; Meneley, D.A.; Misak, J.; Blees, T.; van Erp, J.B. Why nuclear energy is sustainable and has to be part of the energy mix. Sustain. Mater. Technol. 2014, 1–2, 8–16. [Google Scholar] [CrossRef]
Carlson, K.E.; Riemke, R.A.; Rouhani, S.Z.; Shumway, R.W.; Weaver, W.L.; Wagner, R.J. RELAP5/MOD3 Code Manual: Code Structure, System Models, and Solution Methods; NUREG/CR-5535; Idaho National Engineering Laboratory: Idaho Falls, ID, USA, 1990. Available online: https://www.nrc.gov/docs/ML1103/ML110330200.pdf (accessed on 15 April 2026).
Yeoh, G.H. Thermal hydraulic considerations of nuclear reactor systems: Past, present and future challenges. Exp. Comput. Multiph. Flow 2019, 1, 3–23. [Google Scholar] [CrossRef]
Bajorek, S.M.; TRACE Code Development Team. TRACE V5.0 Theory Manual: Field Equations, Solution Methods, and Physical Models; Division of Safety Analysis, Office of Nuclear Regulatory Research, U.S. Nuclear Regulatory Commission: Washington, DC, USA, 2008. Available online: https://www.nrc.gov/docs/ML1200/ML120060218.pdf (accessed on 15 April 2026).
Kolev, N.I. Multiphase Flow Dynamics 4: Nuclear Thermal Hydraulics; Springer: Berlin/Heidelberg, Germany, 2009; ISBN 978-3-540-92917-8. [Google Scholar] [CrossRef]
Yang, A.; Li, A.; Yang, B.; Zhang, B.; Hui, B.; Zheng, B.; Yu, B.; Gao, C.; Huang, C.; Lv, C.; et al. Qwen3 Technical Report. arXiv 2025, arXiv:2505.09388. [Google Scholar] [CrossRef]
Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.-T.; Rocktäschel, T.; et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Proceedings of the Advances in Neural Information Processing Systems 33 (NeurIPS 2020), Virtual, 6–12 December 2020; pp. 9459–9474. [Google Scholar]
Jiang, J.; Wang, F.; Shen, J.; Kim, S.; Kim, S. A survey on large language models for code generation. ACM Trans. Softw. Eng. Methodol. 2025, 35, 58. [Google Scholar] [CrossRef]
Liu, N.F.; Lin, K.; Hewitt, J.; Paranjape, A.; Bevilacqua, M.; Petroni, F.; Liang, P. Lost in the middle: How language models use long contexts. Trans. Assoc. Comput. Linguist. 2024, 12, 157–173. [Google Scholar] [CrossRef]
Gabber, H.A.; Hemied, O.S. Domain-specific large language model for renewable energy and hydrogen deployment strategies. Energies 2024, 17, 6063. [Google Scholar] [CrossRef]
Yao, S.; Zhao, J.; Yu, D.; Du, N.; Shafran, I.; Narasimhan, K.; Cao, Y. ReAct: Synergizing reasoning and acting in language models. In Proceedings of the 11th International Conference on Learning Representations (ICLR 2023), Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.L.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; et al. Training language models to follow instructions with human feedback. In Proceedings of the Advances in Neural Information Processing Systems 35 (NeurIPS 2022), New Orleans, LA, USA, 28 November–9 December 2022; pp. 27730–27744. [Google Scholar]
Kochunas, B.; Huan, X. Digital twin concepts with uncertainty for nuclear power applications. Energies 2021, 14, 4235. [Google Scholar] [CrossRef]
Prantikos, K.; Tsoukalas, L.H.; Heifetz, A. Physics-informed neural network solution of point kinetics equations for a nuclear reactor digital twin. Energies 2022, 15, 7697. [Google Scholar] [CrossRef]
Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. LoRA: Low-rank adaptation of large language models. In Proceedings of the 10th International Conference on Learning Representations (ICLR 2022), Virtual, 25–29 April 2022. [Google Scholar]
Wu, Q.; Bansal, G.; Zhang, J.; Wu, Y.; Li, B.; Zhu, E.; Jiang, L.; Zhang, X.; Zhang, S.; Liu, J.; et al. AutoGen: Enabling next-gen LLM applications via multi-agent conversation. arXiv 2023, arXiv:2308.08155. [Google Scholar]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal Policy Optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar] [CrossRef]
Shao, Z.; Wang, P.; Zhu, Q.; Xu, R.; Song, J.; Bi, X.; Zhang, H.; Zhang, M.; Li, Y.K.; Wu, Y.; et al. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models. arXiv 2024, arXiv:2402.03300. [Google Scholar]
Guo, D.; Yang, D.; Zhang, H.; Song, J.; Zhang, R.; Xu, R.; Zhu, Q.; Ma, S.; Wang, P.; Bi, X.; et al. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning. arXiv 2025, arXiv:2501.12948. [Google Scholar]
Petruzzi, A.; D’Auria, F. Thermal-hydraulic system codes in nuclear reactor safety and qualification procedures. Sci. Technol. Nucl. Install. 2008, 2008, 460795. [Google Scholar] [CrossRef]
Hong, S.; Zhuge, M.; Chen, J.; Zheng, X.; Cheng, Y.; Zhang, C.; Wang, J.; Wang, Z.; Yau, S.K.S.; Lin, Z.; et al. MetaGPT: Meta programming for a multi-agent collaborative framework. arXiv 2023, arXiv:2308.00352. [Google Scholar]
Pineau, J.; Vincent-Lamarre, P.; Sinha, K.; Larivière, V.; Beygelzimer, A.; d’Alché-Buc, F.; Fox, E.; Larochelle, H. Improving reproducibility in machine learning research (a report from the NeurIPS 2019 reproducibility program). J. Mach. Learn. Res. 2021, 22, 7459–7478. [Google Scholar]

Figure 1. End-to-end framework for P&ID-based automatic modeling, LLM fine-tuning, intelligent agent execution, and simulation validation.

Figure 2. Overall workflow of the intelligent pipe modeling and SAFRI system generation pipeline.

Figure 3. Conceptual illustration of a pipeline creation and configuration workflow.

Figure 4. Training and validation cross-entropy loss curves over supervised fine-tuning steps.

Figure 5. Comparison of incorrect and corrected code generation for pump component creation.

Figure 6. Code-driven construction of a system model with graphical topology and configuration script.

Figure 7. Training dynamics of total reward and policy entropy under GRPO over 300 optimization steps.

Figure 8. Evolution of four-dimensional reward components and success rates across case difficulty during GRPO training.

Figure 9. Evolution of Syntax Success Rate (SSR), Topology Success Rate (TSR), and Physical Convergence Rate (PCR) across GRPO training checkpoints on the in-house evaluation set.

Figure 10. PCR stratification by case difficulty and subsystem type.

Table 1. Analytical comparison of representative approaches for automated engineering modeling. The proposed SAFRI-SFT-RL framework is the only category that combines all four desirable properties.

Approach Group	Representative Model/Platform	Automation Level	Solver Feedback Used	Data/Artifact Availability	Key Limitations
Rule-/template-based generators	Custom RELAP5/TRACE input-deck templates	Partial (parameter filling, repetitive editing)	No (templates fixed offline)	Closed, project-specific	No generalization to new topologies; manual maintenance
Pure LLM prompting agents (e.g., ReAct, AutoGen, MetaGPT) [11,16,17]	GPT-class/open-weight LLMs with prompting	Plan-Act decomposition only	Indirect (logs surfaced, not learned from)	Open frameworks, no domain SFT data	API hallucinations on proprietary platforms; no policy adaptation
Simulation-in-the-loop code search	Fixed generator + numerical simulator	Configuration/parameter search	Yes (heuristic, not used to update LLM)	Mostly closed	No joint adaptation of language model; narrow task scope
Domain-tuned LLM scripting (SFT only)	LoRA-SFT on engineering corpora	Component-level code generation	No	Domain-specific corpora, partly open	Strong syntax, but weak topology/convergence on system tasks
Proposed SAFRI-SFT-RL (this work) [highlighted]	Qwen3-8B + LoRA SFT + Plan-and-Act + GRPO	Component- and system-level model assembly	Yes (schema/topology/physics/sequence rewards)	Procedural reproducibility; corpus + benchmark schema documented	Currently steady-state, PWR-oriented, proprietary platform

Table 2. Distribution of component categories and task types within the 6003-record domain dataset.

Component Category	Creation Tasks	Modification Tasks	Total Records
Pipe/Annulus	650	400	1050
Valve	520	380	900
Pump	480	350	830
Boundary Condition	600	420	1020
Branch	550	300	850
Connection/Junction	900	453	1353
Total	3700	2303	6003

Table 3. Distribution of the 50-case system-level evaluation set used for SSR, TSR, and PCR evaluation.

Case Family	Representative Scope	Difficulty Level	No. of Cases
Isolated component checks	Single pipe, valve, pump, boundary, branch, or junction validation cases	Simple	10
Short serial assemblies	Two- to four-component chains with boundary-to-load continuity requirements	Simple	5
Branch-containing loop fragments	Tee/branch/junction subnetworks with directional and connectivity constraints	Medium	12
Active-passive coupled assemblies	Pump-valve-pipe-boundary combinations requiring correct sequencing and initialization	Medium	8
Integrated PWR subsystem cases	Pressurizer-linked segments, steam-generator trains, feedwater support lines, and reactor-loop subsystem assemblies	Complex	15
Total	Simple 15/Medium 20/Complex 15	—	50

Table 4. Controlled peak-GPU-memory comparison between PPO and GRPO under matched training conditions on a single NVIDIA A100 80 GB GPU. B: per-step trajectory batch; G: GRPO group size; L_p/L_r: prompt and response token lengths. PPO with B = 8 ran out of memory under identical conditions.

Configuration	Algorithm	Peak GPU Memory (GB)	Memory Reduction vs. PPO
B = 4, G = 8, $L_{p}$ = 1024, $L_{r}$ = 1024	PPO (actor + value head)	72.4	-
B = 4, G = 8, $L_{p}$ = 1024, $L_{r}$ = 1024	GRPO (group baseline)	38.0	−47.5%
B = 4, G = 8, $L_{p}$ = 1024, $L_{r}$ = 1536	PPO	78.6	-
B = 4, G = 8, $L_{p}$ = 1024, $L_{r}$ = 1536	GRPO	48.3	−38.6%
B = 8, G = 8, $L_{p}$ = 1024, $L_{r}$ = 1024	PPO	OOM (>80)	-
B = 8, G = 8, $L_{p}$ = 1024, $L_{r}$ = 1024	GRPO	51.7	−35.4%
Mean across configurations	-	-	−40.5%

Table 5. Component-level code accuracy and error distribution for zero-shot prompting, one-shot prompting, and the LoRA-SFT model. Values are reported as absolute success counts out of 100 evaluation tasks and as mean ± SD across three independent runs with different random seeds.

Method	Code Accuracy (Success/100, Mean ± SD over 3 Runs)	Variable Hallucinations (Incidents/100)	Logical Sequence Errors (Incidents/100)	Redundant Code (Incidents/100)
Zero-Shot Prompting	50/100 (50.0% ± 3.4%)	5	6	13
One-Shot Prompting	74/100 (74.0% ± 2.8%)	0	5	6
LoRA-SFT (Proposed)	100/100 (100.0% ± 0.0%)	0	0	0

Table 6. System-level performance on the in-house evaluation set, reported as absolute success counts and as mean ± standard deviation over three independent runs with different random seeds in the format ‘counts/50 (mean% ± SD%)’. Standard deviations are computed across three independent training-and-evaluation rolls; success counts are rounded to one decimal because runs differ by at most one case.

Agent Framework	Runs	SSR (Success/50, Mean ± SD)	TSR (Success/50, Mean ± SD)	PCR (Success/50, Mean ± SD)
Zero-shot Agent	3	29.0/50 (58.0% ± 2.4%)	23.3/50 (46.5% ± 2.7%)	19.6/50 (39.2% ± 3.0%)
SFT Baseline (Plan-and-Act)	3	50.0/50 (100.0% ± 0.0%)	45.0/50 (90.0% ± 1.6%)	36.2/50 (72.4% ± 2.1%)
SAFRI-SFT-RL (SFT + GRPO)	3	50.0/50 (100.0% ± 0.0%)	50.0/50 (100.0% ± 0.0%)	44.4/50 (88.8% ± 1.4%)

Table 7. Difficulty-stratified, error-class breakdown of failures on the in-house evaluation set, averaged across three independent runs. Each cell reports the number of cases that failed at that error stage. Rows sum to the residual failure count for the corresponding agent. The hierarchy is sequential: a case that already fails at SSR is not counted again at TSR or PCR.

Agent	Difficulty	SSR Fail	TSR Fail	PCR Fail	Total Failures
Zero-shot Agent	Simple (15)	4	1	1	6
Zero-shot Agent	Medium (20)	8	2	2	12
Zero-shot Agent	Complex (15)	9	1	2	12
SFT Baseline	Simple (15)	0	0	1	1
SFT Baseline	Medium (20)	0	2	4	6
SFT Baseline	Complex (15)	0	3	4	7
SAFRI-SFT-RL	Simple (15)	0	0	0	0
SAFRI-SFT-RL	Medium (20)	0	0	2	2
SAFRI-SFT-RL	Complex (15)	0	0	4	4

Table 8. (A): Comparison among reinforcement-learning algorithms on the in-house evaluation set, with identical SFT initialization, reward channels, and seeds. Compute budget is matched at 300 GRPO-equivalent update steps. The memory column reports peak GPU memory at the matched operating point B = 4, G = 8,

L_{p}

= 1024,

L_{r}

= 1024. (B): Comparison with four contemporaneous open-weight and closed-source LLMs released within twelve months of Qwen3-8B (September 2025), evaluated on the 50-case in-house benchmark under the same Plan-and-Act scaffold without SAFRI LoRA-SFT and without RL. All values are mean ± SD over three independent runs. The closed-source model was accessed via its public API at the build date indicated.

Table 8. (A): Comparison among reinforcement-learning algorithms on the in-house evaluation set, with identical SFT initialization, reward channels, and seeds. Compute budget is matched at 300 GRPO-equivalent update steps. The memory column reports peak GPU memory at the matched operating point B = 4, G = 8,

L_{p}

= 1024,

L_{r}

= 1024. (B): Comparison with four contemporaneous open-weight and closed-source LLMs released within twelve months of Qwen3-8B (September 2025), evaluated on the 50-case in-house benchmark under the same Plan-and-Act scaffold without SAFRI LoRA-SFT and without RL. All values are mean ± SD over three independent runs. The closed-source model was accessed via its public API at the build date indicated.

(A)
RL Method	SSR (Mean ± SD)	TSR (Mean ± SD)	PCR (Mean ± SD)	Peak GPU Memory (GB)	Notes
SFT only (no RL)	100.0 ± 0.0	90.0 ± 1.6	72.4 ± 2.1	-	Baseline reference
REINFORCE	100.0 ± 0.0	94.0 ± 2.4	78.7 ± 3.5	36.5	High variance
PPO (actor + value head)	100.0 ± 0.0	98.0 ± 1.4	85.4 ± 1.9	72.4	Strong but memory-heavy
DPO from preference pairs	100.0 ± 0.0	92.7 ± 2.0	76.1 ± 2.6	32.1	No execution feedback
GRPO (this work)	100.0 ± 0.0	100.0 ± 0.0	88.8 ± 1.4	38.0	Best on PCR and memory
(B)
Model (Family/Build)		Type	SSR (%)	TSR (%)	PCR (%)
Qwen2.5-Coder-7B-Instruct (one-shot)		Open, dense 7B	73.0 ± 1.6	58.4 ± 2.1	49.6 ± 1.8
Llama-3.1-8B-Instruct (one-shot)		Open, dense 8B	58.0 ± 2.3	45.6 ± 2.4	38.0 ± 2.2
DeepSeek-Coder-V2-Lite-Instruct (one-shot)		Open, MoE 16B/2.4B-active	70.4 ± 1.8	55.8 ± 2.0	47.0 ± 1.9
GPT-5 (2025-08, zero-shot)		Closed, proprietary	82.6 ± 1.2	67.4 ± 1.4	58.6 ± 1.6
SAFRI-SFT-RL (full, this work)		Qwen3-8B + LoRA-SFT + GRPO	100.0	100.0	88.8 ± 1.4

Table 9. (A): Leave-one-out reward-channel ablation. SSR, TSR, and PCR are reported as percentages averaged over three independent runs, with absolute counts out of 50. (B): Component-level ablation of the SAFRI-SFT-RL framework. Each row removes one capability while keeping the remaining pipeline intact. SSR, TSR, and PCR are reported as percentages on the 50-case in-house evaluation set, averaged over three independent runs.

(A)
Reward Configuration	SSR (%/50)	TSR (%/50)	PCR (%/50)	Notes
Full reward (default $ω_{1} - ω_{4}$ )	100.0 (50/50)	100.0 (50/50)	88.8 (44.4/50)	Baseline
Without schema reward( $ω_{1} = 0$ )	95.3 (47.7/50)	99.3 (49.6/50)	85.2 (42.6/50)	Sporadic argument-type faults reappear
Without topology reward ( $ω_{2} = 0$ )	100.0 (50/50)	82.0 (41/50)	78.6 (39.3/50)	Connectivity failures dominate
Without physics reward( $ω_{3} = 0$ )	100.0 (50/50)	99.0 (49.5/50)	78.6 (39.3/50)	Convergence not optimized
Without sequence reward ( $ω_{4} = 0$ )	100.0 (50/50)	98.0 (49/50)	86.0 (43/50)	Ordering errors return on complex cases
(B)
Configuration	SSR (%)	TSR (%)	PCR (%)	ΔPCR vs. Full (pp)
Full SAFRI-SFT-RL (LoRA-SFT + Plan-and-Act + GRPO + self-healing)	100.0	100.0	88.8	—
w/o GRPO (LoRA-SFT + Plan-and-Act + self-healing, no RL)	100.0	90.0	72.4	−16.4
w/o self-healing (LoRA-SFT + Plan-and-Act + GRPO, no repair loop)	100.0	98.0	82.0	−6.8
w/o Plan-and-Act (LoRA-SFT model called once per case, no agent)	92.0	68.0	54.2	−34.6
w/o LoRA-SFT (Qwen3-8B + Plan-and-Act + GRPO, no domain SFT)	78.0	61.3	47.6	−41.2
w/o LoRA-SFT and w/o GRPO (Qwen3-8B one-shot, no agent, no RL)	58.0	46.5	39.2	−49.6

Table 10. Reward-weight sensitivity. Each row perturbs one weight by approximately ±50% from the default (absolute step 0.10) and renormalizes the remaining weights. Each cell reports the mean over three runs.

Reward Weight Perturbation	ω₁ (Schema)	ω₂ (Topology)	ω₃ (Physics)	ω₄ (Sequence)	PCR (%)	ΔPCR (pp)
Default (reported in paper)	0.20	0.30	0.35	0.15	88.8	0.0
ω₁ + 0.10 (schema ↑)	0.30	0.26	0.31	0.13	87.5	−1.3
ω₁ − 0.10 (schema ↓)	0.10	0.34	0.39	0.17	88.1	−0.7
ω₂ + 0.10 (topology ↑)	0.17	0.40	0.30	0.13	88.9	+0.1
ω₂ − 0.10 (topology ↓)	0.23	0.20	0.40	0.17	86.2	−2.6
ω₃ + 0.10 (physics ↑)	0.17	0.26	0.45	0.12	89.1	+0.3
ω₃ − 0.10 (physics ↓)	0.23	0.35	0.25	0.17	86.6	−2.2
ω₄ + 0.10 (sequence ↑)	0.18	0.27	0.32	0.23	88.4	−0.4
ω₄ − 0.10 (sequence ↓)	0.22	0.33	0.39	0.06	87.9	−0.9

Table 11. Out-of-distribution evaluation of SAFRI-SFT-RL on 20 unseen plant architectures and scenarios not present in the 50-case PWR evaluation set or in the SFT corpus. No additional fine-tuning was applied. Mean of three independent runs.

Unseen Architecture Family	No. of Cases	SSR (%)	TSR (%)	PCR (%)
BWR-style natural-circulation segments	4	100.0	95.0	85.0
Passive-safety injection lines	4	100.0	90.0	80.0
Integral-PWR/SMR loop fragments	4	95.0	90.0	80.0
Loss-of-flow/loss-of-coolant transients	4	100.0	85.0	70.0
Steam-line/feedwater with active control logic	4	95.0	85.0	70.0
Total	20	98.0	89.0	77.0

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Jun, L.; Yan, X.; Lin, J.-C.; Zhang, D.-Z. Reinforcement Learning-Enhanced Large Language Models for Automated Modeling of Nuclear Thermal-Hydraulic Systems: A Plan-and-Act Agent Framework. Appl. Sci. 2026, 16, 5885. https://doi.org/10.3390/app16125885

AMA Style

Jun L, Yan X, Lin J-C, Zhang D-Z. Reinforcement Learning-Enhanced Large Language Models for Automated Modeling of Nuclear Thermal-Hydraulic Systems: A Plan-and-Act Agent Framework. Applied Sciences. 2026; 16(12):5885. https://doi.org/10.3390/app16125885

Chicago/Turabian Style

Jun, Luo, Xiong Yan, Jing-Chen Lin, and Da-Zhi Zhang. 2026. "Reinforcement Learning-Enhanced Large Language Models for Automated Modeling of Nuclear Thermal-Hydraulic Systems: A Plan-and-Act Agent Framework" Applied Sciences 16, no. 12: 5885. https://doi.org/10.3390/app16125885

APA Style

Jun, L., Yan, X., Lin, J.-C., & Zhang, D.-Z. (2026). Reinforcement Learning-Enhanced Large Language Models for Automated Modeling of Nuclear Thermal-Hydraulic Systems: A Plan-and-Act Agent Framework. Applied Sciences, 16(12), 5885. https://doi.org/10.3390/app16125885

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Reinforcement Learning-Enhanced Large Language Models for Automated Modeling of Nuclear Thermal-Hydraulic Systems: A Plan-and-Act Agent Framework

Abstract

1. Introduction

2. Related Work

2.1. Nuclear Thermal-Hydraulic Simulation and Automated Modeling

2.2. Large Language Models for Engineering Code Generation

2.3. Reinforcement Learning for Tool-Using Language Agents

2.4. Critical Synthesis and Positioning

3. Simulation Platform and Problem Formulation

3.1. Modeling Hierarchy, Component Taxonomy, and API Parameterization

3.2. Task Formalization as a Markov Decision Process (MDP)

4. Methodology

4.1. Overall Framework

4.2. Domain Dataset Construction

4.3. Supervised Fine-Tuning with LoRA

4.4. Agent Workflow: Planning, Retrieval, and Execution

4.5. GRPO-Based Policy Optimization

4.6. Four-Dimensional Reward Design

4.7. Error Self-Healing Mechanism

5. Experimental Setup

5.1. Benchmark Cases and Evaluation Protocol

5.2. Implementation Details

5.3. Evaluation Metrics

6. Results

6.1. Overall Comparison with Prompting and SFT Baselines

6.2. Effects of Reinforcement Learning on Topological and Physical Validity

6.3. Ablation Study and Performance Analysis

7. Discussion

8. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Supplementary Experimental Details and Statistical Analyses

Appendix B. SAFRI Python Wrapper API Reference

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI