Reinforcement Learning-Enhanced Large Language Models for Automated Modeling of Nuclear Thermal-Hydraulic Systems: A Plan-and-Act Agent Framework
Abstract
1. Introduction
- (1)
- We constructed a specialized 6003-record domain dataset to fine-tune the Qwen3-8B model using Low-Rank Adaptation (LoRA). We demonstrate that SFT effectively resolves variable hallucinations, achieving 100% local syntax accuracy, substantially outperforming zero-shot (50%) and one-shot (74%) baselines.
- (2)
- We designed a Plan-and-Act agent workflow incorporating Retrieval-Augmented Generation (RAG) and precise dependency tracking to translate raw, structured design documents into executable modeling actions via a Transmission Control Protocol (TCP) interface.
- (3)
- We formalized the system-level modeling task as a Markov Decision Process (MDP) and applied Group Relative Policy Optimization (GRPO). By designing a multi-dimensional reward function (schema, topology, physics, sequence) grounded in real simulation feedback, the framework transitions from syntactic imitation to physical optimization.
- (4)
- We introduced an autonomous error self-healing mechanism. Across the in-house evaluation set, the RL-enhanced agent increased the Physical Convergence Rate (PCR) from 72.4% for the SFT baseline to 88.8% and reduced peak GPU memory by approximately 40% relative to PPO under matched profiling conditions, supporting the practical value of physics-informed RL for solver-grounded modeling.
2. Related Work
2.1. Nuclear Thermal-Hydraulic Simulation and Automated Modeling
2.2. Large Language Models for Engineering Code Generation
2.3. Reinforcement Learning for Tool-Using Language Agents
2.4. Critical Synthesis and Positioning
3. Simulation Platform and Problem Formulation
3.1. Modeling Hierarchy, Component Taxonomy, and API Parameterization
3.2. Task Formalization as a Markov Decision Process (MDP)
- State Space (S): The state st includes the target structural design document (JSON format), the operational history (previously executed API calls and their environment feedback), and the internal state of the SAFRI platform (instantiated topological graph).
- Action Space (A): The action at comprises the structured JSON response generated by the LLM, which dictates the target SAFRI Python API code snippet for execution.
- Transition Function (P): The environment is deterministic. Executing at either updates the platform’s topological state or returns an error traceback if the API constraints are violated.
- Reward Function (R): A multi-dimensional scalar feedback mechanism evaluating intermediate syntax and terminal physical convergence.
- Discount Factor (: Set to 0.95 to penalize unnecessarily protracted modeling sequences.
4. Methodology
4.1. Overall Framework
4.2. Domain Dataset Construction
4.3. Supervised Fine-Tuning with LoRA
4.4. Agent Workflow: Planning, Retrieval, and Execution
- Plan: the agent receives the global instruction, parses the design-document structure, and generates a hierarchical task list with explicit identifiers and dependency annotations. Tasks that create components therefore precede tasks that modify them, and connection operations are constrained to occur only after the relevant ports exist.
- Retrieve: using Retrieval-Augmented Generation (RAG) [7], an embedding-based retriever isolates only the physical parameters and local structural context relevant to the current subtask. This step compresses the effective prompt context and reduces interference from irrelevant portions of the full design document.
- Act: the SFT-initialized policy generates the corresponding API sequence. That code is transmitted via TCP to the SAFRI client, executed immediately, and the resulting platform state and error logs are returned to the agent memory for subsequent planning or repair.
4.5. GRPO-Based Policy Optimization
4.6. Four-Dimensional Reward Design
- Schema Reward: this channel verifies that generated parameters lie within physically admissible bounds, such as pressure between 0.1 and 20 MPa, and that required argument fields are not omitted. Missing-field, out-of-range, and datatype violations are explicitly penalized within the normalized score.
- Topology Reward: this channel evaluates whether the generated network matches the intended graph connectivity by penalizing isolated nodes, duplicated links, unresolved ports, and reversed flow directions. Its normalized form is computed from the number of topology faults detected for the current case.
- Physics Reward: this is the decisive terminal reward that interfaces directly with the PANTHER solver. A score of 1 is assigned when the model reaches full steady-state convergence, an intermediate score is assigned when initialization succeeds but convergence criteria are only partially satisfied, and a score of 0 is assigned when the solver fails before meaningful physical evaluation.
- Sequence Reward: this channel incentivizes adherence to API logic, ensuring that components are created before they are modified and that dependent objects or boundaries are defined before connection steps are attempted. Ordering and dependency violations are penalized through a normalized count-based score.
4.7. Error Self-Healing Mechanism
5. Experimental Setup
5.1. Benchmark Cases and Evaluation Protocol
5.2. Implementation Details
5.3. Evaluation Metrics
- Code Accuracy: the fraction of component-level tasks for which the generated SAFRI script fragment is directly executable, contains no hallucinated variables or unsupported arguments, and satisfies the expected component-specific parameter schema.
- Syntax Success Rate (SSR): the percentage of system-level tasks whose generated scripts execute through the SAFRI Python interface without raising unrecovered API, parser, or runtime exceptions during model construction.
- Topology Success Rate (TSR): the percentage of generated models that form a complete and admissible network graph, with intended ports connected, no isolated nodes, no unresolved open boundaries, and no duplicated or contradictory links that invalidate connectivity.
- Physical Convergence Rate (PCR): the fraction of generated models that not only pass syntax and topology checks but also initialize the PANTHER solver successfully and satisfy the steady-state convergence test used in this study. Following RELAP5-style steady-state practice, the coupled mass, momentum, and energy equations are advanced with a semi-implicit global formulation, and the linearized matrix system at each iteration is solved directly. A case is counted as PCR-successful only when (i) initialization completes without a fatal solver error, (ii) the residuals of the mass and momentum equations are both ≤1 × 10−5 and the energy residual is ≤1 × 10−4, (iii) global mass and energy imbalances are each below 0.5%, and (iv) all criteria are reached within 5000 iterations.
6. Results
6.1. Overall Comparison with Prompting and SFT Baselines
6.2. Effects of Reinforcement Learning on Topological and Physical Validity
6.3. Ablation Study and Performance Analysis
7. Discussion
8. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Appendix A. Supplementary Experimental Details and Statistical Analyses
| Category | Hyperparameter | Value | Notes |
|---|---|---|---|
| Policy | Backbone model | Qwen3-8B (LoRA-merged) | SFT-initialized; identical for PPO/GRPO/DPO/REINFORCE |
| LoRA rank r | 8 | Inherited from SFT stage | |
| LoRA α | 16 | ||
| LoRA dropout | 0.05 | ||
| Trainable parameters | ≈21 M (0.26% of 8 B) | Attention Q/K/V/O projections | |
| GRPO core | Group size G | 8 | Candidate trajectories per prompt |
| KL coefficient β | 0.04 | Against SFT reference policy | |
| Clipping threshold ε_c | 0.2 | Symmetric clip on importance ratio | |
| Discount factor γ | 0.95 | Consistent with Section 3.2 | |
| Advantage normalization | Within-group z-score | ||
| Optimization | Optimizer | AdamW | β1 = 0.9, β2 = 0.95 |
| Weight decay | 0.01 | Excluding LayerNorm and bias | |
| Learning rate | 1 × 10−6 | Linear warmup over first 10 steps | |
| LR schedule | Cosine decay to 1 × 10−7 | Over 300 total steps | |
| Gradient accumulation | 4 micro-batches per update | Effective batch = 4 × G = 32 | |
| Gradient clipping | Global L2 norm = 1.0 | ||
| Mixed precision | bfloat16 | Gradient checkpointing enabled | |
| Rollout | Max API actions per trajectory | 64 | Rollout truncation criterion |
| Max self-healing cycles | 3 | Inside the same trajectory | |
| Sampling temperature | 0.9 | During rollout exploration | |
| Top-p | 0.95 | Nucleus sampling | |
| Top-k | 50 | Hard cutoff | |
| Termination triggers | (a) PCR success, (b) 3 repair cycles exhausted, (c) unrecoverable solver error | See Section 4.7 | |
| Training schedule | Total optimization steps | 300 | Stable plateau by step ~220 |
| Trajectories sampled per step | B × G = 32 | ||
| Rollout buffer size | 256 trajectories | FIFO refresh | |
| Evaluation interval | Every 25 steps | On 10-case validation subset | |
| Compute | Hardware | 8 × NVIDIA A100 80 GB | NVLink, single node |
| Distributed strategy | DeepSpeed ZeRO-2 + LoRA | Actor only (no value head for GRPO) | |
| Total wall-clock | ≈18 h | End-to-end including SAFRI execution | |
| SAFRI execution backend | Windows workstation (Xeon, 32 GB RAM) | TCP-bridged to Linux GPU node |
| Section A. Repair effectiveness by difficulty level (mean across three independent runs) | |||||
| Difficulty | Cases | Cases Triggering Self-Healing | Mean Repair Attempts per Triggered Case | Repair Success Rate (Recovered Within ≤3 Cycles) | Net PCR Contribution (pp) |
| Simple | 15 | 2.0/15 (13.3%) | 1.5 | 100.0% | +6.7 |
| Medium | 20 | 6.7/20 (33.5%) | 2.2 | 90.9% | +15.0 |
| Complex | 15 | 9.3/15 (62.0%) | 2.8 | 71.4% | +26.7 |
| Overall | 50 | 18.0/50 (36.0%) | 2.3 | 81.4% | +16.4 |
| Section B. Dominant failure types intercepted by self-healing (across all triggered episodes) | |||||
| Failure Class | Share of Triggered Repairs | Typical Corrective Operator | First-Attempt Recovery Rate | ||
| Port-orientation/junction-direction errors | 37% | Junction-direction adjustment, port remapping | 73% | ||
| Missing or out-of-order dependency declarations | 28% | Component reordering, local rollback + re-execution | 81% | ||
| Boundary-coupling inconsistencies (e.g., incompatible pressure/temperature pairs) | 21% | Boundary replacement, parameter revision | 64% | ||
| Parameter out-of-range or schema violations | 14% | Parameter revision (in-range resampling) | 92% | ||
| Section C. Distribution of repair attempts per triggered case | |||||
| Repair Attempts Used | Cases | Share | Resulting Status | ||
| 0 (succeeded on first try, no repair) | 32.0/50 | 64.0% | PCR success | ||
| 1 | 7.7/50 | 15.3% | PCR success | ||
| 2 | 5.0/50 | 10.0% | PCR success | ||
| 3 (max allowed) | 1.7/50 | 3.5% | PCR success | ||
| > 3 (terminated as unrecoverable) | 3.6/50 | 7.2% | PCR fail (residual) | ||
| Section A. 95% Wilson and Clopper–Pearson confidence intervals on the in-house evaluation set (n = 50 per run; k counted across three independent runs and rounded) | |||||||
| Agent | Metric | Successes (Mean/50) | Point Estimate | 95% Wilson CI | 95% Clopper–Pearson CI | ||
| Zero-shot Agent | SSR | 29.0 | 58.0% | [44.2%, 70.6%] | [43.2%, 71.8%] | ||
| Zero-shot Agent | TSR | 23.3 | 46.5% | [33.2%, 60.3%] | [32.5%, 60.9%] | ||
| Zero-shot Agent | PCR | 19.6 | 39.2% | [26.7%, 53.4%] | [25.8%, 53.9%] | ||
| SFT Baseline | SSR | 50.0 | 100.0% | [92.9%, 100.0%] | [92.9%, 100.0%] | ||
| SFT Baseline | TSR | 45.0 | 90.0% | [78.6%, 95.7%] | [78.2%, 96.7%] | ||
| SFT Baseline | PCR | 36.2 | 72.4% | [58.6%, 82.9%] | [58.0%, 83.7%] | ||
| SAFRI-SFT-RL | SSR | 50.0 | 100.0% | [92.9%, 100.0%] | [92.9%, 100.0%] | ||
| SAFRI-SFT-RL | TSR | 50.0 | 100.0% | [92.9%, 100.0%] | [92.9%, 100.0%] | ||
| SAFRI-SFT-RL | PCR | 44.4 | 88.8% | [80.3%, 94.5%] | [76.9%, 95.4%] | ||
| Section B. Pairwise significance tests on PCR (McNemar test on paired case-level outcomes, three-run pooled). b = cases where the first agent succeeds but the second fails; c = the reverse. | |||||||
| Comparison | Discordant Pairs (b, c) * | McNemar χ2 (Continuity-Corrected) | p-Value | Effect Size (ΔPCR, pp) | |||
| SFT Baseline vs. Zero-shot | (50, 0) | 48.02 | <0.001 | +33.2 | |||
| SAFRI-SFT-RL vs. Zero-shot | (74, 0) | 72.01 | <0.001 | +49.6 | |||
| SAFRI-SFT-RL vs. SFT Baseline | (25, 1) | 20.35 | <0.001 | +16.4 | |||
| SAFRI-SFT-RL vs. PPO | (6, 1) | 2.29 | 0.130 | +3.4 | |||
| SAFRI-SFT-RL vs. DPO | (20, 1) | 15.43 | <0.001 | +12.7 | |||
| Section C. Run-to-run variance across three independent seeds. | |||||||
| Agent | Metric | Run 1 | Run 2 | Run 3 | Mean | SD | Coefficient of Variation |
| SFT Baseline | PCR | 70.0% | 74.0% | 73.2% | 72.4% | 2.1% | 2.9% |
| SAFRI-SFT-RL | PCR | 90.0% | 88.0% | 88.4% | 88.8% | 1.4% | 1.6% |
| SAFRI-SFT-RL | TSR | 100.0% | 100.0% | 100.0% | 100.0% | 0.0% | 0.0% |
Appendix B. SAFRI Python Wrapper API Reference
References
- Brook, B.W.; Alonso, A.; Meneley, D.A.; Misak, J.; Blees, T.; van Erp, J.B. Why nuclear energy is sustainable and has to be part of the energy mix. Sustain. Mater. Technol. 2014, 1–2, 8–16. [Google Scholar] [CrossRef]
- Carlson, K.E.; Riemke, R.A.; Rouhani, S.Z.; Shumway, R.W.; Weaver, W.L.; Wagner, R.J. RELAP5/MOD3 Code Manual: Code Structure, System Models, and Solution Methods; NUREG/CR-5535; Idaho National Engineering Laboratory: Idaho Falls, ID, USA, 1990. Available online: https://www.nrc.gov/docs/ML1103/ML110330200.pdf (accessed on 15 April 2026).
- Yeoh, G.H. Thermal hydraulic considerations of nuclear reactor systems: Past, present and future challenges. Exp. Comput. Multiph. Flow 2019, 1, 3–23. [Google Scholar] [CrossRef]
- Bajorek, S.M.; TRACE Code Development Team. TRACE V5.0 Theory Manual: Field Equations, Solution Methods, and Physical Models; Division of Safety Analysis, Office of Nuclear Regulatory Research, U.S. Nuclear Regulatory Commission: Washington, DC, USA, 2008. Available online: https://www.nrc.gov/docs/ML1200/ML120060218.pdf (accessed on 15 April 2026).
- Kolev, N.I. Multiphase Flow Dynamics 4: Nuclear Thermal Hydraulics; Springer: Berlin/Heidelberg, Germany, 2009; ISBN 978-3-540-92917-8. [Google Scholar] [CrossRef]
- Yang, A.; Li, A.; Yang, B.; Zhang, B.; Hui, B.; Zheng, B.; Yu, B.; Gao, C.; Huang, C.; Lv, C.; et al. Qwen3 Technical Report. arXiv 2025, arXiv:2505.09388. [Google Scholar] [CrossRef]
- Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.-T.; Rocktäschel, T.; et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Proceedings of the Advances in Neural Information Processing Systems 33 (NeurIPS 2020), Virtual, 6–12 December 2020; pp. 9459–9474. [Google Scholar]
- Jiang, J.; Wang, F.; Shen, J.; Kim, S.; Kim, S. A survey on large language models for code generation. ACM Trans. Softw. Eng. Methodol. 2025, 35, 58. [Google Scholar] [CrossRef]
- Liu, N.F.; Lin, K.; Hewitt, J.; Paranjape, A.; Bevilacqua, M.; Petroni, F.; Liang, P. Lost in the middle: How language models use long contexts. Trans. Assoc. Comput. Linguist. 2024, 12, 157–173. [Google Scholar] [CrossRef]
- Gabber, H.A.; Hemied, O.S. Domain-specific large language model for renewable energy and hydrogen deployment strategies. Energies 2024, 17, 6063. [Google Scholar] [CrossRef]
- Yao, S.; Zhao, J.; Yu, D.; Du, N.; Shafran, I.; Narasimhan, K.; Cao, Y. ReAct: Synergizing reasoning and acting in language models. In Proceedings of the 11th International Conference on Learning Representations (ICLR 2023), Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
- Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.L.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; et al. Training language models to follow instructions with human feedback. In Proceedings of the Advances in Neural Information Processing Systems 35 (NeurIPS 2022), New Orleans, LA, USA, 28 November–9 December 2022; pp. 27730–27744. [Google Scholar]
- Kochunas, B.; Huan, X. Digital twin concepts with uncertainty for nuclear power applications. Energies 2021, 14, 4235. [Google Scholar] [CrossRef]
- Prantikos, K.; Tsoukalas, L.H.; Heifetz, A. Physics-informed neural network solution of point kinetics equations for a nuclear reactor digital twin. Energies 2022, 15, 7697. [Google Scholar] [CrossRef]
- Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. LoRA: Low-rank adaptation of large language models. In Proceedings of the 10th International Conference on Learning Representations (ICLR 2022), Virtual, 25–29 April 2022. [Google Scholar]
- Wu, Q.; Bansal, G.; Zhang, J.; Wu, Y.; Li, B.; Zhu, E.; Jiang, L.; Zhang, X.; Zhang, S.; Liu, J.; et al. AutoGen: Enabling next-gen LLM applications via multi-agent conversation. arXiv 2023, arXiv:2308.08155. [Google Scholar]
- Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal Policy Optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar] [CrossRef]
- Shao, Z.; Wang, P.; Zhu, Q.; Xu, R.; Song, J.; Bi, X.; Zhang, H.; Zhang, M.; Li, Y.K.; Wu, Y.; et al. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models. arXiv 2024, arXiv:2402.03300. [Google Scholar]
- Guo, D.; Yang, D.; Zhang, H.; Song, J.; Zhang, R.; Xu, R.; Zhu, Q.; Ma, S.; Wang, P.; Bi, X.; et al. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning. arXiv 2025, arXiv:2501.12948. [Google Scholar]
- Petruzzi, A.; D’Auria, F. Thermal-hydraulic system codes in nuclear reactor safety and qualification procedures. Sci. Technol. Nucl. Install. 2008, 2008, 460795. [Google Scholar] [CrossRef]
- Hong, S.; Zhuge, M.; Chen, J.; Zheng, X.; Cheng, Y.; Zhang, C.; Wang, J.; Wang, Z.; Yau, S.K.S.; Lin, Z.; et al. MetaGPT: Meta programming for a multi-agent collaborative framework. arXiv 2023, arXiv:2308.00352. [Google Scholar]
- Pineau, J.; Vincent-Lamarre, P.; Sinha, K.; Larivière, V.; Beygelzimer, A.; d’Alché-Buc, F.; Fox, E.; Larochelle, H. Improving reproducibility in machine learning research (a report from the NeurIPS 2019 reproducibility program). J. Mach. Learn. Res. 2021, 22, 7459–7478. [Google Scholar]










| Approach Group | Representative Model/Platform | Automation Level | Solver Feedback Used | Data/Artifact Availability | Key Limitations |
|---|---|---|---|---|---|
| Rule-/template-based generators | Custom RELAP5/TRACE input-deck templates | Partial (parameter filling, repetitive editing) | No (templates fixed offline) | Closed, project-specific | No generalization to new topologies; manual maintenance |
| Pure LLM prompting agents (e.g., ReAct, AutoGen, MetaGPT) [11,16,17] | GPT-class/open-weight LLMs with prompting | Plan-Act decomposition only | Indirect (logs surfaced, not learned from) | Open frameworks, no domain SFT data | API hallucinations on proprietary platforms; no policy adaptation |
| Simulation-in-the-loop code search | Fixed generator + numerical simulator | Configuration/parameter search | Yes (heuristic, not used to update LLM) | Mostly closed | No joint adaptation of language model; narrow task scope |
| Domain-tuned LLM scripting (SFT only) | LoRA-SFT on engineering corpora | Component-level code generation | No | Domain-specific corpora, partly open | Strong syntax, but weak topology/convergence on system tasks |
| Proposed SAFRI-SFT-RL (this work) [highlighted] | Qwen3-8B + LoRA SFT + Plan-and-Act + GRPO | Component- and system-level model assembly | Yes (schema/topology/physics/sequence rewards) | Procedural reproducibility; corpus + benchmark schema documented | Currently steady-state, PWR-oriented, proprietary platform |
| Component Category | Creation Tasks | Modification Tasks | Total Records |
|---|---|---|---|
| Pipe/Annulus | 650 | 400 | 1050 |
| Valve | 520 | 380 | 900 |
| Pump | 480 | 350 | 830 |
| Boundary Condition | 600 | 420 | 1020 |
| Branch | 550 | 300 | 850 |
| Connection/Junction | 900 | 453 | 1353 |
| Total | 3700 | 2303 | 6003 |
| Case Family | Representative Scope | Difficulty Level | No. of Cases |
|---|---|---|---|
| Isolated component checks | Single pipe, valve, pump, boundary, branch, or junction validation cases | Simple | 10 |
| Short serial assemblies | Two- to four-component chains with boundary-to-load continuity requirements | Simple | 5 |
| Branch-containing loop fragments | Tee/branch/junction subnetworks with directional and connectivity constraints | Medium | 12 |
| Active-passive coupled assemblies | Pump-valve-pipe-boundary combinations requiring correct sequencing and initialization | Medium | 8 |
| Integrated PWR subsystem cases | Pressurizer-linked segments, steam-generator trains, feedwater support lines, and reactor-loop subsystem assemblies | Complex | 15 |
| Total | Simple 15/Medium 20/Complex 15 | — | 50 |
| Configuration | Algorithm | Peak GPU Memory (GB) | Memory Reduction vs. PPO |
|---|---|---|---|
| B = 4, G = 8, = 1024, = 1024 | PPO (actor + value head) | 72.4 | - |
| B = 4, G = 8, = 1024, = 1024 | GRPO (group baseline) | 38.0 | −47.5% |
| B = 4, G = 8, = 1024, = 1536 | PPO | 78.6 | - |
| B = 4, G = 8, = 1024, = 1536 | GRPO | 48.3 | −38.6% |
| B = 8, G = 8, = 1024, = 1024 | PPO | OOM (>80) | - |
| B = 8, G = 8, = 1024, = 1024 | GRPO | 51.7 | −35.4% |
| Mean across configurations | - | - | −40.5% |
| Method | Code Accuracy (Success/100, Mean ± SD over 3 Runs) | Variable Hallucinations (Incidents/100) | Logical Sequence Errors (Incidents/100) | Redundant Code (Incidents/100) |
|---|---|---|---|---|
| Zero-Shot Prompting | 50/100 (50.0% ± 3.4%) | 5 | 6 | 13 |
| One-Shot Prompting | 74/100 (74.0% ± 2.8%) | 0 | 5 | 6 |
| LoRA-SFT (Proposed) | 100/100 (100.0% ± 0.0%) | 0 | 0 | 0 |
| Agent Framework | Runs | SSR (Success/50, Mean ± SD) | TSR (Success/50, Mean ± SD) | PCR (Success/50, Mean ± SD) |
|---|---|---|---|---|
| Zero-shot Agent | 3 | 29.0/50 (58.0% ± 2.4%) | 23.3/50 (46.5% ± 2.7%) | 19.6/50 (39.2% ± 3.0%) |
| SFT Baseline (Plan-and-Act) | 3 | 50.0/50 (100.0% ± 0.0%) | 45.0/50 (90.0% ± 1.6%) | 36.2/50 (72.4% ± 2.1%) |
| SAFRI-SFT-RL (SFT + GRPO) | 3 | 50.0/50 (100.0% ± 0.0%) | 50.0/50 (100.0% ± 0.0%) | 44.4/50 (88.8% ± 1.4%) |
| Agent | Difficulty | SSR Fail | TSR Fail | PCR Fail | Total Failures |
|---|---|---|---|---|---|
| Zero-shot Agent | Simple (15) | 4 | 1 | 1 | 6 |
| Zero-shot Agent | Medium (20) | 8 | 2 | 2 | 12 |
| Zero-shot Agent | Complex (15) | 9 | 1 | 2 | 12 |
| SFT Baseline | Simple (15) | 0 | 0 | 1 | 1 |
| SFT Baseline | Medium (20) | 0 | 2 | 4 | 6 |
| SFT Baseline | Complex (15) | 0 | 3 | 4 | 7 |
| SAFRI-SFT-RL | Simple (15) | 0 | 0 | 0 | 0 |
| SAFRI-SFT-RL | Medium (20) | 0 | 0 | 2 | 2 |
| SAFRI-SFT-RL | Complex (15) | 0 | 0 | 4 | 4 |
| (A) | |||||
| RL Method | SSR (Mean ± SD) | TSR (Mean ± SD) | PCR (Mean ± SD) | Peak GPU Memory (GB) | Notes |
| SFT only (no RL) | 100.0 ± 0.0 | 90.0 ± 1.6 | 72.4 ± 2.1 | - | Baseline reference |
| REINFORCE | 100.0 ± 0.0 | 94.0 ± 2.4 | 78.7 ± 3.5 | 36.5 | High variance |
| PPO (actor + value head) | 100.0 ± 0.0 | 98.0 ± 1.4 | 85.4 ± 1.9 | 72.4 | Strong but memory-heavy |
| DPO from preference pairs | 100.0 ± 0.0 | 92.7 ± 2.0 | 76.1 ± 2.6 | 32.1 | No execution feedback |
| GRPO (this work) | 100.0 ± 0.0 | 100.0 ± 0.0 | 88.8 ± 1.4 | 38.0 | Best on PCR and memory |
| (B) | |||||
| Model (Family/Build) | Type | SSR (%) | TSR (%) | PCR (%) | |
| Qwen2.5-Coder-7B-Instruct (one-shot) | Open, dense 7B | 73.0 ± 1.6 | 58.4 ± 2.1 | 49.6 ± 1.8 | |
| Llama-3.1-8B-Instruct (one-shot) | Open, dense 8B | 58.0 ± 2.3 | 45.6 ± 2.4 | 38.0 ± 2.2 | |
| DeepSeek-Coder-V2-Lite-Instruct (one-shot) | Open, MoE 16B/2.4B-active | 70.4 ± 1.8 | 55.8 ± 2.0 | 47.0 ± 1.9 | |
| GPT-5 (2025-08, zero-shot) | Closed, proprietary | 82.6 ± 1.2 | 67.4 ± 1.4 | 58.6 ± 1.6 | |
| SAFRI-SFT-RL (full, this work) | Qwen3-8B + LoRA-SFT + GRPO | 100.0 | 100.0 | 88.8 ± 1.4 | |
| (A) | ||||
| Reward Configuration | SSR (%/50) | TSR (%/50) | PCR (%/50) | Notes |
| Full reward (default ) | 100.0 (50/50) | 100.0 (50/50) | 88.8 (44.4/50) | Baseline |
| Without schema reward() | 95.3 (47.7/50) | 99.3 (49.6/50) | 85.2 (42.6/50) | Sporadic argument-type faults reappear |
| Without topology reward () | 100.0 (50/50) | 82.0 (41/50) | 78.6 (39.3/50) | Connectivity failures dominate |
| Without physics reward() | 100.0 (50/50) | 99.0 (49.5/50) | 78.6 (39.3/50) | Convergence not optimized |
| Without sequence reward () | 100.0 (50/50) | 98.0 (49/50) | 86.0 (43/50) | Ordering errors return on complex cases |
| (B) | ||||
| Configuration | SSR (%) | TSR (%) | PCR (%) | ΔPCR vs. Full (pp) |
| Full SAFRI-SFT-RL (LoRA-SFT + Plan-and-Act + GRPO + self-healing) | 100.0 | 100.0 | 88.8 | — |
| w/o GRPO (LoRA-SFT + Plan-and-Act + self-healing, no RL) | 100.0 | 90.0 | 72.4 | −16.4 |
| w/o self-healing (LoRA-SFT + Plan-and-Act + GRPO, no repair loop) | 100.0 | 98.0 | 82.0 | −6.8 |
| w/o Plan-and-Act (LoRA-SFT model called once per case, no agent) | 92.0 | 68.0 | 54.2 | −34.6 |
| w/o LoRA-SFT (Qwen3-8B + Plan-and-Act + GRPO, no domain SFT) | 78.0 | 61.3 | 47.6 | −41.2 |
| w/o LoRA-SFT and w/o GRPO (Qwen3-8B one-shot, no agent, no RL) | 58.0 | 46.5 | 39.2 | −49.6 |
| Reward Weight Perturbation | ω1 (Schema) | ω2 (Topology) | ω3 (Physics) | ω4 (Sequence) | PCR (%) | ΔPCR (pp) |
|---|---|---|---|---|---|---|
| Default (reported in paper) | 0.20 | 0.30 | 0.35 | 0.15 | 88.8 | 0.0 |
| ω1 + 0.10 (schema ↑) | 0.30 | 0.26 | 0.31 | 0.13 | 87.5 | −1.3 |
| ω1 − 0.10 (schema ↓) | 0.10 | 0.34 | 0.39 | 0.17 | 88.1 | −0.7 |
| ω2 + 0.10 (topology ↑) | 0.17 | 0.40 | 0.30 | 0.13 | 88.9 | +0.1 |
| ω2 − 0.10 (topology ↓) | 0.23 | 0.20 | 0.40 | 0.17 | 86.2 | −2.6 |
| ω3 + 0.10 (physics ↑) | 0.17 | 0.26 | 0.45 | 0.12 | 89.1 | +0.3 |
| ω3 − 0.10 (physics ↓) | 0.23 | 0.35 | 0.25 | 0.17 | 86.6 | −2.2 |
| ω4 + 0.10 (sequence ↑) | 0.18 | 0.27 | 0.32 | 0.23 | 88.4 | −0.4 |
| ω4 − 0.10 (sequence ↓) | 0.22 | 0.33 | 0.39 | 0.06 | 87.9 | −0.9 |
| Unseen Architecture Family | No. of Cases | SSR (%) | TSR (%) | PCR (%) |
|---|---|---|---|---|
| BWR-style natural-circulation segments | 4 | 100.0 | 95.0 | 85.0 |
| Passive-safety injection lines | 4 | 100.0 | 90.0 | 80.0 |
| Integral-PWR/SMR loop fragments | 4 | 95.0 | 90.0 | 80.0 |
| Loss-of-flow/loss-of-coolant transients | 4 | 100.0 | 85.0 | 70.0 |
| Steam-line/feedwater with active control logic | 4 | 95.0 | 85.0 | 70.0 |
| Total | 20 | 98.0 | 89.0 | 77.0 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Jun, L.; Yan, X.; Lin, J.-C.; Zhang, D.-Z. Reinforcement Learning-Enhanced Large Language Models for Automated Modeling of Nuclear Thermal-Hydraulic Systems: A Plan-and-Act Agent Framework. Appl. Sci. 2026, 16, 5885. https://doi.org/10.3390/app16125885
Jun L, Yan X, Lin J-C, Zhang D-Z. Reinforcement Learning-Enhanced Large Language Models for Automated Modeling of Nuclear Thermal-Hydraulic Systems: A Plan-and-Act Agent Framework. Applied Sciences. 2026; 16(12):5885. https://doi.org/10.3390/app16125885
Chicago/Turabian StyleJun, Luo, Xiong Yan, Jing-Chen Lin, and Da-Zhi Zhang. 2026. "Reinforcement Learning-Enhanced Large Language Models for Automated Modeling of Nuclear Thermal-Hydraulic Systems: A Plan-and-Act Agent Framework" Applied Sciences 16, no. 12: 5885. https://doi.org/10.3390/app16125885
APA StyleJun, L., Yan, X., Lin, J.-C., & Zhang, D.-Z. (2026). Reinforcement Learning-Enhanced Large Language Models for Automated Modeling of Nuclear Thermal-Hydraulic Systems: A Plan-and-Act Agent Framework. Applied Sciences, 16(12), 5885. https://doi.org/10.3390/app16125885

