Machine-Learning Algorithms for Remote-Control and Autonomous Operation of the Very-Small, Long-Life, Modular (VSLLIM) Microreactor

El-Genk, Mohamed S.; Schriener, Timothy M.; Shaheen, Ahmad N.

doi:10.3390/jne6040054

Open AccessArticle

Machine-Learning Algorithms for Remote-Control and Autonomous Operation of the Very-Small, Long-Life, Modular (VSLLIM) Microreactor

by

Mohamed S. El-Genk

^*,

Timothy M. Schriener

and

Ahmad N. Shaheen

Institute for Space and Nuclear Power Studies and Nuclear Engineering Department, The University of New Mexico, Albuquerque, NM 87131, USA

^*

Author to whom correspondence should be addressed.

J. Nucl. Eng. 2025, 6(4), 54; https://doi.org/10.3390/jne6040054

Submission received: 10 October 2025 / Revised: 27 November 2025 / Accepted: 28 November 2025 / Published: 2 December 2025

Download

Browse Figures

Versions Notes

Abstract

This work investigated machine-learning algorithms for remote-control and autonomous operation of the Very-Small, Long-Life, Modular (VSLLIM) microreactor. This walk-away safe reactor can continuously generate 1.0–10 MW of thermal power for 92 and 5.6 full power years, respectively, is cooled by natural circulation of in-vessel liquid sodium, does not require on-site storage of either fresh or spent nuclear fuel, and offers redundant means of control and passive decay heat removal. The two ML algorithms investigated are Supervised Learning with Long Short-Term Memory networks (SL-LSTM) and Soft-Actor Critic with Feedforward Neural Networks (SAC-FNN). They are trained to manage the movement of the control rods in the reactor core during various transients including startup, shutdown, and to change the reactor steady state power up to 10 MW. The trained algorithms are incorporated into a Programmable Logic Controller (PLC) coupled to a digital twin dynamic model of the VSLLIM microreactor. Although the SL-LSTM algorithms demonstrate high prediction accuracy of up to 99.95%, they demonstrate inferior performance when incorporated into the PLC. Conversely, the PLC with SAC-FNN algorithm accurately adjusts the control rods positions during the reactor startup transients to within ±1.6% of target values.

Keywords:

machine learning; autonomous and remote control; modular small and microreactors; supervised and reinforcement learning algorithms; soft actor critic algorithm

1. Introduction

The recent growing interest in the development and deployment of small modular nuclear fission reactors (SMRs) and modular microreactors (MMRs) is due to the need for standalone sources to supply baseload electricity 24/7 to data and Artificial Intelligence centers. These reactors are also a practical option for providing both baseload electricity and process-heat to remote communities, military bases, and industrial and mining operations with limited or no access to an electrical grid. SMRs and MMRs could be factory fabricated, assembled, and sealed, offer passive operation and safety features, and short construction and deployment times. It is desirable that these reactors in remote communities and sites could be controlled and operated remotely with a high degree of local autonomy [1]. Such capabilities will minimize the need for onsite personnel and ensure safe plant operation in the event of a delay or loss of communication with the remote operators.

Training and implementing Machine-Learning (ML) algorithms can enable fail-safe operation and autonomous control of these reactors. The trained algorithms learn from operation data to produce generalized responses to perform tasks without explicit instructions. They would monitor and diagnose anomalous operating conditions, independently take corrective control actions [2], and monitor sensor data as well as detect, identify, and correct faults [3]. The success of ML algorithms to learn patterns in big data sets has led to further investigations and the applications to direct and autonomous control of industrial processes and to predict the performance of a nuclear power plant based on a digital twin. Among the ML algorithms investigated are those of Supervised Learning (SL) and Reinforcement Learning (RL) [4,5].

The training of the SL algorithms uses data of known input parameters (or features) and desired outputs (or targets) to build a function that can map new data and predict the correct output values [4]. Such training can employ pre-existing labeled data, such as that of historical operations of existing nuclear plants, and high-quality simulation data. The latter could be for operating conditions absent from the historical monitoring data of physical nuclear plants [6].

The Reinforcement-Learning (RL) algorithms train agents to receive a high cumulative reward for making correct actions while actively controlling a dynamic process [5]. Thus, training the RL algorithms neither relies on pre-generated labeled input/output pairs of control actions and state variables nor requires that the network’s sub-optimal actions be corrected [5,7]. Instead, during training, the RL algorithms seek balance between exploring the action space and exploiting the current knowledge of the controller’s responses. This exploration feature allows RL algorithms to examine different control actions in response to a received input of the state variables and identify the most advantageous response.

1.1. Prior Investigation of ML for Nuclear Reactor Instrumentation and Control Systems

Trained SL algorithms have been investigated for fault detection and diagnosis and data forecast for nuclear plants. Wang et al. [6] investigated a trained fault diagnosis system for nuclear plants using SL and Support Vector Machine algorithms. The system employed a knowledge-based module of plant historical operation data to identify potential faults and their causes using data generated by a fast-running integrated thermal-hydraulic model of the plant [6]. Radaideh et al. [8] investigated an SL algorithm for forecasting the operation variables of a commercial Light Water Reactor (LWR) plant during simulated Loss of Coolant Accidents (LOCAs). The trained algorithm predicted the operation parameters with an accuracy of 92–99% during testing. Xiao, et al. [9] trained an ML global Neural Network Predictive Controller for the Westinghouse IRIS reactor [10] to serve as a transfer function in the Model Predictive Control (MPC) system for the control rods. Compared to a conventional Proportional-Integral-Differential (PID) controller, the performance of the developed ML predictive controller decreased overshoots in reactor power and core temperatures, following a reactivity perturbation.

Other researchers have investigated RL algorithms for fault detection and state prediction, as well as for direct control of different systems within nuclear plants. Qian and Liu [11] have applied Tensorflow to RL algorithms for fault diagnostics of a simulated nuclear power plant. They compared the results of a Convolutional Neural Network (CNN) and a Gated Recurrent Unit (GRU) neural network. The trained RL agent attempted to identify different faults from the supplied plant state variables data, with an accuracy > ~95% using either network. Wei, et al. [12] trained a Twin Delayed Deep Deterministic Policy Gradient (TD3) RL algorithm to control a RELAP5 code system model of the Qinshan PWR plant in China. An actor–critic arraignment trained the neural networks to perform the prediction function in an MPC controller. The combined trained algorithm with a PID controller simulated 10% and 20% step changes in the nominal thermal power of the reactor. Results showed a slight decrease in power overshoot compared to a PID controller [12].

Park, et al. [13] have investigated an Asynchronous Advantage Actor Critic (A3C) RL algorithm for the autonomous control of a Westinghouse 3-loop PWR during a simulated heat-up transient. This scenario increased the primary coolant temperature to a hot zero power state prior to ensuing the reactor startup transient. During the simulated transient, the trained A3C agent maintained the pressure and the water level in the pressurizer within defined limits.

Lee et al. [14] have applied a Soft-Actor–Critic (SAC) ML algorithm to control the startup and emergency operation of a simulated Westinghouse 3-loop PWR. During the startup transient, the trained SAC algorithm predicted the rate of increase of the reactor power but did not directly control the position of the control rods or the concentration of the soluble boron poison in the reactor core coolant [14]. In an emergency operation, the SAC algorithm controlled the actuation of the pressurizer’s spray nozzle and the power to the submerged electrical heaters, the charging and letdown valves, the primary coolant pumps, and the safety injection pumps. The trained SAC model successfully increased the reactor power within an allowable range of <3%/h. [14]. The trained controller also reduced the pressure and temperature within the primary loops within the criteria for a reactor shutdown following a simulated small break LOCA.

Nguyen et al. [15] have used simulation data to train an SAC RL algorithm for the controller of a Pebble Bed-High Temperature Gas-cooled Reactor (PB-HTGR). They used the System Analysis Model (SAM) code [16] and a developed balance of plant steam Rankine model in MATLAB Simulink 2020 to generate the training data. The controller with a SAC algorithm coupled to a surrogate model adjusted the rate and magnitude of the external reactivity insertion, the speeds of the gas circulator, the feedwater and the condenser pumps, and the turbine control valve. The controller maintained a smooth change of the reactor power and temperatures within 1.5 °C, however, the secondary side of the plant failed to return to the same conditions at the beginning of the transient. Chen and Ray [17] applied a Deep Deterministic Policy Gradient (DDPG) RL algorithm to the control of a Boiling Water Reactor (BWR) simulator. The trained DDPG actor–critic algorithm used Feedforward Neural Networks (FNN) for the actor and critic. It is trained to maintain the reactor thermal power at a specified setpoint for a simulated BWR experiencing random system perturbations. The DDPG controller settled the thermal power of the reactor within ~2 s compared to ~10 s using an H_∞ control system and reduced the power oscillations during the transient.

Radaideh et al. [18] have trained Advantage Actor Critic (A2C) and Proximal Policy Optimization (PPO) algorithms to predict the position of the rotating control drums for a simulated microreactor model based on the Westinghouse eVinci design [19]. They used training data sets for different fuel burnup levels to determine the control drums’ angular position for critical operation of the reactor. The PPO algorithm outperformed the A2C algorithm, which did not converge to an optimal policy during training [18].

Trunkle et al. [20] investigated a RL PPO model for controlling the Holos-Quad high-temperature gas-cooled microreactor concept. The trained PPO algorithm rotated control drums coupled to a simplified reactor kinetics and thermal-hydraulics model during power change transients. The trained PPO algorithm outperformed a conventional PID controller for increasing and decreasing the reactor power to the target values with less overshooting [20].

In summary, trained SL and RL algorithms can perform diagnostic and control functions of large nuclear reactor plants and microreactors using data generated either by simulators or integrated physics-based models of the plants. However, to the best of these authors’ knowledge, little work has investigated the performance of the trained ML algorithms for real-time reactor control during startup and shutdown transients. These reactor control tasks are challenging owing to the highly nonlinear reactor kinetics and the sensitivity of the external reactivity insertion to the displacements of the control rods with the reactor core. While some researchers have applied ML algorithms for aspects of reactor control, the algorithms did not directly control the positions of the reactivity control elements [14,15,17], or they did not test their algorithms for real-time control of a simulated reactor or a reactor digital twin [18,20]. Therefore, it is desirable to evaluate the performance of trained ML algorithms for reactor control, while incorporated into a digital Programmable Logic Controller (PLC) coupled to a real-time, physics-based transient model of a nuclear power system or a digital twin.

1.2. Objectives

The objective of this research is to train and compare the performance of two distinct ML algorithms, namely, Supervised Learning with Long Short-Term Memory networks (SL-LSTM) and Soft-Actor–Critic with Feedforward Neural Networks (SAC-FNN). The SL-LSTM algorithm is an appropriate choice for using the time-series reactor plant data for training. The trained SAC algorithm with FNNs demonstrated good performance in control processes [14,21]. The off-policy algorithm updates the Actor and Critic networks with data throughout the entire transient to prevent local overfitting of the weights and the biases of the neural networks [5,21].

The SL-LSTM and SAC-FNN algorithms are trained separately to manage the movement of the control rods in the core of the Very-Small, Long-Life, Modular (VSLLIM) microreactor [22,23] during simulated startup transients to steady thermal power levels of 1.0–10 MW_th. The trained ML algorithms perform the reactor control function of the Programmable Logic Controller (PLC) for the control rods, by adjusting their positions or vertical displacements in the VSLLIM reactor core during various simulated startup transients.

The implemented SL-LSTM algorithm processes in sequential sets of time-series the training data generated using a developed digital twin of the VSLLIM microreactor in MATLAB Simulink platform [24]. The parametric analyses of the SL-LSTM algorithm help identify the combinations of the hyperparameter values for high accuracy and low variation of the predicted positions of the control rods during the simulated startup scenarios. The SAC-FNN trains while coupled directly to the VSLLIM microreactor digital twin. The trained SL-LSTM and SAC-FNN algorithms are then integrated separately into a software PL developed in house. This is to evaluate and compare their performance for real-time control of the VSLLIM microreactor during the same simulated startup transients. The next section briefly describes the design features and control of the VSLLIM microreactor.

2. VSLLIM Microreactor Design Features and Control

The present work trained the SL-LSTM and SAC-FNN algorithms using data sets generated by a developed MATLAB-Simulink transient model, or a digital twin, of the VSLLIM microreactor. This data is for simulated startup transients to steady power levels of 1.0–10 MW_th. The fast spectrum walk-away safe VSLLIM MMR design (Figure 1 and Figure 2) is cooled by natural circulation of in-vessel liquid sodium (Na) during nominal operation and after shutdown aided by a 2-m tall chimney and a helically coiled-tube Na/Na heat exchanger (HEX) placed at the top entrance to the downcomer (Figure 1 and Figure 2) [22,23].

It offers redundant control and passive removal of decay heat after shutdown and employes liquid metal heat-pipe thermoelectric (LMHP-TE) conversion modules cooled by natural convection of ambient air. They generate auxiliary DC power 24/7 during reactor operation and after shutdown, and in case of unlikely loss of both off-site and on-site power sources. This factory fabricated, assembled, and sealed modular microreactor can continuously generate 1.0 MW of thermal power for 92 full power years and up to 10.0 MW for ~5.9 Years (FPY), without refueling [22]. It arrives at the operating site on an 18-wheeler truck, by rail, or on a barge and is mounted underground on seismic isolation bearings to protect against earthquakes and an airplane crash or a missile impact (Figure 2).

Owing to the low vapor pressure of liquid sodium, the VSLLIM microreactor operates below atmospheric pressure, which eliminates the need for a pressure vessel. The primary and guard containments for the VSLLIM microreactor are separated by a small gap filled with argon gas, which houses sodium leak detectors. The low thermal conductivity argon gas decreases side heat losses during reactor operation. In the event of a loss of heat removal, due to a failure or malfunction of the in-vessel Na/Na HEX, the argon gas in the gap between the primary and guard vessels is replaced with liquid sodium. This facilitates the decay heat removal by in-vessel natural circulation of liquid sodium and by natural circulation of ambient air along the outer surface of the guard vessel (Figure 2) [26].

The VSLLIM microreactor core (Figure 1b) is loaded with hexagonal assemblies of 13.76 wt.% enriched UN fuel rods with HT-9 steel cladding and with scalloped BeO walls (Figure 3). The scalloped walls ensure that the liquid sodium flow is laterally uniform for cooling the fuel rods within the reactor core assemblies [22]. The fifty-four full hexagonal assemblies and the six partial assemblies of UN fuel rods in the reactor core are arranged in four concentric rings (Figure 1b). The full assemblies are loaded with 19 UN fuel rods, in a triangular lattice, and the partial corner assemblies contain 12 UN fuel rods each. The BeO wedges that surround the UN fuel assemblies in the core within the HT-9 steel core barrel serve as a radial neutron reflector (Figure 1b).

The VSLLIM microreactor has two independent and redundant means for the reactor control. The twelve B₄C Reactor Control (RC) rods located at the center of selected UN fuel assemblies in the second and third rings of the core (Figure 1b and Figure 3a,c) are for reactor control during operation and shutdown. These control rods fall into three groups, labeled A, B, and C, with separate drive motors (Figure 1b). Group A comprises three B₄C rods located at the center of fuel assemblies in the second ring of the reactor core. Group B comprises six B₄C rods in the fuel assemblies in the third ring of the core, and Group C comprises three B₄C rods in the fuel assemblies in the third ring of the core (Figure 1b).

The central Emergency Shut Down (ESD) assembly of 19 B₄C rods, 80% enriched in ¹⁰B, within scalloped HT-9 steel wall (Figure 1 and Figure 3b,d) provides independent shutdown of the reactor in case of an emergency. The next section briefly describes the VSLLIM microreactor digital twin dynamic model developed using the MATLAB Simulink platform [24]. This model generates the operation data sets for training and testing the trained SL-LSTM and SAC-FNN algorithms. They are implemented into the PLC controller of the VSLLIM microreactor during simulated startup and operation transients.

3. VSLLIM Digital Twin Model and Controller

The VSLLIM digital twin dynamic model couples a six-group point kinetics sub-model [26] that accounts for the temperature reactivity feedback to thermal-hydraulics sub-models of the VSLLIM microreactor and the in-vessel Na/Na HEX (Figure 1a and Figure 4). The digital twin model uses the versatile MATLAB Simulink platform [24] to solve the governing equations in the coupled sub-models. The determined values of the physics-based operation parameters during the simulated startup transients are used for training the ML algorithms. These parameters are the reactor thermal power; the average temperatures of the UN fuel, HT-9 steel cladding, and structure and the circulating in-vessel liquid sodium in the reactor core; the mass flow rate and the temperature of the liquid sodium exiting the reactor core; and the temperatures of the rising sodium in the chimney, in the upper and lower plenums and on the shell side of the in-vessel Na/Na heat exchanger (Figure 1, Figure 2 and Figure 3). The Na/Na HEX maintains the temperature of the in-vessel liquid sodium entering the core at 610 K while the exit temperature that varies with the reactor thermal power is <800 K. At these temperatures, liquid sodium is compatible with the HT-9 steel cladding of the UN fuel rods and core structure [27]. The VSLLIM digital twin model uses the ode23s modified Rosenbrock solver in the MATLAB-Simulink platform to numerically solve the coupled point-kinetics sub-model to the overall energy and momentum balance equations of the reactor (Figure 4) with 20 ms time step size during the simulated transients.

The six-group point-kinetics sub-model calculates the transient changes in the reactor fission power, P_Rx, as a function of the external reactivity insertion, Δρ_ex, and the temperature reactivity feedback, Δρ_fb. The simulated startup transient of the VSLLIM microreactor begins after fully withdrawing the ESD central assembly. The inserted external reactivity is due to partially withdrawing the Groups A, B, and C control rods in the core. The temperature reactivity feedback, due to the decreases in the densities of the fuel, cladding, and liquid sodium in the reactor core and the Doppler broadening of the neutron cross sections in the UN fuel, are highly negative. However, the temperature reactivity feedback of the BeO in the radial and axial reflectors and in the scalloped walls of the UN fuel assemblies is slightly positive [22]. In the simulated startup transients, the total reactivity, ρ_total, in the VSLLIM microreactor core is the sum of the inserted external reactivity and the total temperature reactivity feedback (Figure 4).

The thermal-hydraulic sub-model of the reactor simultaneously solves the coupled energy balance equations in the UN fuel rods, core structure, and the in-vessel sodium and the momentum balance equation for natural circulation of the in-vessel liquid sodium coolant in the reactor core, chimney, and the downcomer. The sub-model of the Na/Na HEX (Figure 4) simultaneously solves the energy and momentum balance equations of the secondary liquid Na flowing inside the helically coiled tubes of the Na/Na heat exchanger and the in-vessel liquid sodium flow on the shell side of the HEX [22].

3.1. The VSLLIM Reactor Controllers

During simulated startup transients, the Reactor Control PLC commands the VSLLIM digital twin (Figure 4) and determines the rates and the magnitudes of the axial displacements of the ESD assembly and Groups A, B, and C control rods in the reactor core (Figure 1, Figure 2 and Figure 3). The PLC receives commands from the remote operator to start up or shut down the VSLLIM microreactor as well as specify the desired reactor power setpoint, P_SP (Figure 4). The digital twin model calculates the magnitude and the rate of the external reactivity insertion, Δρ_ex, in the reactor core as a function of the axial displacements of the control rods in the reactor core and passes it to the point-kinetics sub-model (Figure 4). The controller continues to adjust the axial displacement of the control rods until reaching the operator specified setpoint, P_SP, of the reactor thermal power.

The results in Figure 5 are those of the performed neutronics analyses using the MCNP6 code [28] to determine the reactivity worth of each of the control rod groups in the VSLLIM reactor core as a function of axial displacement from the bottom of the core and the calculated mean temperatures in the core (Figure 5). This figure plots the calculated reactivity worth of the control rods in Groups A, B, and C and of the center ESD assembly as functions of axial displacement at isothermal temperatures of 400 K and 800 K. The vertical line in the figure marks the limit set for the axial withdrawal of the B₄C control rods in the core, which corresponds to 2/3 the active core height to speed up reactor shutdown in case of an emergency. The results assume that the control rods are in thermal equilibrium with the in-vessel liquid sodium in the reactor core.

3.1.1. Reactor Control PLC

Two Reactor Control PLC programs are developed to determine the control element positions, using (a) a modified Proportional–Differential (PD) controller and (b) an ML controller using the trained neural networks. The Reactor Control PLC program runs with a scan cycle time of 50 ms. It is sufficiently small to capture the response of the PLC to changes in the reactor operation. At the start of each scan cycle, the PLC reads the Modbus input registers holding the calculated values of the VSLLIM reactor state variables by the digital twin and the commands received from the remote human operator. The state variables include the reactor thermal power, the in-vessel and HEX Na flow rates, the core Na inlet and exit temperatures, the HEX Na inlet and exit temperatures, the calculated core reactivity, and the axial positions of the control rod groups and the ESD assembly. The PLC then acts on the received commands from the remote operator to determine the displacement rates for the control elements. At the end of the scan cycle the actions to move the control elements are written to the PLC’s Modbus output holding registers and communicated to the digital twin model of the VSLLIM microreactor.

During the simulated startup transients, the Reactor Control PLC brings the digital twin from an initial cold subcritical condition at a mean core temperature of 500 K to a steady full power operation at the reactor power setpoint specified by the remote operator. The PLC adjusts the axial displacement of the control rods to bring the reactor power to the setpoint specified by the remote operator, P_SP.

The Reactor Control PLC with the modified PD controller adjusts the rate of the axial displacement of the Group A and C control rods by a rate determined by the PD function depending on the input value of (P_SP − P_Rx). The displacement rate is limited to ≤0.125 mm/s to ensure a smooth increase in the thermal power reactor during the simulated startup transients. The modified PD controller uses a criterion derived from that proposed by Bernard, Lanning, and Ray [29] to adjust the axial withdrawal of the control rods to ensure smooth and gradual increase in the total reactivity, ρ_total, and hence in the reactor power and temperatures, during the startup transient. This criterion is given as follows:

ρ_{t o t a l} < \frac{1}{α} [\frac{|\frac{d ρ}{d t}|}{λ_{e}} + |\frac{d ρ}{d t}| τ \ln \frac{P_{S P}}{P_{R x}}],

(1)

In this expression, α is a scaling coefficient,

\frac{d ρ}{d t}

is the rate of change in the total reactivity, τ is the reactor period, and λ_e is the effective decay constant for the six delayed neutron groups in the reactor’s point-kinetics sub-model. The scaling coefficient provides adequate time for the total reactivity to account for the delayed negative temperature reactivity feedback due to the thermal inertia of the system before further displacing the control rods. A value of α = 25 is used in the present work, for a good balance between shortening the startup time and ensuring smooth increases in the reactor thermal power and the core temperatures during the simulated startup transients.

The Reactor Control PLC program incorporating the ML SL-LSTM and SAC-FNN algorithms inputs the current state variables to the trained neural network and determines the position of the reactor control elements from the network’s output. The displacement rate of the Group A and C control rods is determined from the difference between the desired control rods’ displacement and the present axial displacement, divided by the step period of 0.4 s. The obtained displacement rate is limited to ≤0.125 mm/s, same as in the PLC with PD controller. Unlike the PLC with PD controller, the PLC program with the ML algorithms does not explicitly limit the control rod displacement using the restriction criteria in Equation (1). Instead, the PLC relies on the trained ML algorithm to adjust the control rod positions and ensure a smooth increase or decrease in the reactor power.

3.1.2. HEX Secondary Flow PLC

In addition to the Reactor Control PLC, the VSLLIM microreactor has a PLC that adjusts the secondary Na flow through the helically coiled tubes of the Na/Na HEX using a Proportional–Integral (PI) control function (Figure 1). This function maintains the temperature of the in-vessel liquid Na entering the reactor core, T_in, constant at ~610 K. The input to the HEX PLC PI controller is the difference between the current in-vessel Na inlet temperature to the reactor core, T_in, and the setpoint of 610 K.

3.2. A Simulated Startup Transient of VSLLIM Microreactor

The VSLLIM microreactor digital twin model (Figure 4) simulates reactor startup from an initial subcritical condition to steady state operation at a user specified reactor thermal power setpoint. Figure 6 presents the results of a simulated startup transient based on the control rods’ reactivity worths in Figure 5. The startup transient in Figure 6 begins with the reactor initially subcritical with the in-vessel liquid sodium and the reactor core at 500 K. The startup procedures begin with the Reactor Control PLC fully withdrawing the ESD center assembly from the reactor core over a period of 240 s (Point 1 in Figure 6a). At such point, the reactor is still subcritical. Then the Reactor Control PLC axially withdraws the Group B control rods by 0.77 m over a period of 180 s for the reactor to achieve criticality (Point 2 in Figure 6a). Next, the PLC simultaneously withdraws the Group A and C control rods in the reactor core at a constant rate of 0.75 mm/s until the reactor power reaches a steady value of 100 kW_th (Point 3 in Figure 6a,b). Subsequently, the PD controller manages the withdrawal of the control rods to increase the reactor power to setpoint P_SP,1 = 0.5 MW_th. The PLC limits the movement rate of the Group A and C control rods to ≤0.125 mm/s to ensure a smooth rise of both the reactor power and the exit temperature of the liquid Na in core (Figure 6b,c). The PLC for the Na/Na HEX increases the flow rate of the secondary liquid sodium in the helically coiled tubes to maintain the inlet temperature of the in-vessel sodium into the reactor core, T_in, at 610 K (Figure 6c,d). The reactor reaches steady state power of 0.5 MW_th at t = 2.38 h into the startup sequence (Figure 6b).

The VSLLIM reactor operates at the power setpoint of 0.5 MW_th for a period allowing the remote operator and the on-site diagnostics to check out the systems prior to resuming the increase in the reactor power to 10 MW_th. The remote operator sends a command to the reactor controller to increase the reactor thermal power setpoint from 0.5 MW_th to 10 MW_th (Point 6 in Figure 6). The PD controller simultaneously displaces the Group A and C control rods to increase the external reactivity insertion and hence the reactor thermal power (Figure 6a,b), the circulation rate of the in-vessel liquid sodium, and the sodium exit temperature from the reactor core. The values of these parameters increase steadily over a period of 4.75 h until the reactor power reaches and levels off at 10 MW_th (Point 7 in Figure 6). The corresponding temperature and circulation rate of the in-vessel liquid sodium at the reactor core exit are 780.6 K, and 46.0 kg/s, respectively.

3.3. Machine-Learning Training Data

The VSLLIM microreactor digital twin (Figure 4) generates the target data sets used to train the SL-LSTM and SAC-FNN algorithms. In the simulated startup transients, the Reactor Control PLC with PD controller manages the displacement of the Group A and C control rods in the core to increase the reactor thermal power in increments of 0.25 MW_th until reaching the low power setpoints P_SP,1 = 0.5–9.75 MW_th, and then the high setpoints P_SP,2 = 1.0–10.0 MW_th. The data sets in Figure 7 are those generated for different periods of 200 to 367 min during the simulated startup transients to change the reactor power from P_SP,1 to P_SP,2. Generating the data sets with the physics-based VSLLIM digital twin model ensures that the operation state variables used as features and targets are physically coupled as in the real reactor and are not independently varied. The VSLLIM digital twin model generated a total of 797 data sets comprising more than 956 million data points covering the VSLLIM microreactor power ranging from 0.5 to10 MW_th (Figure 7). The SL-LSTM algorithm used the data sets for training, validation, and testing, while the SAC-FNN algorithm used the generated data as a target.

4. Training the SL-LSTM Algorithm

The training of the SL-LSTM algorithm used the values of the input parameters (referred to as features) and of the desired output (referred to as the target) and builds a function to predict output for unseen input data [4]. This algorithm uses LSTM recurrent neural networks [30] to process sequential time-series data inputs [31] and trains in iterative cycles referred to as epochs. Within each epoch the algorithm undergoes the five operations shown in Figure 8. First, it shuffles the order of the data sets provided and then randomly samples a mini batch of parameters (Point 1 in Figure 8). Shuffling the training data sets avoids learning biases associated with the order of the data within the sets.

The predictions generated by the LSTM network are based on the supplied input parameters in randomly selected mini batches from the training data sets (Point 2 in Figure 8). It then compares the predictions to the target values in the training data sets and calculates a loss function (Point 3 in Figure 8). Next, the algorithm performs backpropagation to calculate the gradients of the loss function with respect to the network’s parameters (Point 4 in Figure 8). These gradients are supplied to the optimizer module to update the weight and bias matrices of the LSTM network (Point 5 in Figure 8). These five steps are performed iteratively within each epoch for all the data in the training sets. The training process continues for sequential epochs until the value of the loss function converges and no longer changes with additional training epochs.

The python program that incorporates the SL-LSTM algorithm (Figure 9) uses the PyTorch 2.2 library [32]. Figure 9 is a flow diagram for two subsequent timesteps of the LSTM network in the present work, which has three layers of cells. The number of cells in each layer equals the size of the Lookback Window, n (Figure 9). In each timestep the Lookback Window includes the values of the features for the past (n) timesteps. For the LSTM network to learn the trends in the time-series data, the input features to a timestep (τ) includes the values for the timesteps (τ − n) to (τ) [30].

The trained SL-LSTM algorithm in the present work selects five primary features, namely: the reactor thermal power setpoint, and the transient values of the reactor thermal power, the reactor core inlet and exit temperatures, and the mass flow rate of the circulating liquid sodium through the core. Each data set contains the transient values of these parameters with a temporal discretization of 0.2 s. The LSTM network estimates a single target value of the normalized position of the Group A and C Control Rods, CRP*. PyTorch normalizes the values of the features to the first layer of cells (Figure 9) to the highest (

x_{m a x})

and lowest

(x_{m i n}

) values in any of the supplied 797 training data sets. This ensures that the values of the normalized feature fall within the interval from 0 to 1. For each feature value, x, the normalized value, x*, is calculated as follows:

x^{*} = \frac{x - x_{m i n}}{x_{m a x} - x_{m i n}} .

(2)

The features use the same values of x_max and x_min to train the SL-LSTM algorithm. The cells in the network receive and output data in two directions (Figure 9). For a given timestep (τ) the arrays of normalized features F* in the lookback window pass from the left in Figure 9 to the LSTM cells in the first layer. These cells also receive the hidden state, h, and the cell state, c, vectors passed from the top down. The algorithm randomly generates the initial values for the hidden state and cell state vectors, h_o and c_o. The values in the three layers in Figure 9 are numerated as h¹, h², and h³, and c¹, c², and c³, respectively.

The present values of the cells’ weight and bias matrices are used to compute the output vectors for the hidden states. The output hidden and cell state vectors pass downward to the cell within the same layer for the subsequent timestep in the lookback window (Figure 9). The hidden state vector also passes to the cell in the next layer for the same timestep in the lookback window (Figure 9).

The learned parameters within the cells control how it “remembers” or “forgets” information stored within the cell and to learn the time-dependent trends of the training data for calculating the output hidden state vectors. The hidden and state vectors pass along the network from left to right and from top to bottom. This is until the present timestep (τ) in the last layer (Layer 3 in Figure 9), where the cell calculates the output hidden state vector h_τ³. This vector passes to a linear node, which converts its output to a single scalar value between 0 and 1. This value is the predicted normalized position for the Group A and C control rods, CRP*.

The predicted axial displacement position, CRP, of the control rods is determined from de-normalizing the output value from the linear node by reversing the min–max normalizing in PyTorch as follows:

x = x^{*} (x_{m a x} - x_{m i n}) + x_{m i n},

(3)

The generated hidden and state vectors for the LSTM cells at the present timestep (τ) in the lookback window, h_g and c_g, are treated as the initial states to cells when calculating the CRP* for the next timestep (τ + 1) (Figure 9). The lookback window for this timestep shifts forward in time by one timestep and includes the input features for the timesteps (τ + 1) to (τ − (n − 1)).

The 797 data sets generated by the VSLLIM digital twin model (Figure 4) are divided into three groups, for training, validation, and testing. The SL-LSTM algorithm uses the Root Mean Square Error (RMSE) for training, testing, and validation loss function. This is calculated based on the difference between the predicted positions of the control rods, x, and the “true” values in the training datasets, as

R M S E = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {(x - x_{t a r g e t})}^{2}},

(4)

In this expression, N is the total number of training data sets. The accuracy of the predictions is the percent relative error of the predicted position of the control rods from the target values in the data sets, expressed as

A c c u r a c y (%) = 100 \times \frac{(x - x_{t a r g e t})}{x_{t a r g e t}},

(5)

The SL-LSTM algorithm also employs the AdamW optimizer with a constant weight decay = 0.1. During training, the algorithm updates the weights and biases for the LSTM cells based on the calculated value of the loss function.

During the validation phase, the SL-LSTM algorithm calculates the RMSE of the predicted position of the control rod position relative to the values in the validation datasets, called the validation loss, but does not update the weights and biases of the LSTM cells. The training loss determines how well the predictions of a trained model fit the provided data sets for training, while the validation loss indicates the expected performance for data not included in the training sets. The validation uses independent data sets to avoid underfitting and overfitting. The testing data sets are used to quantify the testing loss and accuracy of the trained SL-LSTM algorithm for predicting the displacements of the control rods in the VSLLIM reactor core, compared to those of the PLC with PD controller.

Results of the Trained SL-LSTM Algorithms

The performed parametric analyses optimize the hyperparameters of the trained SL-LSTM algorithm and investigate the effect of different parameters on accuracy and applicability to the controller of the VSLLIM microreactor. Appendix A details the investigated ranges and the effects of the hyperparameters on the accuracy of the trained SL-LSTM algorithm. Figure 10, Figure 11 and Figure 12 present example test results for a trained SL-LSTM algorithm with one layer of neurons, a hidden size of 15, and a learning rate of 0.001, using fifty-one randomly selected training data sets, nine validation sets, and one hundred testing sets. Figure 10 compares the calculated RMSE curves for training and validation losses. The training loss decreases to ~1 × 10⁻³ after only three epochs, and changes slightly thereafter (Figure 10). The validation loss oscillates but is of the same magnitude as the training loss (~3 × 10⁻³). These results confirm the successful training of the SL-LSTM algorithm after only a few epochs. The low testing loss of 1.56 × 10⁻³ confirms good predictive performance of the trained SL-LSTM algorithm for the 100 testing cases not included in its training data (Figure 10).

Figure 11 compares the predicted displacements of the Group A and C control rods in the VSLLIM reactor core in a testing case with a final power setpoint, P_sp,2, of 3.5 MW_th. The predictions of the trained SL-LSTM algorithms agree with an accuracy of 99.93%. Figure 12a,b show that the testing accuracies are similarly good for the other VSLLIM startup simulations. The determined accuracy displays a small spread between 99.43% and 99.93%, with an average weighted accuracy of 99.82% for the 100 testing data sets, with a testing loss of 1.56 × 10⁻³ (Figure 12a). These randomly selected testing data sets cover a range of final power setpoints from 3 to 10 MW_th (Figure 12b).

5. Training the SAC-FNN Algorithm

The FNN in the algorithm processes information in one direction, where the output values for a layer of neurons pass on to the inputs of the next layer of neurons (Figure 13). Unlike the SL-LSTM algorithm, the SAC-FNN algorithm does not make predictions based on previous timesteps’ data. Instead, the output is solely based on the present values of the features, F (Figure 13a,b). The Actor Network comprises an input layer with a single neuron, three hidden layers of many neurons each, and an output layer with two neurons (Figure 13a). The features are normalized using the same min–max normalization function used for the SL-LSTM algorithm Equation (2).

The array of normalized features, F*, passes through the input layer (Figure 13a), which passes the output values to the neurons in the first hidden layer. The output values, Y, are calculated from input values, X, based on the values of the neurons’ weight, w, and bias, b, and an activation function α, as follows:

Y = α (w X + b),

(6)

The SAC-FNN algorithm updates the learned weight and bias parameters of the neurons in the FNNs during the training process. The mean (μ) and standard deviation (σ) output by the Actor Network define a normal distribution of the normalized control rod displacements (Figure 13a). The two neurons in the input layer of the Critic Network are for the array of the normalized state values, F*, and the corresponding normalized control rods position CRP* (Figure 13b). These values pass through the neurons in the State-Action layer and sequentially to each of the four hidden layers for the Critic Network. The output layer calculates the approximate action function, referred to as the Q-value.

The SAC-FNN algorithm is incorporated into a Python program using Tensorflow [33] with the Keras ML libraries [34] based on those proposed by Bae, Kim, and Lee [21]. These include a Training Environment and an Actor Network Update Algorithm (Figure 14). The Training Environment (Point 1 in Figure 14) couples the Episodic Actor Network to the Python Reactor Controller (described in Section 3.1). The environment links the controller to the VSLLIM digital twin model to control the movement of the control rods during the performed transient startup scenario (Figure 6). During each training episode the controller attempts to follow the startup scenario in the user supplied Target Data Set and bring the VSLLIM microreactor to the specified target power setpoint, P_sp,2. The trained SAC-FNN algorithm learns to reproduce the startup control actions of the PD controller displayed in the Target Data sets (Figure 14). These sets are selected from among the 797 sets generated by the digital twin model of the VSLLIM microreactor (Figure 7).

In each timestep of the simulated startup transient (e.g., Figure 6), the Episodic Actor Network (Figure 13a) receives the features, F, from the VSLLIM digital twin model (Figure 4) and the mean and standard deviation of the output data. The Normal Distribution Sampler then samples a CRP* from the developed normal distribution of the calculated values of μ and σ. These values are de-normalized using the defined min–max de-normalization Equation (3) and are passed on to the Python Controller.

The controller calculates the displacement rates of the Group A and C control rods in the core of the VSLLIM microreactor (Figure 1). These rates are based on the difference between the predicted position of the control rods from the FNN and the present position in the digital twin model of the reactor. It then communicates the displacement rate of the control rods to the digital twin model to adjust the reactor operation parameters in the next simulation timestep. The Python Controller communicates with the digital twin using a POSIX shared memory function. The MATLAB engine for python [24] launches the digital twin model of the VSLLIM reactor at the start of each training episode (Figure 14).

At the end of the episode the reward value is calculated by the reward function for each timestep of the simulated transient (Point 2 in Figure 14). The reward function for the SAC-FNN algorithm uses a distance-based proportional reward Equation (7). A reward, R_i, is calculated separately for each of the three features, i, using the percent relative difference, E, between the value determined by the digital twin model of the microreactor and that in the target set, as follows:

R_{i} = 10 - (100 \times |\frac{x - x_{t a r g e t}}{x_{t a r g e t}}|), f o r E < 10 %

(7a)

R_{i} = - 10, f o r E \geq 10 %

(7b)

The region in which E ≥ 10% defines the termination range. A reward of −10 is given for each time step in which the features are within that range. The episode is terminated if any of the three features stays in the termination range for a continuous period of more than 60 s. This period provides the reactor controller time to self-correct back into the desirable region of E < 10%. The episode’s total reward is the sum of the cumulative reward for each of the three features over all the timestep in the training episode. Therefore, episodes that terminate early receive a smaller total reward, while episodes where the controller maintains the features within the allowed range for a longer period will receive a higher total reward. The minimum total reward for an episode is limited to zero because negative rewards for training episodes resulted in poor learning for the Actor network.

The algorithm randomly selects sets of the features at different points of the episode, and the corresponding control rods position predicted, and the calculated rewards at the end of each episode. These values, referred to as experiences, are stored within the Replay Buffer (Point 3 in Figure 14). They are of the current and all previous training episodes. The SAC-FNN algorithm randomly samples a batch of experiences from the Replay Buffer and passes them to the Actor Network Update function to update the Actor Network to improve its performance. Updates to the networks occur only at the end of the episode, and the algorithm does not update the Actor Network while it is controlling the digital twin model. The update function comprises the Actor Network, the Critic Network, and the Target Critic Network (Figure 14).

The Actor Network learns a policy to determine the control actions, the Critic Network learns the action–value function (called the Q-value function) to update the policy of the Actor Network, and the Target Critic Network helps stabilize the Critic Network by evaluating its performance in updating the policy of the Actor Network. The Target Critic Network calculates a target Q-value (Point 4 in Figure 14) that passes on to the Critic Objective Function to estimate the expected future reward for the Actor Network. This value is compared to past reward values to determine the updates for the weight and bias matrices in Critic Network (Point 5 in Figure 14).

The updated Critic Network uses sampled experiences from the Replay Buffer and the policy actions of the Actor Network to calculate the Q-value for the Actor Objective Function (Point 6 in Figure 14). It then updates the weights and biases in the Actor Network to maximize the episodic reward (Point 7 in Figure 14). The SAC algorithm copies the parameters of the Critic Network to the Target Critic Network to improve the controller’s behavior in the next episode (Points 8, 9 in Figure 14). The process continues in subsequent episodes until the SAC algorithm successfully trains the Episodic Actor Network. A successful episode is the one in which the Episodic Actor Network of the trained SAC-FNN algorithm successfully increases the thermal power of the VSLLIM microreactor during the simulated startup transient from an initial setpoint P_sp,1 = 0.5 MW_th, to the final setpoint P_sp,2 = 10.0 MW_th.

Implemented SAC-FNN Algorithm Results

The implemented SAC-FNN algorithm performed 25 different training runs or cases, labeled A-Y. The five cases labeled A-E are for troubleshooting and optimization prior to conducting the actual training cases labeled F through Y. The selected and varied hyperparameters in the SAC-FNN training cases are listed in Appendix B. As an example of the training results, Figure 15 plots the changes in the episodic reward during the training case Q with 256 neurons per hidden layer of the FNN. The small initial reward is due to the early termination of the episodes when one of the state variables, either the reactor thermal power, the reactor Na exit temperature, or the in-vessel Na mass flow rate, continuously exceeds the specified termination range for period of 60 s. Eventually, the trained SAC-FNN algorithms successfully complete the simulated startup transient in episode 53, as indicated by the large total reward.

Not all performed training cases produced successful episodes in which the SAC-FNN algorithms complete the VSLLIM startup scenario to the final reactor power setpoint of 10.0 MW_th. Thirteen successfully trained algorithms are produced during 25 training cases of the SAC-FNN algorithm. The results of the performed parametric analyses of varying the number of neurons per layer showed that only the networks with 3 layers and 64 and 256 neurons per layer produced successful training cases. The training cases R, X, and Y with networks of 64 neurons per layer produced a total of nine successful episodes. The training cases K, P, and Q with networks of three layers and 256 neurons per layer produced four successful episodes. The training cases with networks of three layers of only 32 and 16 neurons per layer did not produce any successful episodes.

Figure 16 plots the predicted position of the Group A and C control rods in the Core of the VSLLIM microreactor by the trained SAC-FNN algorithm versus the target values in the simulated startup transients. The nine successfully trained SAC-FNN algorithms, each of three layers and 64 neurons per layer in cases R, X, and Y accurately predict the control rods’ position to within +0.3% and −1.6% of the target (Figure 16a–c). The four successfully trained SAC-FNN algorithms of three layers and 256 neurons per layer in cases K, P, and Q accurately predict the position of the control rods within +0.5% and −1.2% of the target values (Figure 16d).

To train successful models, the SAC-FNN algorithm requires more computational time than the SL-LSTM algorithm. Training the SAC-FNN algorithm with three layers and 64 neurons per layer to successfully complete the startup transient required an average of ~83 training episodes. This required an average of 54 h of computational time on an Ubuntu LTS 20.04 workstation with an AMD 3970X 32-core processor and 256 GB of RAM. The training the SAC-FNN algorithm with three layers and 256 neurons per layer required an average of ~80 training episodes with an average of 40 h of computational time. In contrast, on the same Ubuntu workstation the SL-LSTM algorithms required only 4–10 h of computational time to train a successful model. Results showed that increasing the number of neurons did not necessarily increase the required trained time.

6. Evaluating the Real-Time Controller

The trained SL-LSTM and SAC-FNN algorithms are integrated into the developed python program for the Reactor Control PLC to adjust the displacements of the control rods during the simulated startup transients using the VSLLIM digital twin model (Figure 4). During testing, the digital twin model runs synchronously to a real-time clock with a small timestep of 20 ms to produce a fine temporal discretization and a better approximation of a continuous data source. This allows the PLC to interact effectively and realistically with the VSLLIM digital twin dynamic model. The actions of the PLC are delayed by the time required for the signals of the reactor operating parameters to reach the controller, and for the generated control signals to reach the digital twin model of the reactor, providing a more realistic testing environment for the controller. In the present work the values of the state variables generated by the digital twin model of the reactor and passed on to the PLC do not include artificial sensor noise.

The LOBO Nuclear CyberSecurity (NCS) platform developed by University of New Mexico’s Institute for Space and Nuclear Power Studies (UNM-ISNPS) in collaboration with Sandia National Laboratory [35,36,37,38] links the Reactor Control PLC to the digital twin model of the VSLLIM reactor (Figure 4). This platform uses the Modbus Industrial Control System (ICS) protocol to manage communication through an isolated Ethernet test network. This is between the PLC program and the server running in real-time, the digital twin model of the VSLLIM microreactor (Figure 17). The controller uses two data channels, a Modbus TCP channel which communicates with the LOBO NCS platform and a TCP/IP channel, for communicating with the remote reactor operator.

The Modbus communication channel for the LOBO NCS data broker receives the calculated values of the state variables from the reactor digital twin model and stores them in Modbus holding registers (Figure 1). It also passes the Modbus control signals sent by the PLC to the digital twin model, which then enacts the transmitted control signals and displaces the control rods in the reactor core. The measured latency time of the Modbus communication between the PLC and the VSLLIM digital twin model in the ethernet testing network is ~0.2 ms on average. The TCP/IP channel receives commands from the remote operator to start up, shut down, or change the steady state setpoint for the reactor thermal power. The PLC transmits back the status of the present actions and the values of the stored state variable in the controller’s Modbus holding registers. The remote operator station has a large screen with a Graphical User Interface (GUI) for monitoring the values of state variables in real time during the simulated startup transients (Figure 17).

In each scan cycle the reactor control PLC reads the most recently received state variables in the holding registers, before passing them to the control logic program with the trained ML algorithms. They manage the movement of the Group A and C control rods in the core of the VSLLIM microreactor during the simulated startup transients using the digital twin model (Section 3.1).

The trained SL-LSTM algorithm receives an array of the values of the operation variables for the present and the previous scan cycles in the lookback window. In contrast, the trained SAC-FNN algorithm receives an array of only the present values of the operational variables. The PLC determines and implements the displacement rate of the Group A and C control rods in the core of the VSLLIM microreactor. These are commensurate with the position the rods determined by the trained algorithms. The program writes the commanded displacement rates to the Modbus holding registers of the PLC and passes them back to the LOBO NCS data broker. The data broker in turn passes the commanded movement rates to the digital twin model through a shared memory communication bridge for action. The next sections present the testing results of the trained SL-LSTM and SAC-FNN algorithms integrated with the PLC of the VSLLIM microreactor.

6.1. Results of the Control PLC with the Trained SAC-FNN Algorithm

The testing results presented in this subsection evaluate the performance of the trained SAC-FNN algorithm while integrated into the Reactor Control PLC and coupled to the digital twin model of the VSLLIM microreactor (Figure 4) using the LOBO NCS platform (Figure 17). Results are the predicted position of the control rods in the VSLLIM microreactor core by the successfully trained SAC-FNN algorithm and of the corresponding reactor thermal power determined by the digital twin model. These results are compared to those generated in the simulated startup transient of the VSLLIM microreactor connected to the PLC with the PD controller.

Figure 18a,b compare the predicted positions of the control rods in the cases R and K of the trained SAC-FNN algorithms of three layers of 64 and 256 neurons per layer, respectively, to those determined by the PLC with the PD controller. The PLC with the episode R trained SAC-FNN algorithm slightly underpredicts the control rod positions to within −0.6% of the values determined by the PLC with the PD controller (Figure 18a). At the end of the simulated startup transient, the reactor thermal power determined using the PLC with SAC-FNN algorithm is 9.8 MW_th compared to the target of 10.0 MW_th (Figure 18c).

The predicted positions of the control rods in the reactor core by the PLC with the episode K trained SAC-FNN algorithm of 256 neurons per layer are in good agreement, to within +0.7% and −0.5% with the values calculated using the PLC with PD controller (Figure 18b). The predictions of the PLC with the SAC-FNN algorithm levels off at a steady state reactor thermal power of 9.93 MW_th, slightly lower than the target of 10.0 MW_th (Figure 18d). The inserts in Figure 18c,d compare the small adjustments in the reactor power during the simulated startup transient. The rate limiting function Equation (1) of the PLC with the PD controller generates the target sets during training. It restricts the displacement of the control rods in Group A and C, to limit the change in the core external reactivity. Therefore, during the simulated startup transient the thermal power of the VSLLIM microreactor increases in small steps after accounting for the negative temperature reactivity feedback.

Although the PLC with the trained SAC-FNN algorithm does not have a rate limiting function Equation (1), it successfully learned to adjust the displacement of the control rods to increase the reactor power gradually without spikes (Figure 18c,d). The PLC with the trained SAC-FNN algorithm of 265 neurons per hidden layer experiences larger oscillations in the predicted reactor thermal power in the startup simulation transient to 3000 s. These oscillations are smaller for the startup transient controlled by the PLC with the trained SAC-FNN algorithm of 64 neurons per hidden layer (Figure 18c,d). Nonetheless, both algorithms in Figure 18 predict similar rates of increase of the reactor power as the reference PLC with PD controller.

During the simulated startup transient of the VSLLIM microreactor, the results in Figure 18 show that the trained SAC-FNN algorithms successfully control the movement of the Group A and C control rods (Figure 1b) to increase the reactor power from 0.5 MW_th to a final setpoint of 10 MW_th. The other eleven successfully trained SAC-FNN algorithms incorporated into the PLC demonstrated similar behaviors, as those shown in Figure 18. They smoothly increase the thermal power of the VSLLIM microreactor during the simulated startup transient to the final steady state power setpoint.

Figure 19a–d compare the calculated changes in the thermal power of the reactor controlled by PLC with the trained episode Q SAC-FNN algorithm of three layers and 256 neurons per layer in four simulated startup transients. They begin from an initial reactor power setpoint P_SP,1 = 1.0 MW_th and continue to final power setpoints, P_SP,2, of 10.0 MW_th, 7.5 MW_th, 5.0 MW_th, and 2.0 MW_th, respectively. For P_SP,2 = 10.0 MW_th the PLC with the trained SAC-FNN algorithm withdraws Group A and C control rods to increase the VSLLIM microreactor power during the simulated startup transient. During the first ~11,400 s of the simulated startup transient the reactor thermal power is close to that calculated for the reactor controlled by the PLC with PD controller. Beyond such time, and while approaching P_SP,2 = 10.0 MW_th, the predicted rate of displacement of the control rods of Group A and C in the reactor core is 1.3% lower than the value determined by the PD controller (Figure 19a).

In the simulated startup transients, the PLC with the trained SAC-FNN algorithm smoothly displaces the group A and C control rods in the reactor core (Figure 19a,b). However, the predicted final reactor power by the PLC with the trained Case Q SAC-FNN algorithm is 2.8% above the target of 7.5 MW_th (Figure 19b). In the simulated startup transients of the VSLLIM microreactor to P_SP,2 = 5.0 MW_th, the predicted final reactor power is ~2.5% higher than the target value (Figure 19c). For the lowest power setpoint of P_SP,2 = 2.0 MW_th, the prediction of the PLC with the trained SAC-FNN algorithm matches the final target power to within 0.1% (Figure 19d). It is worth noting that the trained algorithms do not display consistent bias to over or underpredict the final reactor power.

Even though the SAC-FNN algorithms in this work are trained for a single target startup scenario of increasing the reactor power from 0.5 to 10.0 MW_th, the reactor control PLC generally performs well for lower P_SP,2 values. In conclusion, the PLC with the trained SAC-FNN algorithms performs well. The predicted displacements of the Group A and C control rods in the core of the VSLLIM microreactor during the simulated startup transients are comparable to those determined by the PLC with PD controller used to generate the training data for the SAC-FNN algorithms.

6.2. Performance of the Reactor Control PLC with the Trained SL-LSTM Algorithms

The VSLLIM Reactor Control PLC with the trained SAC-FNN algorithms performs well for real-time control of the VSLLIM microreactor. However, the performance of the PLC with the trained SL-LSTM algorithms is unsatisfactory. Figure 20 compares the predicted thermal power in a simulated startup transient using the PLC with both the trained SL-LSTM and SAC-FNN algorithms. The results presented for the trained SL-LSTM algorithms are for three different cases with a hidden size of 10 and a lookback window of 20. In the simulated startup transient, the PLC increases the reactor power from an initial value of 1.0 MW_th up to 10 MW_th.

During the first 6000 s of the simulated startup transient, the PLC with a trained SL-LSTM algorithm of two layers and the external reactivity, ρ_ex, as a feature (see Appendix A.6), increases the reactor power in good agreement with the predictions using the PD controller. However, the reactor power calculated by the digital twin model levels off at a steady state value of 6.37 MW_th, which is well below the target setpoint of 10 MW_th. The PLC and the trained SL-LSTM algorithms with two or three layers, and ρ_ex, as a feature, reach the correct reactor power setpoint of 10 MW_th. However, they rapidly displace the Group A and C control rods causing the reactor power to rise faster than the target values calculated using the PLC with PD controller.

In contrast, the PLC with the trained Case Y SAC-FNN algorithm of three layers and sixty-four neurons per layer smoothly increases the reactor power in close agreement with the predictions using the PLC with PD controller. The reactor power levels off at a steady state value of 10.02 MW_th, only 0.2% higher than the target power setpoint. This acceptable performance for the trained Case Y SAC-FNN algorithm in Figure 20 is consistent with those displayed for the Cases K, R, and Q algorithms in Figure 18 and Figure 19 for real-time control during the simulated startup transients.

The results for the Reactor Control PLC with the trained SL-LSTM algorithms show that despite the high testing accuracy of the algorithms (Figure 12a), the real-time control performance is poor and inconsistent (Figure 20). The PLC with the trained SL-LSTM algorithms does not self-correct when the values of the input features differ from those in its training data sets. Owing to the highly nonlinear kinetics of the reactor, a small change in external reactivity due to displacement of the control rods in the core results in larger changes in the reactor thermal power and, hence, the features used to train the SL-LSTM algorithm. Examples are the reactor core exit temperature and the in-vessel Na flow rate through the core by natural circulation. Despite the high testing accuracy of the SL-LSTM algorithms during training, a small difference in the predicted displacement of the control rods during real-time testing causes the reactor power to significantly deviate from the target values. The SL-LSTM algorithms do not incorporate control feedback during the training process but are trained using pre-generated data sets covering a wide range of target curves of the simulated startup transients with different reactor power setpoints. Consequently, the algorithms did not learn during training how the VSLLIM digital twin model responds to different predictions of the control rods positions from those in the training data sets. This contrasts with the trained SAC-FNN algorithms that learn to moderate the predicted displacements of the control rods and self-correct so that the increases in the reactor power agree with the target values (Figure 18, Figure 19 and Figure 20).

7. Summary and Conclusions

This work trained and investigated the performance of two different ML algorithms for remote operation and control of the VSLLIM microreactor during simulated startup transients. The trained SL-LSTM and SAC-FNN algorithms are incorporated into a PLC program for real-time control of a digital twin model of the VSLLIM reactor. The results compare the performance of the trained algorithms during training and real-time remote control of the VSLLIM microreactor. The developed physics-based MATLAB dynamic Simulink model represents that of the digital twin of this microreactor. The model generated 797 data sets for training the ML algorithms. These data sets are of the startup transients from an initial subcritical condition to different steady state power levels of the VSLLIM microreactor up to 10 MW_th.

The trained SL-LSTM algorithms predicted the position of the Group A and C control rods in the core of the VSLLIM reactor during the simulated start up transients with an accuracy of >99.90%. However, this high accuracy did not translate to good real-time control of the reactor with the PLC incorporated with the trained SL-LSTM algorithms. The PLC withdraws the control rods either too rapid or too slow, compared to targets using the PLC with PD controller. These results may be caused by the absence of feedback to adjust the predictions during the SL-LSTM algorithm training. Consequently, the PLC with the SL-LSTM algorithms do not self-correct when the state variables differ from the target values. Increasing the number of training data sets did not increase in the predictive accuracy of the SL-LSTM algorithm (Appendix A.3). Owing to the absence of feedback it is unlikely that the real-time performance of the PLC with SL-LSTM algorithms would improve with further training.

In contrast, the thirteen different trained SAC-FNN algorithms incorporated into the Reactor Control PLC successfully completed the simulated startup transients of the VSLLIM microreactor. The PLC with the trained SAC-FNN algorithms displaces the control rods for a steady rise in the reactor power that matches that of the PLC with PD controller. Four of the trained SAC-FNN algorithms that comprise three layers with 256 neurons per layer, and nine of three layers with 64 neurons per layer, performed well. Unlike the PLC with the trained SL-LSTM algorithms, those with these SAC-FNN algorithms demonstrated good real-time control of the VSLLIM microreactor. In the SAC-FNN algorithms the reward feedback for actions during training helps them to take corrective actions to adjust the reactor thermal powers to match target values during the simulated startup transients.

The predictions with thirteen trained SAC-FNN algorithms agree with the target displacement curves of the Group A and C control rods in the microreactor core to be within ±1.6%. The PLC with the nine trained SAC-FNN algorithms of 64 neurons per layer reaches 9.5% lower reactor power from the setpoint of 10.0 MW_th. The PLC with the four trained SAC-FNN algorithms of 256 neurons per layer displayed superior performance. These reach final reactor power levels that are ~0.5% higher than the setpoint of 10.0 MW_th. These trained algorithms with larger numbers of neurons learn better during the simulated startup transients and the predictions closely match the target data. The predicted final reactor powers for the PLC with the trained SAC-FNN algorithms slightly differ from the setpoint of 10.0 MW_th used during training. Nonetheless, the algorithms smoothly and accurately increase the reactor power to the target values.

In conclusion, the present research demonstrated that the trained SAC-FNN algorithms are a viable choice for remote control of the VSLLIM microreactor during simulated startup transients. The implemented PLC controller with the trained algorithms in this work can monitor the operation of the reactor and send commands to the control rod actuators, as well as communicate with a remote operator station. The remote-control function with securely encrypted and transmitted command signals and monitoring data is demonstrated in our computational laboratory at UNM-ISNPS using an isolated Ethernet network.

Author Contributions

Conceptualization, and methodology, M.S.E.-G. and T.M.S.; methodology, parametric analyses and results validation and testing, T.M.S., A.N.S. and M.S.E.-G.; resources and data curation, M.S.E.-G., T.M.S. and A.N.S.; writing—original draft preparation, M.S.E.-G. and T.M.S.; writing—review and editing, M.S.E.-G. and T.M.S.; visualization, supervision, project administration, and funding acquisition, M.S.E.-G. and T.M.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work supported by a DOE Office of Nuclear Energy’s Nuclear Energy University Programs award through a subaward from Purdue University to the University of New Mexico’s Institute for Space and Nuclear Power Studies under contract DE-NE0009268.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Nomenclature

The abbreviations and symbols in this manuscript:

A2C	Advantage Actor Critic
A3C	Asynchronous Advantage Actor Critic
b	Neurons biases
BWR	Boiling Water Reactor
c	LSTM cell state vector
c¹	First layer cell state vector
c²	Second layer cell state vector
c³	Third layer cell state vector
CNN	Convolutional Neural Network
CRP	Control Rod Position
CRP*	Normalized Control Rod Position
DDPG	Deep Deterministic Policy Gradient
E	Percent relative difference
ESD	Emergency Shut Down
F	Feature of State Variables Array
F*	Normalized Feature of State Variables Array
FNN	Feed-forward Neural Network
FPY	Full Power Years
GRU	Gated Recurrent Unit
h	LSTM hidden state vector
h¹	First layer hidden state vector
h²	Second layer hidden state vector
h³	Third layer hidden state vector
HEX	Heat Exchanger
I&C	Instrumentation and Control
LMHP-TE	Liquid Metal Heat Pipe-Thermoelectric auxiliary power generation modules
LOBO NCS	LOBO Nuclear CyberSecurity
LOCA	Loss of Coolant Accident
LR	Learning Rate
LSTM	Long Short-Term Memory
LWR	Light Water Reactor
ṁ	Mass flow rate (kg/s)
ML	Machine Learning
MMR	Modular Microreactor
MPC	Model Predictive Control
n	Lookback Window size
N	Number of data sets
P_Rx	Reactor Thermal Power (kW, MW)
P_SP	Power setpoint (MW_th)
P_SP,1	Initial power setpoint (MW_th)
P_SP,2	Final power setpoint (MW_th)
PB-HTGR	Pebble Bed-High Temperature Gas-cooled Reactor
PCM	Percent-Mille
PD	Proportional-Differential
PI	Proportional-Integral
PID	Proportional-Integral-Differential
PLC	Programmable Logic Controller
PPO	Proximal Policy Optimization
PWR	Pressurized Water Reactor
R	Reward
RC	Reactor Control
RL	Reinforcement Learning
RMSE	Root Mean Square Error
SAC	Soft Actor Critic
SAC-FNN	Soft Actor Critic algorithm with Feed-forward Neural Network
SL	Supervised Learning
SL-LSTM	Supervised Learning algorithm with Long Short-Term Memory network
T_in	Sodium coolant inlet temperature (K)
T_ex	Sodium outlet temperature (K)
T_in^Rx	Reactor core inlet temperature (K)
TD3	Twin Delayed Deep Deterministic Policy Gradient
UNM-ISNPS	University of New Mexico’s Institute for Space and Nuclear Power Studies
VSLLIM	Very-Small, Long-Life, Modular
Y	Output value
w	Neurons weights
X	Feature value
X*	Normalized feature value
X_max	Feature highest value
X_min	Feature lowest value
Greeks
α	Controller scaling coefficient, Activation function
Δρ	Reactivity insertion/feedback ($)
λ_e	Effective constant for a delayed neutron group (s⁻¹)
μ	Actor network mean
ρ	Total reactivity ($) or PCM
σ	Actor network standard deviation
τ	Reactor period, timestep number
Subscripts and Superscripts
0	Initial state
b	Bulk
BeO	Beryllium Oxide
CR	Control Rod
Dop	Doppler effect
ex	External
fb	Feedback
fuel	nuclear fuel
g	Generated state
HEX	Na/Na heat exchanger
Na	Liquid sodium
Rx	Reactor
SP	Setpoint

Appendix A. Hyperparameters in the SL-LSTM Algorithms

Performed parametric analyses investigate different model hyperparameters of the trained SL-LSTM algorithms. Evaluated are the effects of the hyperparameters on the learning curves and the accuracy of the predictions. Table A1 lists the hyperparameter values used in parametric analyses. The parameters investigated are the Learning Rate (LR) of the SL-LSTM algorithms, the size of the lookback window and the length of the sequence, the size of the training, validation and testing data sets. These are for the same number of training sets for each final reactor power setpoint P_SP,2, the order of the training sets, the hidden size and the number of the LSTM layers, as well as including additional parameters as features for training the algorithms. The following subsections detail the results of the performed parametric analyses.

Table A1. Summary of hyperparameters used for training the SL-LSTM algorithms.

Hyperparameter	Value
Constant Hyperparameters
Normalization Range	[0, 1]
No. of targets	One
$h_{0}$ $, c_{0}$	Zero vectors
Loss Metric	RMSE
Optimizer	Addams
Weight decay coefficient	0.01
Number of Training Epochs	25
Mini-batch Size	64
Varied Hyperparameters
Learning Rate	Constant LR = 0.001 and 0.1 LR scheduler with initial LR = 0.001
Lookback Window	1, 5, 10, 15, 20, 64, 250, 500
No. of Training Sets	Varied between 10 and 626, with random selection and ordering
No. of Training Sets for Different Power Levels	One, two, three, four, five sets for each of the forty-seven final setpoint powers $P_{SP, 2}$ . All models used the same randomly selected data sets
Order of Training Sets by $P_{SP, 2}$	Low-to-high, High-to-low, Random Shuffle
Ordered Training Sets by $P_{SP, 1}$	Low-to-high, High-to-low, Random Shuffle
Hidden Size	5, 10, 15, 20, 25, 30.
No. of Layers	One, two, three
Additional Features	Time derivatives as additional features $[d P_{Rx} / d t$ $, d T_{in} / d t$ $, d T_{out} / d t$ $, and d \dot{m} / d t$ $.], ρ_{total}$ $, Δ ρ_{ex}$ $, Δ ρ_{ex}$ $and Δ ρ_{fb}$ $, replacing P_{SP}$ $with (P_{Rx} - P_{SP}$ )

Appendix A.1. Effect of the Learning Rate

Results show using an LR of 0.1 is inadequate for convergence of the SL-LSTM algorithm during training. The implemented LR scheduler that adjusted the learning rate during the training process decreased the LR to a very value of 1 × 10⁻⁶. This LR limited learning in subsequent epochs, decreases the validation loss but increases the training loss with increased numbers of epochs. The best training results are those for a constant learning rate LR = 0.001. This rate decreases the training and validation loss to an order of 10⁻³, which is indicative of good convergence of the algorithm.

Appendix A.2. Effect of the Size of the Lookback Window

With a lookback window size of 1, 5, or 10, the trained SL-LSTM algorithms fail to capture the temporal dependence of the time-series in the training data sets. Also, the training loss did not converge, and the testing accuracy was low. With large sizes of the lookback window of 250 or 500, the training algorithm fails. This is because the backpropagation step could not calculate the gradient of the loss function. With intermediate values of 15, 20, and 64 for the lookback window, the loss function converges at a high training accuracy of 99.5%. However, the lookback window size of 64 significantly increases the training time to 10 h compared to only 4 h with a window size of 20.

Appendix A.3. Effect of Increasing the Number of Training Data Sets

The parametric analyses performed varied the number of training sets from 10 to 626 to determine the effect on the training performance. With 10 to 40 randomly selected training data sets, accuracy during testing is low, with the AWA ranging from 97% to 98%. Figure A1 compares the effect on the accuracy of varying the number of training sets for SL-LSTM algorithms with hidden sizes of 15 and 20 and one layer of neurons. Each trained algorithm is tested using the same randomly selected 100 testing data sets. Increasing the number of training sets beyond 45 did not significantly increase the accuracy of the trained algorithm for the testing data sets, and in some cases decreased the AWA. The spread between the max and min accuracy values generally increased as the number of training data sets increased.

Figure A1. Effects of increasing the number of training files and the hidden size on the accuracy of the trained SL-LSTM algorithms.

Appendix A.4. Effect of Hidden Size and Number of Neurons Layers

Employing more layers of neurons, but a smaller hidden size slightly increases the average weighted accuracy of the trained SL-LSTM algorithms. With a hidden size of 20, one layer of LSTM cells, and two training data sets per P_SP,2, the estimate of the average weighted accuracy of the algorithms is 99.84–99.90% with 88 training data sets, 99.75% with 129 training data sets, 99.56–99.71% with 165 training data sets, and 99.82–99.83% with 198 training data sets.

Decreasing the hidden size to 10 and using two layers of LSTM cells, two training data sets per P_SP,2, slightly increases the average weighted accuracy to 99.85–99.90% using 88 training sets, to 99.78–99.79% using 129 training sets, to 99.63–99.76% using 165 training sets, and to 99.80% using 198 training sets. A hidden size of 10 gives the highest accuracy, however, with decreasing accuracy for larger hidden sizes up to 30 and the same number of layers.

Appendix A.5. Effect of Ordering the Training Data Sets

Changing the order of the training data sets strongly affects the prediction accuracy of the SL-LSTM algorithms. Ordering of the training data sets by final power setpoint, P_SP,2, from low-to-high increases the weighted average accuracy and decreases the spread in the accuracy values. For hidden sizes of 15, one layer of LSTM cells, 129 training data sets, 20 validation data sets, 100 testing data sets, three training sets per P_SP,2, and five different randomly shuffled training data sets, the calculated weighted accuracies of the trained SL-LSTM algorithms average 99.54–99.8%. Ordering the training data sets by the value of the power setpoint, P_SP,2, from low to high increases the average weighted accuracy to 99.82%. Conversely, ordering the data sets by P_SP,2 from high to low lowers the average weighted accuracy to 97.76%.

With hidden size of 10, two layers of LSTM cells, thirty-four training data sets, twenty validation data sets, and one hundred testing data sets, random shuffling of the training data sets for P_SP,1 gives an average weighted accuracy between 99.75–99.81%. Ordering the training data sets by P_SP,1 from low to high results in the calculated average weighted accuracy of the trained SL-LSTM algorithms varying from 99.21–99.34%, and from 99.17–99.42% with ordered training data sets from high to low.

Appendix A.6. Increasing Training Features

Additional investigations examined the effect of adding more features to the base ones of P_SP, P_Rx, T_ex, T_in, and ṁ. The added features are the time derivatives of the base features, and of the total reactivity, the external insertion reactivity, and the temperature feedback reactivity (Figure 3). With a hidden size of twenty, one layer, eighty-eight training data sets, ten validation data sets, and one hundred testing data sets, the calculated average weighted accuracy of the trained SL-LSTM algorithms is 99.68%. This is remarkably close to 99.90% calculated using only the five primary features. The added features also increase the spread in the accuracy values of the trained SL-LSTM algorithms.

In summary, results show that the selection of the features is essential for training the SL-LSTM algorithms to recognize the patterns in the training data sets. Adding features with weaker connections to the targets can hamper the algorithm’s ability to learn. Therefore, employing the five primary features in the training of the SL-LSTM algorithms is determined as the best option.

Appendix B. Hyperparameters in SAC-FNN Algorithms

Table A2 lists the constant and varied hyperparameters for training the SAC-FNN algorithms. The performed parametric analyses for the SAC-FNN algorithms investigate the effects of the number of neurons in the hidden layers of the Actor Network and the Critic Networks. The Actor Networks each have three hidden layers of neurons, with a varied number of neurons per layer from 16 to 256 (Table A2). The number of neurons in the hidden layers of the Critic Networks varied commensurate with those of the Actor Network to 256, 512, 256, 128, 64 for the state action layer and four hidden layers. The trained SAC-FNN algorithm with 256 neurons per layer and 64 neurons per layer succeeded in training to perform the simulated startup transients of the VSLLIM microreactor. Using Actor Networks of 16 and 32 neurons per layer did not result in any successful training cases using the SAC-FNN algorithm.

The performed analyses also investigate the effect of employing initial weights for the Actor and Critic Networks that are either randomly determined using the Xavier uniform distribution or taken from the values for a previously successful episode. The trained SAC-FNN algorithms successfully complete the transient startup of the VSLLIM microreactor using both randomly determined wights and selected weights from previously successful episodes.

Table A2. Listing of the selected hyperparameters for training the SAC-FNN algorithms.

Hyperparameter	Value
Constant Hyperparameters
Number of VSLLIM environments	$1$
Tolerance time	$60 s$
$VSLLIM state space, \|s_{t}\|$	$3$
$VSLLIM action space, \|a_{t}\|$	$1$
Maximum number of episodes per case	$5000$
Maximum number of epochs per policy update	$10,000$
Hidden layers activation function	ReLU
Standard deviation layer activation function	Softplus
L2 weight regularization factor	$0.01$
Replay buffer storing frequency	$1 s$
Batch size, $\|B\|$	$256$
Optimizer	Adam
Learning rate	$0.00001$
Tradeoff coefficient learning rate	$0.000005$
Tradeoff coefficient initial value, $α_{0}$	$0.5$
Discount rate, $γ$	$0.99$
Target smoothing coefficient, $τ$	$0.001$
Varied Hyperparameters
Initialization of Actor and Critic weights	Random using the Xavier uniform distribution, Weights successful actors and critics.
Number of neurons in the Actor’s three hidden layers	(256, 256, 256), (64, 64, 64), (32, 32, 32), (16, 16, 16)
Numbers of neurons in the Critic’s state-action layers and four hidden layers	(256, 512, 256, 128, 64), (64, 64, 32, 16, 8), (32, 32, 16, 8, 4), (16, 16, 8, 4, 2)

References

Agarwal, V.; Ballout, Y.A.; Gehin, J.C. Fission Battery Initiative: Research and Development Plan; INL/EXT-21-61275; Idaho National Laboratories: Idaho Falls, ID, USA, 2021. [Google Scholar]
Cetiner, S.M.; Muhlheim, M.D.; Guler Yigitoglu, A.; Belles, R.J.; Greenwood, M.S.; Harrison, T.J.; Denning, R.S.; Bonebrake, C.A.; Dib, G.; Brunett, A.A. Supervisory Control System for Multi-Modular Advanced Reactors; ORNL/TM-2016/693; Oak Ridge National Laboratory: Oak Ridge, TN, USA, 2016. [Google Scholar]
Tang, C.; Yu, C.; Gao, Y.; Chen, J.; Yang, J.; Lang, J.; Liu, C.; Zhong, L.; He, Z.; Lv, J. Deep Learning in Nuclear Industry: A Survey. Big Data Min. Anal. 2022, 5, 140–160. [Google Scholar] [CrossRef]
Simon, J.B.; Knutins, M.; Ziyin, L.; Geisz, D.; Fetterman, A.J.; Albrecht, J. On the Stepwise Nature of Self-Supervised Learning. arXiv 2023, arXiv:2303.15438. [Google Scholar] [CrossRef]
Sutton, S.S.; Barto, A.G. Reinforcement Learning, 2nd ed.; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
Wang, P.; Yan, X.; Zhao, F. Multi-objective optimization of control parameters for a pressurized water reactor pressurizer using a genetic algorithm. Ann. Nucl. Energy 2019, 124, 9–20. [Google Scholar] [CrossRef]
Kaelbling, L.P.; Littman, M.L.; Moore, A.W. Reinforcement Learning. J. Artif. Intell. Res. 1996, 4, 237–285. [Google Scholar] [CrossRef]
Radaideh, M.I.; Pigg, C.; Kozlowski, T.; Deng, Y.; Qu, A. Neural-based time series forecasting of loss of coolant accidents in nuclear power plants. Expert Syst. Appl. 2020, 160, 113699. [Google Scholar] [CrossRef]
Xiao, K.; Wu, Q.; Chen, J.; Pu, X.; Zhang, Y.; Yang, P. A neural network predictive control method for power control of small, pressurized water reactors. Ann. Nucl. Energy 2022, 169, 108946. [Google Scholar] [CrossRef]
Carelli, M.D.; Conway, L.E.; Oriani, L.; Petrović, B.; Lombardi, C.V.; Ricotti, M.E.; Barroso, A.C.O.; Collado, J.M.; Cinotti, L.; Todreas, N.E.; et al. The design and safety features of the IRIS reactor. Nucl. Eng. Des. 2004, 230, 151–167. [Google Scholar] [CrossRef]
Qian, G.; Liu, J. Development of deep reinforcement learning-based fault diagnosis method for rotating machinery in nuclear power plants. Prog. Nucl. Energy 2022, 152, 104401. [Google Scholar] [CrossRef]
Wei, L.; Jie, C.; Tong, L.; Yongchao, L.; Sichao, T.; Bo, W.; Zhengxi, H.; Ruifeng, T.; Jihong, S. Neural network model predictive control of core power of Qinshan nuclear power plant based on reinforcement learning. Ann. Nucl. Energy 2024, 207, 110702. [Google Scholar] [CrossRef]
Park, S.; Lee, J.; Kwack, Y.; Kim, Y.; Hien, H.N.; Sim, S. Machine Learning Accident Diagnosis Methodology of Nuclear Power Plant Using Mars-Ks Best-Estimate Performance and Safety Analysis Database. SSRN 2022, 4290950. [Google Scholar] [CrossRef]
Lee, D.; Kim, H.; Choi, Y.; Kim, J. Development of Autonomous Operation Agent for Normal and Emergency Situations in Nuclear Power Plants. In Proceedings of the IEEE 5th International Conference on System Reliability and Safety, Palermo, Italy, 24–26 November 2021; pp. 240–247. [Google Scholar]
Nguyen, H.K.; Rivas, A.; Delipei, G.K.; Hou, J. Reinforcement Learning-Based Control Sequence Optimization for Advanced Reactors. J. Nucl. Eng. 2024, 5, 209–225. [Google Scholar] [CrossRef]
Hu, R.; Zou, L.; Hu, G.; Nunez, D.; Mui, T.; Fei, T. SAM Theory Manual; Argonne National Laboratory Report ANL/NSE-17/4; Argonne National Laboratory: Argonne, IL, USA, 2021. [Google Scholar]
Chen, X.; Ray, A. Deep Reinforcement Learning Control of a Boiling Water Reactor. IEEE Trans. Nucl. Sci. 2022, 69, 1820–1832. [Google Scholar] [CrossRef]
Radaideh, M.I.; Tunkle, L.; Price, D.; Abdulraheem, K.; Lin, L.; Elias, M. Multistep Criticality Search and Power Shaping in Microreactors with Reinforcement Learning. arXiv 2024, arXiv:2406.15931. [Google Scholar] [CrossRef]
Price, D.; Roskoff, N.; Radaideh, M.I.; Kochunas, B. Thermal modeling of an eVinci™-like heat pipe microreactor using OpenFoam. Nucl. Eng. Des. 2023, 415, 112709. [Google Scholar] [CrossRef]
Trunkle, L.; Abdulraheem, K.; Lin, L.; Radaideh, M.I. Nuclear Microreactor Control with Deep Reinforcement Learning. arXiv 2025, arXiv:2504.00156v1. [Google Scholar] [CrossRef]
Bae, J.; Kim, J.M.; Lee, S.J. Deep reinforcement learning for a multi-objective operation in a nuclear power plant. Nucl. Eng. Technol. 2023, 55, 3277–3290. [Google Scholar] [CrossRef]
El-Genk, M.S.; Palomino, L.M. A Walk-Away Safe, Very Small, Long-LIfe, Modular (VSLLIM) Reactor for Portable and Stationary Power. Ann. Nucl. Energy 2019, 129, 181–198. [Google Scholar] [CrossRef]
El-Genk, M.S.; Schriener, T.M.; Palomino, L.M. Passive and Walk-Away Safe Small and Microreactors for Electricity Generation and Production of Process Heat for Industrial Uses. Nucl. Eng. Radiat. Sci. 2021, 7, 031302. [Google Scholar] [CrossRef]
MATLAB, version 2020b; The Mathworks: Natick, MA, USA, 2022. Available online: www.matlab.com (accessed on 1 December 2024).
Palomino, L.M.; El-Genk, M.S. Neutronic and CFD-Thermal Hydraulic Analyses of Very-Small, Long-Life Modular (VSLLIM) Reactor; Technical Report UNM_ISNPS-1-2019; Institute for Space and Nuclear Power Studies (ISNPS), University of New Mexico: Albuquerque, NM, USA, 2019. [Google Scholar]
El-Genk, M.S.; Tournier, J.-M. A Point Kinetics Model and Dynamic Simulation of Next Generation Nuclear Reactor. J. Prog. Nucl. Energy 2016, 92, 91–103. [Google Scholar] [CrossRef]
De Caro, M.S. Irradiation Embrittlement in Alloy HT-9; LA-UR-12-24334; Los Alamos National Laboratory: Los Alamos, NM, USA, 2012. [Google Scholar]
Goorley, T. MCNP6.1.1-Beta Release Notes; LA-UR-14-24680; Los Alamos National Laboratory: Los Alamos, NM, USA, 2014. [Google Scholar]
Bernard, L.A.; Lanning, D.D.; Ray, A. Digital Control of Power Transients in a Nuclear Reactor. IEEE Trans. Nucl. Sci. 1984, 31, 701–705. [Google Scholar] [CrossRef]
Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
Amari, S.-I. Learning patterns and pattern sequences by self-organizing nets of threshold elements. IEEE Trans. Comput. C 1972, 21, 1197–1206. [Google Scholar] [CrossRef]
Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.; DeVito, Z.; Lin, Z.; Desmaison, A.; Antiga, L.; Lerer, A. Automatic differentiation in pytorch. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Abadi, M. TensorFlow: A System for Large-Scale Machine Learning. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI’16), Savannah, GA, USA, 2–4 November 2016. [Google Scholar]
Chollet, F. Keras. 2015. Available online: https://keras.io (accessed on 1 October 2024).
El-Genk, M.S.; Schriener, T.M. A Cybersecurity Platform for Simulating Transient Responses of Emulated Programmable Logic Controllers in Instrumentation and Control Systems for a PWR Plant. J. Cyber Secur. Technol. 2022, 6, 65–90. [Google Scholar] [CrossRef]
Schriener, T.M.; El-Genk, M.S. Simulated False Data Injection Attacks on Emulated and Hardware Programmable Logic Controllers of the Pressurizer in a Representative Pressurized Water Reactor Plant. J. Cyber Secur. Technol. 2022, 6, 216–241. [Google Scholar] [CrossRef]
El-Genk, M.S.; Altamimi, R.; Schriener, T.M. Pressurizer Dynamic Model and Emulated Programmable Logic Controllers for Nuclear Power Plants Cybersecurity Investigations. Ann. Nucl. Energy 2021, 154, 108121. [Google Scholar] [CrossRef]
El-Genk, M.S.; Schriener, T.M. Modeling and Simulation Capabilities for Nuclear Cybersecurity Investigations of a Representative PWR Plant and Space Reactor Power Systems. In Nuclear Power Plants: Recent Progress and Future Directions; Campton, J.K., Ed.; Nova Science Publishers, Inc.: Hauppauge, NY, USA, 2002; Chapter 1. [Google Scholar]

Figure 1. Longitudinal and radial cross section views of the VSLLIM microreactor for generating 1–10 MW_th; (a) natural circulation of in-vessel liquid sodium and (b) primary control rod groups and the ESD assembly. Adapted from [25].

Figure 2. Longitudinal cross-section of the VSLLIM microreactor installed below ground and mounted on seismic isolation bearings. Adapted from [25].

Figure 3. Cross section and elevation views of a UN fuel assembly with central B₄C control rod (a,c) and the central ESD assembly of 19 B4C rods (b,d). Adapted from [25].

Figure 4. A block diagram of the coupled sub-models in the MATLAB Simulink VSLLIM microreactor digital twin model.

Figure 5. Reactivity worth of the VSLLIM control rod groups and the ESD central assembly versus axial displacement in the reactor core at mean temperatures of 400 and 800 K.

Figure 6. Operation parameters of the VSLLIM microreactor during a simulated startup transient to a steady thermal power of 10 MW. (a) Positions of the ESD assembly and the Group A, B, and C control rods, (b) the reactor thermal power, (c) the sodium temperatures at the inlet and exit of the reactor core, and (d) the mass flow rates of the in-vessel Na and the secondary Na in the helical coiled tubes of the Na/Na HEX.

Figure 7. The numbers of training data sets generated by the dynamic model of the VSLLIM microreactor during the simulated startup transients for different final reactor power setpoints, P_SP,2.

Figure 8. A block diagram of the performed operations to update the LSTM network during a training epoch of the SL-LSTM algorithm.

Figure 9. Block Diagram of information flow through LSTM networks used in the SL-LSTM algorithms, for two sequential timesteps, τ and τ + 1.

Figure 10. Sample results of the loss curves for a trained SL-LSTM algorithm using hidden size of fifteen, one layer of neurons, learning rate of 0.001, and randomly shuffled data sets.

Figure 11. Sample results of predicted control rod positions for a trained SL-LSTM algorithm using hidden size of fifteen, one layer of neurons, learning rate of 0.001, and randomly shuffled data sets.

Figure 12. Sample results of a trained SL-LSTM algorithm with a hidden size of fifteen, one layer of neurons, learning rate of 0.001, and randomly shuffled data sets showing (a) the testing accuracy and (b) the number of training data files per final setpoint power.

Figure 13. Feedforward Neural Network structures of the (a) Actor and (b) Critic networks used in training of the SAC-FNN algorithm.

Figure 14. A block diagram of the training for the SAC-FNN algorithm in the present work.

Figure 15. Example of progression of total episodic reward for the SAC-FNN Training Case Q.

Figure 16. Comparison of predicted positions of the Group A and C control rods of the VSLLIM microreactor by the trained SAC-FNN algorithms with (a–c) three layers, and 64 neurons per layer and (d) with three layers, and 256 neurons per layer versus targets in simulated startup transients.

Figure 17. Developed setup for remote operation of the Reactor Control PLC linked to the trained SL-LSTM and SAC-FNN algorithms and digital twin dynamic model of the VSLLIM microreactor.

Figure 18. Comparison of predicted positions of Group A and C control rods and the reactor power by the trained SAC-FNN algorithms incorporated into the Reactor Control PLC to those using the PLC with PD controller during a simulated startup transient of the VSLLIM microreactor for (a,c) Case R with three layers, 64 neurons per layer, and (b,d) Case K with three layers, 256 neurons per layer.

Figure 19. Comparison of predicted reactor thermal power during a simulated startup of the VSLLIM microreactor by the Reactor Control PLC with the trained SAC-FNN algorithm in Case Q with three layers and 256 neurons per layer to those calculated in the simulated startup using the PLC with PD controller for target reactor power setpoints, P_SP,2, of (a) 10.0 MW_th, (b) 7.5 MW_th, (c) 5.0 MW_th, and (d) 2.0 MW_th.

Figure 20. Comparison of the predicted reactor thermal power during simulated startup transients using the Reactor Control PLC with both the trained SL-LSTM and SAC-FNN algorithms to those calculated by the PLC with PD controller for the same power setpoint of P_SP,2 = 10 MW_th.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

El-Genk, M.S.; Schriener, T.M.; Shaheen, A.N. Machine-Learning Algorithms for Remote-Control and Autonomous Operation of the Very-Small, Long-Life, Modular (VSLLIM) Microreactor. J. Nucl. Eng. 2025, 6, 54. https://doi.org/10.3390/jne6040054

AMA Style

El-Genk MS, Schriener TM, Shaheen AN. Machine-Learning Algorithms for Remote-Control and Autonomous Operation of the Very-Small, Long-Life, Modular (VSLLIM) Microreactor. Journal of Nuclear Engineering. 2025; 6(4):54. https://doi.org/10.3390/jne6040054

Chicago/Turabian Style

El-Genk, Mohamed S., Timothy M. Schriener, and Ahmad N. Shaheen. 2025. "Machine-Learning Algorithms for Remote-Control and Autonomous Operation of the Very-Small, Long-Life, Modular (VSLLIM) Microreactor" Journal of Nuclear Engineering 6, no. 4: 54. https://doi.org/10.3390/jne6040054

APA Style

El-Genk, M. S., Schriener, T. M., & Shaheen, A. N. (2025). Machine-Learning Algorithms for Remote-Control and Autonomous Operation of the Very-Small, Long-Life, Modular (VSLLIM) Microreactor. Journal of Nuclear Engineering, 6(4), 54. https://doi.org/10.3390/jne6040054

Article Menu

Machine-Learning Algorithms for Remote-Control and Autonomous Operation of the Very-Small, Long-Life, Modular (VSLLIM) Microreactor

Abstract

1. Introduction

1.1. Prior Investigation of ML for Nuclear Reactor Instrumentation and Control Systems

1.2. Objectives

2. VSLLIM Microreactor Design Features and Control

3. VSLLIM Digital Twin Model and Controller

3.1. The VSLLIM Reactor Controllers

3.1.1. Reactor Control PLC

3.1.2. HEX Secondary Flow PLC

3.2. A Simulated Startup Transient of VSLLIM Microreactor

3.3. Machine-Learning Training Data

4. Training the SL-LSTM Algorithm

Results of the Trained SL-LSTM Algorithms

5. Training the SAC-FNN Algorithm

Implemented SAC-FNN Algorithm Results

6. Evaluating the Real-Time Controller

6.1. Results of the Control PLC with the Trained SAC-FNN Algorithm

6.2. Performance of the Reactor Control PLC with the Trained SL-LSTM Algorithms

7. Summary and Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Nomenclature

Appendix A. Hyperparameters in the SL-LSTM Algorithms

Appendix A.1. Effect of the Learning Rate

Appendix A.2. Effect of the Size of the Lookback Window

Appendix A.3. Effect of Increasing the Number of Training Data Sets

Appendix A.4. Effect of Hidden Size and Number of Neurons Layers

Appendix A.5. Effect of Ordering the Training Data Sets

Appendix A.6. Increasing Training Features

Appendix B. Hyperparameters in SAC-FNN Algorithms

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI