Improving Thermochemical Energy Storage Dynamics Forecast with Physics-Inspired Neural Network Architecture

Thermochemical Energy Storage (TCES), specifically the calcium oxide (CaO)/calcium hydroxide (Ca(OH)2) system is a promising energy storage technology with relatively high energy density and low cost. However, the existing models available to predict the system’s internal states are computationally expensive. An accurate and real-time capable model is therefore still required to improve its operational control. In this work, we implement a Physics-Informed Neural Network (PINN) to predict the dynamics of the TCES internal state. Our proposed framework addresses three physical aspects to build the PINN: (1) we choose a Nonlinear Autoregressive Network with Exogeneous Inputs (NARX) with deeper recurrence to address the nonlinear latency; (2) we train the network in closed-loop to capture the long-term dynamics; and (3) we incorporate physical regularisation during its training, calculated based on discretized mole and energy balance equations. To train the network, we perform numerical simulations on an ensemble of system parameters to obtain synthetic data. Even though the suggested approach provides results with the error of 3.96 × 10−4 which is in the same range as the result without physical regularisation, it is superior compared to conventional Artificial Neural Network (ANN) strategies because it ensures physical plausibility of the predictions, even in a highly dynamic and nonlinear problem. Consequently, the suggested PINN can be further developed for more complicated analysis of the TCES system.


Thermochemical Energy Storage
Energy storage systems have become increasingly important in the shift towards renewable energy because of the fluctuation inherent to renewable energy generation [1,2]. Thermochemical Energy Storage (TCES) stores and releases energy in the form of heat as chemical potential of a storage material through a reversible endothermic/exothermic chemical reaction. TCES is favourable compared to sensible and latent heat storage [3][4][5] because it features a high energy density, low heat losses and the possibility to discharge the system at a relatively high and constant output temperature [6].
In general, the chemical processes occurring on the storage material can be classified into: redox of metal oxides, carbonation/decarbonation of carbonates and hydration/dehydration of hydroxides [14]. The choice of materials depends on many criteria, one of which is the application of the energy storage. For example, in integration with Concentrated Solar Power (CSP) plants, manganese oxide is not suitable because of its high reaction temperature [6,15]. Another important aspect to consider is the practicability of the process; for example, in a calcium carbonate system, CO 2 as the side effect of the reaction has to be liquefied and results in a high parasitic loss [6,14]. Additionally, there are many more criteria to consider, such as cyclability, reaction kinetics, energy density and, most importantly, safety issues. For comprehensive reviews of varying storage materials, we refer to [14][15][16].
Recently, experimental investigations have been conducted specifically for the calcium oxide (CaO)/calcium hydroxide (Ca(OH) 2 ) system. One experiment investigated the material parameters (such as heat capacity and density) and the reaction kinetics [17], another experiment focused on studying the operating range, efficiency and the cycling stability of the system [18], and there was also an experiment on the feasibility of integration with concentrated solar power plants [19]. All these experiments show that CaO/Ca(OH) 2 is a very promising candidate as TCES storage material. Furthermore, it is more attractive compared to other storage materials because it is nontoxic, relatively cheap and widely available [20,21]. The system stores the heat (is charged) during the dehydration of Ca(OH) 2 by injecting dry air with higher temperature. Charging results in an endothermic reaction along with the formation of H 2 O vapour and lower temperature at the outlet. It releases the heat (is discharged) during the hydration of CaO. This is achieved by injecting air with higher humidity (H 2 O content) and relatively lower temperature, resulting in an exothermic reaction (see Figure 1). Note that in this case, the hydration process occurs at lower temperature relative to the dehydration process, but both processes occur at high operating temperature [22]. The reversible reaction is written as:  A robust operational control of this system needs an accurate and real-time capable model to predict its state of charge and health. Similar models are operationally used, namely for batteries in mobile devices [23,24]. Accordingly, numerical TCES modelling studies were conducted to predict the system's behaviour [6,18,20,21,25]. However, the PDEs that describe the system are dynamic, highly nonlinear and strongly coupled, making the numerical simulation computationally expensive. This poses a significant hindrance on a more thorough and complex analysis of the TCES system. Estimation of the system's state of health, for example, requires a 2D or 3D model to study the effect of the structural change due to agglomeration [26]. With increasing spatial dimension, the computational time also increases strongly. Consequently, the system is not ready yet for commercial and industrial use until a faster and accurate model is developed. In this work, we consider using Artificial Neural Network (ANN) as a cheaper alternative to the expensive existing models.

Physics-Inspired Artificial Neural Networks
Artificial Neural Networks (ANNs) have been studied and applied intensively in the past few decades. They have become very popular alongside linear regression and other techniques such as Gaussian Process Regression (GPR) and Support Vector Machine (SVM) [27]. ANNs have advantages in terms of their flexibility and better applicability to model nonlinear problems compared to linear regression and GPR [28]. Additionally, it has better scalability to larger data compared to SVM [29]. However, a detailed performance comparison of ANN with other machine learning techniques is out of the scope of this paper.
ANNs have a wide range of applications, such as image and pattern recognition, language processing, regression problems and data-driven modelling [30]. In this paper, we focus on data-driven modelling, where an ANN is trained to predict the physical behaviour of a TCES system based on available data. ANNs have been used for data-driven modelling in different fields. In hydrology [31], ANNs have been successfully applied, for example to predict rainfall-runoff [32,33], groundwater levels [34] and groundwater contamination [35] . Moreover, ANNs have been used in energy system applications [36], for example to predict the performance of [37], reliability of [38] and design [39] renewable energy systems . All these examples show that ANNs have a potential to be a quick decision making tool which is useful for many engineering and management purposes.
In previous applications of ANNs in data-driven modelling, the ANN was treated as a black box [40,41] that learns only the mathematical relationship between the input and output. In such a process, the physical relationships and scientific findings that were previously used to build governing equations of the modelled systems are completely neglected. This issue is very troublesome and needs to be addressed because real data are noisy with measurement errors, and fitting the ANN to the noisy data without any physical constraint might lead to overfitting problems [30]. Additionally, in many cases, observation data is difficult and expensive to obtain, providing users with only a limited amount of data to train the ANN. Without any physical knowledge, ANNs perform poorly when trained with a low amount of data [42,43]. Furthermore, ANNs have a very poor interpretability [44,45], meaning that there might be different combinations of ANN elements (width, depth, activation functions and parameters) that fit the training data with similar likelihood, but not all of them are physically meaningful and robust. As a result, ANN predictions might be misleading.
Implementing physical knowledge to build the ANN structure and regularisation is a potential solution to solve this issue. By combining a black box model with a white box (fully physical) model, we obtain a grey box model. In such an approach, physical theories are used in combination with observed data to improve the model prediction and plausibility [46]. Moreover, the data will help to include complex physical processes that may not be captured in currently existing white box models. There are at least two motivations to do so: to obtain a reliable surrogate model for the physics-based model for the sake of speed in real-time environments and to address situations where the underlying physics of the system are incompletely understood, so that ANNs can build on, and later exceed, the current state of physical understanding.
Several works have been conducted to develop the so-called Physics Inspired Neural Networks (PINN). In general, PINNs can be grouped into two distinguishable motivations as mentioned above. The first one applies ANNs to infer the parameter values in the governing Ordinary Differential Equations (ODEs) or Partial Differential Equations (PDEs) as well as the constitutive relationships and the differential operators [43,47], assuming that the ODEs or PDEs perfectly describe the modelled system. The second one treats the system as a complex unit that is not sufficiently represented only with simplified equations. It trains the ANN based on observation data while constraining the ANN using physically-based regularisation [42,48,49]. We aim mostly at the second motivation and ask ourselves how much of the useful knowledge contained in the PDEs can be used to inform ANNs before proceeding to train them on observation data.
Despite the success of PINN implementations in this direction, there are still some open issues that need to be addressed: (I) There is no well-defined alignment between the structure of the governing equations with the structure of the ANN. For example, most, if not all of the applications are for dynamic systems. Nevertheless, the structures of the ANNs applied do not resemble the dynamic behaviours of the systems and do not consider recurrency. (II) The focus in PINN development is more on getting high accuracy with limited amount of training data rather than improving the physical plausibility of the predictions. (III) The implementations of PINNs in previous works are mainly for relatively simple problems, and implementation to more complex problems (featuring multiple nonlinear coupled equations) has not been evident yet. In our current study, we address these three open issues.

Approach and Contributions
For dynamic and complex systems with coupled nonlinear processes such as the TCES system, we need an advanced approach to solve it using ANNs. Our approach implements physical knowledge of the system into building the ANN such as: (I) we use a Nonlinear Autoregressive Network with Exogeneous Inputs (NARX) structure. This is a form of Recurrent Neural Network (RNN), and we use deeper recurrence to account for the system's long-term time scales and nonlinear dynamics; (II) we train the network with recurrence structure to improve the long-term predictions in the dynamic system; and (III) we add physical regularisation terms in the objective function of training to enforce physical plausibility of the predictions.
NARX is suitable to model time series of sequential (time-dependent) observations y(t) [50,51], which are equispaced time series. There are several reasons why NARX is preferable to alternative ANN structures: the included feedback loop in NARX enables it to capture long-term dependencies [34] and the possibility to provide exogenous inputs improves the results compared to networks without them [52]. Thinking in terms of PDEs, the exogeneous inputs resemble time-dependent boundary conditions, and the feedback provides access to preceding time steps of the PDE solution. With deeper recurrence, even integro-differential equations can be resembled, which is important for hysteretic systems or for system descriptions on larger scales.
There are two different methods to train NARX, namely Series-Parallel (SP, also known as open-loop) and Parallel (P, also known as closed-loop). In SP training, each time step in the time series is used as an independent training example. This means the recurrency in the ANN structure is ignored, and the preceding data values from the time series are provided as feedback inputs instead of the predicted values. The feedback loop is closed only after completing the training to perform multistep ahead predictions [52,53]. The independency of the training examples makes the training much easier; however, the trained network performs much worse after closing the feedback loop [52].
Most, if not all studies conducted with NARX have used SP structure to train the network. In this paper, we argue that P training resembles the dynamic system better. The reason is that P training optimizes the ANN exactly for the later prediction purpose over longer time horizons: it accounts for error propagation over time and for the time-dependency of the predictions between time steps. As a downside, it requires more time to train the network in P mode. We are readily willing to accept this trade-off, because once trained, the network can still calculate its outputs in high speed [54]. We also propose to use a deeper recurrency to train the network by feeding back predictions of multiple preceding time steps. This accounts for the nonlinearity of the system and for possible higher-order memory effects in the system. In terms of PDE-governed systems, this corresponds to the time delay between system excitations at one system boundary and the system's reaction at a remote boundary.
Regularisation in training ANNs is useful to prevent overfitting. Here, as an addition to the commonly used L2 regularisation, simple regularisation terms are added that align with the physics. Several examples include, but are not limited to, monotonicity and non-negative values (examples in this case are volume and mole fractions) of the internal states. This follows the works presented in [46,55]. We suggest to use Bayesian Regularisation (BR) to optimally calculate the hyperparameters (normalising constants) of all terms in the loss function, unlike in previous works, where the hyperparameters were calibrated manually. Furthermore, regularisation terms with discretized balance equations are also used. This regularisation is a way to feed the training with fundamental human knowledge previously used to build PDEs. It helps the network realise the extensive relationships between inputs (previous states and boundary conditions) and outputs (future states of the system) in complex problems and to prevent physically implausible predictions.
To test our ANN framework, a Monte Carlo ensemble of numerically simulated time series of system states is used, which we generate from random samples of uncertain system parameters. White noise is then added to the simulation results to emulate the actual noisy measurement data. Then, this ensemble (both parameters and time series) is used to train the network. We use synthetic data instead of experimental data because the former allow more exhaustive and controlled testing; this does not imply that our main purpose is only surrogate modelling. As optimization algorithm for training, the Levenberg-Marquardt (LM) algorithm [56][57][58] is implemented to obtain an optimum set of NARX parameters (which consist of the so-called weights and biases).
This paper is organised as follows: in Section 2 we introduce the governing equations used for numerical simulation of the TCES internal states, the alignment between the dynamic of CaO/Ca(OH) 2 and the NARX structure, as well as how we implement the physical knowledge into the regularisation. In Section 3, we discuss the results of our test, and Section 4 concludes the findings in the work.

Governing Equations
This study serves as an initial step towards enabling a more complex analysis of the TCES system, focusing on predictions of the system's dynamic internal states that change during the endothermic/exothermic reaction process. The analysis of the system's integration with the energy source is out of scope of this paper.
To set up the prediction model, we consider the CaO/Ca(OH) 2 TCES lab-scale reactor of 80 mm length along the flow direction as described in [20]. Assuming the system properties and parameters to be homogeneous, the simulation was conducted in 1D. The system was modelled as a nonisothermal single-phase multicomponent gas flow in porous media with chemical reaction acting as the source/sink terms and can be described using mole and energy balance equations. The inlet temperature and outlet pressure were fixed and defined with Dirichlet boundary conditions, and Neumann conditions were used to define the gas injection rates. The solid components forming the porous material are CaO and Ca(OH) 2 , and the gases are H 2 O and N 2 . The latter serves as an inert component to regulate the amount of H 2 O mole fraction in the injected gas. Full explanation in detail can be found in [20], and we offer a brief overview only in this section.
The mole balance equation was formulated for the solid component (subscript s) as: where ρ n denotes the molar density, ν the volume fraction, q the source/sink term, t the time and the subscript n refers to molar properties. Note that ν s is the volume fraction of each solid component with regard to the full control volume, and therefore ∑ s ν s = 1 − φ. In Equation (2), there is no effect of advection or diffusion (no fluxes) because the solid is assumed to be immobile. The change of solid component is solely caused by the chemical reaction through the reaction source/sink term q s , assuming the solid is immobile. The change in the gas component (subscript g), however, is affected both by advective and diffusive mass transfer and by a source/sink term for the reactive component H 2 O, as is defined in the mole balance equation: Here, x denotes the molar fraction, φ the porosity, K the absolute permeability of the porous medium, µ the gas viscosity, p the pressure and D the effective diffusion coefficient.
The energy balance equation was formulated assuming local thermal equilibrium. It accounts for internal energy change of both solid and gas phase, convective and conductive heat flux as well as source/sink term from the reaction. It was defined as: where ρ m is the mass density, u g is the gas specific internal energy, c p,s is the specific heat capacity of the solid material (CaO and Ca(OH) 2 ), T is temperature, h g is gas specific enthalpy and λ e f f is the average thermal conductivity of both solid materials and gas components. Reaction rates must be specified to determine the source/sink term for each equation. Based on [20,21], simple reaction kinetics were used, described as: whereρ m,SR is the mass reaction rate, k H R and k D R are hydration and dehydration reaction constant, respectively and T eq is the equilibrium temperature. Hydration process occurs when T < T eq , which is also called the discharge process and is the exothermic part of the reaction; and dehydration process occurs when T > T eq , also known as charge process and the endothermic part of the reaction. At the beginning of each reaction, the storage device is assumed to be in chemical equilibrium, corresponding to ν Ca(OH) 2 = 0 and ν CaO = 0 for hydration and dehydration, respectively.
The relation between the reaction rate and the source/sink terms for the mole balance equations were defined as: withρ n,SR the molar reaction rate (obtained fromρ m,SR using the molar mass of each respective component). The energy balance source/sink term q e was calculated accounting for the reaction enthalpy ∆H and the volume expansion work [59] according to: Note that a negative sign is necessary to calculate q e , so that its value is in proportion to q Ca(OH) 2 , and in reverse to q H 2 O and q CaO . This negative sign can be explained by the fact that to form Ca(OH) 2 from CaO and H 2 O in the hydration process, energy is released into the system. Correspondingly, a decrease in the molar amount of CaO and H 2 O (and an increase in the molar amount of Ca(OH) 2 ) results in a positive source term. The opposite holds for the dehydration process.

Input and Output Variables
The numerical model used in this work was developed using DuMu x (Distributed and Unified Numerics Environment for Multi-{Phase, Component, Scale, Physics, ...} [60]). As input to the simulator, we need the material parameters such as CaO density (ρ CaO ), Ca(OH) 2 density (ρ Ca(OH) 2 ), CaO specific heat capacity (c p,CaO ), Ca(OH) 2 specific heat capacity (c p,Ca(OH) 2 ), CaO thermal conductivity (λ CaO ) and Ca(OH) 2 thermal conductivity (λ Ca(OH) 2 ); porous medium parameters such as absolute permeability (K) and porosity (φ); reaction kinetics parameters such as reaction rate constant (k r ) and specific reaction enthalpy (∆H); and initial and boundary conditions such as N 2 molar inflow rate (ṅ N 2 ,in ), H 2 O molar inflow rate (ṅ H 2 O,in ), initial pressure (p init ), outlet pressure (p out ), initial temperature (T init ), inlet temperature (T in ) and initial H 2 O mole fraction (x H 2 O,init ). In the TCES system application, one of the main goals is to estimate the state of charge of the device that is implied in the CaO volume fraction ν CaO . The device in fully charged condition corresponds to ν CaO = 1 and vice versa. We are also interested in the output variables p, T and x g,H 2 O (H 2 O mole fraction). The behaviour of these variables, especially p, is very nonlinear. Therefore, it is interesting to see the prediction of the ANN for these nonlinear variables. Additionally, these variables are also important to assist in the system understanding. Therefore, our main output variables of interest were defined as in the following vector y as a function of time t: All input-output data samples are available as supplementary materials on https://doi.org/10.18419/darus-633.

Aligning the ANN Structure with Physical Knowledge of the System
ANN representation via NARX has two different training architectures, namely Series-Parallel (SP) and Parallel (P) structure. The network outputŷ SP (t + 1) of the SP structure is a function of the observed target values of previous time steps y (t) up to a feedback delay d y and of the so-called exogenous inputs u:ŷ In this work, u was assumed to be constant over time, meaning there is no disturbance signal throughout the whole simulation period.
In P structure, the difference lies in the fed-back values. Here, the network outputs of the P structureŷ P (t) are fed-back instead of the original given data y(t): Note that, in terms of notation, the difference only lies in the hats above the fed-back values. Apparently, the P-structure in NARX resembles an explicit time-discrete differential equation (ODE or PDE) in a simplistic case, for example using the Adams-Bashforth discretization scheme [61,62] which can be described as:ŷ (t + 1) ≈ŷ(t) + ∆t · g(u,ŷ(t),ŷ(t − 1), . . . ,ŷ(t − d y )). (11) whereŷ (t + 1) is an explicit function ofŷ (t) . . .ŷ t − d y . In Equation (10), the NARX function , u can be seen as an approximation ofŷ(t) (11). Based on this reason, we propose to train using P-structure for solving dynamic problems whenever possible. Additionally, training in P-structure helps the network to learn that there is dependency between predicted values at different time steps. While both architectures were considered for NARX training, only P architecture was used for testing, as for longer-term forecasting, real data of previous time steps are not available [52]. For better understanding of the difference between P and SP, Figure 2 illustrates both architectures. Feedback delay is also an important property, because Equations (2)-(4) are not elliptic, and hence there will be a time delay for the effect of input change to change the output. Because of this memory effect and its nonlinearity, we propose to use a deeper recurrence in NARX to enable the network to learn the system's latency. In this work, feedback delay values d y ranging from 1 to 5 were tested to get the optimum value. Additional to an appropriate ODE-like structure, the hyperbolic tangent (tanh) function was chosen as activation function within the neurons of NARX. Tanh is a nonlinear activation function (named tansig function in MATLAB [63]). Aside from the nonlinearity, this choice was driven by the assumption that all the input parameters and the targets depend on each other via smooth functions (differentiable). This knowledge results, among others, from the presence of a diffusion term in all relevant transport equations, and from the absence of shock waves in the solutions to Equations (2)-(4). Hence, for each hidden layer l as shown in Figure 2, the layer output a [l] is computed in the feedforward procedure described in: For the output layer L, a linear function is assigned for scaling, leading to:

Physical Constraints in the Training Objective Function
As loss function for training, the most commonly used performance measure is a Mean Squared Error (MSE). Equation (14) shows the MSE for n training datasets (here, we use the subscript D for "data" to label the error term E D in the loss function as the data-related term): where i = 1 . . . n indicates a specific sample in the training dataset and t = 1 . . . n t indicates a specific time step. However, using MSE alone in the loss function is not enough most of the time. The optimization problem to be solved in training is typically an ill-posed problem in many instances [30]. Thus, regularisation is required to prevent overfitting. In this work, the L2 regularisation method was used to increase the generalisation capability of the ANN [64]. L2 regularisation is also known as weight decay or ridge regression [65]. The goal of L2 regularisation is to force the network to have small parameter values (choosing the simpler network over the more complex one). This effectively adds a soft constraint to the loss function to prevent the network from blindly fitting the possible noise in training datasets: where N is the total number of network parameters (weights and biases), and θ ∈ R N are the network parameters.
To improve the network prediction and the physical plausibility even more, known physical laws were inserted as part of the network regularisation: where the subscript k identifies a specific physical law, for example a mole balance equation, and e phy,i,t,k is the physical error listed in Table 1. For example, the term e phy,i,t,1 corresponds to the mole balance equation for dehydration/hydration. The mole balance equation used for this regularisation is the H 2 O mole balance, because it has the most complete storage, flux and source/sink term (the solid components are assumed to be immobile, and N 2 is inert). The mole balance error can be written as: where n H 2 O is the molar amount of H 2 O, the subscript out, in, sto and q denote outflow, inflow, storage and source/sink term, respectively. The mole balance error was used as a contraint e phy,i,t,1 and is equal to 0 if the mole balance is fulfilled. Putting this equation as a regularisation term penalises the network if the mole balance is not satisfied. Similarly, the corresponding energy balance equation also has to be fulfilled: where Q is the energy in the system. It was used as a regularisation in e phy,i,t,2 . A more detailed derivation of the mole balance error e phy,i,t,1 and the energy balance error e phy,i,t,2 can be found in Appendix B.

Dehydration Hydration
Further relations of the form F (ŷ) ≤ 0 (monotonicity and non-negative values) were implemented using the Rectified Linear Units (ReLU) function, so that the physical error was then calculated with e phy,i,t,k = ReLU F (ŷ) [46]. The ReLU function returns 0 as output value for negative arguments and linearly increases for positive arguments. Hence, it punishes positive values in proportion to their magnitudes. Examples of these ReLU constraints are e phy,i,t,3 through e phy,i,t,13 . They define non-negativity and monotonicity of the predicted target variables. For both dehydration and hydration process, negative fractional valuesν CaO andx H 2 O are physically and mathematically impossible. Therefore, in e phy,i,t,3 and e phy,i,t,4 , the network is punished for predicting negative values for these targets. Additionally, for both processes, e phy,i,t,5 provides an additional constraint forν CaO , to limit the amount of CaO volume in relation with the porosity (ν CaO ≤ 1 − φ). All these monotonicity assumptions originated from the fact that the system's material parameters are considered to be constant throughout operation of the system. Therefore, the system's behaviour should be monotonic and bounded in the specified aspects. Specific for the dehydration process,p,T and ν CaO are expected to not decrease throughout the simulation. This results in the corresponding monotonicity constraints e phy,i,t,6 to e phy,i,t,8 . The system temperature must also be lower or equal to the injected temperature as constrained in e phy,i,t,9 , because the injected temperature is higher than the initial temperature. Specifically for the hydration process, the monotonicity constraints e phy,i,t,6 to e phy,i,t,9 for the dehydration process are reversed, because the hydration process is the reverse of the dehydration process.

Obtaining Optimum Network Parameters
The complete loss function defined in Section 2.4 including MSE and all the regularisation term is written as: Here, α and β are normalising constants of E θ and E D , respectively; and λ k is a normalising constant for each physical regularisation k. All error and regularisation terms, therefore, are evaluated in a normalised metric. These normalising constants, also known as the hyperparameters, determine the importance given to each term. For example, a high β means that it is more important for the network to fit the training datasets than to generalise better. In many cases, the hyperparameters are determined manually from trial-and-error. In this work, Bayesian Regularisation was adopted to calculate them overall using a maximum likelihood approach to minimise L(θ) [66,67]. Bayesian Regularisation reduces the subjectivity arising from manual choice of hyperparameters. First, all the hyperparameters α, β and λ k along with the network parameters θ were initialised. The hyperparameters were initialised by setting β = 1, α = 0 [68] and also λ k = 0, while the network parameters were initialised using the Nguyen-Widrow method [69,70]. The Nguyen-Widrow method initialises the network parameters so that each neuron contributes to a certain interval of the whole output range (added with some random values).
The complete derivation for updating α and β can be found in [66,67]. Here, we only give the calculation of λ k . After each iteration, they are updated according to the following relation: where J phy,k is the Jacobian of physical error E phy,k with respect to the network parameters θ. The approximate Hessian matrix H of the overall loss function L(θ) was defined as follows: where J D is the Jacobian of the MSE (E D ) with respect to the network parameters and I is the N × N identity matrix (N is the number of network parameters θ).
The network parameters are also updated after each iteration according to the Levenberg-Marquardt algorithm: where µ > 0 is the algorithm's damping parameter. Its value is increased when an iteration step is not successful, and is decreased otherwise. The Levenberg-Marquardt algorithm was chosen because of its faster convergence rate compared to Steepest Decent and higher stability relative to the Gauss-Newton algorithm [58]. In absence of our physical regularisation terms and with fixed α, β, the procedure would simplify to a plain (nonlinear) least squares training, which would be the standard approach for training ANNs. Values of the trained network parameters and the normalising constants at the end of the training are given as supplementary materials on https://doi.org/10.18419/darus-634.

Results and Discussion
Our hypothesis is that applying physical knowledge of the modelled system into the construction of ANNs would lead to an improved physical plausibility of the prediction results. In this section, the prediction of the TCES system using ANNs is assessed and three relevant aspects that support our hypothesis are discussed: (I) the effect of feedback delay on the prediction result to account for the system's nonlinearity and long-term memory effect (Section 3.1), (II) the comparison between training in SP and P architecture (Section 3.2) and (III) the improved physical plausibility from using physical regularisation (Section 3.3). The results are illustrated only for the dehydration process, because the hydration provides very similar results.
The complete workflow of the ANN application is shown in Figure 3. In general, the workflow can be divided into: training, validation and testing of the ANN. To train the ANN, first an ensemble of exogeneous input u was generated based on selected probability distributions. These distributions are based on different values used in literature [6,[17][18][19][20][21]. The complete list of exogeneous inputs u with their corresponding distributions is listed in Appendix A. This ensemble was then plugged into the numerical model in DuMu x and was simulated until t = 5000 s to obtain an ensemble of target data y(t). The governing equations are provided in Section 2.1. White noise was then added to these targets by generating normally distributed random numbers with zero mean and a standard deviation of 0.05 times the target values. Lastly, both exogenous inputs and targets were normalised to the range [−1, 1] to help the stability of the training [71]. Then, we set up the NARX ANN as described in Section 2. The training was then conducted using the built-in functions for NARX in the MATLAB Neural Network Toolbox [63], in which the loss function calculation was modified based on the equations provided in Section 2.5. It was conducted in batch mode both for dehydration and hydration process with a total of 100 training datasets.
Without physical regularisation, we obtain the lowest MSE value when the NARX is trained using 1000 training datasets, as shown in Figure 4. However, it is interesting to see how the physical knowledge can further improve the performance of NARX with limited training data. Therefore, we conducted the training in batch mode both for dehydration and hydration process with a total of only 100 training datasets.
For conciseness, the choice of the number of hidden layers, number of nodes per hidden layer and choice of activation function is not discussed because there is no existing uniform and systematic method to calculate the appropriate or the best combination [72]. Based on trial and error, we found that for this specific problem, 2 hidden layers with 15 and 8 nodes at each layer gives reasonable results. An example of ANN prediction using this configuration is provided in Appendix C. The stopping criteria are the dampening factor µ > 10 10 (see Equation (22)) or the loss function gradient g = ∂L(θ) ∂θ < 10 −7 , both of which are the default values proposed in the toolbox. Additionally, a maximum epochs is set for the training. Since training error converges mostly before 500 epochs, this limit is sufficient.  Different initialization values often lead to different network response as the algorithm might fail to always locate the global minimum [65]. Therefore, the network was retrained with 50 different initialisations. This number was set to find a compromise between a reliable result and a reasonable computational time. After each training, the network was validated on 20 validation datasets, and the trained network with the lowest MSE (E D evaluated against validation data) was selected.
Finally, the network was tested with data contained neither in the training nor in the validation set on a test set with 800 time series. Figure 5 shows the influence of feedback delay on the MSE evaluated on both the training (dashed lines) and test datasets (solid lines) for networks trained using P structure. As shown in Figure 5, for d y > 1 the test MSE is lower than the training MSE. This is generally because the network was trained using additional white noise-producing more errors in the training, while the test datasets were smooth for reference. In Figure 5, while the training error seems to remain constant at d y > 2, the test error keeps decreasing with increasing d y . This clearly illustrates that including more depth in the recurrence improves the generalisation capability and therefore improves the ANN prediction. As the best MSE test was obtained with d y = 5, this value will be used from here onwards. From Figure 5 we can also analyse that a time delay of at least 3 previous time steps is useful to train the network. Moreover, we do not see the value of using d y > 5, as judging by the MSE trend in Figure 5, there is no significant improvement expected. Figure 6 compares the training time of P compared to SP structure. Moreover, plotting the gradient and performance as function of training epochs allows us to analyse the difference between both training characteristics. As expected and shown in Figure 6 (dashed lines), both SP and P structure training time increases nearly linearly with the number of epochs (iterations). However, the slope is steeper for P training which is caused by higher computational cost using Backpropagation Through Time (BPTT).

SP Versus P Training Structure
In Recurrent ANNs, BPTT is used to calculate the derivative of the loss function instead of the normal backpropagation method. BPTT is technically the same as the normal backpropagation method but with the RNN unfolded through time being the main difference [30]. The gradient is then backpropagated through this unfolded network. Unfolding the recurrent network increases its size, and therefore the optimization problem becomes computationally more expensive and more difficult to solve.
After every epoch in P training, the output values change, and consequently the feedback values also change. The constantly changing feedback values cause additional changes of the gradient values along the iteration (dotted lines in Figure 6). This makes the training a more nonlinear problem. Correspondingly, the training performance (smaller MSE) increases much slower. In SP training, the gradient strongly decreases during the first 20 epochs, showing that the SP training is more computationally stable. However, the MSE (solid lines in Figure 6) was evaluated during training for the structure the network was trained with, meaning that the training performance does not consider the closed-loop conversion error for SP training. For that reason, the MSE values shown in Figure 6 seem to be better for SP training. Regardless of this difference, both training procedures converge with strictly monotonic decay of their MSE. Next, the prediction performances of both training architectures are compared. In Figure 7, the results of the SP-trained NARX (dashed lines) are shown compared to the target values obtained from the simulation (solid lines) as reference solutions. Here, all target variables are calculated with differing inlet temperature. After a few time steps with relatively precise forecasts, the NARX predictions for T, p and ν CaO diverge from real values and are highly fluctuating over time, which is nonphysical. Note that, in Figure 7, the results are shown only up to time step 100 instead of 1000. This is because the NARX prediction results after time step 100 have even higher fluctuations as the error propagates, hence making the comparison unclear visually. The forecast for x g,H 2 O is reasonably accurate for t < 100 but more erroneous for longer forecast periods. The forecast error is caused by the closed-loop-instability, meaning the inaccuracy caused by converting the network structure from SP to P. In other words, training using SP structure gives increasingly erroneous results with increasing time horizon. On the contrary, training with P structure provides clearly more accurate forecasts compared with SP, as shown in Figure 8. The NARX predictions (dotted lines) correspond really well to the reference targets (blue solid lines) throughout the whole simulation time for 1000 time steps, with inlet temperatures covering the whole range of its input distribution.
The comparison of P and SP structure shows that, while training in P structure seems to be more unstable, it provides significantly better long-term predictions because it trains the network to realise the time-dependency of output variables in a dynamic system. As shown by Equation (11), NARX resembles an explicitly discretized ODE, which is known to be conditionally stable. In cases where the discrete ODE is unstable, the error grows exponentially through time [73]. By training the network using P structure, the same structure is used for both training and testing, hence minimising the error propagation. The higher computing time needed to train in P structure (in comparison with training in SP structure) should not be a problem because the training only needs to be conducted once. Once the optimum network parameters are obtained, NARX can give reasonably accurate predictions with fast computational time. In fact, almost all studies in the area of surrogate models are willing to accept high computational costs during training (called offline costs) if the accuracy and speed for prediction (online) are good [54,74,75].
To make a fair comparison, all networks were trained with the same set of initial network parameters. In this test, we used only P training and feedback delay d y = 5, because they clearly showed preferable performance in the previous sections. The comparison is summarised in Table 2. While the MSE on training data is in a comparable range for all loss functions, differences in MSE test are observed. L2 regularisation helps to reduce overfitting of training data, resulting in a lower test error in "MSE + L2" compared to "MSE". At first glance, the additional physical regularisation does not seem to further improve the results. MSE test of "MSE + PHY" is slightly worse compared to "MSE", and MSE test of "MSE + L2 + PHY" is in the same order with "MSE + L2" because another constraint is added in the objective function, while the performance is measured only based on MSE. Moreover, using only the physical (MSE + PHY) instead of only L2 regularisation (MSE + L2) leads to a test performance decrease.
Even though the performance for both "MSE" and "MSE + L2" are better than "MSE + PHY", they both fail to produce physically plausible predictions in several test datasets (outliers) as shown in Figures 9 and 10 (the label "Reference" for the blue line refers to the synthetic test data obtained from the physical model), the clearest one being negative fraction values of CaO and H 2 O. One important aspect that needs to be considered is that the ANN was trained using only 100 training datasets, compared to almost 500 parameters that exist inside the network. This made the optimisation problem an ill-posed one, leading to clear overfitting in the network with "MSE" and "MSE + L2". Physical regularisation tackles this problem even for relatively sparse training data, which is valuable once experiments are costly, and therefore, not much data are often available to train the network.
Even though it produces the worst overall test performance, physical regularisation alone (MSE + PHY) is able to produce physically plausible results despite no application of L2 regularisation, see Figure 11. The figure illustrates the worst prediction result of all test sets obtained using "MSE + PHY". Even in its worst prediction with high error, the network is much more stable. With the addition of L2 regularisation in the "MSE + L2 + PHY" scheme, the prediction error (MSE) is further reduced so that it lies within the same range as "MSE + L2". The major difference here is illustrated in Figure 12, where the worst prediction result produced by "MSE + L2 + PHY" is far more physically plausible, shown by the absence of unstable fluctuations as well as the relatively higher accuracy.
We trained the ANN using numerical simulation results which indirectly imbues the physics from the formulated governing equations into the ANN. When the ANN was not trained using numerical simulation results but with real observation data (which could follow more complex, scientifically unexplored equations), physical regularisation helps to constrain the ANN training at least to fundamental, confirmed laws and prevent unnecessary overfitting to the irregular and noisy observation data. As such, implementation of the method we present will be even more beneficial for applications with real observation data.

Conclusions
We adopted a PINN framework as an example of grey-box modelling to predict the dynamic internal states of the TCES system. Our approach aligns with the motivation of PINN that sees the modelled system as a complex unit that is underrepresented by the governing equations used in the physical model. We do not construct the ANN only as a surrogate model for the expensive numerical model, unlike in other PINN approaches that use ANN to infer the governing equations of the modelled system or the parameter values, assuming that the physical model describes the system perfectly. Our method was designed for application with real observation data that imply more complex processes than the simplified physical model. In this paper, however, we used synthetic data to train the ANN as a proof of concept.
As a key contribution, we propose to implement physical knowledge about a system into building the structure, choosing the training mode and designing the regularisation of ANNs to assure the physical plausibility and to increase the performance of the TCES dynamic internal state predictions. The alignment between the system's behaviour (dynamic and nonlinear) and the ANN structure is described. The ANN predictions using different regularisation strategies are also compared to show the improvement provided by our method.
We show that, while training in P structure is computationally more expensive and unstable, the result is superior to training using SP structure, because P training resembles the dynamic of the governing differential equations better. Additionally, we found that, due to the nonlinearity and long-term memory effects implied by the system equations, deeper recurrency is necessary. A moderate depth of feedback delay produces better prediction performance, resulting from the network ability to capture the latency of the system. However, using too much feedback delay is also counter-productive, as it does not give significant improvement anymore, only increasing computational cost.
We also show that including physical regularisation to train the network improves the physical plausibility of the network predictions, even for worst-case scenarios. Physical regularisation helps the network to learn about relationships between different input and target variables, as well as the time-dependency between them. This includes mole and energy balance equations that serve as the building blocks of the system's behaviour, along with simple monotonicity and non-negative constraints. However, physical regularisation alone is not enough to improve the generalisation capability of the network, and therefore, L2 regularisation is also necessary.
A very common issue with using ANNs in data-driven modelling is that obtaining experimental or operational data can be very costly, and therefore, there is often no sufficient data available to train the ANN. Our work shows that even with only a relatively small amount of training data (compared to the number of network parameters), using P training with a moderate amount of feedback delay d y , combined with physical regularisation helps to prevent overfitting in optimising ill-posed problems and it produces relatively accurate and physically plausible predictions of the CaO/Ca(OH) 2 TCES system internal states. Further work is required for more sophisticated analysis of the system, for example with spatial distribution of the internal system, dynamic exogeneous input and uncertainty quantification of the predictions.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript:  Table A1 lists all of the exogeneous input with their corresponding distributions. The exogeneous input distributions are centred around the values taken from [20]. Table A1. Input distributions for exogenous inputs, with µ and σ being the mean and standard deviation used to generate the data, respectively; while the superscript D and H refer to the dehydration and hydration process, respectively.

Appendix B. Mole and Energy Balance Error
Physical constraints in e phy,i,t,1 and e phy,i,t,2 use mole and energy balance equation, respectively. Both balances are calculated in a simplified way discretized for time steps of 5 s with spatially averaged values and a local thermal equilibrium of the gas and the solid. For clarity, all the specific sample indicators in the training dataset i are omitted in this section.
The mole balance is formulated for H 2 O (assuming that the density can be calculated with ideal gas law) with the in-and outflowing moles n H20,in and n H20,out , the storage term in the gaseous phase ∆n H20,sto and the source/sink term ∆n H20,q . The in-and outflowing moles of H 2 O both are known values from the simulation or input data. The storage term ∆n H20,sto can be calculated from the change in H 2 O molar fraction, x g,H 2 O (t) −x g,H 2 O (t − 1) multiplied with the H 2 O molar density and the pore volume. The complete definition is written as: ∆n H 2 O,sto (t) = φV(x g,H 2 O (t) −x g,H 2 O (t − 1))ρ n,H 2 O (t).
The source/sink term ∆n H20,q is calculated with the molar amount of CaO formed. Based on the stoichiometry ratio, with every 1 mole of CaO forme, 1 mole of H 2 O is also formed. The molar amount of CaO is determined by the change in CaO volume fraction,ν CaO (t) −ν CaO (t − 1), multiplied with the molar density and the volume. The calculation for ∆n H20,q is written as: ∆n H 2 O,q (t) = V(ν CaO (t) −ν CaO (t − 1))ρ n,CaO . (A2) Finally, Equations (A1) and (A2) are substituted into Equation (17). For the energy balance formulation, the energy of the inflowing and outflowing gas Q in and Q out are also known from simulation or from input data. The energy storage in the gaseous phase is neglected as its contribution is negligible. Only the solid contribution is used in the calculation of ∆Q sto . The solid energy change is calculated as the change in both CaO and Ca(OH) 2 mass multiplied by the temperature and specific heat capacity. The definition is written as: where Q sto (t) is defined as: The source/sink term for the energy balance equation, ∆Q q , is calculated based on the change in molar amount of CaO multiplied by the specific reaction enthalpy and subtracted with the volume expansion work. The negative sign corresponds to the definition in Equation (7). The calculation is written as: Equations (A3) and (A5) are then substituted into Equation (18). Figure A1 shows the best prediction of the ANN using 2 hidden layers with 15 and 8 nodes at each layer using only MSE and L2 regularisation to define the loss function. Figure A1. An example of the best prediction sample (red) obtained using 2 hidden layers with 15 and 8 nodes at each layer and reference solution obtained from the physical model (blue).