LSTM and GRU Neural Networks as Models of Dynamical Processes Used in Predictive Control: A Comparison of Models Developed for Two Chemical Reactors

This work thoroughly compares the efficiency of Long Short-Term Memory Networks (LSTMs) and Gated Recurrent Unit (GRU) neural networks as models of the dynamical processes used in Model Predictive Control (MPC). Two simulated industrial processes were considered: a polymerisation reactor and a neutralisation (pH) process. First, MPC prediction equations for both types of models were derived. Next, the efficiency of the LSTM and GRU models was compared for a number of model configurations. The influence of the order of dynamics and the number of neurons on the model accuracy was analysed. Finally, the efficiency of the considered models when used in MPC was assessed. The influence of the model structure on different control quality indicators and the calculation time was discussed. It was found that the GRU network, although it had a lower number of parameters than the LSTM one, may be successfully used in MPC without any significant deterioration of control quality.


Introduction
In Model Predictive Control (MPC) [1,2], a dynamical model of the controlled process is used to predict its behaviour over a certain time horizon and to optimise the control policy. This problem formulation leads to very good control quality, much better than that in classical control methods. As a result, MPC methods have been used for a great variety of processes, e.g., chemical reactors [3], heating, ventilation and air conditioning systems [4], robotic manipulators [5], electromagnetic mills [6], servomotors [7], electromechanical systems [8] and stochastic systems [9]. It must be pointed out that satisfactory control is only possible if the model used is precise enough. Although there are numerous types of dynamical models, e.g., fuzzy systems, polynomials, and piecewise linear structures [10], neural networks of different kinds [11] are very popular due to their excellent accuracy and simple structure [12]. In particular, Recurrent Neural Networks (RNNs) [13][14][15][16] can serve as a model as they are able to give predictions over the required horizon.
In theory, RNNs can be extremely useful in various machine learning tasks in which the data are time-dependent such as modelling of time series, speech synthesis or video analysis. In contrast to the classical feedforward neural networks, RNNs can be used to create models and predictions from sequential data. However, in practice, their use is limited due to their one major drawback: the lack of long-term memory. RNNs have short-term memory capabilities; however, they tend to forget about the long-term inputoutput time dependencies during the backpropagation training. This problem is caused by the vanishing gradient phenomena, which was described in great detail in [17][18][19]. Many ways of limiting the vanishing gradient influence on the training process have been proposed, such as using different activation functions (such as ReLU) or branch normalisation. Another approach is to modify the network architecture in a way that Both of these issues are worth considering since the GRU networks have a simpler architecture and a lower number of parameters than the LSTM ones.
This work has three objectives: (a) A thorough comparison of LSTM and GRU neural networks as models of two dynamical processes, polymerisation and neutralisation (pH) reactors, is considered. An important question is whether or not the GRU network, although it has a simpler structure as the LSTM one, offers satisfying modelling accuracy; (b) The derivation of MPC prediction equations for the LSTM and GRU models; (c) The development of MPC algorithms for the two aforementioned processes with different LSTM and GRU models used for prediction. An important question is whether or not the GRU network offers control quality comparable to that possible when the more complex LSTM structure is applied.
Unfortunately, to the best of the authors' knowledge, the efficiency of LSTM and GRU networks as dynamical models and their performance in MPC have not been thoroughly compared in the literature; typically, the LSTM structures are used [40][41][42].
The article is organised in the following way. Section 2 describes the structures of the LSTM and GRU neural networks. Section 3 defines the MPC optimisation task algorithm and details how the two discussed types of neural models are used for prediction in MPC. Section 4 thoroughly compares the efficiency of LSTM and GRU neural networks used as models of the two dynamical systems. Moreover, the efficiency of both considered model classes is validated in MPC. Finally, Section 5 summarises the whole article.

The LSTM Neural Network
The LSTM approach aims to create a model that has a long-term memory and, at the same time, is able to forget about unimportant information in the training data. To achieve this, three main differences in comparison to classical RNNs are introduced: • Two types of activation functions; • A cell state that serves as the long-term memory of the neuron; • The neuron is called a cell and has a complex structure consisting of four gates that regulate the information flow.

Activation Functions
In the classical RRNs, the most commonly used activation function is the tanh type: The output values of the hyperbolic tangent are in the range <−1, 1>. This helps to regulate the data flow through the network and avoid the exploding gradient phenomena [17,44]. In the LSTM networks, the usage of tanh is kept; however, the sigmoid activation function is additionally implemented. This function is defined as: The output values of the sigmoid function are in the range of <0, 1>. This allows the neural network to discard irrelevant information. If the output values are close to zero, they are not important and should be forgotten. If the values are close to one, they should be kept.

Hidden State and Cell State
In the classical RNN architecture, the hidden state is used as a memory of the network and an output of the hidden layer of the network. The LSTM networks additionally implement a cell state. In their case, the hidden state serves as a short-term working memory. On the other hand, the cell state is used as a long-term memory to keep information about important data from the past. As depicted in Figure 1, only a few linear operations are performed on the cell state. Therefore, the gradient flow during the backpropagation training is relatively undisturbed. This helps to limit the occurrence of the vanishing gradient problem.

Gates
The LSTM network has the ability to modify the value of the cell state through a mechanism called gates. The LSTM cell shown in Figure 1 consists of four gates: 1.
The forget gate f decides which values of the previous cell state should be discarded and which should be kept; 2.
The input gate i selects values from the previous hidden state and the current input to update by passing them through the sigmoid function. The function product is then multiplied by the previous cell state; 3.
The cell state candidate gate g first regulates the information flow in the network by using the tanh function on the previous hidden state and the current input. The product of tanh is multiplied by the input gate output to calculate the candidate for the current cell state. The candidate is then added to the previous cell state; 4.
The output gate o first calculates the current hidden state by passing the previous hidden state and the current input through the sigmoid function to select which new information should be taken into account. Then, the current cell state value is passed through the tanh function. The products of both of those functions are finally multiplied.

LSTM Layer Architecture
The LSTM layer of a neural network is composed of n N neurons. The layer has n f input signals. For a network used as a dynamical model of the process represented by the general equation: this parameter can be written as n f = n A + n B . The vector of the network's input signals at the time instant k is then: When considering the entire LSTM network layer consisting of n N cells, the gates can be represented as vectors f , g, i, o, each of dimensionality n N × 1. The LSTM layer of the network contains also a number of weights. The symbol W denotes the weights associated with the input signals x; the symbol R denotes the so-called recursive weights, associated with the hidden state of the cell from the previous moment h(k − 1); the symbol b denotes the constant (bias) components. The subscripts f, g, i or o appear next to all the weights; they indicate to which gate the weights belong. Network weights can be therefore written in matrix form as: The matrices W i , W f , W g and W o have dimensionality n N × n f ; the matrices R i , R f , R g and R o have dimensionality n N × n N ; the vectors b i , b f , b g and b o have dimensionality n N × 1.
At the time instant k, the following calculations are performed sequentially in the LSTM layer of the network: The new cell state at the time instant k is then determined: Finally, the hidden state at the time instant k can be calculated: The symbol • denotes the Hadamard product of the vectors. In other words, the vectors are multiplied elementwise. In Equation (10), this operation is used twice. The cell state from the previous time instant is multiplied by the values output by the forget gate. If those values are close to zero, the Hadamard product is close to zero as well, and therefore, the past information stored in the cell state is discarded. If the forget gate values are close to one, the past information becomes mostly unchanged. Then, the output of the input gate and cell candidate gate is pointwise multiplied. The purpose of this operation is similar. If the input gate values are close to zero, no new information is added to the cell state. Otherwise, the previous cell state values are updated with the values from the cell state candidate gate. In Equation (11), the Hadamard product is close to zero when the output gate values are close to zero. In this situation, the hidden state from the previous time instant becomes mostly unchanged. Otherwise, the new hidden state is updated with the new values from the cell state.
The LSTM layer of the neural network is then connected to the fully connected layer, as shown in Figure 2. It has its weight vector W y of dimensionality 1 × n N and bias b y . The output of the network at the time instant k is calculated as follows:

The GRU Neural Network
The GRU network is a modification of the LSTM concept, which aims to reduce to network's computational cost. There are some differences between the architectures, mainly: 1.
The GRU cell lacks the output gate; therefore, it has fewer parameters; 2.
The usage of the cell state is discarded. The hidden state serves both as the working and long-term memory of the network.
The single-GRU cell layout is presented in Figure 3. It consists of three gates: 1.
The reset gate r is used to select which information to discard from the previous hidden state and input values; 2.
The role of the update gate z is to select which information from the previous hidden state should be kept and passed along to the next steps; 3.
Candidate state gate g calculates the candidate for the future hidden state. This is done by firstly multiplying the previous state with the reset gate's output. This step can be interpreted as forgetting unimportant information from the past. Next, new data form the input are added to the remaining information. Finally, the tanh function is applied to the data to regulate the information flow.  The current hidden state h k is calculated as follows. Firstly, the output from the update gate z is subtracted from one and then multiplied with the previous state h k−1 . Then, the state candidate g is multiplied by the unchanged output from the update gate z. The results of both of those operations are finally added. This means that if the values output from update gate z are close to zero, more new information is added to the current state h. Alternatively, if the values output from update gate z are close to one, the current state is mostly kept as it was in the previous time iterations.
When considering the whole GRU layer of n N cells, the weight matrices W r , W z , W o have dimensions n N × n f , matrices R r , R z , R g have dimensions n N × n N , and vectors b r , b z , b g have dimensions n N × 1. The matrices can be written as: The following calculations are performed at the sampling time k: Similar to the LSTM layer, the GRU layer of the neural network is then connected to the fully connected layer. It has its weight vector W y of dimensionality 1 × n N and a constant component b y . The output of the network at the time k is determined by the hidden state of all cells of the GRU layer multiplied by the weights of the fully connected layer, respectively, according to the following relation:

LSTM and GRU Neural Networks in Model Predictive Control
The manipulated variable, i.e., the input of the controlled process, is denoted by u, while the controlled one, i.e., the process output, is denoted by y. A good control algorithm is expected to calculate the value of the manipulated variable, which leads to fast control, i.e., the process output should follow the changes of the set-point. Moreover, since fast control usually requires abrupt changes of the manipulated variables, which may be dangerous for the actuator, such situations should be penalised. Finally, it is necessary to take some constraints; they are usually imposed on the magnitude and the rate of change of the manipulated variable. In some cases, constraints can also be imposed on the process output variable.

The MPC Problem
The vector of decision variables calculated online at each sampling instant of MPC is defined as the increments of the manipulated variable: . . .
where the control horizon is denoted by N u . The general MPC optimisation problem is: The cost function can be divided into two parts. The first part describes the control error, which is defined as the sum of the differences between the set-point value y sp (k + p|k) and the output predictionŷ(k + p|k) over the prediction horizon N. The (k + p|k) notation should be interpreted as follows: the prediction of the moment in the future k + p is calculated in the current moment k. The second part of the cost function consists of the change of the manipulated variables multiplied by the weighting coefficient λ. When the whole cost function is taken into account, one can observe that it minimises both control errors and the change of control signals. Weighting coefficient λ is used to fine-tune the procedure.
The constraints of the MPC optimisation problem are as follows: • The magnitude constraints u min and u max are enforced on the manipulated variable over the control horizon N u ; • The constraints u min and u max are imposed on the increments of the same variable over the control horizon N u ; • The constraints put on the predicted output variable y min and y max over the prediction horizon N.
When the optimisation procedure calculates the decision vector (Equation (19)) from Equation (20), the first element of it is applied to the process. The most common way of this application is given by the following equation: The whole computational scheme is then repeated at the next sampling instants.
In MPC [2], the general prediction equation for the sampling instant k + p is: where p = 1, . . . , N. The output of the model for the sampling instant k + p calculated at the current instant k is y(k + p|k), and the current estimation of the unmeasured disturbance acting on the process output is d(k). Typically, it is assumed that the disturbance is constant over the whole prediction horizon, and its value is determined as the difference between the real (measured) value of the process output and the model output calculated using the process input and output signals up to the sampling instant k − 1:

The LSTM Neural Network in MPC
In the case of the LSTM model, to determine the predicted output, it is necessary to first calculate the prediction values of the cell state given by Equations (6)-(10) in the following way: . . .
Finally, the prediction of the output signal can be calculated based on Equations (18) and (32) as: Taking into account the input vector of the network (Equation (4)), for prediction over the prediction horizon, the vector of arguments of the network is: . . .

The GRU Neural Network in MPC
There is no cell state in the GRU neural networks, and therefore, to calculate the predicted output signal valuesŷ, only the prediction of hidden state h is necessary to evaluate first. This is performed based on Equations (14)- (17) in the following way: . . .
where 1 n N ×1 is an identity matrix with dimensions n N × 1. The prediction of the output signal Equation (32), as well as the input vector Equation (35) are the same as in the LSTM neural network model. The proposed MPC control procedure may be summarised as follows: 1.
The estimated disturbance d(k) is calculated from Equation (22): a.
In the case of the LSTM network, the model output y(k|k − 1) is calculated from Equations (6)-(12); b.
In the case of the GRU network, the model output is calculated from Equations (14)- (17) and (12); 2.
The MPC optimisation task is then performed. To calculate the output prediction, the cell and hidden state prediction must be calculated first: a. For the LSTM model, the predictions are calculated from Equations (24)-(32); b.
For the GRU, the model state prediction are calculated from Equations (36)- (38) and the output prediction is generated as shown in Equation (32). The cost function is the same for both models and is given by Equation (20); 3.

Results of the Simulations
In order to compare the accuracy of the LSTM and GRU networks and their efficiency in MPC, we considered two dynamical systems: a polymerisation reactor and a neutralisation (pH) reactor.

Description of the Dynamical Systems
First, two considered processes are briefly described. Moreover, a short description of the data preparation procedure is given.

Benchmark 1: The Polymerisation Reactor
The first considered benchmark was a polymerisation reaction taking place in a jacketed continuous stirred-tank reactor. The reaction was the free-radical polymerisation of methyl methacrylate with azo-bis-isobutyronitrile as the initiator and toluene as the solvent. The process input was the inlet initiator flow rate F I (m 3 h −1 ); the output was the value of Number Average Molecular Weight (NAMW) of the product(kg kmol −1 ). The detailed fundamental model of the process was given in [45]. The process was nonlinear: in particular, its static gain depended on the operating point. The polymerisation reactor is frequently used to evaluate model identification algorithms and advanced nonlinear control methods, e.g., [12,45,46].
The fundamental model of the polymerisation process, comprising four nonlinear differential equations, was solved using the Runge-Kutta 45 method to obtain training and validation and test datasets, each of them having 2000 samples. After each 50 samples, there was a step change of the control signal. The magnitude of the control signal was chosen randomly. Next, since process input and output signals had different magnitudes, these signals were scaled in the following way: whereF I = 0.016783 and NAMW = 20,000 denote the values of the variables at the nominal operating point. The sampling time was 1.8 s.

Benchmark 2: The Neutralisation Reactor
The second considered benchmark was a neutralisation reactor. The process input was the base (NaOH) streamflow-rate q 1 (mL/s); the output was the value of the pH of the product. The detailed fundamental model of the process was given in [47]. The process was nonlinear since its static and dynamic properties depended on the operating point. Hence, it is frequently used as a good benchmark to evaluate model identification algorithms and advanced nonlinear control methods, e.g., [46][47][48].
The fundamental model of the neutralisation process, comprising two nonlinear differential equations and a nonlinear algebraic equation, was solved using the Runge-Kutta 45 method to obtain training and validation and test datasets, each of them having 2000 samples. After each 50 samples, there was a step change of the control signal. The magnitude of the control signal was chosen randomly. The process signals were scaled in the following way: whereq 1 = 15.5 and pH = 7 denote the values of the variables at the nominal operating point. The sampling time was 10 s.

LSTM and GRU Neural Networks for Modelling of Polymerisation and Neutralisation Reactors
A number of LSTM and GRU models were trained for the two considered dynamic processes. All models were trained using the Adam optimisation algorithm. The maximum number of training epochs (iterations) was: • 500 for the models with n N ≤ 3; • 750 for the models with 3 < n N ≤ 7; • 1000 for the models with 7 < n N . The training procedure was performed as follows: 1.
The order of the dynamics of the LSTM model was set to n A = n B = 1. The number of neurons in the hidden layer was set to n N = 1. For the considered configuration, ten models were trained, and the best one was chosen; 2.
The number of neurons was increased to two. Ten models were trained, and the best was chosen. This procedure was repeated until the number of neurons reached n N = 30; 3.
The first two steps were repeated with the increased order of the dynamics n A = n B = 2, n A = n B = 3.
It is important to stress that setting the order of the dynamics to higher than n A = n B = 3 did not result in any significant increase of the modelling quality. Therefore, further experiments with n A = n B > 3 are not presented.
It is an interesting question if LSTM and GRU models without recurrent input signals y(k − 1) . . . y(k − n A ) can perform well in modelling tasks. In theory, the recurrent nature of hidden state h should be sufficient to ensure good model quality. To verify this expectation, an additional series of models was trained. The training procedure was similar to the one described above, the only difference being that now, the model order of the dynamics was first set to n A = 0, n B = 1, then increased to n A = 0, n B = 2 and, finally, to n A = 0, n B = 3.
The quality of all trained models was then validated with the mean squared error chosen as the quality index. The models were validated in the nonrecurrent Autoregressive with eXogenous input (ARX) mode and the Output Error (OE) recurrent mode. The model input vectors for the two considered cases are: . . .
It is important to stress that in the case of the models with n A = 0, the ARX and OE modes were the same.
Taking into account the objective of this work, it is interesting to compare the accuracy of the LSTM and GRU models with different structures, defined by the number of neurons, n N , and the order of the model dynamics, determined by n A and n B . For the polymerisation reactor, the results for the chosen networks are given in Tables 1 and 2 Tables 3 and 4, and Figure 5 depicts the model validation errors for all considered numbers of neurons. The following notation is used: • E t is the the mean squared error for the training dataset in ARX mode; • E v is the the mean squared error for the validation dataset in ARX mode; • E rec t is the the mean squared error for the training dataset in recurrent mode; • E rec v is the the mean squared error for the validation dataset in recurrent mode.    The presented results can be summarised in the following way:

LSTM GRU
• In the case of the polymerisation reactor, the results achieved with the LSTM and GRU networks were comparable. As seen in Figure 4, the means squared errors were similar for every combination of n A , n B and n N ; • In the case of the neutralisation reactor, the LSTM models ensured a better quality of modelling, especially for models with a low number of parameters. However, as seen in Figure 5, as the number of neurons increased, this difference became more and more negligible. This is again not surprising. GRU networks have less parameters than LSTM networks. Therefore, GRU models with a low number of neurons and a low order of the dynamics performed worse than their LSTM counterparts. As the models became bigger and more complex, the difference between their quality decreased. • Models with a higher numbers of neurons (15)(16)(17)(18)(19)(20)(21)(22)(23)(24)(25)(26)(27)(28)(29)(30) ensured the best and most consistent modelling quality. This is not surprising, as the number of model parameters is directly proportional to the capacity to reproduce the behaviour of more complex processes. However, this can also be a main drawback of complex models, because of the enormous number of parameters, as shown in Figures 6 and 7, increases their computational cost significantly; These models had too few parameters to accurately represent the behaviour of the processes under study; • For the models with a medium number of neurons (3)(4)(5)(6)(7)(8)(9)(10), the modelling quality was not consistent. In some cases, it was quite poor; in others, it even outperformed models with a huge number of neurons (an example can be found in Table 4, the GRU network with n A = n B = 1 n N = 5). One can conclude that this group of models has a structure complex enough to represent the behaviour of the systems under investigation. The training procedure must be, however, performed many times, as training may sometimes not be successful. In other words, if the goal is to find the model with the minimum number of parameters and good quality, the medium-sized models are the best option; • Models with a low (1-2) number of neurons did not ensure a good modelling quality regardless of the neural network type and the model order of the dynamics, as shown in Figures 8 and 9. • Interestingly enough, the order of the dynamics of the model seemed not to greatly impact the modelling quality. Models with higher order were most commonly only slightly better than those with n A = n B = 1. Only in the case of the neutralisation reactor with n A = 0 in Table 3 could a noticeable improvement be observed when n B was set to two. The unique long-term memory quality of the networks under study may be a cause of this phenomenon. The information about the important previous input and output signals from the past can be kept inside the hidden and cell states, and therefore, the networks can perform very well with only the most recent input values (i.e., n A = 0, n B = 1); Table 3. The neutralisation reactor: comparison of selected LSTM and GRU networks without the recurrent inputs (n A = 0) in terms of the training (E t ) and validation errors (E v ).  Table 4. The neutralisation reactor: comparison of selected LSTM and GRU networks with the recurrent inputs in terms of the training (E t ) and validation errors (E v ).  Based on the observations summarised above, it can be concluded that it is a good practice to train a model with a medium number of neurons and a low order of the dynamics. This approach may require many training trials, but as a result, the model has a relatively low number of parameters; therefore, a lower computational cost can be achieved. A direct comparison of the polymerisation reactor models can be seen in Figures 10 and 11. Both models performed very well, and the modelling errors were minimal. A similar comparison for the pH reactor can be seen in Figures 12 and 13. The modelling quality was again very satisfactory. Here, it is important to stress that in the case of the GRU model with n A = 0, it was necessary to choose one with a higher order of the dynamics to achieve results similar to those ensured by the simpler LSTM models.

LSTM and GRU Neural Network for the MPC of Polymerisation and Neutralisation Reactors
A few of the best-performing models were been chosen with the aim of being applied in the MPC control scheme for prediction. First, let us describe the tuning procedure of the MPC controller. It starts with the selection of the prediction horizon. It should be long enough to cover the dynamic behaviour of the process. However, if the horizons are too long, the computation cost of the optimisation task increases. The control horizon cannot be too short since it gives insufficient control quality, while its lengthening also increases the computational burden. The process of tuning was therefore as follows: • The constant weighting coefficient λ = 1 was assumed; • The prediction horizon N and the control horizon N u were set to have the same, arbitrarily chosen lengths. If the controller was not working properly, both horizons were lengthened; • The prediction horizon was gradually shortened, and its minimal possible length was chosen (with the condition N u = N); • The effect of changing the length of the control horizon on the resulting control quality was then assessed experimentally (e.g., assuming successively N u = 1, 2, 3, 4, 5, 10, . . . , N). The shortest possible control horizon was chosen; • Finally, after determining the horizon's lengths, the weighting coefficient λ was adjusted.
After applying the tuning procedure on both processes under study, the following settings were determined: • N = 10, N u = 5, λ = 0.5 for the polymerisation process; • N = 10, N u = 3, λ = 0.5 for the neutralisation process.
Simulations of the MPC algorithms were performed with MATLAB. For optimisation, the fmincon() function was used with the following settings: • Optimisation algorithm-Sequential Quadratic Programming (SQP); • Finite differences type-centred.
MPC performance using the models without the recursive input signals (n A ) proved to be very satisfactory. In the case of the polymerisation reactor, in Figure 14, minimal overshoot and a short settling time can be observed. Similar control quality was achieved for the neutralisation reactor system as depicted in Figure 15. Interestingly enough, for MPC with more complex models (n A = n B ), the results were comparable, as demonstrated in Figure 16.
In the case of the polymerisation system and the LSTM model, small oscillations around the set-point could be observed, as shown in Figure 17, and the overall control quality was slightly worse. Tables 5 and 6 compare the simulation results of the MPC algorithms based on the LSTM and GRU models, for the polymerisation and neutralisation processes, respectively. The following indicators used in process control performance assessment were considered [49]: • The sum of squared errors (E); • The Huber standard deviation (σ H ) of the control error; • The rational entropy (S r ) of the control error.
Additionally, the average time of calculation (t) during the whole simulation horizon (in seconds) was specified.  From the performed experiments, we were able to draw the following conclusions: 1.
Both types of neural networks allowed for a successful application of the MPC control scheme. All control performance indicators, i.e., E, σ H and S r , showed that GRU network models, when applied for prediction in MPC, lead to very similar control quality when the rudimentary LSTM networks are used. What is more, as GRU models have fewer internal parameters, their computation cost and, therefore, the time of calculations are lower, as shown in Tables 5 and 6;  2. It is advisable to choose models with a relatively simple structure and a low number of parameters to implement in the MPC scheme. More complex models often provide comparable or even worse quality of control, and the computation cost rises with the number of parameters of the model; 3.
Minor model imperfections are reduced with great success by feedback in MPC. An example of this phenomenon can be observed in the bottom plots in Figure 12, where the model outputs differ slightly from the validation data in some areas. However, when the models are implemented in the MPC scheme, as shown in Figure 16, the quality of control is very satisfactory. However, the negative feedback is not sufficient to ensure satisfactory control if the model itself has poor quality. Example simulation results for the polymerisation process are presented in Figure 18. As a result of a very bad model, the MPC algorithm leads to unacceptable control quality, i.e., the set-point is never achieved, and strong oscillations are observed. Example simulations results when an inaccurate model is used in MPC for the neutralisation process are presented in Figure 19. In this case, the overshoot is larger and the setting time is longer when compared with the MPC algorithm based on a good model, e.g., as shown in Figure 15.
It is important to stress that the above observations are true for the two considered processes.    Figure 19. The neutralisation reactor: MPC results with the LSTM and GRU models for n N = 1, n A = n B = 1.

Conclusions
Having performed numerous experiments with different structures of LSTM and GRU neural networks as models of dynamical systems used in the MPC of two chemical reactors, we found that the GRU network gives very good results. Firstly, it approximates the properties of the dynamical systems with good accuracy, comparable with that possible when the rudimentary LSTM model is used. Secondly, it gives very good results when used for prediction in MPC, very similar to those observed in the case of the LSTM models. It is necessary to point out that the number of model parameters is lower in the case of the GRU network. Hence, the use of the GRU network is recommended for modelling of dynamical processes and MPC.
Future work is planned to develop more computationally efficient MPC control schemes based on the GRU structure and for Multiple-Input Multiple-Output (MIMO) processes. Moreover, it is planned to develop GRU models and use them in MPC applied to the ball-on-plate laboratory process [8].