Modeling a Practical Dual-Fuel Gas Turbine Power Generation System Using Dynamic Neural Network and Deep Learning

: Accurate simulations of gas turbines’ dynamic performance are essential for improvements in their practical performance and advancements in sustainable energy production. This paper presents models with extremely accurate simulations for a real dual-fuel gas turbine using two state-of-the-art techniques of neural networks: the dynamic neural network and deep neural network. The dynamic neural network has been realized via a nonlinear autoregressive network with exogenous inputs (NARX) artiﬁcial neural network (ANN), and the deep neural network has been based on a convolutional neural network (CNN). The outputs selected for simulations are: the output power, the exhausted temperature and the turbine speed or system frequency, whereas the inputs are the natural gas (NG) control valve, the pilot gas control valve and the compressor variables. The data-sets have been prepared in three essential formats for the training and validation of the networks: normalized data, standardized data and SI units’ data. Rigorous effort has been carried out for wide-range trials regarding tweaking the network structures and hyper-parameters, which leads to highly satisfactory results for both models (overall, the minimum recorded MSE in the training of the MISO NARX was 6.2626 × 10 − 9 and the maximum MSE that was recorded for the MISO CNN was 2.9210 × 10 − 4 , for more than 15 h of GT operation). The results have shown a comparable satisfactory performance for both dynamic NARX ANN and the CNN with a slight superiority of NARX. It can be newly argued that the dynamic ANN is better than the deep learning ANN for the time-based performance simulation of gas turbines (GTs).


Aims and Motivations
Gas turbines' power share has increased progressively in the global power generation mix in later decades due to the progress in their design specifications, efficiency and reliability [1,2]. The field of system modeling and identification has facilitated the path towards many notable improvements, including higher cycle efficiencies and a reduced level of emissions; therefore, GT power generation technology has become an unavoidable choice for many developed and developing countries [3][4][5]. It can be more informative to provide an adequate motivation and background for this research before reviewing the literature/The operating principle of dual-fuel GT can be as simple as shown below in Figure 1.
The air is discharged by the compressor (1-2) for more efficient combustion, while in the combustor, the air/fuel blend is fired and burned (2)(3). An isentropic process is established as the operation (1-2), whereas operation 2-3 is a constant pressure or isobaric. The combusted gases are taken by the gas turbine as an isentropic operation  The scientific merit of this article will be discussed in the next subsection, with a discussion of the related literature.

Related Work and the Paper Contribution
The aspects of a multidisciplinary/interdisciplinary nature can also be deduced from the literature review that will be presented here; for more detailed literature, the reader may refer to the recent critical review written by the corresponding author [5]. The recently published dynamic models of GTs, whether combined with the steam cycle to become CCGT or as a single unit, can be established by physical laws, system identification, artificial intelligence, machine learning or deep learning techniques. The literature will be informative, with an emphasis on modeling via neural networks, machine learning and deep learning methodologies. Asgari et al. (2014Asgari et al. ( , 2016Asgari et al. ( , 2021 [11][12][13] have published NARX-type ANN models to simulate some significant variables in the startup process, which has been used to simulate the behavior of an actual general electric (GE) GT (PG 9351FA GT). The compression ratio has given the maximum error in the simulation, with a RMSE of 2.8% (0.028) and minimum RMSE of 0.0004 in the speed response [11]. The same primary author has extended the work on GT modeling by a recurrent neural network with a single hidden layer, which achieved a comparable RMSE of approximately 0.22% (0.0022) for training and 2.6% (0.026) for testing [14]. Ibrahem (2020) [15] has offered a NARX ANN model for a GT manufactured by Siemens SGT-A65 ADGTE in order to pave the way toward the design of a predictive control strategy. Different neural network structures of ensemble and single MISO NARX were trained and tested. It has been found that the minimum RMSE achieved for the turbine speed during the training phase is 0, but is 0.0022 in the testing phase for one of the spool speeds. Mohamed et al. (2019) [4] have presented the performance of feed-forward (FF) back-propagation ANN (BPNN) in simulating for the purpose of a comparison with a physics-based model subspace system identification model. The minimum error has been given by the FF ANN of 0.05048 in the frequency or speed response. Rashid et al. (2015) [16] have presented a new model for CCGT by training FF ANN via particle swarm optimization (PSO), where the MSE for training is 1.019 × 10 −4 and for testing is 0.0055. Rahmoune et al. (2020) [17] have developed a NARX model to identify the dynamic behavior of the gas turbine components under the influence of the vibration phenomena. The results of the proposed NARX model validated the capability of the NARX NN in determining the dynamic behavior of the gas turbine system, with a simulation MSE of 3.8414 × 10 −3 for the high pressure (HP) turbine, 1.29152 × 10 −1 and 2.12090 × 10 −4 for the gas and air control valves, respectively. In terms of deep learning, Cao et al. (2021) [18] have presented different deep learning techniques that have been used to predict the changes in the efficiency and flow capacity of turbomachinery. The degradation predictions have been established via the LSTM approach, with a high accuracy ranging from 81.65% to 93.65%.
From this review and previous critical reviews [5], it can be readily found that there are no constraints for the achievable accuracy, and therefore more accurate results are probably still attainable. On the other hand, it is unfair to claim the numerical superiority of the accuracy of the proposed models with regard to the published literature because that depends on other factors rather than the NN structure design, such as the difference in on-site data from one GT to another in the literature, which prevents claiming a preference of obtaining accuracy numerically, with differences in data-sets from one research study to another. To the best of the authors' knowledge, deep learning techniques for modeling GT haven't yet been studied in detail on the GT time-based dynamic simulation, and it is very interesting to know whether they are comparable, superior or less effective than the dynamic neural network with a shallow structure, especially the NARX ANN.
The convolutional neural network is a well-recognized example of deep learning tools and NARX ANN is a typical and extensively used example of a dynamic neural network; therefore, they are both selected for this study. The scientific contributions of this manuscript are then: (1) Two accurate methods for simulating Siemens dual-fuel GT have been shown, with an emphasis on the essential variables of the GT. One simulator has been established using dynamic NARX ANN and the other is based on a deep-learning convolutional neural network; (2) The models' performances are depicted in MIMO and parallel MISO structures with highly accurate results; as overall indicators, the performances showed that the minimum recorded MSE in the training of the MISO NARX was 6.2626 × 10 −9 , and the corresponding testing MSE was 3.4983 × 10 −7 . On the other hand, the maximum average MSE was recorded for the MISO CNN as 2.9210 × 10 −4 , and both networks worked successfully for more than 15 operating hours of the GT; (3) It is newly shown that the NARX dynamic ANN was slightly superior in accuracy over the deep neural network, which indicates that the deep learning can be regarded as an alternative, but not substitutional, tool for the simulation of heavy-duty power GTs; in other words, they shall not replace the dynamic ANN, even with shallow architectures. One of the features that makes the NARX ANNs superior is the adoption of past outputs as additional direct inputs in NARX, which increase their overall accuracies. This major advantage has no equivalence in the deep convolutional networks in spite of the variety of their hyper-parameters.
The rest of the paper is organized as follows: Section 2 presents the data preparation of the adopted GT, inputs/outputs selection, normalization, standardization and actual quantities. Section 3 presents the NARX ANN model development, Section 4 presents the CNN model establishment, Section 5 shows the simulations results of both methodologies with a comparison against the real measurements and quantified analysis of the results and, finally, Section 6 concludes the research study and findings with some feasible future trends.

Data Curation and Analysis
The utilized datasets for this study have been collected from a real gas turbine generation unit and are provided by the corresponding author. The data-set comprises long-term data that represent 16 h of the GT operation. According to Tables 1 and 2, the collected datasets have been classified as GT inputs and outputs variables, with the operational range for each variable. As can be seen from the tables, four variables have been identified as the GT's inputs-the NG valve, the pilot valve, the compressor outlet temperature and the compressor outlet pressure-whereas the remaining three variables, which are the output power, exhausted temperature and the frequency, i.e., speed of the rotor, have been appointed as the outputs of the system. After defining the input and output parameters from the obtained datasets, the corresponding data have been divided into two groups alternatively, namely training and validation datasets; this will ease the evaluation of the model generalization and prevent over-fitting during the training phase.
The first group of data has been used to train the model, whereas the other group has been applied to evaluate the models' accuracy, which comprises unseen data, i.e., samples that have not been utilized during the training process. The system formation, including the input and output variables, is shown in Figure 3. It is worth mentioning that the usual way of considering inputs is to include the compression ratio (CR) as an input instead of both COT and COP; however, these can be equivalent, and the improved accuracy has been notable during the testing phase of the GT.
Standardization and normalization are the most popular rescaling techniques. Both approaches specify the features of the system data within a restricted range rather than a wider range, making it very complex for the model to map inputs to outputs properly. However, both techniques differ in the way they work, and each of them have special use cases. Based on this, the collected data-sets from the GT unit have been pre-processed and rescaled in two main formats aside from the SI units' data-normalized data and standardized data-in order to train and validate the built networks. This will be more valuable in providing a brief description through these two processes in order to understand how and why the given data are normalized or standardized.

Data Normalization
This specifies the data between the 0 and 1 range or between the −1 and +1 range. Normalization is required when there is a large difference in ranges of system's features, furthermore, this scaling approach can be beneficial when the collected data do not follow any distribution, such as a Gaussian distribution. Therefore, this technique can be very useful in the neural networks algorithm, since it does not assume any data distribution. This technique is also known as min-max scaling. Equation (1) presents the mathematical formula for the normalization approach [19][20][21].
where (x max ) and (x min ) are the maximum and minimum values of the input or output feature to the model, respectively. From the above equation, it can be clearly noticed that the range of features for each variable falls between the 0 and 1 range according to the following three scenarios: 1. When x equals the minimum, then (x norm ) is 0; 2.
On the other hand, when x is the maximum point in the array, then (x norm ) is 1; 3.
However, if x is between the minimum and maximum, then (x norm ) will be between 0 and 1.

Data Standardization
This is another common rescaling approach that typically rescales the data to be about the mean with a unity deviation or unit variance. This indicates that the mean is zero and that the resulting distribution has a deviation of one. On the other hand, standardization might be useful when the data have a Gaussian distribution; however, this does not have to be the case. Furthermore, contrasting normalization, standardization has no boundary range; as such, if the data contain outliers, standardization will have no effect on them. Equation (2) shows the associated formula with the standardization technique [19][20][21].
where µ is the mean of features and σ is a standard deviation of the feature values. It can be noticed from the above equation that the input and output values are not restricted to a particular range.
In conclusion, using the normalization or standardization will ultimately rely upon the type of data and the machine-learning-based technique that will be employed. There is no hard and fast rule that states when the data should be normalized or standardized. Fitting the model by utilizing the actual, normalized and standardized data in order to achieve the best results, and then comparing the performance among these three types of data formatting, can be a powerful criterion in the deployment of the final model of a GT power plant; see Figure 4, which is dedicated toward data curation in this study.

The NARX Model Setup
The mathematical expression of the NARX model can be given as [22] where y(t) andŷ(t) are the target and predicted output variables, respectively; u(t) is the input variable of the network; n u and n y are the time delays of the input and output variable; and e(t) is the model error between the target and prediction. In other words, y and u are the output and externally determined variables in this equation, respectively. y(t) is the next value of the dependent output signal, which is regressed on previous values of the output signal and an independent (exogenous) input signal.
To set up an accurate and reliable NARX model for the GT power plant with an acceptable predictive performance, much like other dynamic neural networks, various architectures may be considered over a wide range of trials [13]. These different architectures are based on several factors, such as the number of inputs and outputs, i.e., the MIMO or MISO structure; training algorithms; the number of hidden layers; the number of neurons in the hidden layer(s); the type of activation functions; the maximum number of epochs, i.e., iterations; the number of recurrent connections; and the time delays in the recurrent connections. In addition, another vital factor has been included in this study, which is the data type, i.e., data format. Figure 5 shows the NARX structure constructed for this study, in which, the tapped delay line (TDL) is employed to feed the network with the past values of inputs and outputs. As can be seen from this figure, the proposed NARX model is composed of four inputs, one hidden layer and three outputs. Where the variables x 0 to x 4 represent the computer representation of the inputs, and w 0 to w 4 are the connection weights, which will be generalized later in the equations describing the NARX ANN, σ is the sigmoid activation function symbol and S is the linear activation function symbol,Ŷ(t) is the predicted output value. A thorough computer code in the MATLAB programming environment has been developed to set up and configure the NARX models with sophisticated generalization properties. MATLAB is a versatile programming environment that was founded and established by MathWorks for numerical computation in engineering and scientific applications. The generated code includes several hyper-parameters for training and configuring NARX models of a gas turbine generation unit. More precisely, the maximum number of iterations, learning rate, number of hidden layer's neurons, time delays in the recurrent connections and model structure, i.e., MIMO and MISO configurations, as well as the data type, including normalized, standardized and actual data. All of these have been considered in the developed code as a combination of a variety of settings. Besides, this study employs a feed-forward multilayer dynamic neural network architecture with an input layer, one hidden layer and an output layer with a sigmoid-type transfer function and linear activation function for the output layer. Furthermore, the developed program has been used to train a wide range of NARX topologies, employing three training algorithms in the training step, which are the Levenberg-Marquardt (LM) algorithm, Bayesian regularization algorithm and scaled conjugate algorithm. Eventually, the tweaking of all hyper-parameters, in addition to the training algorithm, results in an indication of the best performance and its relevant NARX model. The mean squared error (MSE), which expresses the average squared error between the network outputs, the default performance function for feed-forward networks can be expressed as [23]: The backpropagation technique, which involves executing computations backwards through the network, is used to determine the gradient and the Jacobian. However, it is tough to estimate which training method will be the most efficient for a given situation [23]. It is determined by a variety of parameters, including the problem's complexity, the quantity of data points in the training set, the number of weights and biases in the network, the error target and whether the network is used for pattern recognition (discriminant analysis) or function approximation (regression) [23]. Therefore, the proposed NARX model of the GT power plant has been trained over a wide range of trials, including the three different optimization algorithms, in order to obtain the best performance and the most applicable NARX network. For more details about the training algorithms, Levenberg-Marquardt (LM) algorithm, Bayesian regularization (BR) algorithm and scaled conjugate gradient (SCG) algorithm, refer to [23]. According to the input variable u(t) in Equation (3), the output from the hidden layer at t time is computed as [22]: where w ij is the connection weight between the input neuron u(t − j) and the i th hidden neuron; w ij is the connection weight between the i th hidden neuron and the output feedback delayed loop; a i is the bias of the hidden layer neurons; f 1 (.) is the hidden layer transfer function, i.e., activation function [22]. As mentioned before, the sigmoid function has been used in the proposed code as a hidden layer activation function. Equation (6) shows the mathematical expression of the sigmoid function [22]: The final NARX prediction value network can eventually be obtained by integrating the hidden layer outputs as given [22].
where w li is the connection weight between the i th hidden neuron and l th estimated output n h ; b l is the bias l th predicted output; n h is the number of hidden neurons; and f 2 (.) is the output layer activation function. The mathematical representation of the linear activation function f 2 (.) is presented in Equation (8) [22]: According to the written code, the early stopping condition for the number of iterations, i.e., epochs, has been set to 1000. The datasets with three formats have been divided into three subsets: the training set (70%) for training the model, the validation set (15%) to confirm that the network is generalized properly and to stop the training step before overfitting and the test set (15%), which is utilized as a totally independent test of network generalization. The divided datasets have been applied to train the open-loop NARX model to guarantee an efficient learning procedure, since the true outputs are available during the training process as discussed before. After determining the optimal open-loop NARX model over a wide range of trials, the optimal open-loop network can then be transformed into a closed-loop mode for multi-step prediction. In this study, there are eighteen NARX architectures based on an open-loop mode with MIMO and parallel MISO structures and with different parameters: the number of hidden layer neurons, the training algorithms, the time delay in the recurrent connection and the data format.
The next subsection explains the MIMO and MISO NARX models.

The MIMO Model
The model has been evaluated with one hidden layer and various numbers of neurons in the hidden layer and various time delays, as well as different data types. The network has a three-neuron output layer, which means that the output power, frequency and exhausted temperature are three steps ahead. Furthermore, the three learning approaches have been tested, i.e., Levenberg-Marquardt, Bayesian regularization and the scaled conjugate. Due to the very high number of trials, it is infeasible to mention all of them here, but some samples that show the performance MSE and regression parameter R of the resultant MIMO NARX models are tabulated in Table 3, with the best design bolded. According to the findings shown in Table 3, it can be noticed that the MIMO NARX structure with fifteen hidden layer nodes and a recurrent connection with thirty seconds employing the normalized data format, as well as the Bayesian regularization training algorithm, produced the best results in the test subset with time delay of 30 time samples and hidden neurons of 15 and normalized data format. Furthermore, the best regression coefficient was also found in the same network. The optimum performance and regression of the developed MIMO NARX model with four inputs and three outputs at the time is shown in Figures 6 and 7, respectively. These graphs depict both the mean squared error (MSE) trend for the training and test sets and their regression training coefficient R during the learning procedure. The decrease in both the training and especially the test sets trends demonstrates that there is no over-fitting in the model. As the performance figure shows, the best training performance was obtained after 503 iterations (epochs), since the minimum gradient was reached, with the MSE averaging 1.0732 × 10 -6 . Figure 8 represents the optimal open-loop MIMO NARX model based on fifteen neurons in the hidden layer.  It can be noticed from Figure 8 that the three outputs are fed into the input layer and output layer at the same time. Despite the relatively high performance and regression coefficients of the MIMO NARX network created, dealing with one output at a time is more efficient in the NARX network and will result in a high performance for the time prediction of each output parameter of the GT unit. Therefore, further developments have been carried out on the MATLAB code to create an open-loop MISO NARX model to predict the GT parameters individually. The constructed MISO models and their performance are elaborated in the next section.

The Parallel MISO Model
The model has been evaluated with one hidden layer and various numbers of neurons in the hidden layer and various time delays, as well as different data types. The network had a one-neuron output layer, which means that the output power, frequency or exhausted temperature were one step ahead at a time in each trial. Furthermore, the three learning approaches have been tested, i.e., Levenberg-Marquardt, Bayesian regularization and the scaled conjugate. Some samples of the trials for establishing the MISO model with the MSE performance and regression coefficients R of the resultant MISO NARX models for each parameter (output power, frequency and exhausted temperature) are tabulated in Tables 4-6, respectively.   The computational reasons for the superiority of the BR training algorithm can be argued to be due to the fact that the BR has no earlier stopping criteria, such as those in the LM and SCG algorithms. In addition, the normalized data are much better handled by the NARX ANN than the actual and standardized because of the harmony in the upper and lower limits of all outputs of the GT in normalized values, and the given set of data is a time-based measurement record of the data, which doesn't belong to the class of data that embeds a Gaussian distribution. From Table 6 above, the optimal average MSE and the regression coefficient of the three GT's parameters have been found in the twenty hidden layer neurons structure, with a 30 sample time delay employing a normalized data type and Bayesian training algorithm. The optimal training performance with an average MSE of 8.46 × 10 -7 was obtained after 1000 iterations (epochs), since the maximum epochs number was reached. Furthermore, the best regression coefficient was also found in the same NARX network. Figures 9-12 show the performance and the regression plot of each developed MISO NARX model that is based on four inputs and one output at a time for the three output variables. These figures illustrate both the mean squared error (MSE) trend for the training and test sets and their regression training coefficient R during the learning procedure.    The decrease in the MSE trend demonstrates that there is no overfitting in the proposed MISO NARX model. The regression plots demonstrate that the model achieved the optimum fits, since the datasets lie against the line at which all of the outputs are on par with the targets. Figure 13 represents the optimal open-loop MISO NARX model with twenty neurons in the hidden layer. It can be noticed from the figure above that there is one output fed into the input layer and output layer at the same time.

The Deep Learning Convolutional Neural Network (CNN) Model Setup
In this section, it will be valuable to elaborate on what the gradient descent algorithm that is used for training the CNN is and how this technique works in order to justify the GT data curation. It is an optimization technique that is utilized when training a neural network model based on the convex function [19]. The gradient descent tweaks the network parameters of the CNN model parameters to attain the minimum cost function of the given model. This function quantifies the performance of the model by computing the error between the predictions and the actual data values, then represents it in a single real number form. In other words, gradient descent is a paramount technique in machine learning models that offers the determination of the function's coefficients that minimize a cost function as much as feasible; more details regarding the gradient descent algorithm can be found in the powerful Coursera course [20]. In machine learning and deep learning terms, the gradient descent can be assumed as a derivative of a function with more than one input [19]. The mathematical translation of the gradient descent technique is as follows [21]: This equation will be adjusted through the weight values until reaching the convergence (i.e., the minimum value of cost function), where: ω j+1 : iterated weight value; ω j : previous weight value; α: learning rate; m: number of training samples; h ω : hypothesis; x (i) : the i th training example; y (i) : the corresponding predicted i th example; x (i) j : the j th feature in a given training example.
As can be seen in Equation (9), the cost function is firstly based on the initial value of the weight vector. These weights are adjusted iteratively using the gradient descent method over given data-sets in order to minimize the cost function of the generated model. From the aforementioned basics, it is clear that the presence of variable (x), which represents the input variables that are fed into the model will influence the gradient descent step size. Moreover, as mentioned before, the datasets that are used for the proposed models have been drawn from a practical GT generation unit, that, in turn, means that the system's variables have a highly dynamic distribution. Therefore, the input and output datasets to the NARX-and CNN-based GT-model may differ greatly in scale, range and distribution for each variable; for example, the deviation among the output power values and exhausted temperature values is slightly larger than the change in the frequency instances. Differences in scales among the model's parameters may exacerbate the difficulty of the modeled problem [19]. Some of the large input and output values may result in a model that learns large weight values. A model with large weight values is frequently unstable, which means that it may perform poorly during the learning phase and may be delicate to input values, resulting in an increased mean squared error, i.e., generalization error. Therefore, there is a need to apply a features-rescaling technique to the GT's variables in the step of data preprocessing. Data pre-processing guarantees that the gradient descent of the model heads smoothly towards the minimum error and that the gradient descent steps are updated at the same rate for all parameters. Having the features of the data on a similar scale makes all input and output variables of a GT power plant equally important and easier to compile by the NARX and CNN model [21]. The convolutional neural network (CNN) is one of the most popular deep neural networks [24]. CNN usually comprises various layers, such as convolutional layers, pooling layers, fully connected layers, i.e., dense layers, etc. Figure 14 represents a typical example of CNN architecture. According to the figure above, the first type of layer, which is called a convolutional layer, consists of filters and feature maps. The input to this filter is known as the receptive field [25], and it has a defined size. Each filter is pushed across the previous layer, producing an output that is collected in the feature map. In other words, the CNN's convolutional layer adjusts the local perception and weight sharing, which consequently improves their ability to extract the significant features [25,26]. It will be informative to mention that the used GT's datasets belong to a one-dimension distribution; thus, the corresponding convolutional layer that deals with the given datasets will be a 1D convolutional layer. One-dimensional-CNN accomplishes convergence across the local area of input parameters to generate the appropriate feature. Each kernel, i.e., filter, has unique characteristics on the feature map in all locations. Since the 1D-CNN utilizes the weight-sharing approach as mentioned before, fewer parameters need to be converged with the 1D-CNN than with conventional neural networks [26]. This ensures that the 1D-CNN converges earlier and faster. An example of a 1D convolutional operation is illustrated in Figure 15. Regarding the kernel's size, it is set to 2, which means that all weights (w 1 , w 2 ) will be shared by every step of the input layer (x 1 , x 2 , · · · , x n ) and the output (y 1 , y 2 , · · · , y n ). In the kernel window (w 1 , w 2 ), which represents the filter size, the input values are multiplied by the weights and then the values are summed up in order to compute the value of the features map. In the shown example, the value of y 2 is obtained from y 2 = (w 1 x 1 + w 2 x 2 ) [26]. The output of the convolution layer is provided as both the output and the input of the following layer. It also represents the features derived from training samples using the convolution kernel. In order to obtain one-dimensional features, 1D-CNN performs input signal convolution operations in the local area, and various kernels extract certain features from the inputs. As illustrated in Figure 15, each kernel recognizes certain characteristics in any location on the input features map, and weight-sharing is performed on the same input feature map. This mechanism minimizes the number of parameters during training. The mathematical formula if the L i is a 1D convolutional layer can be generally expressed in Equation (10) [26]: where k denotes the number of convolution kernels, j is the kernel size and M refers to the channel input number x i l−1 . The kernel bias is indicated by b, where the symbol (*) is the convolution operator. f (c.) represents the non-linear activation function. CNNs usually utilize the rectified linear unit (ReLU), i.e., f (x) = max(0, x), as an activation function [24]. Pooling layers are paramount for CNNs. Pooling methods can be thought of as downsamples used to minimize the parameter number while maintaining the major features in order to speed up the next stage, since there are more feature maps in the downstream sampling phase, resulting in an increased data dimensionality, which makes calculations too complex [27]. Figure 16 illustrates some max-pooling operation used in this study.
The learning rate determines how fast or slow we will move towards the optimal weights. If the learning rate is very large, we will skip the optimal solution; if it is too small, we will need too many iterations to converge to the best values. Therefore, using a good learning rate is crucial.  Table 7.

Time-Based Simulation Results and Discussion
This section depicts the simulation results of the two approaches and their architectures (Figures 17-22). From the results and the previously corresponding tabulated MSEs, it is evidenced that the deep CNN and the dynamic NARX ANN have shown a satisfactory performance in their application to heavy-duty dual-fuel GTs. They can be used for short-term or long-term predictions, controller upgrading, performance monitoring during measurement device malfunctioning, fuel requirements for the demand, GT characteristics with different fuels and so on.
The trends are followed successfully by both techniques for the power (ranges: 0-1 normalized and 124.89-241.57 MW actual power range of load-down, then loadup), with very negligible errors (minimum MSE 6.2626 × 10 -9 and maximum MSE of 2.9210 × 10 -4 ) for the adopted long operation time of the GT (more than 15 h of continuous operation), which indicate the robustness of deep learning and shallow dynamic ANNs. Such accuracies in the responses of GTs are difficult to attain by physics-informed or other system identification techniques because the power plant noises and uncertainties are high, and increasingly vary with the changes in the operating conditions. In addition, the differences in the nature of the responses make the simulation far more challenging; for instance, the power variations appear to be slower than the changes in the temperature and frequency, whereas the later responses change more severely, which makes the problem computationally over-complicated for the models to track all these variation trends simultaneously. Nevertheless, the proposed techniques in this paper have easily handled such computational burdens and prediction capabilities for a longer time than what has been previously published, which covers more than 15 h (or more than 54,000 sec) of operation.
It can also be seen that the NARX ANN has shown a slight superiority in the error values and also in zooming the results for both structures (parallel MISO and MIMO); this could be due to the following reasons: 1.
Its simplified structure that implicates the direct effect of inputs and outputs; therefore, there are more realistic reflections of the inputs on the outputs; 2.
The use of feedback delayed outputs as additional inputs, which increase the number of inputs utilized to depict the output more accurately. This important feature has no equivalence in CNN, despite its sophistication in the variety and number of its layers.
It can be generally deduced that the dynamic ANN, even if recognized as a shallow ANN with a single hidden layer, is still a leading choice for the modeling and simulation of GTs, which have negligible simulation errors and a high simulation performance of the variation trends of GT power plants. For other different successful applications of CNN and NARX ANNs, rather than time-based simulations, the reader may refer to the references [24][25][26][27][28].

Conclusions
Based on the most recent proposed future trends, simulated models of deep CNN and dynamic NARX ANN have been presented with extremely accurate results, which confirm the scientific merits of deep learning and shallow dynamic ANNs for the emulation of the GT power station performance. Some paper findings are below: • It is generally highly recommended to normalize the data of GTs rather than dealing with actual quantities in using ANNs in models; • The training algorithm of BR outperforms other training algorithms because of its late ultimate termination criteria, unlike other aforementioned earlier ones (LM and SCG); • The prediction capabilities of NARX ANN and CNN for the GTs time-based dynamic performance are satisfactory, with very small negligible errors for both techniques.
Based on the aforementioned points, the paper's goals have been generally achieved. Further goals are important to mention based on the observation and investigation of the results:

•
There was a slight superiority of the dynamic NARX type in terms of its accuracy. A new conclusion can be suggested by stating that the main computational reason, which is the feedback delay element in NARX despite the shallow structure, is capable of providing additional information with other direct inputs in order to improve the accuracy over the deep CNN, in which there is no delay feedback element; • Based on the aforementioned results, deep learning can act as an alternative choice of modeling GTs in real applications, but cannot be a substitutional tool for the shallow dynamic ANN. This is because both have shown successful performances and can be used reliably in real applications; • Despite the achieved targets of the paper, there are still some deep learning techniques that have not been investigated in the literature; these techniques might have a comparable performance, and this motivates the mentioning of some future research opportunities; • One of the clearer future trends is to use other deep learning techniques and to compare them appropriately with developed/published ones. This may include the advanced deep recurrent neural network and locally connected neural networks; • Another possible future outcome is to include the fuel preparation system, especially for biogas firing for such turbines, and the process of (gasification/digestion), in order to quantify the amount of materials used to be converted to biogas and to link those with an enhanced control strategy with new objectives; • Another feasible future point is designing a supervisory controller for the developed ANN models and applying it to regulate the diffusion and premix modes, together with the objectives of a higher efficiency and lower emissions. A comparative study with other modeling philosophies may be useful, such as physics-based models and other black-box and grey-box models, with emphasis on many performance criteria rather than the mere numeric value of the accuracies. Funding: This research received no external funding.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author. The data are not publicly available due to the preference of its availability upon request.

Conflicts of Interest:
The authors declare no conflict of interest.