Machine Learning-Based Dynamic Modeling for Process Engineering Applications: A Guideline for Simulation and Prediction from Perceptron to Deep Learning

: A misusage of machine learning (ML) strategies is usually observed in the process systems engineering literature. This issue is even more evident when dynamic identiﬁcation is performed. The root of this problem is the gradient explode and vanishing issue related to the recurrent neural networks training. However, after the advent of deep learning, these issues were mitigated. Fur-thermore, the problem of data structuration is often overlooked during the machine learning model identiﬁcation in this ﬁeld. In this scenario, this work proposes a guideline for identifying ML models for the different applications in process systems engineering, which are usually for simulation or prediction purposes. While using the proposed guideline, the work also identiﬁes a virtual analyzer for a pressure swing adsorption unit. In these types of adsorption separations, it is usual that the measurement of the main properties is not done online. Therefore, the virtual analyzer is another contribution of this manuscript. The overall results demonstrate that even though the test provides good performance during the ML model identiﬁcation, its quality might degenerate over the application domain if the model application is overlooked.


Introduction
While, from a first perspective, the concepts of prediction and simulation appear to be the same, they are different, and this difference must be emphasized. A simulation is performed to verify the response of a model using input data and initial conditions. The values calculated as a response of the model have a sampling time equal to the sampling time of the input variables. A prediction computes the response of a model at some specified future horizon of time through the projection of current and past values of measured input and output values, as well as initial conditions. Thus, a simulation does not require measurements (the actual states of the system) beyond the initial condition. In contrast, a prediction depends on it, leading simulations to be adequate to situations in which measurements are unavailable, such as control system design, fault detection, and process optimization [1]. On the other hand, in scenarios where there is a measurable quantity, but its measurement has a significant deadtime, prediction is more suitable.
The difference between these concepts reflects the instruments and techniques adequate for one or the other. The data structure is the first and essential step in identifying the models when dealing with data-driven modeling. The predictors are structures used to organize the data following its application. There are many predictor structures [2]. (2020) [15] applied ML tools to model a basic oxygen steelmaking from experimental data. Even in other engineering fields, ML strategies are usually employed in this sense. Coccia et al. (2021) [16] employed ANN models to simulate the cooling demand of a singlefamily house. Pervez et al. (2021) proposes an ANN model to predict wind speed to be applied as a simulation model in a control scheme. Still, these works do not evaluate the suitable predictor, embedding dimensions, and ML model for their applications. As it is possible to see from this literature review, there is overall misleading employment of machine learning models in the field of chemical engineering, and more specifically, in the area of modeling pressure swing adsorption unit. In fact, it is generally challenging to find works in chemical engineering that apply machine learning models and evaluate the issues related to the predictors. As pointed by Dobbelaere et al. (2021) [17]: "The greatest threat in artificial intelligence research today is inappropriate use because most chemical engineers have had limited training in computer science and data analysis.". Therefore, this work addresses part of this issue, providing a comprehensive guideline for ML application in this field.
The dynamics of a PSA unit are very complex, where no steady state is observed. Furthermore, it presents difficulties in obtaining a measurement of its main properties, such as concentrations. These measurements are usually obtained using offline instruments as gas chromatography (GC) techniques. These instruments require a significant amount of time to perform the measurement, during which the desired information is unknown.
In this scenario, this work proposes identifying a soft sensor to perform real-time predictions of the purities and recoveries of a PSA unit. On the other hand, it presents the identification of machine learning (ML) models to simulate the dynamic behavior PSA unit. The machine learning strategies employed here were the traditional feedforward neural network (FNN), a recurrent neural network (RNN), and a deep neural network (DNN) [18]. Therefore, three main contributions of this work are highlighted. A general contribution is the discussion of guidelines for the dynamic identification of ML models in chemical and process engineering. Overall, two specific contributions are the development of a soft sensor for a PSA unit and a simulator of the PSA process.

Methods
This section presents the main methods applied in developing the ideas addressed here.

Study Case
Cyclic adsorption processes promote the separation of complex mixtures while being an energetically efficient and environmentally friendly route. Among the cyclic adsorption processes, the pressure swing adsorption (PSA) unit can be highlighted for its wide range of applications to promote gas-phase separations. These processes leverage the variation of the adsorption capacity of a given adsorbent with the pressure, hence promoting the gas separation by taking advantage of the adsorption phenomena. The unit operates in a series of discrete steps, each with a specific function, as described below. This creates a dynamic behavior where no steady estate is reached.
The PSA unit of this work will be used as a pre-treatment of syngas to purify it. As shown in Figure 1, the PSA unit has five steps: steps I and IV are conducted under varying pressure, whereas the others are performed at constant pressure. It will be fed a gas mixture of CO 2 , CO, and H 2 to capture the CO 2 and separate it from the CO and H 2 , considering the components of the purified syngas. This operation aims to achieve a CO 2 -enriched stream with an H 2 /CO stoichiometric ratio between 2.2 and 2.3, the suitable feed range for the Fischer-Tropsch step. A virtual plant of this unit was used based on its phenomenological model. The virtual plant will generate the synthetic data with which the networks will be trained. The adequate tools are discussed in the next section. The phenomenological model used in this work was proposed by Silva et al. (1999) [19] and validated by Regufe et al. [20], in which the following assumptions were made: 1. Gas-phase follows an ideal behavior; 2. The flow is axially dispersed; 3. There are external mass and heat transfer resistances, represented by the film model; 4. There is an internal mass transfer resistance, represented by the linear driving force (LDF) model; 5. The heat transfer in the solid phase is faster than in the gas phase to the extent that there are no temperature gradients inside the particles; 6. The porosity along the bed is constant; 7. The Ergun equation is valid locally.

Simulation Case: Short-Term Simulation for Online Applications
Simulations are usually necessary for optimization and control ends. For instance, the literature on PSA optimization and control presents works addressing the optimization and control of these units through simulations performed using ML models [21,22]. Furthermore, generally in the chemical engineering literature, it is challenging to find works that properly use ML models for simulation ends. Usually, NARX structures are employed for this end. However, no works were found that address the proper employment of ML models to perform this task. Therefore, it is a pertinent issue to be evaluated. It provides essential guidelines to the literature on simulation using ML models.
The virtual plant will serve as a benchmark simulation to compare the simulations performed in this case. As the phenomenological model requires great computational effort to simulate the unit, a surrogate model might be more appropriate when simulations are necessary to be performed in the short term. A virtual plant of this unit was used based on its phenomenological model. The virtual plant will generate the synthetic data with which the networks will be trained. The adequate tools are discussed in the next section. The phenomenological model used in this work was proposed by Silva et al. (1999) [19] and validated by Regufe et al. [20], in which the following assumptions were made:
The flow is axially dispersed; 3.
There are external mass and heat transfer resistances, represented by the film model; 4.
There is an internal mass transfer resistance, represented by the linear driving force (LDF) model; 5.
The heat transfer in the solid phase is faster than in the gas phase to the extent that there are no temperature gradients inside the particles; 6.
The porosity along the bed is constant; 7.
The Ergun equation is valid locally.

Simulation Case: Short-Term Simulation for Online Applications
Simulations are usually necessary for optimization and control ends. For instance, the literature on PSA optimization and control presents works addressing the optimization and control of these units through simulations performed using ML models [21,22]. Furthermore, generally in the chemical engineering literature, it is challenging to find works that properly use ML models for simulation ends. Usually, NARX structures are employed for this end. However, no works were found that address the proper employment of ML models to perform this task. Therefore, it is a pertinent issue to be evaluated. It provides essential guidelines to the literature on simulation using ML models.
The virtual plant will serve as a benchmark simulation to compare the simulations performed in this case. As the phenomenological model requires great computational effort to simulate the unit, a surrogate model might be more appropriate when simulations are necessary to be performed in the short term.

Prediction Case: Online Sensor
The online sensor proposed in this work will provide information about the system in real-time to help with the problem of measurement dead time associated with a PSA unit. The virtual plant will simulate the offline measurements provided to the network in the prediction case. It will be used to predict the following values of the properties until the subsequent measurement is performed again to provide information during the dead time. Figure 1 illustrates the online sensor: samples of the output will be regularly obtained to measure the concentrations and get the purity and recovery of the PSA unit. Additionally, the input information of the PSA unit, such as cycle times, column pressures, and flow values, will also be made available to the online sensor.
It is expected that the error associated with each prediction of the online sensor rises as time passes. The machine learning models used to represent the nonlinear dynamic system of the PSA will be based on deep neural networks, recurrent neural networks, and feedforward neural networks. Therefore, these three main ML strategies will be evaluated in this prediction scenario, hence, a comprehensive comparison of these models will be conducted.

Predictors
A dynamic system can be represented as a time series, a series of data points ordered in time. This series can be described in a general form by Equation (1): in which u is an input, and v is the white noise, and y is an output. There are many tools to the development of models, one of which is the predictors. Though many predictors exist [23], only a few of them are the ones to which a detailed explanation will be given hereinafter. The relationship between past inputs u and the output y can be given by Equation (2): in which g is a nonlinear function, for example, a neural network; φ(t) is the regression vector, and θ is the function's parameters. The choice of different components of the regression vector determines the predictor structure.

Nonlinear Autoregressive with Exogenous Inputs (NARX)
The components of the regression vector associated with the NARX structure are the past measured inputs and the past measured outputs. As such, through this structure, a predictionŷ of the actual output, y is given by Equation (3).
in which d is the input delay; and n a and n b are the number of past values for the output and input, respectively. Note that the noise is added inside the structure in itself: for the NARX, it is assumed that the error information is filtered through the system's dynamic [1]. Consequently, the error structure of NARX is additive. Equation (3) is a one-step-ahead prediction. However, it often is extended to multiple step-ahead predictions or simulations by using the predictionsŷ instead of measurements y as input to make future predictions recursively. Thus, the error associated with a prediction adds to future predictions, as it is carried over through the additive nature of the error in this structure: it is here that the problem lies. Figure 2 shows the NARX structure used recursively. The prediction is used instead of measurement as input of the function. Note that the number of measured information, output predictions, and the input time delay are parameters to be considered when developing the structure: in Figure 2, there is only one measured input, one output prediction, and an input time delay of one, to simplify the example. In other words, if a simulation is performed with NARX, the structure will be used recursively. Thus, the predictions calculated at one time are used as measurements at the next instant. This means that there is an assumption of knowing the system's state at future times. In the absence of the measurement, the past prediction is assumed to be a good representation of the predictions made. This assumption might not be reasonable in simulations, as there is no previous knowledge of future states.

Nonlinear Output Error (NOE)
The NOE structure does not require an exogenous input as a component of the regression vector: it uses the measured input and the predicted output. With the objective of more easily organizing the equations to be shown, for the NOE, the predicted output will be represented by ( ). As such, Equations (4) and (5) represent a one-step-ahead prediction through NOE.
As can be seen, the prediction structure is noiseless, and the noise is added only to the prediction after the system. Consequently, if this structure is used recursively to perform long-term predictions or simulations, the error is not propagated through each prediction. The easiest way to visualize an NOE structure is by analogy with a system of partial differential equations that represents a rigorous model. As such, the NOE is more adequate than the NARX to perform simulations. Figure 3 exhibits the NOE structure used recursively: it can be seen that the noise is only added after the prediction steps to compose the actual output value. Similar to the NARX structure, the number of measured inputs, output predictions, and the input delay are parameters to be determined. In order to simplify the description shown in the figure, only output prediction and one recursive delay were chosen. In other words, if a simulation is performed with NARX, the structure will be used recursively. Thus, the predictions calculated at one time are used as measurements at the next instant. This means that there is an assumption of knowing the system's state at future times. In the absence of the measurement, the past prediction is assumed to be a good representation of the predictions made. This assumption might not be reasonable in simulations, as there is no previous knowledge of future states.

Nonlinear Output Error (NOE)
The NOE structure does not require an exogenous input as a component of the regression vector: it uses the measured input and the predicted output. With the objective of more easily organizing the equations to be shown, for the NOE, the predicted output will be represented by p(t). As such, Equations (4) and (5) represent a one-step-ahead prediction through NOE.
As can be seen, the prediction structure is noiseless, and the noise is added only to the prediction after the system. Consequently, if this structure is used recursively to perform long-term predictions or simulations, the error is not propagated through each prediction. The easiest way to visualize an NOE structure is by analogy with a system of partial differential equations that represents a rigorous model. As such, the NOE is more adequate than the NARX to perform simulations. Figure 3 exhibits the NOE structure used recursively: it can be seen that the noise is only added after the prediction steps to compose the actual output value. Similar to the NARX structure, the number of measured inputs, output predictions, and the input delay are parameters to be determined. In order to simplify the description shown in the figure, only output prediction and one recursive delay were chosen.
Thus, the NARX structure has the additive noise issue and, as such, must be treated carefully when used for long-term predictions or simulations. However, this care is not usual in the literature. Hence, the necessity to deep discuss this issue.  Thus, the NARX structure has the additive noise issue and, as such, must be treate carefully when used for long-term predictions or simulations. However, this care is n usual in the literature. Hence, the necessity to deep discuss this issue.
Although inadequate for such means, the NARX is often used to simulate as consequence of a more accessible training process, as previously described. Furthermor the NOE approach for dynamic ML modeling was impossible until the last decade due software/hardware limitations. This strategy became available to model time-series on after the advent of deep learning. Therefore, it is still usual the systematic use of NAR strategies coped with ML modeling. While this approach can sometimes lead to goo results, it can also create a significant difference between the predicted and correct valu due to the addictive nature of the error [4].

Analysis, Results, and Discussion
This section provides the analysis, the correspondent result, and the two applicatio cases that illustrate the topics discussed in this work.

Data Acquisition
The quality of the predictions made by the network depends heavily on the quali of the data used to train it, as it is the basis over which the model learns the phenomen in analysis. As such, to produce the data set, the Latin hypercube sampling (LHS), method developed by McKay et al. [24], was used. In LHS, a sample range of each inp variable was divided into a given number of intervals with equal probability, and random value inside each interval was chosen. The input variables were random organized to form input groups to calculate the output values. This method, which is a extension of stratified sampling, ensures that the input variables have all their portio represented and all portions of their distributions. In this article, the LHS designed 10 experiments. Each experiment was applied to the virtual plant. The results for ea experiment were recorded until the PSA reached the cyclic steady state, after 25 cycle Therefore, 25.000 points were generated.
All input variables were considered in the design of experiments, namely: inl temperature, purge flowrate, rinse flow rate, high pressure, low pressure, feed ste duration, rinse step duration, and purge step duration. The training, validation, and test data set was created using the PSA virtual plan The input data generated by the LHS is presented in Figure 4. In the diagonal of this figur it is possible to observe that the samples are uniformly distributed within the samplin Although inadequate for such means, the NARX is often used to simulate as a consequence of a more accessible training process, as previously described. Furthermore, the NOE approach for dynamic ML modeling was impossible until the last decade due to software/hardware limitations. This strategy became available to model time-series only after the advent of deep learning. Therefore, it is still usual the systematic use of NARX strategies coped with ML modeling. While this approach can sometimes lead to good results, it can also create a significant difference between the predicted and correct values due to the addictive nature of the error [4].

Analysis, Results, and Discussion
This section provides the analysis, the correspondent result, and the two applications cases that illustrate the topics discussed in this work.

Data Acquisition
The quality of the predictions made by the network depends heavily on the quality of the data used to train it, as it is the basis over which the model learns the phenomena in analysis. As such, to produce the data set, the Latin hypercube sampling (LHS), a method developed by McKay et al. [24], was used. In LHS, a sample range of each input variable was divided into a given number of intervals with equal probability, and a random value inside each interval was chosen. The input variables were randomly organized to form input groups to calculate the output values. This method, which is an extension of stratified sampling, ensures that the input variables have all their portions represented and all portions of their distributions. In this article, the LHS designed 1000 experiments. Each experiment was applied to the virtual plant. The results for each experiment were recorded until the PSA reached the cyclic steady state, after 25 cycles. Therefore, 25,000 points were generated.
All input variables were considered in the design of experiments, namely: inlet temperature, purge flowrate, rinse flow rate, high pressure, low pressure, feed step duration, rinse step duration, and purge step duration. The limits used was based on Nogueira et al. The training, validation, and test data set was created using the PSA virtual plant. The input data generated by the LHS is presented in Figure 4. In the diagonal of this figure, it is possible to observe that the samples are uniformly distributed within the sampling space, as expected.

Embedding Dimensions Optimal Selection
He and Asada proposed a methodology [26] to identify the optimal number of past values for the output and input variables in nonlinear dynamic predictors. It consists of calculating the Lipschitz coefficient q ( ) through Equation (6), using observed inputoutput data of the system in analysis, to which it is assumed a sensitivity analysis can be done. Through this strategy, it is possible to characterize the relationship of a nonlinear set of input-output relationships for any chaotic/complex dynamic system.
in which is the number of observed points in the input-output data, is the number of input variables to be considered, and j = 1,2, … , N. For each measurement pair, a Lipschitz coefficient is computed, after which each coefficient is then used to calculate the Lipschitz Index ( ) through Equation (7): in which is the number of delays considered in the variables, p is a parameter usually between 0.01 N and 0.02 N, and q(k) ( ) is the k-th most significant Lipschitz coefficient from all q ( ) calculated by Equation (6). The method consists of varying and calculating the correspondent Lipschitz index until its respective value does not differ significantly. Then, the first index to define this region of insensible indexes is the one associated with the desired optimal number of delays.
In this work, this procedure will be used with the NOE and NARX structures to determine the best number of input and output delays to be used in each predictor. The value = 0.01 was chosen following the literature recommendation [18]. Figure 5 presents the results obtained after applying the Lipschitz analysis on the synthetic data obtained from the virtual plant. It is possible to see that for both cases, one delay for the

Embedding Dimensions Optimal Selection
He and Asada proposed a methodology [26] to identify the optimal number of past values for the output and input variables in nonlinear dynamic predictors. It consists of calculating the Lipschitz coefficient q (m) j through Equation (6), using observed input-output data of the system in analysis, to which it is assumed a sensitivity analysis can be done. Through this strategy, it is possible to characterize the relationship of a nonlinear set of input-output relationships for any chaotic/complex dynamic system. q (m) j = y j−1 − y j (δu 1 (t − j)) 2 + . . . . . . +(δu m (t − j)) 2 (6) in which m is the number of observed points in the input-output data, m is the number of input variables to be considered, and j = 1, 2, . . . , N. For each measurement pair, a Lipschitz coefficient is computed, after which each coefficient is then used to calculate the Lipschitz Index q (n) through Equation (7): in which n is the number of delays considered in the variables, p is a parameter usually between 0.01 N and 0.02 N, and q(k) (n) is the k-th most significant Lipschitz coefficient from all q (m) j calculated by Equation (6). The method consists of varying n and calculating the correspondent Lipschitz index until its respective value does not differ significantly. Then, the first index to define this region of insensible indexes is the one associated with the desired optimal number of delays.
In this work, this procedure will be used with the NOE and NARX structures to determine the best number of input and output delays to be used in each predictor. The value p = 0.01 N was chosen following the literature recommendation [18]. Figure 5 Processes 2022, 10, 250 9 of 18 presents the results obtained after applying the Lipschitz analysis on the synthetic data obtained from the virtual plant. It is possible to see that for both cases, one delay for the outputs is enough. On the other hand, four output delays are necessary for the recovery predictor, while five are necessary for the purity predictor.

Hyperparameter Tuning
A first step in identifying an artificial neural network model is the definition of the variables that govern the training process and artificial neural network topology. These variables are known as hyperparameters and may have a significant impact on fitting performance. A set of hyperparameters is not directly related to the final model structure, but to how the model will be identified, such as learning rate, momentum, learning rate decay, the number of epochs, and mini-batch size. On the other hand, there is a set of parameters related to the ANN structure, such as the number of layers and neurons, activation function, and layer type. The hyperparameter space is comprised of a set of both discrete and continuous variables, which makes their selection a complex task.
Therefore, an initial optimization step is necessary to select the hyperparameters efficiently. This is performed over the original learning problem, in a way to select the hyperparameters by monitoring the learning cost function [27]. This is a very timeconsuming procedure that has been studied in the recent literature. A most recent advance in this field is the HYPERBAND method [28]. This strategy formulates hyperparameter  (3) and (4).

Hyperparameter Tuning
A first step in identifying an artificial neural network model is the definition of the variables that govern the training process and artificial neural network topology. These variables are known as hyperparameters and may have a significant impact on fitting performance. A set of hyperparameters is not directly related to the final model structure, but to how the model will be identified, such as learning rate, momentum, learning rate decay, the number of epochs, and mini-batch size. On the other hand, there is a set of parameters related to the ANN structure, such as the number of layers and neurons, activation function, and layer type. The hyperparameter space is comprised of a set of both discrete and continuous variables, which makes their selection a complex task.
Therefore, an initial optimization step is necessary to select the hyperparameters efficiently. This is performed over the original learning problem, in a way to select the hyperparameters by monitoring the learning cost function [27]. This is a very time-consuming procedure that has been studied in the recent literature. A most recent advance in this field is the HYPERBAND method [28]. This strategy formulates hyperparameter optimization as an exploration problem. A predefined resource is allocated to randomly sampled configu-rations within a chosen hyperspace. Hence, it requires as input parameters the hyperspace limits and the maximum amount of resources to be used (epochs).
In the present work, the neural networks, training, and hyperparameters tuning were implemented using TensorFlow 2.0 on the Google Colaboratory environment with Python 3.6. Google Colaboratory is a free serverless Jupyter notebook environment for prototyping machine learning models [29]. This allowed performing the identification using hardware accelerators such as tensor process units (TPUs) which can increase neural network training speed in the order of 10 compared to modern CPUs. For the present case, the hyperspace was defined as depicted in Table 1. For the DNN case, two types of deep learning structures are evaluated the gated recurrent unit (GRU) and the long short-term memory (LSTM).

Hyperparameters
Search Space

RNN FNN DNN
Initial learning rate The results in Table 2 were obtained after applying the HYPERBAND algorithm on the hyperspace presented in Table 1. Interestingly, the algorithm found a mix of GRU and LSTM layers as an optimal structure of DNN for the purity model. On the other hand, only GRU layers are identified as an optimal structure for the recovery model.

Neural Network Training
Once the hyperparameters are defined, the next step is to train the defined structures of ML models. This is the learning step, where the model's parameters are estimated. The cross-validation method proposed by Schenker and Agarwal [30] was used here to avoid issues related to overtraining and overfitting. This strategy consists in dividing the training data into two separate sets; two groups of networks are trained with each data set. After training is achieved, the data set utilized for training one group of networks is used to validate the other by calculating the mean squared errors. In this way, it is possible to maximize the usage of available information without causing overfitting and overtraining issues.
Further precautions are taken to avoid overtraining. The early stop criterion was used during the training process, that is, the training is stopped after an arbitrary number of iterations or when the mean squared error associated with the training started to increase instead of decrease. The ADAM algorithm was used. Table 3 presents the model's validation performance, where it is possible to observe that the DNN presents the best validation performance. The FNN performance presents a reasonable performance, while the RNN presents a poor performance index compared with the others. Thus, it is possible to visualize the models' behavior during the test step. The behavior observed in these graphics reflects the performance index obtained during the ML models identification. Hence, it is possible to see a good adherence of the DNN to the test data, a fair adherence by the FNN, and a poor prediction by the RNN.
Finally, a better analysis of the validation results can be performed based on the parity graphics shown in Figure 9. The parity plots display the ANNs' predictions versus the observed values in the synthetic data ( Figure 9). For the cases of the DNN and the FNN models, it is possible to observe that their parity is randomly distributed around the diagonal line across the whole range of values, thus indicating that the residuals are random. This demonstrates that the model was satisfactorily estimated within that domain. On the other hand, for the RNN models, no randomness is observed in the parity graphics.
This, therefore, evidences the problems of the solely recurrent approach. These are well-known issues in this field, as mentioned in the Introduction. These issues are the reason for the widespread use of FNNs-based on NARX predictors. Furthermore, the good performance of the FNN-NARX approach led to the reluctant employment of DNN strategies in the chemical processes literature [18,21]. On the other hand, these models are usually evaluated only by their test behavior. After that, the model is considered a good tool to be employed in the practical case. However, if they are employed to address a situation where they are not adequately used, their quality will quickly degenerate. This issue will be further discussed in the results of this work. Overall, this is a practical observation of the topics discussed here.
It is important to note that there are significant differences between the computation time to train an NOE model and a NARX model. In the present work, it took approximately 1.    Finally, a better analysis of the validation results can be performed based on the parity graphics shown in Figure 9. The parity plots display the ANNs' predictions versus the observed values in the synthetic data ( Figure 9). For the cases of the DNN and the FNN models, it is possible to observe that their parity is randomly distributed around the  Finally, a better analysis of the validation results can be performed based on the parity graphics shown in Figure 9. The parity plots display the ANNs' predictions versus the observed values in the synthetic data ( Figure 9). For the cases of the DNN and the FNN models, it is possible to observe that their parity is randomly distributed around the diagonal line across the whole range of values, thus indicating that the residuals are random. This demonstrates that the model was satisfactorily estimated within that domain. On the other hand, for the RNN models, no randomness is observed in the parity graphics.  This, therefore, evidences the problems of the solely recurrent approach. These are well-known issues in this field, as mentioned in the Introduction. These issues are the

Prediction Case
In this case, a virtual analyzer to provide real-time PSA purity and recovery predictions is proposed. Following the problem discussed here, three analyzers are developed: a NARX based FNN soft-sensor, and two other NOE-based (RNN and DNN) soft sensors.
The concentration measurement in PSA usually presents a significant deadtime, which can reach several minutes. Hence, the purity and recovery are unknown while no mea-surement is available. Therefore, the virtual analyzer for these units is a suitable solution, providing information regarding the process variables while no measurement is available. As the NARX structure presents the exogenous input, this becomes a gate to introduce the purity and recovery measured values when the measurement is available. Hence, it is expected that the FNN-NARX sensor will produce the best performance in this case. Therefore, the NARX is used to perform a short-term prediction while new information about the system is collected. On the other hand, the RNN and DNN are misemployed here, as we are dealing with a short-prediction scenario. The results compare the measured values provided by the phenomenology model with the values predicted by the structures. Figure 10 presents the results obtained for each structure to perform the predictions of purity and recovery. The simulated measurement deadtime was 10 cycles. As can be seen, the NOE, whose structures are more adequate to simulations, provides a more significant difference between the predicted values and the reference values when compared with the NARX predictor. This shows that the NARX, in this short-term prediction case, is indeed adequate to provide information of the system: the error cumulation is not detrimental in the prediction horizon chosen.

Prediction Case
In this case, a virtual analyzer to provide real-time PSA purity and recovery predictions is proposed. Following the problem discussed here, three analyzers are developed: a NARX based FNN soft-sensor, and two other NOE-based (RNN and DNN) soft sensors.
The concentration measurement in PSA usually presents a significant deadtime, which can reach several minutes. Hence, the purity and recovery are unknown while no measurement is available. Therefore, the virtual analyzer for these units is a suitable solution, providing information regarding the process variables while no measurement is available. As the NARX structure presents the exogenous input, this becomes a gate to introduce the purity and recovery measured values when the measurement is available. Hence, it is expected that the FNN-NARX sensor will produce the best performance in this case. Therefore, the NARX is used to perform a short-term prediction while new information about the system is collected. On the other hand, the RNN and DNN are misemployed here, as we are dealing with a short-prediction scenario. The results compare the measured values provided by the phenomenology model with the values predicted by the structures. Figure 10 presents the results obtained for each structure to perform the predictions of purity and recovery. The simulated measurement deadtime was 10 cycles. As can be seen, the NOE, whose structures are more adequate to simulations, provides a more significant difference between the predicted values and the reference values when compared with the NARX predictor. This shows that the NARX, in this short-term prediction case, is indeed adequate to provide information of the system: the error cumulation is not detrimental in the prediction horizon chosen. Further comparison is provided by Table 4, which presents the performance indexes of each soft sensor developed. The results presented in this table corroborate the discussions made above. Finally, these results provide the PSA soft sensor, which is one contribution of this Further comparison is provided by Table 4, which presents the performance indexes of each soft sensor developed. The results presented in this table corroborate the discussions made above.
Finally, these results provide the PSA soft sensor, which is one contribution of this work. As it is possible to see from the results, the soft sensor can efficiently provide information regarding the system behavior while there are no measurements. Furthermore, as it is possible to see from the virtual plant, the soft sensor can actually provide precise real-time information.

Simulation Case
Process simulation within a long horizon is usually necessary for process control and optimization. For example, in dynamic optimization where it is necessary to evaluate the evolution of the process from a given steady-state to another or in the infinity prediction horizon in model predictive controllers. As discussed here, the NARX structures are not recommended in these situations. However, it is usual to see them being employed for these cases in the literature. Therefore, as another case study, the models identified here were used to perform simulations of the PSA dynamic evolution. Figure 11 portrays the result of the simulations. A total of five different operating conditions were given to the PSA plant, and sufficient time was given until the system reached the corresponding cyclic steady-state.
Processes 2022, 10, x FOR PEER REVIEW 17 of 19 Figure 11. Simulation of the PSA operating from several different operating conditions.
As it is possible to see from the graphics in Figure 11, the FNN-NARX model presented an offset at some conditions. On the other hand, the DNN model efficiently predicted the cyclic steady-state and the system dynamics. A series of 100 operating conditions were given to the model to better evaluate the differences between the ML strategies under a simulation scenario. The graphics are not provided for this last case, as the visualization is not appropriated due to the data density. Table 5 provides the simulation performance indexes. It is possible to see that the DNN is significantly superior Figure 11. Simulation of the PSA operating from several different operating conditions.
As it is possible to see from the graphics in Figure 11, the FNN-NARX model presented an offset at some conditions. On the other hand, the DNN model efficiently predicted the cyclic steady-state and the system dynamics. A series of 100 operating conditions were given to the model to better evaluate the differences between the ML strategies under a simulation scenario. The graphics are not provided for this last case, as the visualization is not appropriated due to the data density. Table 5 provides the simulation performance indexes. It is possible to see that the DNN is significantly superior to the FNN in the simulations scenario. The performance indexes differences are a clear demonstration of how the FNN-NARX performance can degenerate along a simulation.

Conclusions
This work addresses the problem of identifying machine learning models for simulations and prediction purposes. It presents a comprehensive guideline for the ML models identification. On the other hand, it also deals with the online measurement issue found in a pressure swing adsorption unit. Thus, a series of ML models are identified to model the PSA process.
It is important to note that the models employed in the simulation and the prediction scenario are the same; their application is different. It is worth mentioning that the DNN presented the best performance for the test of the ML models during the model identification. Therefore, it was demonstrated how misleading it is to use the test performance solely without considering their application. The model with the best test performance (DNN) had inferior behavior during the prediction. On the other hand, the model with acceptable performance in the test (FNN-NARX) had poor behavior when applied for simulation purposes. Therefore, it was demonstrated the proper evaluation of the kind of application a machine learning model is sought for is critical.
Regarding the online measurement, the results demonstrated that the soft sensor based on FNN-NARX can efficiently predict the states while there is no measured data. On the other hand, a model for simulation purposes based on DNN was provided, showing to be able to efficiently simulate the model dynamic behavior.