Tabular Machine Learning Methods for Predicting Gas Turbine Emissions

Predicting emissions for gas turbines is critical for monitoring harmful pollutants being released into the atmosphere. In this study, we evaluate the performance of machine learning models for predicting emissions for gas turbines. We compare an existing predictive emissions model, a first principles-based Chemical Kinetics model, against two machine learning models we developed based on SAINT and XGBoost, to demonstrate improved predictive performance of nitrogen oxides (NOx) and carbon monoxide (CO) using machine learning techniques. Our analysis utilises a Siemens Energy gas turbine test bed tabular dataset to train and validate the machine learning models. Additionally, we explore the trade-off between incorporating more features to enhance the model complexity, and the resulting presence of increased missing values in the dataset.


Introduction
Gas turbines are widely employed in power generation and mechanical drive applications, but their use is associated with the production of harmful emissions, including nitrogen oxides (NOx) and carbon monoxide (CO), which pose environmental and health risks.Regulations have been implemented to limit emissions and require monitoring.
To monitor emissions from gas turbines, a Continuous Emissions Monitoring System (CEMS) is commonly employed, which involves sampling gases and analysing their composition to quantify emissions.While CEMS can accurately measure emissions in real-time, it can lead to a high cost to the process owner, including requiring daily maintenance to avoid drift.As a result, CEMS may not always be properly maintained, leading to inaccurate or unreliable measurements.
Predictive emissions monitoring system (PEMS) models provide an alternative method of monitoring emissions that is cost-effective and requires minimal maintenance compared to CEMS, while not requiring the large physical space needed for CEMS gas analysis.PEMS is

Evaluation
Model's performance is evaluated on training data.
Errors are calculated

Correction
New decision tree is fit to errors made by previous tree, aiming to reduce the errors

Update Model
New tree is added to the model.Weights of each tree are based on performance on training data Repeat until performance stops improving, or predetermined number of trees is reached

Predict
Final model used to predict on test data trained on historical data using process parameters such as temperatures and pressures, and uses real-time data to generate estimations for emissions.
To develop a PEMS model, it is necessary to validate the model's predictive accuracy using data with associated emissions values [4].In our experiments, we used test bed tabular data consisting of tests conducted over a wide range of operating conditions to train our models.Gradient-boosted decision trees (GBDTs) such as XGBoost [3] and LightGBM [5] have demonstrated excellent performance in the tabular domain, and are widely regarded as the standard solution for structured data problems.
Previous studies comparing neural networks (NNs) and GBDTs for tabular regression have generally found that GBDTs match or outperform NN-based models, particularly when evaluated on datasets not documented in their original papers [6], while some NN-based methods are beginning to outperform GBDTs, such as SAINT [2].
We compare the predictive performance of an industry used Chemical Kinetics PEMS model [1], serving as the baseline, against two machine learning approaches: SAINT and XGBoost, to determine how improvements can be made in emissions prediction for gas turbines.
We observe that, on average, XGBoost outperforms both the original Chemical Kinetics model and the deep learning-based SAINT model for predicting both NOx and CO emissions on test bed data for gas turbines.

Gradient-Boosted Decision Trees
Gradient-boosted decision trees (GBDTs) are popular machine learning algorithms that combine the power of decision trees with the boosting technique, where multiple weak learners are combined in an ensemble to create highly accurate and robust models.GBDTs build decision trees iteratively, correcting errors of the previous trees in each iteration.Gradient boosting is used to combine the predictions of all the decision trees, with each tree's contribution weighted according to its accuracy.The final prediction is made by aggregating the predictions of all the decision trees.
XGBoost, or eXtreme Gradient Boosting [3], is a widely-used implementation of GBDTs, used for both classification and regression tasks.XGBoost is designed to be fast, scalable, and highly performant, making it well-suited for large-scale machine learning applications.Figure 2: Multi-head attention from [7], where h is the number of heads, Q, K, and V are the query, key and value vectors.
One of the key features of XGBoost is its use of regularisation functions to prevent overfitting and improve the generalisation of the model.XGBoost also uses a tree pruning algorithm to remove nodes with low feature importance to reduce the complexity of the model and improve accuracy.XGBoost has been highly successful for tabular data analysis, and deep learning researchers have been striving to surpass its performance.

Attention and Transformers
Transformers, originating from Vaswani et al. [7], are a type of deep learning architecture originally developed for natural language processing tasks and have been adapted for use in the tabular domain.These models use self-attention to compute the importance of each feature within the context of the entire dataset, enabling them to learn complex, non-linear relationships between features.This is contrasted to GBDTs where all features are treated equally and relationships are not considered between them.Attention mechanisms are capable of highlighting relevant features and patterns in the dataset that are the most informative for making accurate predictions.
Multi-head self-attention is a type of attention mechanism used in Transformers.A weight is assigned to each input token based on its relevance to the output, allowing selective focus on different parts of the input data.
The attention mechanism is applied multiple times in parallel, with each attention head attending to a different subspace of the input representation, allowing the model to capture different aspects of the input data and learn more complex, non-linear relationships between the inputs.The outputs of the multiple attention heads are then concatenated and passed through a linear layer to produce the final output.This is depicted in Figure 2, where the scaled dot-product attention is: In Figure 2 and Equation 1, Q, K, and V are the query, key and value vectors used to compute attention weights between each element of the input sequence.d k is the dimension of the key vectors.
SAINT [2], the Self-Attention and Intersample Attention Transformer, is a deep learning model designed to make predictions based on tabular data.SAINT utilises attention to highlight specific features or patterns within the dataset that are most relevant for making accurate predictions, helping models better understand complex relationships within the data and make more accurate predictions.
In their experiments, they find that SAINT, on average, outperforms all other methods on supervised and semi-supervised tasks for regression, including GBDT-based methods, on a variety of datasets.

Chemical Kinetics
Siemens Energy developed a Chemical Kinetics PEMS model [1] through mapping emissions via a 1D reactor element code 'GENE-AC' computational fluid dynamics model of their SGT-400 combustor and converting this to a parametric PEMS model.This is a first principles method that uses factors such as pilot/main fuel split, inlet air temperature and inlet air pressure to calculate the predicted emissions.

Gas Turbine Emissions Prediction 3.1.1. First Principles
Predictive emissions monitoring systems (PEMS) for gas turbines have been developed since 1973 [8] in which an analytical model was developed using thermodynamics to predict NOx emissions.Rudolf et al. [9] developed a mathematical model which takes into account performance deterioration due to engine ageing.They combined different datasets, such as validation measurements and long-term operational data, to provide more meaningful emission trends.Lipperheide et al. [10] also incorporate aging of the gas turbines into their analytical model which is capable of accurately predicting NOx emissions for power in the range 60-100%.Siemens Energy developed a Chemical Kinetics model [1] to accurately predict CO and NOx emissions for their SGT-400 gas turbine.They used a 1D reactor model to find the sensitivity of the emissions to the different input parameters as a basis for the PEMS algorithm.Bainier et al. [11] monitor their analytical PEMS over two years and find a continuous good level of accuracy, noting that training is required to fully upkeep the system.

Machine Learning
A number of machine learning (ML) methods have been used to predict emissions for gas turbines and have been found to be more flexible for prediction than first principles methods.Cuccu et al. [12] compared twelve machine learning methods including linear regression, kernel based methods and feed-forward artificial neural networks with different backpropagation methods.They used k-fold cross-validation to select the optimal methodspecific parameters and found that improved resilient backpropagation (iRPROP) achieved the best performance, and note that thorough pre-processing is required to produce such results.Kaya et al. [13] compared three decision fusion schemes on a novel gas turbine dataset, highlighting the importance of certain features within the dataset for prediction.Si et al. [14] also used k-fold validation to determine the optimal hyperparameters for their neuralnetwork based models.Rezazadeh et al. [15] proposed a k-nearest-neighbour algorithm to predict NOx emissions.
Azzam et al. [16] utilised evolutionary artificial neural networks and support vector machines to model NOx emissions from gas turbines, finding that use of their genetic algorithm results in a high enough accuracy to offset the computational cost compared to the cheaper support vector machines.Kochueva et al. [17] develop a model based on symbolic regression and a genetic algorithm with a fuzzy classification model to determine "standard" or "extreme" emissions levels to further improve their prediction model.Botros et al. [18,19,20] developed a predictive emissions model based on neural networks with an accuracy of ±10 parts per million.
Guo et al. [21] developed a NOx prediction model based on attention mechanisms, LSTM, and LightGBM.The attention mechanisms were introduced into the LSTM model to deal with the sequence length limitation LSTM faces.They eliminate noise through singular spectrum analysis and then use LightGBM to select the dependent feature.This processed data is then used as input to the LSTM and the attention mechanism is used to enhance the historical learning ability of information.They add feature attention and temporal attention to the LSTM model to improve prediction by allowing different emphasis by allocating different weights.

Tabular Prediction 3.2.1. Tree-Based
Gradient-boosted decision trees (GBDTs) have emerged as the dominant approach for tabular prediction, with deep learning methods only beginning to outperform them in some cases.Notably, XGBoost [3] often achieves state-of-the-art performance in regression problems.Other GBDTs such as LightGBM [5] and CatBoost [22] have shown success in tabular prediction.
Deep learning faces challenges when dealing with tabular data, such as low-quality training data, the lack of spatial correlation between variables, dependency on preprocessing, and the impact of single features [23].Shwartz et al. [6] conclude that deep models were weaker than XGBoost, and that deep models only outperformed XGBoost alone when used as an ensemble with XGBoost.They also highlight the challenges in optimizing deep models compared to XGBoost.Grinsztajn et al. [24] find that tree-based models are state-of-the-art on medium sized data (10,000 samples), especially when taking into account computational cost, due to the specific features of tabular data, such as uninformative features, non rotationallyinvariant data, and irregular patterns in the target function.Kadra et al. [25] argue that well-regularized plain MLPs significantly outperform more specialized neural network architectures, even outperforming XGBoost.

Attention and Transformers
Attention-and transformer-based methods have shown promise in recent years for tabular prediction.Ye et al. [26] provide an overview on attention-based approaches for tabular data, highlighting the benefits of attention in tabular models.SAINT [2] introduced intersample attention, which allows rows attend to each other, as well as using the standard self-attention mechanism, leading to improved performance over GBDTs on a number of benchmark tasks.TabNet [27] is an interpretable model that uses sequential attention to select features to reason from at each step.FT-Transformer [28] is a simple adaption of the Transformer architecture that has outperformed other deep learning solutions on most tasks.However, GBDTs still outperform it on some tasks.TabTransformer [29] transforms categorical features into robust contextual embeddings using transformer layers, but it does not affect continuous variables.Kossen et al. [30] took the entire dataset as input and used self-attention to reason about relationships between datapoints.ExcelFormer [31] alternated between two attention modules to manipulate feature interactions and feature representation updates and manages to convincingly outperform GBDTs.
Despite the promising results of these attention-and transformer-based methods, deep learning models have generally been weaker than GBDTs on datasets that were not originally used in their respective papers [6].Proper pre-processing, pre-training [32] and embedding [33] can enable deep learning tabular models to perform significantly better, reducing the gap between deep learning and GBDT models.

Data
The data is test bed data from the Siemens SGT400 gas turbines.This is tabular data consisting of a number of different gas turbines tested over a wide range of operating conditions.In total, there are 37,204 rows of data with 183 features including process parameters such as temperatures and pressures, and the target emission variables NOx and CO.All data is numerical values.

Pre-Processing
From the test bed dataset, two comparison sub-datasets were used: "Full" and "Cropped".The Cropped dataset consisted of a significant amount of filters pre-applied to the data by Siemens Energy for the Chemical Kinetics model, while the Full dataset had no filters applied.Standard pre-processing was applied to both sets of data including removing rows with missing data, removing negatives from emissions data, and removing liquid fuel data.Features with a significant number of missing rows were also removed.For the Full dataset, any features with more than 18,100 missing values were removed.Similarly, for the Cropped dataset, features with more than 3,000 missing values were removed.These threshold values were chosen to be greater than the number of missing values than the maximum number of missing values found in the emission columns.
Table 1 provides an overview of both sub-datasets and the number of rows and features in each.Due to the prior pre-processing removing proportionally more missing values through the original filters, the Cropped dataset ends with more rows of data compared to the Full dataset, at the cost of reducing the number of features.When removing the same features The dataset is collected from 0% to 126% load, and pre-processing reduces this to 24% to 126%.We utilise this full range for our comparisons.
Figure 3 depicts the spread of the data for the target emissions, NOx and CO, for both sub-datasets.CO has many more outliers compared to NOx, with some particularly far from the median.

Models
We compare a transformer-based model, SAINT [2], and GBDT XGBoost [3], against the an existing PEMS model used by Siemens Energy, a first principles-based Chemical Kinetics model [1].

SAINT
SAINT accepts a sequence of feature embeddings as input and produces contextual representations with the same dimensionality.[2].The features, [f 1 , ..., f n ], are the process parameters from sensors within the gas turbine tests, where n is the number of features, 11.Each x i is one row of data including one of each feature, where b is the batch size, 32.A [CLS] token with a learned embedding is appended to each data sample.This batch of inputs is passed through an embedding layer, consisting of a linear layer, a ReLU non-linearity, followed by a linear layer, prior to being processed by the SAINT model L times, where L is 3.Only representations corresponding to the [CLS] token are selected for an MLP to be applied to.MSE loss is done on predictions during training.For our experiments, b is the batch size (32), n is the number of features (7).L 1 is the first linear layer, with 1 input feature and 100 output features, L 2 is the second linear layer, with 100 input features and 1 output feature.The embedding layer is performed for each feature.
Features are projected into a combined dense vector space and passed as tokens into a transformer encoder.A single fully-connected layer with a ReLU activation is used for each continuous feature's embedding.
SAINT alternates self-attention and intersample attention mechanisms to enable the model to attend to information over both rows and columns.The self-attention attends to individual features within each data sample, and intersample attention relates each row to other rows in the input, allowing all features from different samples to communicate with each other.
Similar to the original transformer [7], there are L identical layers, each containing one self-attention and one intersample attention transformer block.The self-attention block is identical to the encoder from [7], consisting of a multi-head self-attention layer with 8 heads, and two fully-connected feed-forward layers with a GELU non-linearity.A skip connection and layer normalisation are applied to each layer.The self-attention layer is replaced by an intersample attention layer for the intersample attention block.For the intersample atten- tion layer, the embeddings of each feature are concatenated for each row, and attention is computed over samples rather than features, allowing communication between samples.We use SAINT in a fully supervised multivariate regression setting, which was not originally reported on in the paper.The code we based our experiments on can be found at1 .We used the AdamW optimiser with a learning rate of 0.0001.

XGBoost
XGBoost reduces overfitting through regularisation and pruning, using a distributed gradient boosting algorithm to optimise the model's objective function to make it more scalable and efficient, and automatically handles missing values.
Decision trees are constructed in a greedy manner as a weak learner.At each iteration, XGBoost evaluates the performance of the current ensemble and adds a new tree to the ensemble that minimises the loss function through gradient descent.Each successive tree implemented compensates for residual errors in the previous tree.

Chemical Kinetics
We compare our work to an updated Chemical Kinetics model, based on [1], using the same sets of test data for comparisons.The predictions for the Chemical Kinetics model are essentially part of the original dataset, with the number of features and rows of each sub-dataset, described in Section 4.2, not affecting the raw predictions but eliminating the varying rows depending on missing values due to features in the dataset.

Metrics and Evaluation
The metrics used to evaluate the models in this work are the mean absolute error (MAE) and root mean squared error (RMSE).
MAE is expressed as follows: RMSE is expressed as follows: We used randomised cross-validation to evaluate the performance of the machine learning models, SAINT and XGBoost, whereby the data was randomly sub-sampled 10 times to obtain unbiased estimates of the models' performance on new, unseen data on which they were re-trained and tested on.We report the average and standard deviation of the MAE and RMSE for each sub-dataset, providing an insight into the models' consistency and variation in performance.The Chemical Kinetics model is also compared on these test sets to provide relative benchmark for the performance of the models.The CO and NOx emissions targets are individually trained for to achieve specialised models for each target.

Impact of Number of Features
To assess the influence of the number of features compared to the number of rows of data on prediction performance, we further split each dataset where each subset contained a decreasing number of features, leading to fewer rows of missing data, allowing an examination into the effect of removing less important features on the availability of data points for training.Feature removal followed the order of decreasing feature importance according to XGBoost, where the importance is calculated by XGBoost based on how often each feature is used to make key decisions across all trees in the ensemble.The order of importance for each feature can be found in Table A.1.

Results and Discussion
Table 2 describes the average MAE and RMSE obtained from the 10 sub-samples of the dataset with the varying number of features.XGBoost has on average the lowest MAE for each emission and number of features, while SAINT has a lower RMSE on average.
All models, especially the Chemical Kinetics model, have significant errors when predicting CO.Further analysis of these results indicated that these large errors were primarily driven by a small number of data points with extremely anomalous MAE values.Figure 6 illustrates these outliers, with the logarithmic scale emphasizing the limited number of data points responsible for the higher mean MAE.Despite the presence of outliers, the median MAE values for each model were not excessively high, with the majority of data points exhibiting more accurate predictions for CO. Figure 8 demonstrates that the majority of predictions generated by all models fall within a reasonable range for accurate CO emission prediction for gas turbines.While overall performance may be affected by the presence of outliers, the models do exhibit good predictive capabilities for CO and NOx emissions.
Figure 7 and 8 show the normalised predictions compared to the real value for NOx and CO.For Figure 8, the predictions above 1000ppm were removed from view as these were extremely anomalous and prevented the main results to be seen clearly.For both emissions, the Chemical Kinetics model has more spread compared to SAINT and XGBoost.For CO especially, XGBoost predictions are closer to the identity line compared to SAINT.
Figure 9 displays the relationship between the MAE values and the number of features in    the analysis, highlighting the potential impact of feature removal and its effect on prediction performance.Despite having 2415 more rows of training data, with the exception of SAINT's CO prediction, the MAE is not significantly affected by the change in number of features and rows.
In our evaluation, XGBoost provided the best prediction accuracy for both NOx and CO, with both machine learning methods outperforming the original Chemical Kinetics model.Prediction for NOx is significantly more accurate than CO prediction for all models.This can be attributed to the wider spread of data points and greater presence of influential outliers in the CO real values, as evident in Figure 3.The abundance of outliers in the CO dataset made it inherently more challenging to predict accurately.The filters used for the Cropped dataset particularly improved the RMSE of the machine learning models as it removed some outlier inputs in the dataset.

Conclusion and Future Work
XGBoost remains the best model for tabular prediction for this gas turbine dataset for both NOx and CO, but the attention-based model, SAINT, is catching up in terms of performance.Both machine learning models outperformed the first-principles-based Chemical Kinetics model, indicating that machine learning continues to show a promising future for gas turbine emissions prediction.
Furthermore, to fully utilise the years of operational gas turbine data that is available but unlabelled, a future step to improve gas turbine emissions prediction will be to include  self-supervised learning into the training process.Despite XGBoost displaying the best performance here, attention-based methods such as SAINT will be easier to combine with selfsupervised learning by performing a pretext task such as masking to predict masked sections of the operational data to learn representations of the data, which can then be used in a downstream task using SAINT to create predictions.

Aknowledgements
The work presented here received funding from EPSRC (EP/W522089/1) and Siemens Energy Industrial Turbomachinery Ltd. as part of the iCASE EPSRC PhD studentship "Predictive Emission Monitoring Systems for Gas Turbines".

Figure 3 :Figure 4 :
Figure 3: NOx and CO data spread for Full and Cropped datasets on a logarithmic scale.

Figure 5 :
Figure 5: Box plots for MAE results for NOx for each model on a logarithmic scale.

Figure 6 :
Figure 6: Box plots for MAE results for CO for each model on a logarithmic scale.

Figure 7 :
Figure 7: Normalised real vs.predicted values for NOx for each model within one standard deviation.

Figure 8 :
Figure 8: Normalised real vs.predicted values for CO for each model within one standard deviation for the Full dataset with all features.Extreme anomalous real and predicted values above 1000 were also removed, removing 14 data points.

Figure 9 :
Figure 9: MAE compared to number of features for the Full dataset.For training, on average between the 10 sub-datasets, 174 features had 3808 rows, 130 and 87 features had 5084 rows, 45 features had 6223 rows.

Table 1 :
Pre-processing process for the Full and Cropped datasets showing number of rows in each dataset.Cropped dataset as the Full dataset only 2044 rows remain so this was not chosen to be used for modelling.Further feature details can be found in TableA.1.

Table 2 :
Tabular prediction results for each model on the two sets of data and four sets of number of features used.Mean value for 10 dataset subsamples provided, with standard deviation in brackets.