Gas Turbine Exhaust Gas Temperature Prediction Under Variable Operating Loads and IGV Positions Using Tree-Based Ensemble Learning

Aslan, Asiye

doi:10.3390/machines14060630

Open AccessArticle

Gas Turbine Exhaust Gas Temperature Prediction Under Variable Operating Loads and IGV Positions Using Tree-Based Ensemble Learning

by

Asiye Aslan

Department of Electricity and Energy, Gönen Vocational School, Bandırma Onyedi Eylül University, 10200 Balıkesir, Türkiye

Machines 2026, 14(6), 630; https://doi.org/10.3390/machines14060630

Submission received: 10 April 2026 / Revised: 28 May 2026 / Accepted: 28 May 2026 / Published: 1 June 2026

(This article belongs to the Section Turbomachinery)

Download

Browse Figures

Versions Notes

Abstract

Exhaust Gas Temperature (EGT) is a critical parameter in Gas Turbines (GTs) in terms of performance monitoring, fault detection, and operational optimization. In this study, a comprehensive and data-driven modeling approach was developed to predict EGT under variable load conditions and different Inlet Guide Vane (IGV) positions in a 401 MW GT unit located in a Combined Cycle Power Plant (CCPP) with a single-shaft design. A large-scale dataset obtained from a total of 18,334 h of real operating conditions was used in the study. Operational parameters such as Gas Turbine Power Output (GTPO), IGV, Compressor Inlet Temperature (CIT), Fuel Gas Flow (FGF), and Lower Heating Value (LHV), together with environmental parameters such as Atmospheric Pressure (AP) and Relative Humidity (RH), were evaluated simultaneously, and the combined effect of these variables on EGT was investigated. In order to model the nonlinear relationships between EGT and the input variables, six different tree-based ensemble learning methods, namely Bagged Trees, Random Forest, Gradient Boosting, eXtreme Gradient Boosting (XGBoost), Light Gradient Boosting Machine (LightGBM), and Categorical Boosting (CatBoost), were applied and compared. The results showed that all models were able to predict EGT with high accuracy. The most successful model was LightGBM, which achieved the best overall prediction performance with a Coefficient of Determination (R²) of 0.9703 and a Root Mean Square Error (RMSE) of 1.5280. The analyses revealed that the most influential parameters affecting EGT were GTPO, CIT, FGF, and IGV, whereas the environmental variables had secondary but still significant effects. The proposed approach provides a reliable and computationally efficient tool for sensor validation, fault detection, and predictive maintenance applications.

Keywords:

exhaust gas temperature; gas turbine; inlet guide vanes; machine learning; tree-based ensemble learning

1. Introduction

In line with the goals of combating global climate change and establishing a carbon-neutral society, energy systems worldwide are undergoing a profound transformation process. In this process, increasing the efficiency of energy generation technologies while minimizing their environmental impacts has become a critical priority. Gas Turbines (GTs) are among the systems widely preferred in power generation due to their advantageous characteristics. These systems stand out because of their high power generation capacity, short start-up time, ability to adapt flexibly to variable load conditions, and ease of integration with other power generation technologies such as the Rankine cycle. These advantages make GTs a strategic technology in terms of both flexible and sustainable power generation [1].

One of the critical parameters for the sustainability of GT performance is Exhaust Gas Temperature (EGT). Widely used in condition monitoring, fault diagnosis, and maintenance planning processes, EGT reliably reflects combustion efficiency, turbine performance, and the effects of degradations occurring within the system. Even small variations in EGT can lead to measurable deviations in system performance [2].

In GT, EGT is a holistic performance indicator reflecting the combined influence of numerous parameters [3]. The primary factor governing EGT is the thermochemical processes occurring in the combustion chamber. In this context, parameters such as the fuel/air ratio, combustion efficiency, and flame temperature directly shape the turbine outlet conditions. In particular, an increase in the fuel/air ratio (richer mixture) raises the flame temperature, thereby increasing the Turbine Inlet Temperature (TIT) and consequently the EGT. In contrast, leaner mixtures reduce the combustion temperature, leading to a decrease in EGT [4,5,6]. The pressure level and pressure losses in the combustion chamber also play an important role; while more effective combustion occurs at higher pressure levels, increased pressure losses may reduce cycle efficiency and alter the turbine outlet conditions [5]. In addition, the cooling air used to protect the turbine blades can mix with the main flow and cause a decrease in EGT.

GTs used in power generation are not systems operating only at base load; on the contrary, they are operated at different load levels depending on changes in grid demand. Such turbines can respond rapidly and flexibly to fluctuating energy demands by providing greater reserve capacity to the grid [3]. In GTs, variable-geometry Inlet Guide Vanes (IGVs) located at the compressor inlet are commonly used to control EGT under part-load conditions. These rotatable vanes regulate the mass flow rate of air entering the compressor, thereby enabling EGT control during part-load operation [7]. When the IGV angle changes, the characteristic parameters of the compressor, namely flow capacity (i.e., air flow rate), pressure ratio, and compressor efficiency, also change. Proper IGV adjustment ensures both turbine safety and optimization of performance and emissions [7,8].

Another factor leading to variations in EGT is the environmental and flow conditions [4]. Environmental parameters such as temperature, pressure, and humidity play a decisive role in air density and mass flow rate. Through these variables, the combustion process and turbine performance are affected; therefore, an indirect but significant effect occurs on EGT. An increase in ambient temperature reduces the density of the inlet air, causing a decrease in the mass flow rate of air entering the compressor. Under a constant fuel flow rate, this condition increases the fuel/air ratio and enriches the mixture, thereby potentially increasing the flame temperature and consequently the EGT [9,10]. However, advanced control strategies applied in modern GTs dynamically adjust the parameters affecting the combustion process, particularly the fuel flow rate, in order to maintain EGT within specified limits and prevent thermal damage to turbine components [10,11].

In GTs, the performance of system components and the operating parameters also have an indirect but significant effect on EGT. Compressor and turbine efficiencies determine the effectiveness of energy conversion in the GT cycle and directly influence EGT. A decrease in turbine efficiency reduces the effectiveness of the expansion process, leading to lower work production and higher temperatures at the turbine outlet. Therefore, efficiency losses may cause an increase in EGT through different mechanisms [12,13,14].

GT modeling methods are generally divided into two main approaches: physics-based models and data-driven models. Physics-based models describe the energy and mass conversion processes occurring in GTs through equations based on fundamental principles such as energy and mass conservation and thermodynamic properties. On the other hand, with the acceleration of industrial digitalization, the amount of operational data stored in the Distributed Control Systems (DCSs) of power plants has been steadily increasing. This situation has significantly increased the importance of data-driven models; these models make it possible to understand the fundamental principles of power generation systems and improve equipment reliability and economic performance by extracting patterns directly from the available data. With the successful application of Machine Learning (ML) in various fields, studies on the data-driven modeling of GTs have also accelerated, and these studies have shown that data-driven models offer high accuracy, fast computation, and broad application potential [15].

A review of the literature indicates that physics-based studies have mainly focused on understanding the effects of EGT on cycle performance. Parapa [6], based on simulations performed in the GateCycle environment using the actual design data of the M701 GT, revealed that a 10 °C change in EGT caused a 0.273% change in power output and a 0.047% change in thermal efficiency. Purba and Zhultriza [16] investigated the effects of IGV tracking implementation on the performance of Combined Cycle Power Plants (CCPPs) operating under part-load conditions and showed that the change in IGV position increased turbine EGT, thereby improving the performance of the Heat Recovery Steam Generator (HRSG). This improvement provided higher load generation in the steam turbine (ST), while reducing the total net plant heat rate by 24 kcal/kWh and increasing part-load efficiency by 1.5%.

However, the use of data-driven methods for the prediction and monitoring of EGT has increased significantly in recent years. Hong and Kim [3] used 13 sensor measurements, including gas flow, inlet pressure, and air temperature, to predict EGT in a GT and developed a hybrid model based on Convolutional Neural Network and Recurrent Neural Network for time-series forecasting. Within the Recurrent Neural Network framework, Long Short-Term Memory and Gated Recurrent Unit methods were employed to achieve higher accuracy. The lowest error values (Root Mean Square Error (RMSE): 0.0120, Mean Absolute Error (MAE): 0.0093) were obtained with the Long Short-Term Memory model. Wang et al. [2] proposed a combined method based on fuzzy C-means and support vector machines for EGT monitoring and fault diagnosis in GT. The EGT data were first pre-classified through clustering, and then condition recognition and fault diagnosis were performed using a multiclass support vector machine model. The results showed that the proposed method is an effective approach for the online monitoring and fault diagnosis of EGT. Apostolidis et al. [17] used the generalized additive model approach for EGT prediction in aero engines. Analyses conducted on NASA’s N-CMAPSS datasets showed that the model was able to make highly accurate predictions despite increasing system complexity. The highest performance was achieved on the DS01 dataset (Coefficient of Determination (R²) > 0.998). Liu and Karimi [18] developed surrogate models based on High-Dimensional Model Representation and Artificial Neural Network to predict the part-load and full-load performance of GT. The models captured compressor and turbine characteristics with an average error below 1% and reliably predicted Gas Turbine Power Output (GTPO), pressure ratio, Fuel Gas Flow (FGF), and EGT. Park et al. [19] predicted combustion chamber characteristics in GTs using Artificial Neural Network. In the model, turbine EGT and basic design parameters were used as inputs, while operational characteristics such as fuel mass flow rate, TIT, fuel distribution at each nozzle, NOx emissions, combustion chamber pressure, and inlet air temperature were predicted. An average RMSE of 0.02296 was obtained, and the maximum error in NOx emissions was observed during the first 20 s of start-up and shutdown. Liu et al. [20] stated that the Enhanced Scale-Aware Efficient Transformer model effectively captured temporal patterns for EGT prediction, with the model achieving an MAE value of 3.47 °R and showing high agreement with actual operating conditions. Zhou [21] performed EGT prediction using different models, including Backpropagation Neural Network, Support Vector Regression, Partial Least Squares, Grey Model, and Multiple Linear Regression. The four best-performing models were combined using the Particle Swarm Optimization method to construct the final prediction model. It was demonstrated that the combined model provided higher accuracy than the individual models. Stephnie and Osoka [22] used an ML-based Nonlinear AutoRegressive with Exogenous Inputs neural network model to predict the EGT of the GT13E2 turbine. The model was trained using historical EGT and other inputs collected from the turbine, while the network structure was optimized according to the Akaike Information Criterion and Bayesian Information Criterion. The model was able to perform predictions up to 100 steps ahead and, although the accuracy decreased somewhat with increasing prediction horizon, it still produced successful results with MAE = 2.97 and RMSE = 3.97.

Recently, hybrid modeling methods that combine the advantages of physics-based and data-driven approaches have also attracted attention. Brusa et al. [23] predicted EGT by combining a semi-physical turbine model with a neural network-based model and achieved high accuracy with an average RMSE of 14% under both steady-state and transient operating conditions. In addition, Ma et al. [24] combined a Long Short-Term Memory-based Nonlinear AutoRegressive with Exogenous Inputs model with a moving average model for EGT prediction in GT engines and demonstrated that the RMSE and MAE values were reduced by at least 13.23% and 18.47%, respectively, while exhibiting effective performance on dynamic data.

The accurate prediction of EGT in GTs is critically important not only for the thermodynamic efficiency of the system but also for preserving its structural integrity. Recent studies have shown that EGT is not merely an indicator of waste energy; rather, it is a direct output of the efficiency within the combustion chamber. In this regard, prediction studies performed using ML-based approaches make significant contributions to the early detection of sensor faults and turbine health monitoring processes, thereby preventing unplanned shutdowns and reducing operating costs. From the perspective of CCPPs, EGT is a critical parameter that directly affects not only GT performance but also the efficiency of the HRSG in the bottoming cycle. Even small changes in EGT can lead to significant differences in steam generation capacity and total power output.

In GT systems, although parameters such as IGV, Variable Guide Vane (VGV) and FGF are fundamental components of EGT control, mechanical and operational anomalies that cannot be anticipated by the control logic are encountered under real operating conditions. Particularly in CCPPs, the accumulation of pipeline-related contaminants in the filters at the combustor inlet alters combustion pressure and airflow dynamics, leading to deviations in blade path temperatures. This situation disrupts the uniform structure of the EGT distribution and creates heterogeneous thermal loads for which the existing control systems alone are insufficient. In addition, feedback errors occurring in fuel control valve servos cause deviations in valve opening ratios, driving the fuel–air mixture toward unintended proportions. Although such faults transmit normal data to the control system, they actually directly affect combustion efficiency and the EGT profile. The proposed ML-based prediction model functions as a virtual supervisory system that detects these complex and nonlinear deviations in advance and prevents faults from escalating. This additional supervisory layer enhances operational safety while providing an early warning mechanism in cases exceeding the limits of the control logic, thereby making a critical contribution to predictive maintenance processes.

However, although there are important studies in the GT literature regarding the prediction of EGT, it is observed that data-driven modeling approaches considering the combined effects of variable load operation, IGV position, and environmental parameters together with real operating data are limited in number and have not been investigated comprehensively. In this study, in order to address this gap in the literature, a comprehensive and data-driven modeling approach was developed for the prediction of EGT in GT. Within the proposed framework, a large-scale dataset consisting of 18,334 operating hours was used, and variable load conditions, IGV position, and environmental parameters were simultaneously included in the model. Operational parameters affecting GT performance, namely GTPO, IGV, Compressor Inlet Temperature (CIT), FGF, and Lower Heating Value (LHV), together with environmental variables (Atmospheric Pressure (AP) and Relative Humidity (RH)), were used as input variables, while EGT was defined as the target variable. Thus, a holistic prediction model representing the dynamic operating characteristics of the GT in a more realistic manner was established.

In addition, different tree-based ensemble learning algorithms for EGT prediction were systematically implemented and compared. In this context, Bagged Trees, Random Forest, Gradient Boosting, eXtreme Gradient Boosting (XGBoost), Light Gradient Boosting Machine (LightGBM), and Categorical Boosting (CatBoost) models were tested under the same dataset and the same evaluation criteria. In the literature, studies examining such a broad family of tree-based ensemble learning methods for EGT prediction within the same framework are quite limited. The obtained results showed that LightGBM and XGBoost, in particular, were more successful than the other methods in terms of high accuracy, low error, and stable residual distribution. Thus, it was demonstrated that the proposed approach is not only theoretically robust but also reliable and computationally efficient enough to be used in real-time industrial applications. In addition, the transient regime prediction capability of the best-performing model was examined under Secondary Frequency Control (SFC)-based load ramp-up operation in order to evaluate its robustness against dynamic operational changes.

The main contributions and key highlights of this study can be summarized as follows:

A modeling approach representing real operating conditions was developed. Unlike conventional approaches based only on steady-state or full-load data, the present study used actual operational data obtained under variable load regimes and different IGV positions. Thus, the dynamic behavior of the GT was modeled in a more realistic manner.
A large-scale dataset consisting of 18,334 operating hours was used in the study. This comprehensive data structure enabled the reliable representation of GT behavior under different operating conditions.
The combined effects of the operational and environmental parameters affecting EGT were comprehensively evaluated. GTPO, IGV, CIT, FGF, LHV, AP, and RH were considered together within the same model, thereby revealing the mutual and nonlinear effects of these variables on EGT.
Tree-based ensemble learning algorithms were implemented within a systematic and comparative framework for EGT prediction. Bagged Trees, Random Forest, Gradient Boosting, XGBoost, LightGBM, and CatBoost models were evaluated using the same dataset; thus, the model that performed best in EGT prediction was identified in detail.
LightGBM and XGBoost exhibited higher accuracy, less errors, and more stable residual distributions compared to the other methods. The LightGBM model was identified as the most successful method for EGT prediction, with an R² of 0.9703 and an RMSE of 1.5280.
The dynamic prediction capability of the LightGBM model was additionally validated during SFC-based load ramp-up operation, demonstrating its ability to accurately follow the actual EGT trend under rapidly varying GTPO and IGV conditions.
The proposed approach provides a reliable decision-support tool for performance monitoring, sensor validation, fault detection, and predictive maintenance applications in GT.

The remainder of this paper is organized as follows. Section 2 introduces the GT system investigated. Section 3 describes the ML methods used for EGT prediction, while Section 4 explains the criteria used to evaluate model performance. The analysis results are presented in Section 5. Section 6 discusses the findings, and finally, Section 7 summarizes the main conclusions of the study. The flowchart illustrating the overall methodological framework of the study is presented in Figure 1.

2. System Description

2.1. GT System

GT is an energy conversion machine that fundamentally operates according to the Brayton cycle principle and mainly consists of an air intake system, compressor, combustion chamber, turbine, and exhaust system [25]. The atmospheric air entering the system is first compressed to a high pressure by the compressor. The compressed air is then mixed with fuel and burned in the combustion chamber, releasing high-temperature energy. The hot and pressurized gases produced by combustion expand isentropically through the turbine; this expansion process transfers mechanical energy to the turbine shaft, thereby enabling electricity generation through the synchronous generator. The high-temperature gases are then discharged from the system at a certain pressure and temperature [26]. Figure 2 presents the main components of a GT and the exhaust gas flow.

There are two main configurations of GT power plants. In a GT power plant operating in a Simple Cycle (SC) configuration, the exhaust gases are released into the atmosphere through a stack. These exhaust gases contain a considerable amount of thermal energy. In a CCPP, however, the exhaust gases are directed to an HRSG system, where water is converted into superheated steam. The resulting superheated steam is then directed to an ST to generate electricity, thereby increasing the overall efficiency of the thermodynamic cycle [26]. The main technical specifications of the investigated SGT5-8000H GT are summarized in Table 1. Figure 3 presents the flow diagram of the plant under study.

The variables included in the dataset used for the EGT prediction analyses are defined below:

GTPO: An operating parameter representing the amount of power produced by the GT at a given operating condition and measured in MW.
FGF: An operational variable representing the flow rate of the fuel supplied to the GT and measured in Sm³/h.
CIT: An inlet condition and operational parameter representing the temperature of the air entering the compressor. Its unit is °C.
LHV: The amount of usable energy obtained from fuel combustion without considering the latent heat of condensation of water vapor. It is expressed in kcal/Sm³.
IGV: An operating parameter representing the opening ratio of the inlet guide vanes. The IGV position is defined as a percentage (%).
AP: An environmental variable representing atmospheric pressure. Its unit is mbar.
RH: An environmental parameter representing the ratio of the amount of water vapor in the atmosphere to the maximum amount of moisture that air at the same temperature can contain. It is defined as a percentage (%).
EGT: A performance indicator representing the temperature of the exhaust gases leaving the turbine. Its unit is °C.

2.2. GT Exhaust System

The GT exhaust system represents the final stage of the turbine, where the high-temperature combustion gases are discharged to the atmosphere, and it plays a critical role in terms of energy efficiency, emission control, and system reliability. In SC operation, EGT typically leaves the turbine at a temperature of 550–680 °C, resulting in a considerable amount of energy loss. In CCPP configurations, however, these gases are directed to an HRSG; the superheated steam generated there is used for additional electricity generation in the ST [27].

The main components of the exhaust system directly affect system performance and operational safety. The diffuser slows the high-velocity gas flow at the turbine outlet and converts kinetic energy into static pressure. In addition, it provides a uniform distribution of the exhaust gas flow and reduces pressure losses through controlled deceleration rather than sudden expansion. The exhaust ducts guide the gas flow and are designed to withstand high temperatures and thermal stresses. Silencers play an important role in reducing acoustic emissions, particularly in industrial applications, while the exhaust stack is the final point at which the gases are safely discharged into the atmosphere and is a critical component for controlling environmental impacts [26,28].

Thermocouples, which are the main elements of the exhaust temperature measurement system, are sensors capable of withstanding high temperatures and providing a rapid response. They are installed at different points along the exhaust duct to determine the temperature distribution. In order to protect the sensors from the effects of high temperature, pressure, and flow, they are generally placed inside protective sleeves (thermowells) [29]. The measured data are collected through the data acquisition system, where the electrical signals are processed and converted into meaningful temperature values. These data are then transmitted to the control unit; the control unit continuously monitors the system and, if the temperature values exceed the specified limits, it issues an alarm or automatically shuts down the turbine (trip) to ensure system safety.

However, existing sensors can only approximately measure the actual temperatures experienced by turbine blades and disks because of the high temperatures and harsh operating conditions. Sensor failures and the fact that measurements are taken downstream of the turbine prevent the temperature distribution from being accurately determined. This situation poses serious risks, particularly in high-performance turbines, due to the reduced safety margin between the limits of hot-section materials and the operating conditions [30]. Therefore, uncertainties and accuracy problems that may occasionally arise in EGT measurement make it necessary to develop and implement EGT prediction models.

Accurate monitoring of EGT also plays a critical role in keeping NOx and CO emissions within the limits specified by regulations. Therefore, measurement systems are required to provide high accuracy and reliability over long periods under harsh environmental conditions. Today, for this purpose, thermocouple sensors, which generally provide accuracy at tolerance class 1 level, are widely preferred for temperature measurements in the gas flow path [31].

Figure 4 shows the temperature values measured at six different measurement points placed around the turbine outlet in the GT exhaust section of the plant investigated. At each measurement point, dual-channel thermocouple sensors are installed in order to improve measurement reliability. This configuration is critically important for monitoring the circumferential symmetry of the exhaust flow and the stability of the combustion process. The fact that the measurement points produce similar results indicates that the GT has a homogeneous combustion characteristic, that no local temperature concentrations occur in the exhaust, and that the thermal load distribution is balanced. Figure 5 presents images of the GT exhaust system.

3. Tree-Based Ensemble Models

Ensemble learning, particularly tree-based ensemble methods, is a powerful ML approach widely preferred for solving complex and nonlinear problems due to the advantages of high accuracy, robustness, and generalizability that it provides [32,33]. In this context, tree-based ensemble methods are fundamentally divided into two main categories: Bagging and Boosting. These methods use different strategies to improve model performance.

3.1. Bagging Methods

Bagging is a leading ensemble learning method that trains the same learner (generally decision trees) on random subsets created from the dataset, referred to as “bags,” and produces the final result by combining the predictions of these models [34].

3.1.1. Bagged Trees

Bagged Trees is an effective ensemble learning method used to reduce high variance and increase generalizability in decision tree-based regression models. In this method,

B

subsets are created from the training dataset by bootstrap sampling with replacement, and an independent regression tree is trained on each subset. By combining the predictions of all resulting trees, a more balanced and reliable model is obtained [35].

In regression problems, the prediction of the Bagged Trees model is calculated by averaging the outputs of each regression tree. In this way, the variance of the individual trees is balanced and the generalization capability of the model is improved [36]. The mathematical formulation of the bagged ensemble regression tree algorithm can be expressed as follows:

{\hat{y}}_{b a g} (x) = \frac{1}{B} \sum_{b = 1}^{B} {\hat{y}}_{T_{b}} (x)

(1)

Here,

{\hat{y}}_{b a g}

represents the predicted output value of the bagged tree ensemble, whereas

{\hat{y}}_{T_{b}}

denotes the final prediction value of each individual regression tree.

3.1.2. Random Forest

In Random Forest, only a randomly selected subset of features, referred to as “mtry,” is used at each decision tree node instead of all features. This additional randomness strengthens the variance reduction effect by reducing the correlation among trees when there is a high correlation between features [37,38]. The performance of Random Forest depends on parameters such as the number of trees (N) and the size of the feature subset (mtry). Appropriate values can be determined using the Out-of-Bag error. In this method, increasing the number of trees generally does not lead to overfitting, although it increases the computational cost [37].

Random Forest can effectively model both linear and nonlinear relationships and can successfully capture complex interactions among variables. In addition, it can operate compatibly with both numerical and categorical data and provides consistent and reliable predictions even in the presence of outliers or missing observations [39,40].

3.2. Boosting Methods

Boosting is an ensemble learning method that improves prediction accuracy by training weak learners sequentially, where each new model focuses on reducing the errors of the previous model. In this approach, decision trees are generally used as weak learners, and model performance is gradually improved at each step [41].

Gradient Boosting, also known as Gradient Boosting Decision Tree, is a powerful ensemble learning method that constructs sequential weak learners based on the negative gradient of the loss function [41]. The main objective of this approach is to obtain the model

\hat{F} (x)

which approximately represents the function

F^{*} (x)

mapping the input variables to the target variables on the training dataset

S = {(x_{i}, y_{i})}_{i = 1}^{N}

[42].

The model is initially defined by a constant value:

F_{0} (x) = \arg \min_{β} \sum_{i = 1}^{N} l (y_{i}, β)

(2)

At each iteration, the pseudo-residual values representing the errors of the current model are calculated by taking the derivative of the loss function with respect to the model.

r_{i m} = - {[\frac{\partial l (y_{i}, F (x_{i}))}{\partial F (x_{i})}]}_{F (x) = F_{m - 1} (x)}

(3)

Then, a weak learner

h (x; α_{m})

is trained to fit these pseudo-residual values, and its parameters are determined using the least squares method. The model update coefficient

β_{m}

is calculated so as to minimize the loss function. Finally, the model is updated as follows:

F_{m} (x) = F_{m - 1} (x) + β_{m} h (x; α_{m})

(4)

This process is repeated until the specified number of iterations is reached or the model converges.

3.2.1. XGBoost

The XGBoost algorithm is an ensemble learning method based on decision trees and uses the gradient boosting approach to improve its performance. It controls model complexity and reduces overfitting by adding a regularization term to the objective function. In addition, it provides more accurate and faster optimization by using second-order derivatives [42,43,44].

L = \sum_{i = 1}^{n} l (y_{i}, F (x_{i})) + \sum_{m = 1}^{M} Ω (h_{m})

(5)

Here, the term controlling model complexity is

Ω (h) = γ T + \frac{1}{2} λ ∥ w ∥^{2}

(6)

An important feature of XGBoost is that it uses the second-order Taylor expansion in the optimization process:

L \approx \sum_{i = 1}^{n} [g_{i} f (x_{i}) + \frac{1}{2} h_{i} f^{2} (x_{i})] + γ T + \frac{1}{2} λ ∥ w ∥^{2}

(7)

Here,

g_{i}

and

h_{i}

are the first and second derivatives of the loss function, respectively. Owing to this structure, the model can be optimized more accurately and rapidly. However, the large number of hyperparameters may make the tuning of the model more difficult.

3.2.2. LightGBM

LightGBM was developed as a computationally more efficient implementation of the gradient boosting approach. The different nature of this method relates to its data processing and tree construction strategies. The Gradient-based One-Sided Sampling method accelerates the learning process by prioritizing samples with large gradient values, while the Exclusive Feature Bundling method reduces dimensionality by combining sparse features. However, due to its leaf-wise tree growth strategy, it may increase the risk of overfitting in small datasets [42,45,46].

3.2.3. CatBoost

CatBoost is a gradient boosting algorithm based on decision trees and was developed particularly to work effectively with categorical data. This method makes gradient calculation more reliable through the ordered boosting approach. The most important feature distinguishing CatBoost from XGBoost and LightGBM is that it constructs a balanced and symmetric tree structure; at each iteration, it selects the feature-split pair that most reduces the loss, thereby optimizing information distribution and reducing computational cost. In addition, by randomly permuting the training set, it calculates the average label value of previous samples belonging to the same category and applies a weighted substitution with the prior value. This method reduces the noise caused by low-frequency categories and improves the overall performance of the model [42,47].

The main reason for selecting these models is that they enable the high-accuracy prediction of a nonlinear and multivariable output such as EGT in GT. Tree-based ensemble learning methods stand out because of their ability to model complex variable interactions without requiring any prior assumptions, their robustness against noisy and real operational data, and their high generalization performance.

In this context, Random Forest and Bagged Trees, representing the bagging approach, provide more stable and reliable predictions by reducing variance, whereas the boosting-based Gradient Boosting, XGBoost, LightGBM, and CatBoost models achieve higher prediction accuracy by sequentially minimizing errors. In addition, the inclusion of both bagging and boosting approaches within this model set enables the comparison of different learning strategies and allows for a comprehensive methodological evaluation. Therefore, the selected models provide an appropriate framework for the EGT prediction problem, as they are both widely used in the literature and have high performance potential.

3.3. Model Optimization and Validation

3.3.1. k-Fold Cross-Validation

In the k-fold cross-validation method, the dataset is randomly divided into k equal subsets. At each step, one of these subsets is selected as the validation/test set, while the remaining k − 1 subsets are used for training the model. The process is repeated a total of k times so that each subset serves once as the test set. Thus, out-of-fold predictions and the corresponding performance metrics are obtained for all observations in the dataset. The average of these metrics represents the cross-validation performance of the model [48,49].

3.3.2. Grid Search

Grid Search is a systematic hyperparameter optimization method widely used in ML to determine the optimum hyperparameter combination for a given model. This approach is based on the systematic evaluation of all possible hyperparameter combinations within predefined parameter ranges. In the first stage, a search grid consisting of different hyperparameter combinations is created; then, the model is trained for each combination, and performance analysis is performed on the validation or cross-validation dataset. The hyperparameter combination that provides the highest performance during the validation process is selected as the optimum hyperparameter set of the model [50].

All ensemble learning models used in the study were optimized using a systematic grid search and 5-fold cross-validation protocol rather than manual selection. For example, for the XGBoost model, critical parameters such as learning_rate, max_depth, and subsample were determined by testing many different combinations within a predefined broad search space. This approach ensured that each compared model was evaluated based on its own best performance capacity, thereby placing the analysis on a fair and unbiased basis. The optimized hyperparameter settings of the ensemble learning models are presented in Table 2.

4. Model Evaluation Metrics

In EGT prediction, a detailed examination of system performance is of great importance in order to evaluate the accuracy and reliability of the models. In this regard, various statistical performance criteria were employed to measure the prediction success of the considered models. The criteria used were MAE, Mean Square Error (MSE), RMSE, and R², which enable a quantitative evaluation of both the prediction error of the models and their agreement with the actual data.

The mathematical expressions of these metrics are given in Equations (8)–(11), providing a fundamental framework for the comparative analysis of the performance of the relevant ML methods [51,52,53]. The performance criteria used in this context are expressed as follows:

4.1. MAE

M A E = \frac{1}{n} \sum_{i = 1}^{n} ∣ y_{i} - {\hat{y}}_{i} ∣

(8)

4.2. MSE

M S E = \frac{1}{n} \sum_{i = 1}^{n} {(y_{i}− {\hat{y}}_{i})}^{2}

(9)

4.3. RMSE

R M S E = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(y_{i}− {\hat{y}}_{i})}^{2}}

(10)

4.4. R²

R^{2} = 1 - \frac{\sum_{i = 1}^{n} {(y_{i}− {\hat{y}}_{i})}^{2}}{\sum_{i = 1}^{n} {(y_{i}− \bar{y})}^{2}}

(11)

Here,

n

represents the number of data points,

y_{i}

represents the actual values,

{\hat{y}}_{i}

represents the predicted values, and

\bar{y}

represents the mean of the actual values. When these metrics are evaluated together, a comprehensive assessment can be made regarding both the magnitude of the model error and its explanatory power.

5. Results

5.1. Dataset Description

In this study, using real operating data obtained from an industrial-scale GT, the effects of input variables such as GTPO, IGV, CIT, FGF, LHV, AP, and RH on EGT were comprehensively analyzed through tree-based ensemble learning algorithms. In the analyses, Bagged Trees, Random Forest, Gradient Boosting, XGBoost, LightGBM, and CatBoost algorithms were implemented, and the prediction performance of each model was compared using statistical indicators. The entire dataset was randomly divided into 80% training and 20% testing subsets. In addition, 5-fold cross-validation was used together with grid search to optimize the hyperparameters and validate model robustness. The analyses were carried out using the Python 3.12.4 programming language. Table 3 presents the descriptive statistics of the variables used in the GT performance analysis. Large-scale real operating data covering a period of 18,334 h were used for each variable.

The investigated GT is operated not only under full-load conditions but also under variable load conditions depending on grid requirements, in accordance with primary and SFC and load increase–load shedding instructions. In order to evaluate the performance of EGT prediction under different load levels, a dataset covering both part-load and full-load ranges was used in the analyses. Thus, the analyses for EGT prediction were performed in a manner representing the behavior of the turbine over a wide operational range.

Modern GTs are equipped with multiple sensor architectures and advanced redundancy protocols in order to ensure operational continuity and prevent unplanned shutdowns. In this design approach, critical process variables are monitored simultaneously by two or three independent sensors, and the DCS generates a dynamic final signal, for example the 901 tag, by averaging the data received from these sensors. In a scenario where two independent sensors, 001 and 002, are used, if a hardware failure or signal deviation occurs in one of the sensors, the system immediately isolates the fault and continues to operate uninterruptedly through the remaining healthy sensor in order to maintain measurement quality. In three-sensor configurations, in the case of a first fault, the average of the remaining two healthy sensors is taken as the reference. However, in a two-sensor configuration, if both sensors lose their signals or if the reliable data source is completely lost due to an out-of-limit deviation between them, the turbine protection logic is activated to maintain process safety at the highest level, and the system is automatically transferred to a controlled safe shutdown or directly to a trip. This redundant control algorithm minimizes the negative effects of instrumentation faults on system performance, operational decisions, and equipment health, thereby making a positive contribution to plant availability.

In addition, during the data preprocessing stage, Cook’s Distance and boxplot methods were applied together to identify outliers. In field applications, not only multiple sensors but also the control command signals sent to the equipment and the feedback signals received from the equipment are continuously monitored independently in order to ensure the reliable monitoring of critical process variables. Accordingly, the command and feedback data obtained from the DCS were compared, and observations showing inconsistencies between these two signals were not included in the dataset. Thus, measurement errors, communication-related deviations, and noise were eliminated from the dataset.

As a result, thanks to both the sensor validation mechanisms applied in field operations and the rigorous filtering processes carried out during the data preprocessing stage, the effect of measurement errors on model predictions was reduced. Figure 6 shows the log-scale boxplot analysis used to identify outliers in the dataset. The logarithmic transformation reduces the scale differences among variables, thereby enabling a clearer visualization of the outliers. After eliminating the outliers, the analyses were performed using the dataset consisting of 18,334 operating hours.

5.2. Analysis Results

The correlation matrix presented in Figure 7 shows the direction and strength of the linear relationships between EGT and the operational and environmental parameters based on Pearson correlation analysis. It is observed that the strongest relationships of EGT are with the turbine operating parameters. In particular, there is a very strong negative correlation of −0.87 between EGT and both GTPO and FGF. This result indicates that EGT tends to decrease as GTPO increases. The negative relationship between FGF and EGT arises because FGF increases together with higher turbine loads and larger IGV openings. Similarly, there is also a strong negative relationship between IGV and EGT at a level of −0.71. The negative relationship between IGV and EGT can be explained by the IGV control strategy applied in the GT. Under part-load conditions, the IGVs are operated in a more closed position in order to minimize the loss in cycle efficiency, thereby maintaining EGT at relatively high levels. As the load increases, the IGV opening is increased and the turbine is operated closer to its design point, with the aim of maintaining EGT around its design values. In contrast, there is a moderate positive correlation of 0.57 between EGT and CIT. This indicates that EGT increases as CIT increases. The relationships between EGT and AP (−0.36) and RH (−0.24) are weakly negative, indicating that AP and RH have a limited effect on EGT. The fact that the correlation between LHV and EGT is only 0.10 reveals that variations in LHV do not significantly affect EGT in the investigated dataset.

In addition to Pearson correlation analysis, Spearman correlation analysis was also performed to evaluate potential monotonic and nonlinear relationships in the dataset. The comparative results of the Pearson and Spearman correlation coefficients are presented in Table 4. Considering the potential of GT operational data to exhibit nonlinear behavior, Spearman correlation was used as an additional validation analysis. The fact that both correlation methods revealed similar trends supports the reliability and consistency of the obtained relationships.

The scatter matrix presented in Figure 8 shows that the relationships between EGT and the other input variables involve not only linear but also nonlinear behavior. In particular, for the EGT–GTPO, EGT–FGF, and EGT–IGV pairs, the data points form a clearly downward-sloping and slightly curved pattern. This indicates that the strong negative relationship identified in the correlation analysis is not purely linear but changes across different operating regions. Such a curved structure supports the preference for tree-based ensemble methods rather than linear models in EGT prediction.

There is a strong and pronounced negative relationship between EGT and GTPO. At low GTPO levels (approximately 240–300 MW), EGT values are higher and are mostly concentrated within the range of 645–660 °C. As GTPO increases, EGT continuously decreases and falls to the range of 620–635 °C under high-load conditions (approximately 380–420 MW). This trend is consistent with the part-load operating strategy associated with the IGV control mode used in CCPPs. The narrow and smooth distribution in the graph indicates that GTPO is one of the most dominant variables affecting EGT and that the relationship is very strong.

A pronounced negative relationship is observed between EGT and FGF. When FGF is at low levels, EGT values are higher, whereas EGT gradually decreases as FGF increases. At first glance, it might be expected that EGT would increase with increasing fuel flow; however, in this case, FGF varies together with turbine load, and high FGF values also represent higher GTPO and larger IGV openings. Therefore, the negative relationship observed in the graph is not solely the effect of FGF itself, but rather the result of the operating strategy acting together with load and air flow rate.

There is a pronounced and nonlinear negative relationship between EGT and IGV. When the IGV opening is at low levels, EGT values are generally higher. As the IGV opening increases, EGT gradually decreases. This trend is consistent with the IGV control strategy applied in the GT. Under part-load conditions, the IGVs are kept more closed in order to limit the air flow entering the compressor; thus, a higher EGT is maintained to support cycle efficiency and HRSG performance. As the load increases, the IGVs are opened, the air flow rate increases, and EGT is maintained at a lower and more stable level as the turbine approaches its design operating point.

The distribution between EGT and CIT exhibits a clearly upward-sloping and approximately linear pattern. It is observed that EGT increases as CIT increases; however, the spread becomes slightly wider at high CIT values. This indicates that CIT has a significant effect on EGT, but at higher temperatures, other operating parameters also come into play and increase the variability. An increase in the initial temperature of the air entering the combustion chamber leads to higher turbine inlet and exhaust temperatures at the same fuel/air ratio. Under high CIT conditions, the effects of parameters such as IGV position, FGF, and load level become more pronounced, resulting in a wider distribution of EGT values.

The distributions between EGT and AP and RH exhibit a weaker and more dispersed pattern. A slight negative trend is observed between AP and EGT. This may be related to the fact that an increase in AP improves the compressor inlet conditions and increases the air mass flow rate; however, the magnitude of this effect is limited compared with the other operational variables. In contrast, the relationship between RH and EGT is very weak and scattered. Although changes in RH have an effect on combustion and air properties, the current distribution indicates that this effect remains secondary compared with the other variables. The more scattered and horizontally clustered pattern of the LHV variable indicates that the LHV varied within a narrow range during operation and therefore had a limited effect on EGT.

When the performance metrics presented in Table 5 are examined, it is observed that all regression models considered exhibit a high explanatory power, with R² values remaining above 0.96. In terms of prediction consistency and minimization of error margins, the LightGBM and XGBoost models clearly stand out compared with the other methods. The LightGBM model became the model that represented the dataset with the highest accuracy by achieving the highest R², with a value of 0.9703, and the lowest RMSE value of 1.5280. XGBoost, on the other hand, showed performance values very close to those of LightGBM and achieved the lowest error margin, particularly in terms of the MAE criterion, with a value of 1.0318.

The fact that LightGBM ranks first in terms of R² but second in terms of MAE reveals a critical aspect of the model’s error characteristics: since the R² and RMSE metrics are based on squared errors, they are highly sensitive to large deviations. Therefore, the success of LightGBM in these metrics demonstrates that the model produces stable predictions even at extreme values and avoids large-scale errors, namely high variance. In contrast, XGBoost’s slight superiority in terms of MAE indicates that the model is highly successful in terms of average error across the general distribution of the dataset, but that it may produce larger deviations than LightGBM in rare cases. On the other hand, although the Bagged Trees and Random Forest models exhibited slightly higher error rates compared with boosting-based algorithms, their overall performances remained well above acceptable limits. As a result, when both variance control and prediction consistency are evaluated together, LightGBM stands out as the most optimal model within the scope of this study.

When the performance metrics presented in Table 5 are evaluated together with the computational cost analyses, the results indicate that the LightGBM model not only achieves a high predictive accuracy, with an R² value greater than 0.97, but also provides computational speed that is more than sufficient for industrial operations. The total dataset used in the study, consisting of 18,334 h, was divided into 80% training and 20% testing subsets in order to evaluate the model’s performance on previously unseen data. In the analyses performed on this independent test set, consisting of approximately 3667 samples, the model training process was completed in a short time of 3.8724 s.

The most critical operational parameter, prediction speed, also referred to as inference latency, was measured by predicting all samples in the test set, consisting of 3667 samples, within a total of 0.0794 s. By dividing this total duration by the number of test samples, the average prediction speed per input was calculated as 0.0216 ms.

Considering that the typical sampling and cycle times of industrial control systems (PLC/SCADA) range between 10 ms and 100 ms, this response time validated using the test data is considerably shorter than these time scales. This indicates that the model can be integrated into complex control loops without creating a significant computational bottleneck and offers high scalability for real-time monitoring processes.

The low computational cost enhances commercial applicability by enabling the system to operate on existing industrial infrastructures without requiring costly server investments. In the context of implementation constraints, potential model deviations caused by wear-related changes in turbine components, referred to as model drift, can be effectively managed through periodic updates, namely scheduled re-training, without causing operational interruption, owing to the model’s rapid re-training capability of less than 4 s. Consequently, the proposed architecture offers an optimized solution for industrial digital twin applications by providing a balance of accuracy, speed, and low cost validated using test data.

The fold-based R² scores obtained during the grid search-based 5-fold cross-validation process for the LightGBM model are presented in Figure 9. The R² values obtained across the five folds ranged from 0.9703 to 0.9732, with a mean R² value of 0.9716 and a standard deviation of 0.0012. The narrow variation range among the folds indicates that the selected hyperparameter configuration provides stable and consistent prediction performance across different validation subsets. These results support the reliability of the hyperparameter optimization process and confirm the robustness of the optimized LightGBM model.

Figure 10 compares the actual and predicted EGT values. In the plots, the horizontal axis represents the actual EGT values, while the vertical axis represents the EGT values predicted by the model. The diagonal line indicates the ideal case in which the predicted values are exactly equal to the actual values. The extent to which the data points are closely and homogeneously distributed around this line reflects the prediction performance of the model. In general, for all models, most of the data points are concentrated around the reference line. This indicates that all models provide high accuracy in EGT prediction and that systematic errors are limited.

In the LightGBM model, it is observed that the majority of the data points are concentrated around the ideal prediction line. The fact that the points generally follow the reference line across different EGT ranges indicates that the LightGBM model successfully captures the actual EGT trend. Similarly, the XGBoost model exhibits a dense distribution around the reference line, demonstrating a strong agreement between the actual and predicted values. The close distribution of the observations to the ideal prediction line in these two models reveals that boosting-based methods provide effective performance in EGT prediction.

The CatBoost model also demonstrates a generally successful prediction distribution, with most data points positioned close to the reference line. However, in some EGT ranges, the scatter is observed to be relatively more pronounced compared with the LightGBM and XGBoost models. The Bagged Trees and Random Forest models also generally follow the reference line, and the predicted values are observed to be distributed close to the actual values. Nevertheless, in these models, the data points form a wider band around the reference line in certain regions. The Gradient Boosting model follows the overall trend; however, compared with the other boosting-based models, it exhibits more pronounced scatter in some regions.

The SHAP analysis presented in Figure 11, performed for the best-performing LightGBM model, illustrates the direction and magnitude of the effects of the input variables on the predicted EGT values. GTPO is the most dominant variable, exerting an effect on the predicted EGT ranging from −16.65 °C to +10.70 °C. The fact that the SHAP values are largely concentrated in the negative region indicates that, as load increases, the model tends to decrease the predicted EGT. Conversely, low GTPO values produce positive SHAP values and increase the EGT prediction. CIT is the second most important variable, with its SHAP effect ranging approximately from −9.13 °C to +5.84 °C. High CIT values are mostly concentrated in the positive SHAP region, whereas low CIT values are located in the negative region. This indicates that, as CIT increases, the model tends to increase the predicted EGT value.

The effect of FGF ranges from −3.64 °C to +3.65 °C. The SHAP distribution of FGF indicates that the influence of this variable on EGT prediction can vary in both positive and negative directions depending on operating conditions. The effect of the IGV ranges from −4.50 °C to +3.73 °C, and high IGV values are observed to mostly produce negative SHAP values. In other words, as the IGV opening increases, the model reduces the predicted EGT. Low IGV values, on the other hand, generate a positive SHAP effect. The SHAP distributions of the LHV, AP, and RH variables are concentrated within a narrow band around zero. This indicates that these variables have a limited influence on the model. The LHV effect ranges from −2.33 °C to +1.88 °C. The effect of AP ranges from −3.06 °C to +1.89 °C, while RH has an effect ranging from −0.96 °C to +1.64 °C.

Figure 12 shows the residual distributions with respect to the predicted EGT values. In general, the residual values are distributed around the zero-error line for all models, indicating that there is no obvious systematic bias in the predictions. In particular, the LightGBM and XGBoost models exhibit almost equivalent and highly accurate prediction performance, as their residual values are densely concentrated around zero and remain within a narrow band across the predicted EGT range. Both models present the most consistent behavior, with compact residual distributions and data densities positioned close to the zero-error line. Although very small deviation differences are observed at some extreme points, the balanced structure of the overall residual distributions supports that both algorithms achieve similarly high and competitive performance in EGT prediction.

The CatBoost model also demonstrates satisfactory residual behavior; however, in some predicted EGT regions, the residual scatter appears slightly wider compared with LightGBM and XGBoost. Gradient Boosting generally shows a balanced residual distribution; nevertheless, larger positive and negative deviations are observed particularly in the mid-to-high predicted EGT range. The Bagged Trees and Random Forest models also follow the zero-error line at a reasonable level; however, their residual distributions are relatively wider, especially in the mid-to-high predicted EGT range.

Figure 13 compares the actual and predicted EGT values. It is observed that all models largely follow the actual EGT distribution. LightGBM shows the closest agreement with the actual values, while XGBoost also demonstrates a highly competitive prediction performance.

Figure 14 comparatively presents the actual EGT values for 100 samples together with the predictions of the LightGBM model, which achieved the highest prediction performance, and the Random Forest model, which exhibited the lowest performance. The graph shows that the actual EGT values exhibit continuous and dynamic fluctuations along the sample index. Both the LightGBM and Random Forest models follow these sudden increases and decreases in the data with a high degree of stability and successfully capture the time-dependent overall trend of the system. When the two models are compared, it is observed that LightGBM adheres more closely to the actual values throughout the graph and tracks them with greater accuracy. In particular, when the dynamic transition points, sharp decreases and increases, and peak–trough values observed in the graph are examined, LightGBM appears to capture these extreme amplitudes more sensitively. In contrast, Random Forest tends to slightly smooth and dampen the extreme values toward the central trend due to its model structure. Consequently, both models demonstrate highly successful performance in EGT prediction; however, the LightGBM model provides the closest agreement with the actual data and exhibits the most consistent and accurate prediction performance.

The nominal design parameters of the turbine are defined based on an installed power capacity of 401 MW under ambient conditions of 15 °C ambient temperature, 70% RH, and 1009 mbar AP. Therefore, all operating conditions below the nominal capacity are considered off-design operating regimes. Figure 15 shows the actual EGT values and the EGT values predicted by the LightGBM model during the transient load increase process occurring between 16:45 and 17:07, together with the variations in GTPO and IGV position. During this process, in which the load was automatically and continuously increased from 253 MW to 400 MW depending on grid requirements, the instantaneous variations in EGT, GTPO, and IGV values can be simultaneously monitored. These results demonstrate that the performance of the LightGBM prediction model is also validated under transient operating conditions in SFC mode and that the model responds to dynamic operating characteristics with extremely high accuracy. In addition, the presented graph clearly reveals the model’s real-time response capability against operational changes.

6. Discussion

In this study, different tree-based ensemble learning algorithms were compared for EGT prediction in a GT using real operating data. All models optimized through grid search and 5-fold cross-validation demonstrated high prediction accuracy. According to the obtained results, the LightGBM and XGBoost models particularly stood out with their low error values and high R² performances. The LightGBM model achieved the best overall performance by providing the lowest RMSE and MSE values together with the highest R² value. These results indicate that boosting-based ensemble learning methods are highly effective for EGT prediction under variable operating conditions.

The obtained results clearly show that EGT cannot be accurately represented by only a single operating parameter. EGT is determined by the combined and nonlinear interaction of GTPO, FGF, CIT, environmental conditions, and especially IGV position. This finding confirms the fundamental thermodynamic behavior of GT and demonstrates that a multidimensional modeling approach is required for EGT prediction under variable operating conditions.

SHAP analysis indicated that GTPO and CIT are the most influential variables affecting EGT, followed by FGF and IGV position. According to the correlation matrix, both GTPO and FGF exhibit a strong negative relationship with EGT (−0.87). This behavior can be explained by the IGV control strategy applied in CCPPs to prevent excessive deterioration in cycle efficiency under part-load conditions. In IGV control mode, the air flow is restricted at part-loads, thereby maintaining higher combustion temperatures and consequently higher EGT levels. In this way, HRSG performance is preserved and cycle efficiency is supported. As the load increases, the IGVs open further, increasing the compressor inlet air flow, and the system approaches a more efficient operating point while EGT tends to decrease. Therefore, the strong negative correlation between EGT and both GTPO and FGF should be interpreted not as a direct effect of increasing load or fuel flow alone, but rather as a consequence of the coordinated IGV control strategy under part-load operation.

In addition, the very high correlation between FGF and GTPO (0.98) indicates that these two variables vary almost simultaneously and represent a common operating dynamic affecting EGT. However, this relationship is also influenced by the LHV of the fuel. An increase in LHV raises the amount of energy supplied per unit of fuel, thereby reducing the amount of FGF required to maintain the same GTPO level. Conversely, under low-LHV conditions, a higher fuel flow rate is required to achieve the same power output. Therefore, LHV can be considered an important fuel quality parameter that modulates the FGF–GTPO relationship.

On the other hand, a strong positive relationship exists between CIT and EGT. An increase in CIT reduces inlet air density and increases the compressor discharge temperature, which subsequently leads to a higher combustor inlet temperature and, consequently, higher EGT. As a result, at the same operating load, higher CIT values lead to higher EGT at the turbine outlet. Therefore, CIT is one of the main environmental variables having a direct and pronounced effect on EGT.

Another important aspect of this study is that the model can predict EGT with high accuracy not only under full-load conditions, but also under part-load and load-varying operating conditions. Many studies in the literature are based on steady-state conditions, limited datasets, or laboratory-scale data. In contrast, this study uses 18,334 h of real operating data. This enables the developed model to represent industrial applications more realistically. Therefore, the proposed approach has high practical value for real power plant applications.

One of the important findings of this study is that the LightGBM model demonstrates successful prediction performance not only under general operating conditions, but also under transient operating conditions involving load variation. In SFC mode, during the process in which the GT load was automatically and continuously increased from approximately 253 MW to 400 MW, the EGT values predicted by the model were observed to closely follow the actual EGT trend. In the same process, the decrease in EGT despite the increase in GTPO and IGV values indicates that the model can accurately represent the physical operating behavior of the GT. This result demonstrates that the developed model also has reliable prediction capability under dynamic operating conditions.

In previous ML-based studies, EGT prediction and monitoring in GT’s have been addressed using different datasets and modeling approaches. Wang et al. [2] focused on EGT-based fault diagnosis using an FCM–SVM approach rather than direct EGT prediction. Apostolidis et al. [17] investigated EGT prediction and interpretability using a synthetic aero-engine dataset, while Zhou [21] used a limited number of aero-engine flight data. Stephnie and Osoka [22] performed multi-step EGT prediction under part-load transient operating conditions using a dynamic NARX ANN model. Compared with these studies, the present study contributes to the literature by using 18,334 h of long-term real industrial CCPP operating data, explicitly incorporating IGV position into the model, testing the model under transient operating regimes, and comparing different tree-based ensemble learning models within the same evaluation framework. LightGBM achieved the best overall performance, with an R² value of 0.9703 and an RMSE value of 1.5280, while SHAP analysis enabled the interpretation of the effects of the input variables on EGT.

However, this study has some limitations. In this study, a wide operating range of the GT from minimum load to maximum load was evaluated, and the prediction performance of the model was also tested under transient load-varying conditions in SFC mode. Therefore, the model was shown to provide successful results not only under steady operating conditions, but also under a specific transient operating condition. However, different transient operating regimes, such as the start-up process from the initial ignition moment until the minimum stable load is reached, were not included within the scope of this study. The exclusion of start-up operating conditions from the dataset was a deliberate choice within the scope of this study and was intended to ensure that the model operates with high accuracy under stable operating regimes. Nevertheless, the effect of this approach on the generalizability of the model should be taken into consideration. In particular, transient conditions in GT systems, such as start-up and load variation, involve a higher degree of nonlinearity arising not only from general sensor dynamics, but also from complex physical processes specific to combustion. During the start-up process, pressure fluctuations in the combustion chamber, referred to as combustion dynamics, occur due to rapid changes in the fuel–air ratio, the fact that flame stabilization has not yet been fully achieved, and turbulence–combustion interactions. These fluctuations are characterized by high-frequency behaviors such as combustion chamber pressure, acoustic modes, and flame oscillations, and are observed in sensor signals as increased noise and sudden deviations.

In this context, although the direct generalization of the proposed model to start-up conditions is limited, the model demonstrates high accuracy and strong performance over a wide operating range under stable operating regimes, including load increase from minimum to maximum load, load decrease from maximum to minimum load, and load acceptance and load rejection processes occurring in response to grid demand within the scope of SFC.

Therefore, future studies may separately investigate the effects of GT start-up conditions on EGT prediction. In addition, only tree-based ensemble learning algorithms were used in this study. Testing different ML methods, such as deep learning or hybrid models, in future studies may further improve the performance and generalizability of the model. Consequently, this study provides an explainable, high-accuracy, and industrially applicable framework for EGT prediction in GT’s under steady-load, variable-load, IGV operating conditions, and transient load variation in SFC mode.

7. Conclusions

In this study, a comprehensive ML framework was developed and validated for EGT prediction in a GT under variable load and different IGV operating conditions. The developed approach not only contributes to the high-accuracy prediction of EGT, but also enables the monitoring of mechanical and operational anomalies that cannot be directly anticipated by conventional control logic. Based on the obtained findings, the following conclusions were drawn:

Tree-based ensemble learning methods are highly effective in modeling the nonlinear relationship between GT operating parameters and EGT. Among the tested models, LightGBM demonstrated the best overall prediction performance, achieving the highest R² value of 0.9703, together with RMSE and MSE values of 1.5280 and 2.3347, respectively.
After hyperparameter optimization, all models were compared using the same dataset, the same training–testing split, and the same evaluation metrics. This approach enabled a fairer and more reliable performance assessment among the models.
GTPO, CIT, FGF, and IGV position were identified as the dominant parameters determining EGT. The inclusion of IGV position in the model contributes to a more accurate representation of actual GT behavior, particularly under variable-load and part-load operating conditions.
Although the effects of environmental and fuel-related variables such as AP, RH, and LHV are relatively lower, these variables still play a meaningful role in EGT prediction and should not be neglected.
The SHAP-based interpretability analysis confirmed the dominant influence of GTPO, CIT, FGF, and IGV variables on model predictions and showed that the directions of their effects on EGT are consistent with the physical operating behavior of the GT.
Using 18,334 h of real operating data, the proposed model successfully predicted EGT under both full-load and part-load conditions.
The LightGBM model was tested under transient load-varying conditions in SFC mode and closely followed the actual EGT trend during the load increase from approximately 253 MW to 400 MW.
The use of the model for sensor validation and fault detection will improve reliability in turbine operation. In addition, it can be used as an effective decision-support tool for online monitoring and predictive maintenance.
The applicability of the study for online and real-time monitoring in modern CCPPs provides a contribution aligned with the digitalization trend in the energy sector. In this respect, the study has a structure that can be directly integrated not only into academic research but also into industrial decision-support systems.

Consequently, the proposed ML-based approach enabled high-accuracy prediction of EGT over a wide operating range from minimum load to maximum load and under different IGV openings. In addition, the successful tracking of the actual EGT trend by the LightGBM model during the SFC-based transient load increase indicates that the model is applicable not only under steady operating conditions, but also under specific dynamic operating conditions. In this respect, the study makes an original contribution to the literature by using long-term real CCPP operating data, incorporating the IGV effect into the model, comparing six different tree-based ensemble learning algorithms, performing SHAP-based interpretability analysis, and demonstrating suitability for industrial decision-support applications.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. (The data are not publicly available due to commercial confidentiality).

Conflicts of Interest

The author declares no conflict of interest.

Abbreviations

AP	Atmospheric Pressure
CatBoost	Categorical Boosting
CCPP	Combined Cycle Power Plant
CIT	Compressor Inlet Temperature
DCS	Distributed Control System
EGT	Exhaust Gas Temperature
FGF	Fuel Gas Flow
GT	Gas Turbine
GTPO	Gas Turbine Power Output
HRSG	Heat Recovery Steam Generator
IGV	Inlet Guide Vanes
LHV	Lower Heating Value
LightGBM	Light Gradient Boosting Machine
MAE	Mean Absolute Error
ML	Machine Learning
MSE	Mean Square Error
R²	Coefficient of Determination
RH	Relative Humidity
RMSE	Root Mean Square Error
SC	Simple Cycle
SFC	Secondary Frequency Control
ST	Steam Turbine
TIT	Turbine Inlet Temperature
VGV	Variable Guide Vane
XGBoost	eXtreme Gradient Boosting

References

Ali, A.; Houda, M.; Waqar, A.; Khan, M.B.; Deifalla, A.; Benjeddou, O. A review on application of hydrogen in gas turbines with intercooler adjustments. Results Eng. 2024, 22, 101979. [Google Scholar] [CrossRef]
Wang, Z.T.; Zhao, N.B.; Wang, W.Y.; Tang, R.; Li, S.Y. A Fault Diagnosis Approach for Gas Turbine Exhaust Gas Temperature Based on Fuzzy C-Means Clustering and Support Vector Machine. Math. Probl. Eng. 2015, 2015, 240267. [Google Scholar] [CrossRef]
Hong, C.W.; Kim, J. Exhaust temperature prediction for gas turbine performance estimation by using deep learning. J. Electr. Eng. Technol. 2023, 18, 3117–3125. [Google Scholar] [CrossRef]
Liu, J.; Long, Z.; Bai, M.; Zhu, L.; Yu, D. A comparative study on fault detection methods for gas turbine combustion systems. Energies 2021, 14, 389. [Google Scholar] [CrossRef]
Tenango-Pirín, O.; Espinosa, S.; García, J.C.; Rodríguez, J.A. Analysis of gas turbine combustor exhaust emissions: Effects of transient inlet air pressure. Energy Technol. 2024, 12, 2301093. [Google Scholar] [CrossRef]
Parapa, H.B.P. The Impact of Changes in Exhaust Temperature on the Power Output and Heat Rate of a Gas Turbine with a Capacity of 238 MW. INTEK J. Penelit. 2021, 8, 96. [Google Scholar] [CrossRef]
Motamed, M.A.; Genrup, M.; Nord, L.O. Part-load thermal efficiency enhancement in gas turbine combined cycles by exhaust gas recirculation. Appl. Therm. Eng. 2024, 244, 122716. [Google Scholar] [CrossRef]
Lee, J.H.; Kang, D.W.; Jeong, J.H.; Kim, T.S. Quantification of variations in the compressor characteristics of power generation gas turbines at partial loads using actual operation data. J. Mech. Sci. Technol. 2023, 37, 1509–1521. [Google Scholar] [CrossRef]
Talah, D.; Bentarzi, H. Ambient temperature effect on the performance of gas turbine in the combined cycle power plant. Alger. J. Environ. Sci. Technol. 2023, 9, 3079–3085. [Google Scholar]
González-Díaz, A.; Alcaráz-Calderón, A.M.; González-Díaz, M.O.; Méndez-Aranda, Á.; Lucquiaud, M.; González-Santaló, J.M. Effect of the ambient conditions on gas turbine combined cycle power plants with post-combustion CO₂ capture. Energy 2017, 134, 221–233. [Google Scholar] [CrossRef]
Farahani, A.S.; Kohandel, H.; Moradtabrizi, H.; Khosravi, S.; Mohammadi, E.; Ramesh, A. Power generation gas turbine performance enhancement in hot ambient temperature conditions through axial compressor design optimization. Appl. Therm. Eng. 2024, 236, 121733. [Google Scholar] [CrossRef]
Liu, J.; Zhu, L.; Ma, Y.; Liu, J.; Zhou, W.; Yu, D. Anomaly detection of hot components in gas turbine based on frequent pattern extraction. Sci. China Technol. Sci. 2018, 61, 567–586. [Google Scholar] [CrossRef]
Talebi, S.S.; Tousi, A.M. The effects of compressor blade roughness on the steady state performance of micro-turbines. Appl. Therm. Eng. 2017, 115, 517–527. [Google Scholar] [CrossRef]
Long, Z.; Zhou, Z.; Suo, P.; Yao, P.; Bai, M.; Liu, J.; Yu, D. Gas turbine circumferential temperature distribution model for the combustion system fault detection. Eng. Fail. Anal. 2024, 158, 108032. [Google Scholar] [CrossRef]
Kong, J.; Yu, W.; Chen, J.; Zhang, H. A Novel Power Prediction Model Based on the Clustering Modification Method for a Heavy-Duty Gas Turbine. Appl. Sci. 2025, 15, 432. [Google Scholar] [CrossRef]
Purba, O.; Zhultriza, F. Inlet Guide Vane Tracking Effectiveness at Various Compressor Efficiency of Gas Turbine. In IOP Conference Series: Materials Science and Engineering; IOP Publishing: Bristol, UK, 2021; Volume 1096, p. 012088. [Google Scholar]
Apostolidis, A.; Bouriquet, N.; Stamoulis, K.P. AI-based exhaust gas temperature prediction for trustworthy safety-critical applications. Aerospace 2022, 9, 722. [Google Scholar] [CrossRef]
Liu, Z.; Karimi, I.A. Gas turbine performance prediction via machine learning. Energy 2020, 192, 116627. [Google Scholar] [CrossRef]
Park, Y.; Choi, M.; Kim, K.; Li, X.; Jung, C.; Na, S.; Choi, G. Prediction of operating characteristics for industrial gas turbine combustor using an optimized artificial neural network. Energy 2020, 213, 118769. [Google Scholar] [CrossRef]
Liu, S.; Zhou, N.; Song, C.; Chen, G.; Wu, Y. Exhaust gas temperature prediction of aero-engine via enhanced scale-aware efficient transformer. Aerospace 2024, 11, 138. [Google Scholar] [CrossRef]
Zhou, W. Aero-engine exhaust gas temperature prediction based on adaptive disturbance quantum-behaved particle swarm optimization. Adv. Mech. Eng. 2022, 14, 16878132221119044. [Google Scholar] [CrossRef]
Stephnie, A.C.; Osoka, D.E.E. Development of a Dynamic Neural Network Model for Multistep ahead Prediction of Exhaust Gas Temperature in Heavy Duty Gas Turbines. Saudi J. Eng. Technol. 2022, 7, 53–61. [Google Scholar]
Brusa, A.; Grossi, A.; Lenzi, M.; Shethia, F.P.; Cavina, N.; Kitsopanidis, I. Modeling of Exhaust Gas Temperature at the Turbine Outlet Using Neural Networks and a Physical Expansion Model. Energies 2025, 18, 1721. [Google Scholar] [CrossRef]
Ma, S.; Wu, Y.; Zheng, H.; Gou, L. A hybrid of NARX and moving average structures for exhaust gas temperature prediction of gas turbine engines. Aerospace 2023, 10, 496. [Google Scholar] [CrossRef]
Du, J.; Zhang, Y.; García, M.M.; Spencer, A. A Novel Dynamic Surge Modeling Framework for Gas Turbines: Integration of Compressor Variable Geometry. Machines 2026, 14, 327. [Google Scholar] [CrossRef]
Faqihi, B.; Ghaith, F. A comprehensive review and evaluation of heat recovery methods from gas turbine exhaust systems. Int. J. Thermofluids 2023, 18, 100347. [Google Scholar] [CrossRef]
Büyükköse, A.O.; Aslan, A.; Çoban, K. Thermal Efficiency Impacts of Structural and Environmental Variables in Combined Cycle Plants: A Machine Learning Approach to Relocation Scenario. Case Stud. Therm. Eng. 2025, 75, 107123. [Google Scholar] [CrossRef]
Zhu, R.; Ren, J.X.; Li, F.Q.; Zhang, H.D.; Tang, Y. Thermal and stress field analysis in heavy-duty gas turbine exhaust system. Adv. Mater. Res. 2012, 516, 688–691. [Google Scholar] [CrossRef]
Pakmehr, M.; Costa, J.; Lu, G.; Behbahani, A. Optical exhaust gas temperature (EGT) sensor and instrumentation for gas turbine engines. In Proceedings of the NATO STO Meeting: Transitioning Gas Turbine Instrumentation from Test Cells to On-Vehicle Applications STO-M P-AVT-306At, Athens, Greece, 3–7 December 2018. [Google Scholar]
Von Moll, A.; Behbahani, A.R.; Fralick, G.C.; Wrbanek, J.D.; Hunter, G.W. A review of exhaust gas temperature sensing techniques for modern turbine engine controls. In Proceedings of the 50th AIAA/ASME/SAE/ASEE Joint Propulsion Conference, Cleveland, OH, USA, 28–30 July 2014; p. 3977. [Google Scholar]
Dutz, F.J.; Boje, S.; Orth, U.; Koch, A.W.; Roths, J. High-temperature profile monitoring in gas turbine exhaust-gas diffusors with six-point fiber-optic sensor array. Int. J. Turbomach. Propuls. Power 2020, 5, 25. [Google Scholar] [CrossRef]
Sepiolo, D.; Ligęza, A. Towards explainability of tree-based ensemble models. A critical overview. In International Conference on Dependability and Complex Systems; Springer International Publishing: Cham, Switzerland, 2022; pp. 287–296. [Google Scholar]
Zhu, M.; Li, Z.; Zhao, J.; Liu, X.; Liu, Y. TERM: Tree Ensemble Models for Interpretable Rule Mining. In International Conference on Web Information Systems Engineering; Springer: Singapore, 2024; pp. 367–382. [Google Scholar]
Ngo, G.; Beard, R.; Chandra, R. Evolutionary bagging for ensemble learning. Neurocomputing 2022, 510, 1–14. [Google Scholar] [CrossRef]
Li, Y.; Li, L.; Fang, Y.; Peng, H.; Ling, N. Bagged tree and ResNet-based joint end-to-end fast CTU partition decision algorithm for video intra coding. Electronics 2022, 11, 1264. [Google Scholar] [CrossRef]
Saeed, M.S.; Mustafa, M.W.; Sheikh, U.U.; Jumani, T.A.; Mirjat, N.H. Ensemble bagged tree based classification for reducing non-technical losses in multan electric power company of Pakistan. Electronics 2019, 8, 860. [Google Scholar] [CrossRef]
Ruiz-Abellon, M.D.C.; Gabaldón, A.; Guillamón, A. Load forecasting for a campus university using ensemble methods based on regression trees. Energies 2018, 11, 2038. [Google Scholar] [CrossRef]
James, G.; Witten, D.; Hastie, T.; Tibshirani, R. An Introduction to Statistical Learning: With Applications in R; Springer: New York, NY, USA, 2013; Volume 103. [Google Scholar]
Gao, Y.; Zhao, J.; Han, L. Quantifying the nonlinear relationship between block morphology and the surrounding thermal environment using random forest method. Sustain. Cities Soc. 2023, 91, 104443. [Google Scholar] [CrossRef]
Zhou, Z.; Qiu, C.; Zhang, Y. A comparative analysis of linear regression, neural networks and random forest regression for predicting air ozone employing soft sensor models. Sci. Rep. 2023, 13, 22420. [Google Scholar] [CrossRef] [PubMed]
Hassan, M.A.; Khalil, A.; Kaseb, S.; Kassem, M.A. Exploring the potential of tree-based ensemble methods in solar radiation modeling. Appl. Energy 2017, 203, 897–916. [Google Scholar] [CrossRef]
Asadi, B.; Hajj, R. Prediction of asphalt binder elastic recovery using tree-based ensemble bagging and boosting models. Constr. Build. Mater. 2024, 410, 134154. [Google Scholar] [CrossRef]
Niazkar, M.; Menapace, A.; Brentan, B.; Piraei, R.; Jimenez, D.; Dhawan, P.; Righetti, M. Applications of XGBoost in water resources engineering: A systematic literature review (Dec 2018–May 2023). Environ. Model. Softw. 2024, 174, 105971. [Google Scholar] [CrossRef]
Zhang, P.; Jia, Y.; Shang, Y. Research and application of XGBoost in imbalanced data. Int. J. Distrib. Sens. Netw. 2022, 18, 15501329221106935. [Google Scholar] [CrossRef]
Chen, C.; Zhang, Q.; Ma, Q.; Yu, B. LightGBM-PPI: Predicting protein-protein interactions through LightGBM with multi-information fusion. Chemom. Intell. Lab. Syst. 2019, 191, 54–64. [Google Scholar] [CrossRef]
Sinha, B.B.; Ahsan, M.; Dhanalakshmi, R. LightGBM empowered by whale optimization for thyroid disease detection. Int. J. Inf. Technol. 2023, 15, 2053–2062. [Google Scholar] [CrossRef]
Huang, G.; Wu, L.; Ma, X.; Zhang, W.; Fan, J.; Yu, X.; Zhou, H. Evaluation of CatBoost method for prediction of reference evapotranspiration in humid regions. J. Hydrol. 2019, 574, 1029–1041. [Google Scholar] [CrossRef]
Dasilas, A.; Rigani, A. Machine learning techniques in bankruptcy prediction: A systematic literature review. Expert Syst. Appl. 2024, 255, 124761. [Google Scholar] [CrossRef]
Teodorescu, V.; Obreja Brașoveanu, L. Assessing the validity of k-fold cross-validation for model selection: Evidence from bankruptcy prediction using random forest and XGBoost. Computation 2025, 13, 127. [Google Scholar] [CrossRef]
Shams, M.Y.; Elshewey, A.M.; El-Kenawy, E.S.M.; Ibrahim, A.; Talaat, F.M.; Tarek, Z. Water quality prediction using machine learning models based on grid search method. Multimed. Tools Appl. 2024, 83, 35307–35334. [Google Scholar] [CrossRef]
Aygun, H.; Dursun, O.O.; Dönmez, K.; Sahin, O.; Toraman, S. Prediction of performance characteristics of an experimental micro turbojet engine using machine learning approaches. Energy 2024, 313, 133997. [Google Scholar] [CrossRef]
Xu, M.; Qiu, Y.; Khandelwal, M.; Kadkhodaei, M.H.; Zhou, J. Optimizing Random Forest with Hybrid Swarm Intelligence Algorithms for Predicting Shear Bond Strength of Cable Bolts. Machines 2025, 13, 758. [Google Scholar] [CrossRef]
Ellahi, M.; Usman, M.R.; Arif, W.; Usman, H.F.; Khan, W.A.; Satrya, G.B.; Daniel, K.; Shabbir, N. Forecasting of wind speed and power through FFNN and CFNN using HPSOBA and MHPSO-BAACs techniques. Electronics 2022, 11, 4193. [Google Scholar] [CrossRef]

Figure 1. GT data flow diagram.

Figure 2. Schematic of the main components of a GT and the air-fuel flow.

Figure 3. Diagram of the CCPP with a single-shaft (1 + 1) configuration.

Figure 4. Schematic layout of exhaust measurement points.

Figure 5. Images of the GT exhaust section and the internal structure of the turbine outlet region.

Figure 6. Log-scale boxplot analysis for detecting outliers in the dataset parameters.

Figure 7. Pearson correlation matrix of the input and output parameters used for EGT prediction.

Figure 8. Scatter matrix showing the pairwise distributions and relationships between EGT and the input variables.

Figure 9. Five-fold cross-validation results of the LightGBM model.

Figure 10. Scatter plots of predicted values.

Figure 11. SHAP feature importance for EGT prediction using the LightGBM model.

Figure 12. Residual distributions of the EGT prediction models.

Figure 13. Comparison of actual and predicted EGT values for the prediction models.

Figure 14. Comparison of actual EGT values and the predictions of LightGBM and Random Forest for 100 samples.

Figure 15. Transient behavior of actual and predicted EGT during off-design load ramp operation under varying GTPO and IGV conditions.

Table 1. Technical specifications of the SGT5-8000H GT.

GT Model/Series	Siemens SGT5-8000H Series
Capacity	401 MW
Frequency	50 Hz
Speed	3000 rpm
Application	Industrial-scale power generation
Fuel type	Natural gas
Reference ambient temperature	15 °C
Reference RH	70%
Reference AP	1009 mbar
Compressor configuration	13 stages; 3 VGVs + 1 IGV
Turbine configuration	4 stages

Table 2. Optimized hyperparameter configurations obtained using grid search and 5-fold cross-validation.

Model	Optimized Hyperparameters
Bagged Trees	n_estimators = 500, max_samples = 1.0, max_features = 1.0, estimator max_depth = None, estimator min_samples_leaf = 1
Random Forest	n_estimators = 500, max_depth = None, max_features = None, min_samples_split = 2
Gradient Boosting	n_estimators = 500, learning_rate = 0.1, max_depth = 6, subsample = 0.8
XGBoost	n_estimators = 500, learning_rate = 0.1, max_depth = 8, subsample = 0.9, colsample_bytree = 0.9
LightGBM	n_estimators = 1000, learning_rate = 0.1, num_leaves = 50, subsample = 0.8, colsample_bytree = 0.8, importance_type = “split”
CatBoost	iterations = 1000, learning_rate = 0.1, depth = 8, l2_leaf_reg = 1, border_count = 64

Table 3. Basic statistics of the dataset.

	Min	Max
GTPO (MW)	239.34	420.11
IGV (%)	39.58	100.00
CIT (°C)	−2.77	33.30
FGF (Sm³/h)	67,197.51	106,933.75
LHV (kcal/Sm³)	8155.91	9090.35
AP (mbar)	980.88	1019.04
RH (%)	10.59	100.00

Table 4. Comparison of Pearson and Spearman correlation coefficients between EGT and operational/environmental parameters.

Variable	Pearson (r)	Spearman (ρ)	Difference (Δ)	Relationship Strength and Type
GTPO	−0.87	−0.94	0.07	Very strong negative
FGF	−0.87	−0.91	0.04	Very strong negative
IGV	−0.71	−0.62	0.09	Strong negative
CIT	0.57	0.56	0.01	Moderate positive
AP	−0.36	−0.34	0.02	Weak negative
RH	−0.24	−0.25	0.01	Weak negative
LHV	0.10	0.10	0.00	Weak positive

Table 5. Comparison of performance metrics.

Model Type	RMSE	MSE	R²	MAE
Bagged Trees	1.6410	2.6927	0.9657	1.0854
Random Forest	1.6421	2.6966	0.9657	1.0857
Gradient Boosting	1.6097	2.5911	0.9670	1.1161
XGBoost	1.5335	2.3516	0.9701	1.0318
LightGBM	1.5280	2.3347	0.9703	1.0319
CatBoost	1.5818	2.5021	0.9683	1.0647

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Aslan, A. Gas Turbine Exhaust Gas Temperature Prediction Under Variable Operating Loads and IGV Positions Using Tree-Based Ensemble Learning. Machines 2026, 14, 630. https://doi.org/10.3390/machines14060630

AMA Style

Aslan A. Gas Turbine Exhaust Gas Temperature Prediction Under Variable Operating Loads and IGV Positions Using Tree-Based Ensemble Learning. Machines. 2026; 14(6):630. https://doi.org/10.3390/machines14060630

Chicago/Turabian Style

Aslan, Asiye. 2026. "Gas Turbine Exhaust Gas Temperature Prediction Under Variable Operating Loads and IGV Positions Using Tree-Based Ensemble Learning" Machines 14, no. 6: 630. https://doi.org/10.3390/machines14060630

APA Style

Aslan, A. (2026). Gas Turbine Exhaust Gas Temperature Prediction Under Variable Operating Loads and IGV Positions Using Tree-Based Ensemble Learning. Machines, 14(6), 630. https://doi.org/10.3390/machines14060630

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Gas Turbine Exhaust Gas Temperature Prediction Under Variable Operating Loads and IGV Positions Using Tree-Based Ensemble Learning

Abstract

1. Introduction

2. System Description

2.1. GT System

2.2. GT Exhaust System

3. Tree-Based Ensemble Models

3.1. Bagging Methods

3.1.1. Bagged Trees

3.1.2. Random Forest

3.2. Boosting Methods

3.2.1. XGBoost

3.2.2. LightGBM

3.2.3. CatBoost

3.3. Model Optimization and Validation

3.3.1. k-Fold Cross-Validation

3.3.2. Grid Search

4. Model Evaluation Metrics

4.1. MAE

4.2. MSE

4.3. RMSE

4.4. R²

5. Results

5.1. Dataset Description

5.2. Analysis Results

6. Discussion

7. Conclusions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Gas Turbine Exhaust Gas Temperature Prediction Under Variable Operating Loads and IGV Positions Using Tree-Based Ensemble Learning

Abstract

1. Introduction

2. System Description

2.1. GT System

2.2. GT Exhaust System

3. Tree-Based Ensemble Models

3.1. Bagging Methods

3.1.1. Bagged Trees

3.1.2. Random Forest

3.2. Boosting Methods

3.2.1. XGBoost

3.2.2. LightGBM

3.2.3. CatBoost

3.3. Model Optimization and Validation

3.3.1. k-Fold Cross-Validation

3.3.2. Grid Search

4. Model Evaluation Metrics

4.1. MAE

4.2. MSE

4.3. RMSE

4.4. R2

5. Results

5.1. Dataset Description

5.2. Analysis Results

6. Discussion

7. Conclusions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

4.4. R²