Prediction of Dry-Low Emission Gas Turbine Operating Range from Emission Concentration Using Semi-Supervised Learning

Dry-Low Emission (DLE) technology significantly reduces the emissions from the gas turbine process by implementing the principle of lean pre-mixed combustion. The pre-mix ensures low nitrogen oxides (NOx) and carbon monoxide (CO) production by operating at a particular range using a tight control strategy. However, sudden disturbances and improper load planning may lead to frequent tripping due to frequency deviation and combustion instability. Therefore, this paper proposed a semi-supervised technique to predict the suitable operating range as a tripping prevention strategy and a guide for efficient load planning. The prediction technique is developed by hybridizing Extreme Gradient Boosting and K-Means algorithm using actual plant data. Based on the result, the proposed model can predict the combustion temperature, nitrogen oxides, and carbon monoxide concentration with an accuracy represented by R squared value of 0.9999, 0.9309, and 0.7109, which outperforms other algorithms such as decision tree, linear regression, support vector machine, and multilayer perceptron. Further, the model can identify DLE gas turbine operation regions and determine the optimum range the turbine can safely operate while maintaining lower emission production. The typical DLE gas turbine’s operating range can operate safely is found at 744.68 °C –829.64 °C. The proposed technique can be used as a preventive maintenance strategy in many applications involving tight operating range control in mitigating tripping issues. Furthermore, the findings significantly contribute to power generation fields for better control strategies to ensure the reliable operation of DLE gas turbines.


Introduction
Gas turbines are one of the most versatile and efficient power generation sources, used in various applications, including aviation, power plant, and oil and gas production. The primary operation in the gas turbine system is the combustion process, where the energy conversion takes place by mixing the compressed air and fuel, which is subsequently ignited to produce a high-temperature gas flow to rotate the turbine, producing a shaft work to drive the electrical generator. However, due to an incomplete reaction, combustion releases emissions, such as nitrogen oxides (NO x ) and carbon monoxide (CO). The emission produced by combustion engines has become an topic in achieving net-zero emission target, which has positioned a stringent policy of pollution leading to the introduction of Dry-Low Emission (DLE) gas turbines [1].
DLE gas turbine reduces the emissions by implementing the lean-premixed (LPM) method to create a lower temperature by adding more air to be mixed with fuel before delivering it to the combustor, as lowering combustion temperature will lower the emission [2,3]. According to [4], the DLE gas turbine can achieve emission reduction up to 97%, which can effectively contribute to minimize the green house gasses. Even though this technology is environmentally friendly, the lean-burn operation may cause combustion instability due to various factors, including acoustic resonance and reduced flame speed, leading to flame-out, commonly known as lean blow-out (LBO) [5]. LBO fault exists when the turbine operates below the LBO limit, the lowest equivalence ratio that can carry on the flame [6]. In addition, combustion at too low a temperature leads to the high formation of CO emissions. On the other hand, operating the turbine higher than the desired range will lead to high emission production of NO x [7,8]. Therefore, the DLE gas turbine operation should be maintained in a specific range, as illustrated in Figure 1. Various causes of LBO are identified, such as frequency fluctuation and unbalanced air fuel ratio due to sudden change of load, as rapid demand on load affects the combustion stability of DLE gas turbine [9,10]. When the load decreases, the fuel flow rate decreases, leading to a leaner air-to-fuel ratio. This leaner mixture can cause the flame to become unstable and eventually lead to a lean blowout [11]. Similarly, when the load on the gas turbine increases, the fuel flow rate increases, leading to a richer mixture. This richer mixture can also lead to instability in the flame due to incomplete combustion. Therefore, proper load management is essential in maintaining the healthy operation of the DLE gas turbine, which can be achieved by carefully determining the operating range.
According to Figure 1, the operating range is a function of NO x and CO emissions against the combustion temperature. Hence, the suitable operating range can be estimated by predicting the emission of NO x and CO produced during the combustion process. Numerous prediction of emission predictions is available in the literature. A numerical model was proposed by Emami in [12] to predict the NO x and CO concentration using Computational Fluid Dynamics (CFD). The numerical simulation was used to identify the mechanism of NO x formation and CO characteristics concerning the change in inlet air temperature. The other method uses semi-empirical analysis by combining CFD and Chemical Reactor Network (CRN) to predict the emission composition and LBO event [13]. Another study from [14] implemented a statistical method by employing response surface methodology-based Box-Behnken design to model and optimize the prediction of NO x and CO emissions from diesel engine. These approaches demonstrated a good result. However, there is a lack of physical insights since the prediction was made using a numerical simulation underlying the physical model.
The data-driven method is subsequently adopted because it can predict the result by only learning from data. Hence, it simplifies the calculation and reduces the possibility of a lack of physical law during model development. Three approaches are commonly used in data-driven methods: supervised, unsupervised, and semi-supervised. In emission prediction, supervised learning is mainly implemented to predict the amount of concentration. Masoud [15] and Saiful [16] implemented Support Vector Machine (SVM) regressor to predict NO x for diesel engines and gas turbines, respectively. Tuttle further combined SVM with Neural Networks (NNs) to classify and predict the emission from different fuels [17]. Bo Liu [18] also employed SVM by combining the model with Principal Component Analysis (PCA) and Genetic Algorithm (GA) to predict NO x concentration, which outperforms other algorithms such as the original SVM, neural networks, and Partial Least Squares (PLS). On the other hand, a non-parametric supervised method, namely k-nearest neighbors (k-NN), is used by Rezazadeh to predict NO x [19]. Meanwhile, other scholars prefer implementing NNs algorithms, such as Botros [20] and Minxing [21], who employed NNs to predict NO x from the conventional and DLE gas turbine, respectively. NNs-based prediction has also found promising results for many applications,such as for forced convection and thermal predictions [22].
A challenge encountered in dealing with gas turbine data is the existence of noisy and missing data due to the heavy operation of the turbine. Hence, selecting a proper prediction technique is necessary to develop an adaptive model that can handle the corrupted data correctly and efficiently. An ensemble algorithm, namely extreme gradient boosting (XGBoost), has the capability to manage these matters as established by Minxing Si in [23], which successfully outperforms the neural networks model in predicting the NO x emission from a coal-fired boiler. Therefore, XGBoost is employed in this study for emission prediction due to effortless data preprocessing, less time model training, and fewer hyperparameters to adjust. In addition, this algorithm can also handle large datasets and has achieved state-of-the-art performance on many prediction method benchmark.
Despite the advantages of XGBoost for predictive modeling tasks, the supervised approach may not be suitable for determining the operating range since it is limited to labeled data and supervision only. The operating range of the gas turbine is affected by various factors, such as the fluctuation of the ambient conditions and the operation demand, creating ambiguity in defining the exact range. Therefore, the clustering approach is adopted in this study to predict the operating range by discovering the similarity in the data and grouping them in distinct regions. K-means is one of the most widely used algorithms in the clustering approach, which is categorized as a partitioning method [24,25]. K-Means is a well-known unsupervised learning-based algorithm introduced 50 years ago [26] and favored due to fast computation, simplicity, and ability to handle huge data [27]. In addition, it is suitable for dealing with unevenly distributed data and producing consistent results with different initializations [28].
Therefore, this paper aims to propose a prediction technique to determine the DLE gas turbine's operating range based on the emission concentration by hybridizing XGBoost and K-Means algorithms. The main contributions of the paper are highlighted as follows: • Develop a model to predict the emission of NO x and CO from DLE gas turbine using XGBoost. Additionally, the combustion temperature will be predicted. • Develop a technique to predict the operating range of a DLE gas turbine based on gas emission concentration using K-means algorithm.
Furthermore, several data-driven techniques based on machine learning methods will also be employed to study different applications of the algorithms and provide an overview of their prediction capability in the studied case. The proposed hybrid model contributes to the development of emission reduction of power generation while proving a healthy operation during the DLE mode. Additionally, the proposed technique is adaptable for other implementations involving engine operation that require an operating range control strategy.

Semi-Supervised Learning for Operating Range Prediction of Dry-Low Emission Gas Turbine
This section is divided into two subsections presenting the XGBoost and K-Means algorithm description. The overall flow of the model development is depicted in Figure 2. Firstly, the data of a DLE gas turbine collected from the actual plant are divided into training and test data with a division ratio of 70:30. The training data are further carried out pre-processing using the Pearson correlation test to determine the important features for model input. In addition, a technical description for each parameter impacting the turbine operation will also be evaluated to ensure the feature selection. Subsequently, the data are trained to develop the regression model of XGBoost predicting the combustion temperature, NO x , and CO emissions. After that, the test data are fed to predict the new result and validate the model performance. In this part, the developed model will be compared with other regression algorithms, such as decision tree, linear regression, multilayer perceptron, and support vector machine. The predicted output is then used for operating range prediction. The next step is to determine the cluster using the elbow method. The optimum cluster is further taken for K-means model development. Lastly, the predicted region is assigned based on the operating condition of the DLE gas turbine to find the optimum range in which the turbine can operate.

XGBoost
Extreme gradient boosting (XGBoost) algorithm is a tree-based ensemble learning that first time released in 2014. The idea of XGBoost comes from boosting method that is expressed as: where y i k is the predicted output for ith data and k is the number of iterations. f k (x i ) is the estimator to improve the previous prediction y i k−1 . The architecture of tree-based learning is illustrated in Figure 3, where it includes root nodes as represented by blue circles, internal nodes with faded orange circles, and leaf nodes with brown and yellow circles.
In XGBoost, a regularization function is introduced to avoid overfitting and optimizes the loss function. The objective function or loss function for regression problem is expressed as: where n denoted as the number of training samples and Ω( f k ) is a regularization function.
The regularization function is written as: where T is the number of leaf nodes and w is the leaf weight. γ and λ are the hyperparameters that can be tuned to improve the performance and produce a great prediction result. The training process is repeated iteratively, with new trees being added that forecast the residuals or errors of previous trees, which are then integrated with previous trees to provide the final prediction. In order to improve the performance of the proposed model, a hyperparameter tuning of XGBoost is employed. The optimization is performed based on cross validation with cv value of 5. Some parameters which potentially gives a better contribution are also manually tuned. The finalized hyperparameters used for model development are tabulated in Table 1. The predicted value from the model will be evaluated against the actual value using three performance parameters, which are R 2 , mean absolute error (MAE), root mean squared error (RMSE), and relative percentage error (%error), expressed as follows; where theŷ is the predicted value of y.

K-Means
K-means clustering has been proved its convergence for many years ago, opening the way for its widespread application in current research and industry [29]. The approach involves selecting k randomly as the initial clustering center, calculating the distance between each object and the initial clustering center, and assigning it to the nearest clustering center [30]. The clustering center as known as centroid and the items that have been assigned to them represent a group of classes, as represented by different colours of data groups in Figure 4. The cluster center will be recalculated for each object assigned according to the the cluster's existing items. The loop continues until the cluster center is no longer changing. The K-means method is divided into two steps. The first step is determining the initial k. In this research, the elbow method is selected to find the proper value of the initial k. The k range used in this study varies from 2 to 10 and is then plotted against the WCSS (within-cluster sum of square), also known as inertia, which is calculated by summing the squared distance between each point and the centroid in a cluster. The value of inertia will decrease as the cluster increases. At the point when the inertia starts to move almost parallel to the X-axis is the elbow point, where the optimum value of k is found.
The second step is determining where each object belongs in the cluster. In this stage, the Euclidean distance is calculated for ith object o i . The Euclidean value represents the distance between o i and each of the cluster-centers k j . Subsequently, we must observe the corresponding cluster center S j with the smallest distance. The calculation is formulated by Equation (8), where M is the total number of features, o i , m is the value of the m th feature of the i th object, and k j,m is the value of the m th feature of the j th cluster center.

Data Collection
The data were collected from 4 months of a DLE gas turbine operation, which consists of 100,000 data points on healthy and unhealthy conditions. The healthy operation represents the data that were collected during the normal operation. The unhealthy data contain three incidents of trips, implying the information on the undesirable operation that happened during the data-collection period. The gas turbine type is a two-stage single shaft with rated power of 17.9 MW. The turbine has 16 stages axial flow compressor and fueled by natural gas.
The data consist of 13 operating parameters and 2 emission parameters of the DLE gas turbine, as tabulated in Table 2. The operational parameters consists of load, speed, ambient air temperature, inlet guide vane opening, compressor discharge pressure, stop ratio valve opening, gas control valve opening, splitter opening, fuel gas flow, fuel gas pressure, T5 combustion temperature, T7 exhaust temperature, and exhaust gas pressure. The gas emissions measured by the gas analyzer are NO x and CO concentration.  Figure 5 illustrates the system flow diagram of the typical DLE gas turbine with the measurement sensors. Three main components of the gas turbine arrangement are observed by sensors: compressor, combustion chamber, and mechanical turbine. The load demand maneuvers the gas turbine operation, as mentioned in 1. The driven load determines the power output by maintaining the rotation of the mechanical turbine at a certain speed in 2. The power output is sensitive to the ambient air temperature, which refers to 3 as an increase in ambient air temperature lowers the air density, reducing the mass flow through the turbine, and decreasing the power output. Hence, monitoring the air ambient temperature is essential in maintaining the reliable operation and performance of the gas turbine. The air is directed to the compressor by IGV at 4, and then compressed with pressure discharge monitored as CDP at 5 before mixing with the fuel in the combustion chamber. The fuel coming to the combustion chamber enters SRV as mentioned in 6 to maintain the gas pressure stable and regulate the pressure drop. The GCV at 7 then regulates the fuel flow as required for the combustion process. Since the DLE combustor type requires two partitions of fuel, the splitter valve, as mentioned in 8, controls the splitting of the main fuel and pilot fuel before entering the chamber. The flow and pressure of the fuel are monitored at 9 and 10, respectively. The combustion temperature is difficult to monitor due to the extreme conditions and thermal gradient inside the chamber. The firing temperature is proportional to the gas temperature leaving the chamber. Hence, the temperature is measured at the exhaust of the chamber as labeled by T5 at 11, as measuring the temperature in the combustion chamber is not possible due to physical sensor limitations. Therefore, T5 is considered the combustion temperature in this study, which will be used for operating range prediction. In the exhaust part of the turbine, the temperature and pressure are monitored as 12 and 13. The emission of NO x and CO produced during the process is measured at 14 and 15, respectively.

Data Analysis
A sample data collected from the typical DLE gas turbine are captured in Figure 6, where the input and output parameters are represented by blue and red lines, respectively. The data contain a trip incident after 280 min, as indicated by the load going down suddenly to 0 MW. Before the trip occurred, there was a sudden increase in load from 10 MW to 18 MW at 238 min. The system maintained the desired load for several minutes before it went failed, then the load significantly dropped. Further analysis exhibits a similar trend of CDP and FGF, where both parameters rise quickly due to the sudden load increase. Since the gas turbine is a single-shaft type with relatively constant speed, the increase in load demand is followed by an increase in the fuel flow or FGF, which raises the combustion temperature, CT. Thus, a large opening of the splitter, SO, is identified from 25% to 88% during load change; then, it increased to 100% or was fully open during the trip. The increase in CT affects the rise of CDP, where CDP is used to estimate the firing temperature reference. Other parameters, such as exhaust temperature (ET) and exhaust gas pressure (EGP), also have an identical pattern in which the value rises at a top point before the trip occurrence. With closer observation, the ambient temperature, AAT, increased gradually during the trip up to 40°C, revealing a fluctuation in ambient conditions. Similarly, the concentration of NO x emission fluctuated before the trip happened. On the other hand, CO emissions significantly increased from 2 ppm to 60 ppm before the trip occurred.
By carefully observing the phenomenon of sudden load increase in DLE gas turbine, it can be examined that the transient condition may cause dynamic instability leading the turbine to trip. In addition, due to rigorous ambient and operational settings, the root cause of the tripping incident might be difficult to recognize. The unsupervised learning can identify the patterns and structure in the data independently and even uncover hidden relationships by grouping the data based on its similarity. Therefore, implementing unsupervised learning such as K-means will help discover the operating region of DLE gas turbine in which the data contain healthy and unhealthy operations. It allows the engineers to identify different operating regimes or conditions that the gas turbine may be operating in. This information can then be used to optimize the gas turbine's performance for each of these operating conditions, leading to improved efficiency and reduced maintenance costs. Similarly, this approach can be implemented to other applications of engines with noisy or incomplete operational data to identify the anomalies and reveal hidden relationships and insights that may be difficult to detect through manual analysis.

Data Pre-Processing
In order to perceive the relationship between the parameters of the dataset, the correlation test result is mapped in Figure 7. The relationship is then analyzed based on the correlation of each input parameter and the pairwise correlation between input and target parameters. The correlation test is performed by calculating Pearson's correlation as described by Equation (9).
where N is the number of pairs of scores, ∑(xy) is the sum of the products of paired scores, ∑(x) and ∑(y) are the sum of x and y scores, and ∑(x 2 ) and ∑(y 2 ) are the sum of squared x and y scores. Firstly, by carefully observing the correlation of input parameters, the highest correlated parameter is CDP, followed by FGP, Speed, SO, and FGF. CDP has a strong correlation value of 1 with speed and FGP. However, speed and FGP are not employed since the typical turbine is a single-shaft type in which the engine has to operate at a relatively constant speed, and the gas pressure is maintained at a particular value. Meanwhile, CDP hugely contributes to estimating the turbine inlet's temperature reference. Thus, CDP is preferred to speed and FGP. Load is also highly correlated to other parameters, with the highest correlation value of 0.99 against FGF. The role of the load in maneuvering the turbine operation has positioned this parameter as an essential feature to be considered for model development. Further, various operating conditions can be affected by the fluctuation of the load, making this parameter more necessary to be examined. Other highly correlated parameters, SO and FGF, have a high correlation value of 0.9 against other parameters. These two parameters significantly impact the DLE gas turbine system since the output power is adjusted by regulating the FGF, which can be controlled through the SO. Therefore, SO and FGF are essential for model development representing the gas fuel system. In the exhaust component, ET and EGT have the highest correlation with other input parameters, with a correlation value above 0.95, except with AAT. Nevertheless, AAT significantly impacts gas turbine performance since the fluctuation in it will affect the output power.
Secondly, the correlation between input and target variables exhibits a high dependency, as summarized in Table 3. CDP and ET have the highest correlation against the combustion temperature, CT, with a correlation value of 1. The other parameters also correlate highly with CT, with an average correlation of 0.9. Examined from the emission predicted target, NO x gains a higher correlation value than CO for all input parameters, with the highest correlation from load (0.98) and the lowest being AAT with a correlation value of −0.61. CO emission is highly correlated with FGP and speed, with a correlation value of 0.26. Other parameters also portray a relatively high correlation, except the load, which is 0.091. Nevertheless, the load is considered an essential parameter since the load demand drives the gas turbine operation. Therefore, based on the correlation analysis, the finalized parameters used for model development are CDP, SO, FGF, EGP, ET, Load, and AAT.

Results and Discussion
This section discusses the results of the predicted combustion temperature, NO x , and CO emissions from the proposed XGBoost model. Furthermore, the comparison of the proposed model and other algorithms is also presented. Subsequently, the prediction result of DLE gas turbine operating range from a K-means model is discussed.

Prediction of Combustion Temperature, NO x , and CO Using XGBoost
The regression model has been developed using XGBoost to predict three output parameters: combustion temperature (CT), NO x , and CO. The result is analyzed based on the graphical plot and numerical evaluation. A benchmark of the proposed model against other algorithms is also discussed, as summarized in Table 4.  Figure 8 presents the plot of combustion temperature (CT) prediction for the training and test dataset. In the figure, it can be examined that the model successfully predicts the test data capturing the trend when the trip happens as the temperature goes down to 0°C and during start-up until it reaches the desired temperature at normal operation. Based on the numerical evaluation of performance metrics, the proposed model performs excellently by obtaining an R 2 of 0.9999, MAE of 1.1285, RMSE of 6.9549, and %error of 0.0356. The MAE of the XGBoost model is the third lowest after linear regression and support vector machine. On the other hand, the RMSE and relative error percentage (%error) are the second largest after decision tree. The model of CT prediction is acceptable since the relative error percentage meets the decision criteria, which is less than 1%. Even though the error metrics of the proposed model are not the lowest among other algorithms, it still exhibits a promising result as the errors are very few.
Based on the graphical evaluation, it can be seen from the bottom right of Figure 8 that the actual and predicted CT values are very coincident, indicating that the predictions can follow the actual values precisely. Furthermore, the predicted training data also converge to the actual data, indicating well-fitted data, as shown in the bottom left of the figure.
The proposed model is further compared graphically with other algorithms, as depicted by the zoomed plot of the predicted test data in Figure 9. Based on the figure, it can be examined that the proposed XGBoost and linear regression have the closest line to the actual one. The support vector machine and multilayer perceptron are slightly distant from the actual one, while the decision tree model has fluctuating predicted values. The fluctuation of the decision model shows a higher deviation than others, as confirmed by the highest RMSE gained, which is 7.3604. The goodness of fit between actual and predicted values is visualized in Figure 10, where the blue represents the data points scattered, and the red line depicts the expected results. Based on the visualization, it can be seen that all algorithms exhibit a well-fitted result as the data points are nearly evenly distributed to the expected result. Furthermore, only some data points are placed distant from the expected result. Therefore, it can be concluded that all the algorithms produce promising results for CT prediction. Nevertheless, the proposed model of XGBoost gains the highest R 2 value, which is 0.9999, revealing the best fit among others, followed by linear regression (0.9996) and multilayer perceptron (0.9993).

NO x Prediction
The proposed model exposes an excellent result of NO x prediction as depicted in Figure 11. According to that figure, the proposed model effectively predicts the actual data of the trend during the trip incident and normal operation for both the training and test data sets.
The numerical analysis exhibits that the proposed XGBoost model can outperform other algorithms by raising the highest R 2 and the lowest RMSE, which are 0.9309 and 4.9765, respectively. However, the proposed model gains the second lowest MAE after multilayer perceptron with a slightly different; XGBoost (3.5968) and multilayer perceptron (3.2349). Nevertheless, the proposed model meets the decision criteria by reaching the relative error percentage of 0.1168, which is the lowest among other algorithms.
With closer observation based on graphical analysis, the predicted trend of NO x emission can follow the actual one as represented by red and blue lines for both the training and test data sets. This result indicates that the proposed model is capable of capturing the NO x emission, which fluctuated due to complex chemical processes during the combustion. The prediction from all algorithms is depicted in Figure 12. It can be graphically examined that the proposed model has a close line to the actual one. On the other hand, the predicted values of linear regression and support vector machine fall slightly away from the actual trend. Similarly, the decision tree model also has a distant trend against the actual one, with fluctuated values at some points. The poor linear regression prediction can be explained numerically by gaining the highest RMSE than others, which is 8.7526. Similarly, the decision tree and support vector machine have the second and third highest RMSE. Furthermore, the deviation of the decision tree model is represented by the relative error percentage of 7.2016, which is the highest among the others.
Further analysis is carried out by plotting the actual and predicted values as visualized in Figure 13. Based on the plot, the data are mainly distributed at 60-137.69 ppm, showing the amount of NO x concentration emitted during the operation. By careful observation, the proposed XGBoost yields a proportional plot with fewer data points which fall away from the expected result compared to the others. This result exhibits a well-fitted prediction, which can be described by raising the highest fitness coefficient of R 2 = 0.9309. In contrast, the linear regression model has more distant data points against the expected result, representing poor prediction. This result can be explained numerically by gaining the lowest R 2 value, which is 0.8008.

CO Prediction
The prediction of CO emission from the proposed model exhibits a promising result, as depicted in Figure 14. The trend during normal operation and trip occurrence is successfully predicted for training and test data sets.
Based on the numerical evaluation of performance metrics, the proposed XGBoost model outperforms other algorithms by obtaining the highest R 2 of 0.7109 and the lowest RMSE of 23.7489. However, the MAE of the proposed model is the second lowest after multilayer perceptron; XGBoost = 3.5968 and multilayer perceptron = 3.2349. Nevertheless, the proposed XGBoost model gains the lowest relative error percentage, which is 0.9200, showing that the model meets the decision criteria. Even though XGBoost capably predicts CO emission, the performance is lower than other predicted outputs, such as combustion temperature and NO x . This result can be estimated by examining the correlation value of CO against features that are lower than NO x and combustion temperature. The comparison of CO prediction from the proposed model and other algorithms is visualized in Figure 15. Based on the plot, it can be examined that the XGBoost model has a closer trend with the actual one compared to others, revealing an excellent prediction result. This result can be explained by the relative error percentage in which the XGBoost model gains the lowest error. In contrast, the decision tree has fluctuated predicted values as represented by a significant deviation on the plot.  The fitness between actual and predicted values is depicted in Figure 16. Based on the plot, it can be seen that the XGBoost model outperforms other algorithms, as shown by the evenly distributed data being spread closer to the expected result than others. In contrast, linear regression, SVM, and MLP have several data points too far from the expected result, indicating a huge deviation. These results can be numerically explained by the rank of R 2 value in which the proposed model gains the highest coefficient of 0.7109, followed by MLP with a value of 0.6918.

Clustering Model Development in K-Means
The predicted values of combustion temperature, NO x , and CO emissions from the XGBoost model are subsequently used to predict the DLE gas turbine's operating range. The prediction starts by determining the number of clusters using Elbow method, as shown in Figure 17.
The inertia is a function of the number of clusters in which the point where its rate starts decreasing to level off is considered the optimal number of clusters. The inertia calculation is carried out separately for NO x and CO emissions against the combustion temperature for more accurate results. Additionally, it will provide an easier visualization to determine the elbow point where the optimal cluster is found. Based on the elbow plot, it can be seen that both the inertia rate of NO x and CO starts decreasing at cluster number 4. Hence, the predicted data will be clustered into four distinct operating regions. The prediction of the operating range is performed separately for NO x and CO emissions, as depicted in Figures 18 and 19, where the NO x and CO data scattered with red and blue color, respectively. The dashed black vertical line represents the margin between two clusters calculated by averaging the distance of the nearest data point to one another. This margin cut the x-axis, the combustion temperature, to identify the range of each region. Based on the NO x perspective, Region 1 is found below 444.56°C, Region 2 ranges from 444.57°C to 822.68°C, Region 3 starts from 822.69°C to 870.10°C, and above 870.10°C is labelled as Region 4. From the CO perspective, Region 1 is located below 480.67°C, Region 2 ranges from 480.68°C to 744.67°C, Region 3 is found at 744.68°C to 829.64°C, and Region 4 is located above 829.64°C.
The regions identified from NO x and CO clustering are subsequently used to determine the final range by averaging the margin from both sides to find the optimum range, as tabulated in Table 5 and depicted in Figure 20. Each region portrays different operation conditions of DLE gas turbines which are assigned as trip, near to trip, safe operation, and unhealthy.   As a tight control of DLE gas turbine operation requires a specific range to operate safely, the proposed model predicts the optimum range, which starts from 744.68°C to 829.64°C (Region 3), as shown in Figure 20. The operation at 480.68°C to 744.67°C (Region 2) is considered near the trip, indicating a high tripping probability, which can be used as a prevention alarm to avoid tripping issues. Hence, the operator can act accordingly by controlling the system to restore the turbine to normal operation. This action also can be referred to prevent the LBO fault and high formation of CO emission. The turbine may experience a trip when operated below 480.67°C (Region 1). In contrast, an unhealthy operation may occur when the turbine is operated above 829.64°C, leading to a high formation of NO x emissions.

Conclusions
This paper presents a technique to predict the DLE gas turbine's operating range using a semi-supervised approach. The prediction model is developed by hybridizing XGBoost and K-Means algorithms using an actual DLE gas turbine data with rated power of 17.9 MW. 15 parameters including operational and emissions concentration parameter are examined. Based on the correlation analysis, the important features which will be used for model developments are compressor discharge pressure, splitter opening, fuel gas flow, exhaust gas pressure, exhaust temperature, load, and ambient air temperature.
The XGBoost model predicts the turbine's combustion temperature, NO x , and CO emissions. Then their predicted output is fed to the K-Means model for operating region prediction. Based on the result, it can be concluded that the XGBoost model successfully predicts the combustion temperature, NO x , and CO with the accuracy represented by R 2 of 0.9999, 0.9309, and 0.7109, respectively. Furthermore, the relative error percentage of these predicted outputs is lesser than 1%, which meets the decision criteria as requested by industry needs. Additionally, the proposed model outperforms other regression algorithms such as decision tree, linear regression, multilayer perceptron, and support vector machine. Based on the comparison between the mentioned algorithms, the decision tree model produced prediction results with high deviation, as observed on the graphical plot of the predicted and actual values for each output parameter. On the other hand, the proposed model exhibits an excellent prediction result in both numerical and graphical evaluation.
For operating region prediction, the optimal number of clusters is 4, representing the region of safe operation, unhealthy, near to trip, and trip zone. Based on the clustering result, the optimum operating range is found at 744.68°C to 829.64°C. The operation exceeding that range will lead the turbine to the unhealthy condition indicated by high production of NO emissions. On the other hand, the operation below that region will turn the turbine into a near-trip zone and finally lead to tripping issues. Further, it can cause a high formation of CO emissions. The advantages and drawbacks found from the analyzed algorithms are tabulated in Table 6.
The proposed model is expected to help the industry stakeholders and operators make the proper decision for more reliable operation of the DLE gas turbine while mitigating the tripping issues and maintaining low emissions production. Hence, the technique can be used as guidance for better load management and tripping prevention strategy, which is applicable to DLE gas turbines and other applications involving operating range prediction.