Next Article in Journal
Physical, Chemical, and Performance Properties of Biodiesel Fuels: A Comparative Study of Lipid-Based Feedstocks
Previous Article in Journal
Coal Consumption Efficiency in the European Union—Trends and Challenges
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

CCPP Power Prediction Using CatBoost with Domain Knowledge and Recursive Feature Elimination

School of Electrical Engineering, Xi’an Jiaotong University, Xi’an 710049, China
*
Author to whom correspondence should be addressed.
Energies 2025, 18(16), 4272; https://doi.org/10.3390/en18164272
Submission received: 10 July 2025 / Revised: 23 July 2025 / Accepted: 28 July 2025 / Published: 11 August 2025

Abstract

Combined cycle power plants are modern power generation systems that provide an efficient and environmentally friendly way of generating electricity. With the development of smart grids, higher requirements have been put forward for their power prediction. Using a dataset comprising 9568 observations from a combined cycle power plant operating at full load for 6 years, a high-precision power prediction model integrating CatBoost and domain knowledge is proposed. Twenty new features were designed based on domain expertise, and Recursive Feature Elimination was applied to select the most informative features, optimizing model performance. Experimental results demonstrate that CatBoost outperformed six commonly used machine learning algorithms, both with and without domain knowledge integration. And the incorporation of domain knowledge improved the predictive performance of all evaluated models, underscoring the effectiveness and general applicability of the proposed features. Moreover, Recursive Feature Elimination was applied to select 11 features. The optimized CatBoost model achieved the best predictive accuracy with a root mean square error of 2.8545, a mean absolute error of 1.9645, and an R-squared of 0.9702. A comparative analysis with existing literature methods further validated the superior performance of the proposed approach. These findings highlight the effectiveness of integrating domain knowledge with machine learning and its potential for improving power output prediction in combined cycle power plants.

1. Introduction

As an important resource, electricity has made great contributions to human activities and social development. To improve power generation efficiency, CCPPs have emerged as a prominent and efficient solution. The basic working principle of CCPP was proposed by Pourbeik et al. [1], whose core components include gas turbines, a steam turbine, HRSGs, and generators. The gas turbine generates electricity by combusting fuel with high-pressure air, which drives the turbine blades to rotate, while the waste heat from the exhaust gases is converted into steam through the HRSG to drive the steam turbine, further contributing to electricity generation. CCPPs integrate both the Brayton and Rankine cycles, achieving over 60% efficiency [2], reducing emissions [3], and also lowering operational and maintenance costs [4]. Given the high cost of storing excess energy, accurately predicting the output of power plants is important for maximizing profits and minimizing pollution in the power grid system [5].
Traditional physics-based approaches utilize NWP models to simulate atmospheric processes based on physical principles and boundary conditions [6]. However, these methods require extensive input parameters, environmental variables, and thermodynamic assumptions to accurately represent real-world systems [7,8]. When meteorological conditions change rapidly or unexpected errors occur, their performance may also degrade [9]. In contrast, statistical methods such as ARMA models [10], Bayesian approaches [11], Kalman filters [12], Markov chain models [13], and Grey theory [14] are more widely used than physics-based prediction models. Nevertheless, most existing statistical models are inherently linear, making them less effective for long-term power supply forecasting [6].
In recent years, machine learning algorithms have been widely applied across various fields, demonstrating strong capabilities and broad applicability [15,16,17,18]. For the power prediction problem of CCPPs, many researchers have proposed a variety of machine learning and optimization methods, achieving fruitful research outcomes.
In 2014, Tufekci [19] evaluated the performance of various machine learning regression methods by exploring the best feature subset of the dataset and found that the most successful method performed well in predicting the full load power output, reaching an accuracy of 2.818 MAE and 3.787 RMSE. Two years later, Ahn and Hur [20] proposed a continuous conditional random field model and obtained an MAE of 2.97 and an RMSE of 3.978 in similar tasks.
With further research, Chatterjee et al. [21] designed a Cuckoo search-enabled neural network model in 2018 and compared it with a PSO-enabled neural network model, which showed slightly better performance. In the same year, Yeom and Kwak [22] proposed an ELM based on the TSK fuzzy system, which used random partitions to generate the initial matrix and combined with LSE to optimize the model parameters. In addition, Elfaki and Ahmed [23] explored a regression ANN combining two backpropagation algorithms to estimate the electrical power output of CCPP.
In 2019, Lorencin et al. [24] adopted GA to optimize the structure of MLP and improved the model performance by adjusting the number of hidden layer nodes. The optimal RMSE was 4.305. Meanwhile, Bandic et al. [25] used Random Forest, Random Tree, and ANFIS for regression analysis and compared the performance of full feature sets and reduced feature sets. In the same year, Han [26] proposed a fuzzy neural network algorithm based on a logic tree structure to achieve efficient power prediction by selecting key neuronal nodes and simplifying rules.
In 2020, Hundi and Shahsavari [27] compared a variety of machine learning models and achieved the best result, with RMSE = 3.5, MAE = 2.4, and R2 = 0.959. Wood [28] used the TOB algorithm and the firefly optimization algorithm to improve prediction accuracy. Subsequently, Qu et al. [5] used the stacking method combined with hyperparameter optimization to achieve higher precision power load prediction by training multiple heterogeneous models in parallel.
In 2021, Afzal et al. [29] modeled Ridge regression, Linear Regression, and SVR and compared their performance through a number of evaluation metrics. Santarisi and Faouri [30] use PCA to simplify the data dimension, and although its cost is significantly reduced, the predictive performance is also slightly reduced.
Subsequently, in 2022, Zhao et al. [31] proposed a model combining ESDA and ANN, which showed superior prediction effects. In 2023, Yi et al. [32] developed a new method combining a Transformer encoder and DNN, which achieved excellent results, with RMSE = 3.5370, MAE = 2.4033, MAPE = 0.5307%, and R2 = 0.9555.
In 2024, Ntantis and Xezonakis [33] innovatively used Levenberg–Marquardt, Bayesian regularization, and SCG to configure multiple ANN models for CCPP power prediction. Later, Xezonakis [34] separately proposed an ANFIS model combining the least squares method and gradient descent to further optimize the performance of the Sugeno fuzzy model. Anđelić, et al. [35] used genetic programming to generate symbolic expressions and searched the optimization model by random hyperparameter values. Song et al. [36] integrated six machine learning models and improved the generalization performance of prediction. Finally, Zhang et al. [37] used a variety of machine learning methods combined with the HGS algorithm to optimize short-term power prediction, which significantly improved the stability and accuracy of prediction.
In addition, there are related papers that use other datasets. For example, Karacor et al. [38] used fuzzy logic and ANN to predict the electrical power output of a 243 MW CCPP in Izmir, Turkey, and found that ANN was able to estimate the power output with high accuracy, and the lowest RPE was between 0.59% and 3.54% in the FL model. In ANN, it is between 0.001% and 0.84%. Shuvo et al. [39] used LR, LAR, DTR, and RFR methods to predict the power output of a 210 MW CCPP located in India, and the LR method achieved the best prediction performance.
Overall, many methods have achieved good results in the power prediction of CCPPs, primarily by relying solely on machine learning models. A summary of these representative methods, along with their respective advantages and limitations, is provided in Table S1. In contrast to previous studies that primarily relied on pure data-driven machine learning models for CCPP power prediction, this work introduces a domain-informed hybrid framework combining physical insights, feature selection, and an advanced ensemble model.
The main contributions of this article are summarized as follows:
(1)
A hybrid approach integrating CatBoost, domain knowledge, and RFE is proposed to enhance power prediction accuracy for CCPPs.
(2)
Twenty new features are designed based on thermodynamic and operational principles, many of which have not been explored in prior works. These features yield consistent performance gains across seven machine learning algorithms.
(3)
Comparative analysis shows that CatBoost consistently outperforms six commonly used machine learning models, both with and without domain knowledge integration.
(4)
RFE is applied to optimize feature selection, with the best predictive performance achieved using 11 selected features, resulting in an RMSE of 2.8545, an MAE of 1.9645, and an R2 of 0.9702.
(5)
The proposed method outperforms existing literature methods, demonstrating its effectiveness in power prediction for CCPPs.

2. System and Dataset Description

This section presents the method for predicting the power output of CCPPs by integrating CatBoost with domain knowledge. As illustrated in Figure 1, the research framework begins with the original dataset, where domain knowledge is leveraged to construct new feature variables. This step enhances the dataset by extracting deeper insights and capturing complex hidden relationships within the data. Subsequently, outlier detection and removal are performed using the Z-Score method to ensure data reliability and consistency.
In the dataset splitting stage, 10-fold cross-validation is employed. During each fold, the training set is normalized, and RFE is applied to identify the most significant features for model training. Finally, the optimized CatBoost model is trained and evaluated using RMSE, MAE, and R2 to assess predictive performance.

2.1. CCPP System Description

CCPP is an advanced power generation system consisting of two gas turbines, two HRSGs, and one steam turbine. The rated power of all gas turbines and the steam turbine is 160 MW. Each gas turbine consists of three main components: compressor, combustor, and turbine. Among them, the air sucked in by the compressor from the environment enters the gas turbine and mixes with fuel in the form of small particles in the combustion chamber. The hot gases generated by combustion expand in the turbine, driving the turbine to rotate and thus driving the associated generator to generate electricity. Subsequently, the hot exhaust gas enters the HRSG for cooling while using the generated steam to power the steam turbine. The system of CCPP is shown in Figure 2.

2.2. Dataset Description

The CCPP dataset [40] was provided by the Çorlu Engineering Faculty of Namik Kemal University. It contains 9568 data points from the full load operation of a CCPP from 2006 to 2011, including hourly average values of environmental variables such as temperature, exhaust vacuum, ambient pressure, and relative humidity, as well as corresponding power outputs. The data has been randomized to eliminate the correlation caused by chronological order. Table 1 shows the symbols and descriptions corresponding to this dataset.

3. Methodology

3.1. Feature Engineering

Feature engineering is a step used to enhance the performance of machine learning by integrating domain knowledge [41]. In this study, 20 new features were proposed based on the original features from Table 1, with detailed formulas provided in Table S2. These new features, as summarized in Table 2, encompass a wide range of thermodynamics and fluid mechanics, as well as specific characteristics of the exhaust system and interaction terms between the original features.
The thermodynamic properties, such as gas density and saturation vapor pressure, provide critical insights into the air’s state and behavior under varying conditions. Gas density is a fundamental parameter in fluid dynamics and energy calculations, as it directly influences the mass flow rate and pressure drop in the system. Saturation vapor pressure is essential for understanding the air’s capacity to hold moisture, which is particularly relevant in humidity control and condensation analysis. The absolute humidity and dew point temperature further quantify the air’s moisture content and saturation point. Enthalpy, which combines sensible and latent heat, is a key parameter in energy balance calculations, enabling the assessment of heating and cooling loads in thermodynamic systems. The wet-bulb temperature and specific humidity further describe the air’s moisture content and cooling potential. Thermal conductivity determines the rate of heat transfer between the air and surrounding surfaces. Heat load quantifies the thermal energy required to maintain a desired temperature.
In the realm of fluid mechanics, the dynamic viscosity and kinematic viscosity describe the air’s resistance to flow, which is crucial for analyzing pressure losses and flow distribution in ducts and pipes. The Prandtl number is a dimensionless parameter that represents the ratio of momentum diffusivity to thermal diffusivity, characterizing the relative thickness of thermal and velocity boundary layers. The speed of sound and diffusion coefficient of water vapor are also included, as they are critical in acoustics and mass transfer studies, respectively. These features collectively provide a detailed understanding of the air’s transport properties and their impact on system performance.
For the exhaust system, the exhaust pressure and exhaust power are derived to characterize the system’s pressure conditions and energy consumption. Exhaust pressure, calculated as the difference between ambient pressure and exhaust vacuum, reflects the pressure drop across the system. Exhaust power quantifies the energy associated with the exhaust flow, providing a measure of the system’s efficiency and energy requirements.
To capture the complex interactions between the original features, interaction terms such as pressure–vacuum, temperature–pressure, and temperature–vacuum were introduced. These terms model the combined effects of temperature, pressure, and vacuum on the system’s performance, offering a more nuanced representation of the underlying physical processes.

3.2. Handling Outliers

Z-score is a value used to measure the degree of deviation between a data point and the average value of the dataset. It first converts all data to a unified dimension and then calculates the degree of deviation of different data from the mean value; 99.7% of the data’s Z-score are between −3 and 3. And the data that are not within this range are outliers. The calculation formula for Z-score is
Z = X μ σ
where X is the value of the data point, μ is the mean of the feature, and σ is the standard value of the feature.

3.3. Dataset Division

To accurately evaluate the performance of the proposed model, 10-fold cross-validation was employed, as shown in Figure 3. In each iteration, the model was trained on nine subsets and tested on the remaining one, with the process repeated ten times to ensure that each subset served as the test set once. The final performance metrics, including RMSE, MAE, and R2, were obtained by averaging the results across all folds. It can mitigate potential biases arising from uneven data partitioning and provide a robust assessment of the model’s predictive capability.

3.4. Data Normalization

The data from the training set is normalized. Due to the varying scales of features, the impact of different features on the model may be skewed. To mitigate this effect and ensure that all features contribute equally to the model at a consistent scale, the data is scaled to the interval [0, 1]. This transformation is achieved using the following formula:
X s c a l e d = X X min X max X min
where X min and X max are the minimum and maximum values of this feature, respectively. The normalized data makes the values of different features in the same interval, avoiding the negative impact of scale differences on model training.

3.5. Feature Selection

The introduction of new features significantly increases the dimensionality of the dataset, enriching the model’s input information. However, as the feature dimension grows, redundant features and noise may negatively impact model performance. To address this, RFE was employed to refine the feature space, enhancing the model’s generalization ability by selecting the most representative features.
RFE is an iterative feature selection technique that identifies the most relevant features by training a base model and assessing feature importance [42,43,44]. As illustrated in Figure 4, RFE progressively removes features with minimal influence on the target variable, ultimately retaining the most critical features for the classification task. This process improves model stability and predictive accuracy by eliminating irrelevant or redundant information.

3.6. CatBoost

CatBoost [45] is a kind of gradient boosting technology developed by Yandex, which constructs multiple decision trees to improve model performance. It, like other gradient boosting methods, aims to minimize a loss function L(y, F(x)) by iteratively adding a weak learner to an ensemble. At each iteration t, it updates the model as
F t ( x ) = F t 1 ( x ) + η h t ( x )
where η is the learning rate, and ht(x) is the newly fitted weak learner trained to approximate the negative gradient of the loss function:
h t ( x ) L ( y , F ( x ) ) F ( x ) F ( x ) = F t 1 ( x )
Unlike XGBoost and LightGBM, CatBoost uses the completely symmetric binary tree [46], which builds the tree layer by layer until it reaches the specified depth. In each iteration, it selects and uses the way of the least amount of loss to split all leaf nodes of the tree in the layer. In addition, CatBoost stands out for its ability to perform effectively with limited training data, handle various data formats, and internally manage missing values, ensuring stability and robustness [47]. It divides a given dataset into random permutations and applies ordered enhancement to these random permutations, avoiding target leakage and gradient bias.
Another advantage of CatBoost is that it can handle categorical features directly without traditional preprocessing steps such as one-hot encoding or label encoding. CatBoost adopts an improved Greedy TBS method, which adds a prior term and a weight coefficient, as shown in the following formula:
x ^ k , i = j = 1 n [ x j , i = x k , i ] Y j + a P j = 1 n [ x j , i = x k , i ] + a
where x k , i represents the k-th training sample of feature i ; x ^ k , i represents the average value of the class k ; Y j represents the label of the j - t h training sample; x j , i represents the j - t h training sample whose class feature is i ; [ x j , i = x k , i ] is used to judge whether the category feature i of the k - t h training sample is consistent with that of the j - t h training sample; a represents the weight coefficient; and P represents the added prior term.

4. Results

4.1. Regression Performance Evaluation Indicators

4.1.1. Root Mean Squared Error

RMSE is often used to evaluate the predictive performance of the model, which measures the bias between predicted values and real values. A smaller RMSE means the predicted value is closer to the real value. The formula is as follows:
R M S E = 1 n i = 1 n ( y i y ^ i ) 2
where n denotes the number of samples. y i represents the true value for the i-th sample, and y ^ i refers to the predicted value for the i-th sample.

4.1.2. Mean Absolute Error

MAE is the average absolute difference between the predicted and real values of a model, which calculates the mean of the absolute errors without squaring the differences. It reflects the average level predicted by the model and is not affected by large errors so much. The formula for calculating MAE is as follows:
M A E = 1 n i = 1 n y i y ^ i
where n represents the number of samples, y i denotes the true value for the i-th sample, y ^ i refers to the predicted value for the i-th sample, and y i y ^ i is the absolute error between the predicted and true values.

4.1.3. R-Squared

R2 measures the degree to which the model’s predictions are consistent with the true value, reflecting the degree to which the model explains the total variation of the target variable. The value ranges from 0 to 1, and the closer the value is to 1, the stronger the model’s explanatory power. The formula is as follows:
R 2 = 1 i = 1 n ( y i y ^ ) 2 i = 1 n ( y i y ¯ ) 2
where n represents the number of samples. y i denotes the true value for the i-th sample, y ^ i refers to the predicted value for the i-th sample, and y ¯ represents the average of the true values.

4.2. Analysis of Loss Function Convergence

As shown in Figure 5, the variation of the loss function for both the training and test sets during the 10-fold cross-validation process is illustrated. The X-axis represents the number of training cycles, while the Y-axis represents the RMSE. It can be clearly obtained that as the number of training cycles increases, the RMSE gradually decreases, indicating that the model progressively fits the training data and converges to a smaller error value. Eventually, the loss function stabilizes, suggesting that the model has reached convergence, where further training no longer yields a significant reduction in error.

4.3. Residual Distribution and Normality Evaluation

The residual plot is used to analyze the fit of the model by drawing the residual difference between the predicted value of the regression model and the actual observed value. Specifically, the horizontal axis of the residual plot is the predicted value of the model, and the vertical axis is the residual value. The red dashed line indicates the baseline with a residual of 0. As shown in Figure 6, the data points are roughly randomly distributed around 0, and there is no systematic pattern of obvious curve shape or grouping structure. Meanwhile, the residual range is roughly between −20 and 10, and most of the points are concentrated in a narrow range, indicating that the model has a small prediction error for most samples.
A quantile–quantile plot can be used to assess whether the distribution of data conforms to the normal distribution. As shown in Figure 7, the X-axis of the Q–Q plot represents the quantiles of the theoretical distribution, and the Y-axis represents the quantiles of the empirical (target) values. The red reference line indicates where the points would lie if the two distributions were exactly the same. Almost all of the points fall close to the red line, indicating that the predicted and true values follow a similar distribution, with only a few large deviations.

4.4. Analysis of Model Interpretability and Uncertainty

Regarding model interpretability, as discussed in previous studies such as [48], the combination of CatBoost and SHAP has been widely used to enhance model explainability, primarily by evaluating feature importance. In this study, instead of relying on a single interpretability tool, we assessed feature importance directly through the CatBoost model, which also served as the basis for RFE and regression. Feature importance scores were aggregated across all cross-validation folds, and the average rankings were reported to ensure robustness, as shown in Figure 8. Notably, features such as specific humidity, dew point, specific heat capacity, and dynamic viscosity consistently ranked among the most important, highlighting their substantial contributions to the model’s predictive performance.
To enhance the reliability and practical applicability of the model, we conducted uncertainty analysis by leveraging CatBoost’s support for quantile regression. Specifically, three models were trained using quantile loss functions at α = 0.1, 0.5, and 0.9, corresponding to the lower bound, median, and upper bound predictions, respectively. This allows us to construct a 90% prediction interval for each test sample, thereby quantifying the confidence level of the model’s outputs. Figure 9 illustrates the prediction intervals for the first 50 test samples in one of the cross-validation folds. The blue curve represents the median prediction, while the shaded area indicates the 90% prediction interval. Red crosses denote the ground truth values. As shown, most actual values lie within the predicted intervals, indicating the model’s ability to provide well-calibrated uncertainty estimates. Additionally, the interval width increases significantly for a few outlier samples, highlighting regions where the model expresses less confidence.

4.5. Model Comparison

This study presents the outcomes of seven distinct machine learning algorithms, including SVM, Random Forest, Decision Tree, Neural Network, XGBoost, LightGBM, and CatBoost, for regression prediction both before and after the incorporation of 20 new features. The corresponding model parameters are summarized in Table 3. Initially, 10-fold cross-validation was conducted on all algorithms without the inclusion of new features, and the outcomes of the regression predictors for each fold are depicted in Figure 10, Figure 11 and Figure 12, including RMSE, MAE, and R2. The average results are presented in Table 4.
After adding 20 new features, the regression prediction performance of each algorithm has been improved to some extent. The results of each fold are shown in Figure 13, Figure 14 and Figure 15, and the average value of each fold is summarized in Table 5. The comparison showed that CatBoost performed optimally on all measures, both without and after the addition of 20 new features. Specifically, CatBoost consistently has the lowest RMSE and MAE and the highest R2, indicating that, of the seven machine learning algorithms, CatBoost has the strongest ability to predict the power output of CCPPs.
Furthermore, by comparing and analyzing the data in Table 4 and Table 5, three comparison graphs are shown in Figure 16, Figure 17 and Figure 18. It can be observed that after incorporating domain knowledge, the RMSE and MAE of these seven machine learning algorithms show varying degrees of reduction, while R2 exhibits a slight increase, albeit with a minimal magnitude. It is considered that 20 new features based on domain knowledge can guide the machine learning model to predict the power of CCPP.

4.6. Comparison of Results with Different Feature Numbers

In this study, a comparative analysis was conducted to assess the impact of feature selection on CatBoost’s performance, considering feature subsets ranging from 4 to 24. When using only four features, the original variables were retained without incorporating any additional derived features. For feature subsets containing 5 to 24 features, the original four variables were supplemented with 20 newly derived physical quantities, and RFE was applied to identify the most informative subset.
As shown in Figure 19, Figure 20 and Figure 21, the optimal performance was achieved when selecting 11 features, yielding an RMSE of 2.8545, an MAE of 1.9645, and an R2 of 0.9702. Notably, when selecting five to seven features, the model’s predictive accuracy was lower than that obtained using only the original four variables. However, when the number of selected features ranged from 12 to 24, the model consistently outperformed the baseline scenario with only the original features, demonstrating the benefits of incorporating domain-specific knowledge into the feature set.

4.7. Characteristic Group Ablation Experiment

Compared to the baseline model that uses only the original features, the addition of thermodynamic features alone has a negligible impact on performance. The inclusion of fluid mechanics features brings a slight improvement. In contrast, adding exhaust system features yields a more noticeable performance gain. However, the model that incorporates only the exhaust-related features still performs slightly worse than the one that integrates all domain-specific features, suggesting that there may be interactions or complementary effects among features from different physical domains. This highlights the importance of considering combined domain knowledge rather than isolated feature sets. These findings are shown in Table 6, which presents the results of the characteristic group ablation experiment.

4.8. Comparison with Results from Other Literature

Table 7 presents a comparison of the results between the proposed method and other methods previously applied to the same dataset. The proposed method, which integrates CatBoost with domain knowledge and employs RFE to select the most informative features, achieves an average RMSE and MAE of 2.855 and 1.965, demonstrating superior performance compared to other approaches. It may be due to the fact that CatBoost is a relatively new algorithm, which has strong advantages in predictive performance. And the other possible reason is that the study combined domain knowledge with RFE for feature selection, thereby further improving the prediction accuracy. It not only fully utilizes the efficient feature processing capability of CatBoost but also enhances the model’s ability to recognize key features, making it perform better in fault diagnosis tasks.

5. Conclusions

In this study, a novel method integrating domain knowledge with CatBoost was proposed for predicting the power output of CCPPs. To enhance predictive accuracy, 20 new features were designed based on domain expertise, and RFE was employed to identify the most informative features. Comparative experiments involving CatBoost and six other commonly used machine learning algorithms demonstrated that CatBoost consistently outperformed its counterparts, regardless of whether domain knowledge was incorporated. The incorporation of domain knowledge led to a significant improvement in prediction accuracy across all machine learning algorithms, highlighting the universal applicability and effectiveness of the newly proposed features in enhancing model performance.
To further optimize the model, RFE was applied to determine the optimal number of features for prediction. Experimental results revealed that varying the number of selected features between 4 and 24 influenced predictive accuracy, with the best performance achieved when selecting 11 features, yielding an RMSE of 2.8545, an MAE of 1.9645, and an R2 of 0.9702. Finally, a comparison with existing methods in the literature confirms the superior predictive accuracy of the proposed approach. These results underscore the effectiveness of integrating domain knowledge with machine learning and its potential for enhancing power output prediction in CCPPs. The proposed framework, with its strong generalization ability and computational efficiency, is well suited for integration into real-time plant operation systems. In smart grid scenarios, where dynamic energy management and short-term load forecasting are critical, the model can contribute to improved scheduling, grid stability, and energy optimization.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/en18164272/s1, Table S1: The advantages and disadvantages of the methods mentioned in the introduction; Table S2: The corresponding formula of features.

Author Contributions

B.G.: Visualization, Validation, Software, Methodology, Formal analysis, Conceptualization. B.Y.: Writing—review & editing, Writing—original draft. S.W.: Resources. W.S.: Investigation, Data curation. F.Y.: Supervision, Methodology. D.W.: Project administration. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data supporting the findings of this study are available at [UC Irvine Machine Learning Repository] and can be accessed via [DOI: 10.24432/C5002N].

Acknowledgments

The authors would like to thank all those who contributed indirectly to this work. No specific support outside the author contribution or funding sections is declared.

Conflicts of Interest

The authors declare no conflict of interest.

Nomenclature

ANFISAdaptive Neural Fuzzy Inference System
ANNArtificial Neural Network
ARMAAutoregressive Moving Average
CatBoostCategorical Boosting
CCPPCombined Cycle Power Plant
DNNdeep neural network
DTRDecision Tree Regression
ELMExtreme Learning Machine
ESDAElectrostatic Discharge Algorithm
GAGenetic Algorithm
GBDTGradient Boosting Decision Tree
Greedy TBSGreedy Tree Boosting Strategy
HRSGHeat Recovery Steam Generator
LARLeast Angle Regression
LRLogistic Regression
LSELeast Squares Estimation
MAEMean Absolute Error
MLPMulti-layer Perceptron
NWPNumerical Weather Prediction
PCAPrincipal Component Analysis
PSOParticle Swarm Optimization
R2R-Square
RFERecursive Feature Elimination
RFRRandom Forest Regression
RMSERoot Mean Square Error
RPERelative Percentage Error
SCGScaled Conjugate Gradient
SVRSupport Vector Regression
TOBTransparent Open Box
TSK Fuzzy SystemTakagi-Sugeno-Kang Fuzzy System
Z-ScoreStandard Score

References

  1. Pourbeik, P. Modeling of combined-cycle power plants for power system studies. In Proceedings of the 2003 IEEE Power Engineering Society General Meeting (IEEE Cat. No. 03CH37491), Toronto, ON, Canada, 13–17 July 2003. [Google Scholar]
  2. Hoang, T.; Pawluskiewicz, D.K. The Efficiency Analysis of Different Combined Cycle Power Plants Based on the Impact of Selected Parameters. Int. J. Smart Grid Clean Energy 2016, 5, 77–85. [Google Scholar] [CrossRef]
  3. Ersayin, E.; Ozgener, L. Performance Analysis of Combined Cycle Power Plants: A Case Study. Renew. Sustain. Energy Rev. 2015, 43, 832–842. [Google Scholar] [CrossRef]
  4. Wittenburg, R.; Hübel, M.; Prause, H.; Gierow, C.; Reißig, M.; Hassel, E. Effects of rising dynamic requirements on the lifetime consumption of a combined cycle gas turbine power plant. Energy Procedia 2019, 158, 5717–5723. [Google Scholar] [CrossRef]
  5. Qu, Z.; Xu, J.; Wang, Z.; Chi, R.; Liu, H. Prediction of Electricity Generation from a Combined Cycle Power Plant Based on a Stacking Ensemble and Its Hyperparameter Optimization with a Grid-Search Method. Energy 2021, 227, 120309. [Google Scholar] [CrossRef]
  6. Wang, H.; Lei, Z.; Zhang, X.; Zhou, B.; Peng, J. A Review of Deep Learning for Renewable Energy Forecasting. Energy Convers. Manag. 2019, 198, 111799. [Google Scholar] [CrossRef]
  7. Kesgin, U.; Heperkan, H. Simulation of Thermodynamic Systems Using Soft Computing Techniques. Int. J. Energy Res. 2005, 29, 581–611. [Google Scholar] [CrossRef]
  8. Samani, A. Combined Cycle Power Plant with Indirect Dry Cooling Tower Forecasting Using Artificial Neural Network. Decis. Sci. Lett. 2018, 7, 131–142. [Google Scholar] [CrossRef]
  9. Shaker, H.; Manfre, D.; Zareipour, H. Forecasting the Aggregated Output of a Large Fleet of Small Behind-the-Meter Solar Photovoltaic Sites. Renew. Energy 2020, 147, 1861–1869. [Google Scholar] [CrossRef]
  10. Aasim; Singh, S.N.; Mohapatra, A. Repeated Wavelet Transform Based ARIMA Model for Very Short-Term Wind Speed Forecasting. Renew. Energy 2019, 136, 758–768. [Google Scholar] [CrossRef]
  11. Wang, Y.; Wang, H.; Srinivasan, D.; Hu, Q. Robust Functional Regression for Wind Speed Forecasting Based on Sparse Bayesian Learning. Renew. Energy 2019, 132, 43–60. [Google Scholar] [CrossRef]
  12. Yang, D. On Post-Processing Day-Ahead NWP Forecasts Using Kalman Filtering. Sol. Energy 2019, 182, 179–181. [Google Scholar] [CrossRef]
  13. Wang, Y.; Wang, J.; Wei, X. A Hybrid Wind Speed Forecasting Model Based on Phase Space Reconstruction Theory and Markov Model: A Case Study of Wind Farms in Northwest China. Energy 2015, 91, 556–572. [Google Scholar] [CrossRef]
  14. Wu, L.; Gao, X.; Xiao, Y.; Yang, Y.; Chen, X. Using a Novel Multi-Variable Grey Model to Forecast the Electricity Consumption of Shandong Province in China. Energy 2018, 157, 327–335. [Google Scholar] [CrossRef]
  15. Madhloom, J.K.; Abd Ghani, M.K.; Baharon, M.R. Enhancement to the patient’s health care image encryption system, using several layers of DNA computing and AES (MLAESDNA). Period. Eng. Nat. Sci. 2021, 9, 928–947. [Google Scholar] [CrossRef]
  16. Education, W.E.D.M.O.; Oday, O.; Majeed, H.L.; Hussein, M.A.; Darwish, S.M.; Al Al-Boridi, O.; Hassen, O.A. Quantum Machine Learning for Video Compression: An Optimal Video Frames Compression Model using Qutrits Quantum Genetic Algorithm for Video multicast over the Internet. J. Cybersecur. Inf. Manag. 2025, 15, 43–64. [Google Scholar]
  17. Xiao, L.; Wang, G.; Long, W.; Liaw, P.K.; Ren, J. Fatigue life prediction of the FCC-based multi-principal element alloys via domain knowledge-based machine learning. Eng. Fract. Mech. 2024, 296, 109860. [Google Scholar] [CrossRef]
  18. Ghadi, A.Z.; Syauqi, A.; Gu, B.; Lim, H. Highly accurate heat release rate marker detection in NH3–CH4 cofiring through machine learning and domain knowledge-based selection integration. Int. J. Hydrogen Energy 2024, 80, 1223–1233. [Google Scholar] [CrossRef]
  19. Tüfekci, P. Prediction of full load electrical power output of a base load operated combined cycle power plant using machine learning methods. Int. J. Electr. Power Energy Syst. 2014, 60, 126–140. [Google Scholar] [CrossRef]
  20. Ahn, G.; Hur, S. Continuous conditional random field model for predicting the electrical load of a combined cycle power plant. Ind. Eng. Manag. Syst. 2016, 15, 148–155. [Google Scholar] [CrossRef]
  21. Chatterjee, S.; Dey, N.; Ashour, A.S.; Drugarin, C.V.A. Electrical energy output prediction using cuckoo search based artificial neural network. In Smart Trends in Systems, Security and Sustainability; Proceedings of WS4 2017; Springer: Singapore, 2018; pp. 277–285. [Google Scholar]
  22. Yeom, C.-U.; Kwak, K.-C. A design of TSK-based elm for prediction of electrical power in combined cycle power plant. In Proceedings of the 2018 International Conference on Intelligent Informatics and Biomedical Sciences (ICIIBMS), Bangkok, Thailand, 21–24 October 2018. [Google Scholar]
  23. Elfaki, E.A.; Ahmed, A.H. Prediction of electrical output power of combined cycle power plant using regression ANN model. Engineering 2018, 6, 17–38. [Google Scholar] [CrossRef]
  24. Lorencin, I.; Anđelić, N.; Mrzljak, V.; Car, Z. Genetic algorithm approach to design of multi-layer perceptron for combined cycle power plant electrical power output estimation. Energies 2019, 12, 4352. [Google Scholar] [CrossRef]
  25. Bandić, L.; Hasičić, M.; Kevrić, J. Prediction of power output for combined cycle power plant using random decision tree algorithms and ANFIS. In International Symposium on Innovative and Interdisciplinary Applications of Advanced Technologies; Springer International Publishing: Cham, Switzerland, 2019; pp. 406–416. [Google Scholar]
  26. Han, C.-W. Output Power Prediction of Combined Cycle Power Plant Using Logic-Based Tree Structured Fuzzy Neural Networks; AIP Publishing LLC: Melville, NY, USA, 2019. [Google Scholar]
  27. Hundi, P.; Shahsavari, R. Comparative studies among machine learning models for performance estimation and health monitoring of thermal power plants. Appl. Energy 2020, 265, 114775. [Google Scholar] [CrossRef]
  28. Wood, D.A. Combined cycle gas turbine power output prediction and data mining with optimized data matching algorithm. Appl. Sci. 2020, 2, 441. [Google Scholar] [CrossRef]
  29. Afzal, A.; Alshahrani, S.; Alrobaian, A.; Buradi, A.; Khan, S.A. Power Plant Energy Predictions Based on Thermal Factors Using Ridge and Support Vector Regressor Algorithms. Energies 2021, 14, 7254. [Google Scholar] [CrossRef]
  30. Santarisi, N.S.; Faouri, S.S. Prediction of combined cycle power plant electrical output power using machine learning regression algorithms. East.-Eur. J. Enterp. Technol. 2021, 6, 114. [Google Scholar] [CrossRef]
  31. Zhao, Y.; Foong, L.K. Predicting electrical power output of combined cycle power plants using a novel artificial neural network optimized by electrostatic discharge algorithm. Measurement 2022, 198, 111405. [Google Scholar] [CrossRef]
  32. Yi, Q.; Xiong, H.; Wang, D. Predicting Power Generation from a Combined Cycle Power Plant Using Transformer Encoders with DNN. Electronics 2023, 12, 2431. [Google Scholar] [CrossRef]
  33. Xezonakis, V.; Samuel, O.D.; Enweremadu, C.C. Modelling and Output Power Estimation of a Combined Gas Plant and a Combined Cycle Plant Using an Artificial Neural Network Approach. J. Eng. 2024, 2024, 5540010. [Google Scholar] [CrossRef]
  34. Ntantis, E.L.; Xezonakis, V. Optimization of electric power prediction of a combined cycle power plant using innovative machine learning technique. Optim. Control Appl. Methods 2024, 45, 2218–2230. [Google Scholar] [CrossRef]
  35. Anđelić, N.; Lorencin, I.; Mrzljak, V.; Car, Z. On the application of symbolic regression in the energy sector: Estimation of combined cycle power plant electrical power output using genetic programming algorithm. Eng. Appl. Artif. Intell. 2024, 133, 108213. [Google Scholar] [CrossRef]
  36. Song, Y.; Park, J.; Suh, M.-S.; Kim, C. Prediction of Full-Load Electrical Power Output of Combined Cycle Power Plant Using a Super Learner Ensemble. Appl. Sci. 2024, 14, 11638. [Google Scholar] [CrossRef]
  37. Zhang, J.; Zhang, M.; Yang, J.; Zheng, X. Prediction of electricity load generated by Combined Cycle Power Plants using integration of machine learning methods and HGS algorithm. Comput. Electr. Eng. 2024, 120, 109644. [Google Scholar] [CrossRef]
  38. Karaçor, M.; Uysal, A.; Mamur, H.; Şen, G.; Nil, M.; Bilgin, M.Z.; Doğan, H.; Şahin, C. Life performance prediction of natural gas combined cycle power plant with intelligent algorithms. Sustain. Energy Technol. Assess. 2021, 47, 101398. [Google Scholar] [CrossRef]
  39. Shuvo, M.G.R.; Sultana, N.; Motin, L.; Islam, M.R. Prediction of hourly total energy in combined cycle power plant using machine learning techniques. In Proceedings of the 2021 1st International Conference on Artificial Intelligence and Data Analytics (CAIDA), Riyadh, Saudi Arabia, 6–7 April 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 170–175. [Google Scholar]
  40. Tfekci, P.; Kaya, H. Combined Cycle Power Plant. UCI Machine Learning Repository, 2014. Available online: https://doi.org/10.24432/C5002N (accessed on 1 March 2025).
  41. Sadeq, A.M. Machine Learning Mastery for Engineers; Amazon: Seattle, WA, USA, 2024. [Google Scholar]
  42. Zhou, X.; Wen, H.; Zhang, Y.; Xu, J.; Zhang, W. Landslide susceptibility mapping using hybrid random forest with GeoDetector and RFE for factor optimization. Geosci. Front. 2021, 12, 101211. [Google Scholar] [CrossRef]
  43. Lin, X.; Li, C.; Zhang, Y.; Su, B.; Fan, M.; Wei, H. Selecting feature subsets based on SVM-RFE and the overlapping ratio with applications in bioinformatics. Molecules 2017, 23, 52. [Google Scholar] [CrossRef] [PubMed]
  44. Kornyo, O.; Asante, M.; Opoku, R.; Owusu-Agyemang, K.; Partey, B.T.; Baah, E.K.; Boadu, N. Botnet attacks classification in AMI networks with recursive feature elimination (RFE) and machine learning algorithms. Comput. Secur. 2023, 135, 103456. [Google Scholar] [CrossRef]
  45. Prokhorenkova, L.; Gusev, G.; Vorobev, A.; Dorogush, A.V.; Gulin, A. CatBoost: Unbiased boosting with categorical features. In Proceedings of the Advances in Neural Information Processing Systems 2018, Montréal, QC, Canada, 3–8 December 2018; p. 31. [Google Scholar]
  46. Hancock, J.T.; Khoshgoftaar, T.M. CatBoost for big data: An interdisciplinary review. J. Big Data 2020, 7, 94. [Google Scholar] [CrossRef]
  47. Luo, M.; Wang, Y.; Xie, Y.; Zhou, L.; Qiao, J.; Qiu, S.; Sun, Y. Combination of feature selection and catboost for prediction: The first application to the estimation of aboveground biomass. Forests 2021, 12, 216. [Google Scholar] [CrossRef]
  48. Joy, R.A. An interpretable catboost model to predict the power of combined cycle power plants. In Proceedings of the 2021 International Conference on Information Technology (ICIT), Amman, Jordan, 14–15 July 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 435–439. [Google Scholar]
Figure 1. The study’s model framework.
Figure 1. The study’s model framework.
Energies 18 04272 g001
Figure 2. The CCPP system.
Figure 2. The CCPP system.
Energies 18 04272 g002
Figure 3. Ten-fold cross-validation for model performance evaluation.
Figure 3. Ten-fold cross-validation for model performance evaluation.
Energies 18 04272 g003
Figure 4. The process of RFE.
Figure 4. The process of RFE.
Energies 18 04272 g004
Figure 5. The loss function changes for each fold.
Figure 5. The loss function changes for each fold.
Energies 18 04272 g005
Figure 6. Residual analysis plot.
Figure 6. Residual analysis plot.
Energies 18 04272 g006
Figure 7. The quantile–quantile plot.
Figure 7. The quantile–quantile plot.
Energies 18 04272 g007
Figure 8. The importance of features.
Figure 8. The importance of features.
Energies 18 04272 g008
Figure 9. The 90% confidence interval of the prediction results of some test samples.
Figure 9. The 90% confidence interval of the prediction results of some test samples.
Energies 18 04272 g009
Figure 10. The RMSE of 10-fold cross-validation before feature engineering.
Figure 10. The RMSE of 10-fold cross-validation before feature engineering.
Energies 18 04272 g010
Figure 11. The MAE of 10-fold cross-validation before feature engineering.
Figure 11. The MAE of 10-fold cross-validation before feature engineering.
Energies 18 04272 g011
Figure 12. The R2 of 10-fold cross-validation before feature engineering.
Figure 12. The R2 of 10-fold cross-validation before feature engineering.
Energies 18 04272 g012
Figure 13. RMSE performance of 10-fold cross-validation after feature engineering.
Figure 13. RMSE performance of 10-fold cross-validation after feature engineering.
Energies 18 04272 g013
Figure 14. MAE performance of 10-fold cross-validation after feature engineering.
Figure 14. MAE performance of 10-fold cross-validation after feature engineering.
Energies 18 04272 g014
Figure 15. R2 performance of 10-fold cross-validation after feature engineering.
Figure 15. R2 performance of 10-fold cross-validation after feature engineering.
Energies 18 04272 g015
Figure 16. Comparison of RMSE values before and after feature engineering.
Figure 16. Comparison of RMSE values before and after feature engineering.
Energies 18 04272 g016
Figure 17. Comparison of MAE values before and after feature engineering.
Figure 17. Comparison of MAE values before and after feature engineering.
Energies 18 04272 g017
Figure 18. Comparison of R2 values before and after feature engineering.
Figure 18. Comparison of R2 values before and after feature engineering.
Energies 18 04272 g018
Figure 19. RMSE of different numbers of selected features by using RFE.
Figure 19. RMSE of different numbers of selected features by using RFE.
Energies 18 04272 g019
Figure 20. MAE of different numbers of selected features by using RFE.
Figure 20. MAE of different numbers of selected features by using RFE.
Energies 18 04272 g020
Figure 21. R2 of different numbers of selected features by using RFE.
Figure 21. R2 of different numbers of selected features by using RFE.
Energies 18 04272 g021
Table 1. Summary statistics of the dataset.
Table 1. Summary statistics of the dataset.
SymbolRangeUnitsMissing ValuesDescription
AT1.81–37.11°CnoAmbient temperature
EV25.36–81.56cm HgnoExhaust vacuum
AP992.89–1033.30milibarnoAmbient pressure
RH25.56–100.16%noRelative humidity
EP420.26–495.76MWnoNet hourly electrical energy output
Table 2. New features and their descriptions based on domain knowledge.
Table 2. New features and their descriptions based on domain knowledge.
SymbolSymbol MeaningSpecific Meaning
ρ Gas DensityDensity of the gas, calculated using the ideal gas law.
e s Saturation Vapor PressureThe pressure exerted by water vapor in equilibrium with liquid water.
A H Absolute HumidityThe mass of water vapor per unit volume of air.
T d e w Dew Point TemperatureThe temperature at which air becomes saturated with water vapor.
h EnthalpyThe total energy of the air, including sensible and latent heat.
T w b Wet-Bulb TemperatureThe temperature of air cooled by evaporation to saturation at constant pressure.
q Specific HumidityThe mass of water vapor per unit mass of moist air.
Q h e a t Heat LoadA measure of the thermal load based on temperature and relative humidity.
c p Specific Heat Capacity of AirThe heat capacity of air at constant pressure.
k Thermal Conductivity of AirThe ability of air to conduct heat.
μ Dynamic Viscosity of AirThe resistance of air to flow under an applied force.
v Kinematic Viscosity of AirThe ratio of dynamic viscosity to gas density.
P r Prandtl NumberThe ratio of momentum diffusivity to thermal diffusivity.
c Speed of Sound in AirThe speed at which sound waves propagate through air.
D Diffusion Coefficient of Water Vapor in AirThe rate of diffusion of water vapor in air.
P e x Exhaust PressureThe pressure in the exhaust system, calculated as ambient pressure minus vacuum.
P p o w e r Exhaust PowerThe power associated with the exhaust flow.
P V Pressure–Vacuum InteractionAn interaction feature between pressure and vacuum.
T P Temperature–Pressure InteractionAn interaction feature between ambient temperature and ambient pressure.
T V Temperature–Vacuum InteractionAn interaction feature between ambient temperature and exhaust vacuum.
Table 3. Parameters of machine learning methods.
Table 3. Parameters of machine learning methods.
MethodIterationsLearning_RateDepthRegularization
CatBoost50000.17l2_leaf_reg = 3
XGBoost50000.17reg_lambda = 1, reg_alpha = 0
LightGBM50000.17reg_lambda = 0, reg_alpha = 0
Random Forest5000-None-
Decistion Tree--None-
SVM---C = 1.0, gamma = ‘scale’
Neural Network5000.0013 (128-64-1)L2 = 0.0001 (alpha)
Table 4. Average results of machine learning methods before feature engineering.
Table 4. Average results of machine learning methods before feature engineering.
MethodsRMSEMAER2
CatBoost2.93102.01030.9690
XGBoost2.99042.06410.9681
LightGBM3.07662.15460.9668
Random Forest3.50742.56950.9560
SVM4.17163.16280.9371
Decision Tree4.10812.91830.9392
Neural Network4.50593.56970.9288
Table 5. Average results of machine learning methods after feature engineering.
Table 5. Average results of machine learning methods after feature engineering.
MethodsRMSEMAER2
CatBoost2.89241.98440.9693
XGBoost2.94612.05240.9690
LGBM3.00872.12160.9672
RF3.46472.55390.9574
SVM4.15023.16360.9399
Decision Tree4.07182.87450.9415
Neural Network4.41163.45830.9298
Table 6. Characteristic group ablation experiment result.
Table 6. Characteristic group ablation experiment result.
MetricsRMSEMSER2
Only original features2.93102.01030.9690
Add thermodynamics features2.93152.02440.9690
Add fluid mechanics features2.92792.01040.9691
Add exhaust system features2.91701.98580.9692
Add all features2.89241.98440.9693
RFE selected 11 features2.85451.96450.9702
Table 7. Compared with results from other literature.
Table 7. Compared with results from other literature.
AlgorithmRef.RMSEMAE
Bagging with REP Tree[19]4.2393.220
C-CRF[20]3.9782.970
TSK-ELM[22]3.97N/A
Regression ANN[23]4.32N/A
GA-MLP[24]4.305N/A
RF[25]3.0272.255
RFR[27]3.52.4
TOB matching[28]2.89N/A
Stacking ensemble[5]3.5012.615
ESDA-MLP[31]4.1113.217
DNN-Transformer[32]3.5372.403
LM + BR + SCG + MLP[33]3.631N/A
ANFIS[34]3.785N/A
GP-RHVS[35]4.2993.338
Super Learner Ensemble[36]3.1772.254
Proposed method 2.8551.965
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Guo, B.; Yang, B.; Shi, W.; Yang, F.; Wang, D.; Wang, S. CCPP Power Prediction Using CatBoost with Domain Knowledge and Recursive Feature Elimination. Energies 2025, 18, 4272. https://doi.org/10.3390/en18164272

AMA Style

Guo B, Yang B, Shi W, Yang F, Wang D, Wang S. CCPP Power Prediction Using CatBoost with Domain Knowledge and Recursive Feature Elimination. Energies. 2025; 18(16):4272. https://doi.org/10.3390/en18164272

Chicago/Turabian Style

Guo, Baicun, Bowen Yang, Weizhan Shi, Fengye Yang, Dong Wang, and Shuhong Wang. 2025. "CCPP Power Prediction Using CatBoost with Domain Knowledge and Recursive Feature Elimination" Energies 18, no. 16: 4272. https://doi.org/10.3390/en18164272

APA Style

Guo, B., Yang, B., Shi, W., Yang, F., Wang, D., & Wang, S. (2025). CCPP Power Prediction Using CatBoost with Domain Knowledge and Recursive Feature Elimination. Energies, 18(16), 4272. https://doi.org/10.3390/en18164272

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop