1. Introduction
As an important resource, electricity has made great contributions to human activities and social development. To improve power generation efficiency, CCPPs have emerged as a prominent and efficient solution. The basic working principle of CCPP was proposed by Pourbeik et al. [
1], whose core components include gas turbines, a steam turbine, HRSGs, and generators. The gas turbine generates electricity by combusting fuel with high-pressure air, which drives the turbine blades to rotate, while the waste heat from the exhaust gases is converted into steam through the HRSG to drive the steam turbine, further contributing to electricity generation. CCPPs integrate both the Brayton and Rankine cycles, achieving over 60% efficiency [
2], reducing emissions [
3], and also lowering operational and maintenance costs [
4]. Given the high cost of storing excess energy, accurately predicting the output of power plants is important for maximizing profits and minimizing pollution in the power grid system [
5].
Traditional physics-based approaches utilize NWP models to simulate atmospheric processes based on physical principles and boundary conditions [
6]. However, these methods require extensive input parameters, environmental variables, and thermodynamic assumptions to accurately represent real-world systems [
7,
8]. When meteorological conditions change rapidly or unexpected errors occur, their performance may also degrade [
9]. In contrast, statistical methods such as ARMA models [
10], Bayesian approaches [
11], Kalman filters [
12], Markov chain models [
13], and Grey theory [
14] are more widely used than physics-based prediction models. Nevertheless, most existing statistical models are inherently linear, making them less effective for long-term power supply forecasting [
6].
In recent years, machine learning algorithms have been widely applied across various fields, demonstrating strong capabilities and broad applicability [
15,
16,
17,
18]. For the power prediction problem of CCPPs, many researchers have proposed a variety of machine learning and optimization methods, achieving fruitful research outcomes.
In 2014, Tufekci [
19] evaluated the performance of various machine learning regression methods by exploring the best feature subset of the dataset and found that the most successful method performed well in predicting the full load power output, reaching an accuracy of 2.818 MAE and 3.787 RMSE. Two years later, Ahn and Hur [
20] proposed a continuous conditional random field model and obtained an MAE of 2.97 and an RMSE of 3.978 in similar tasks.
With further research, Chatterjee et al. [
21] designed a Cuckoo search-enabled neural network model in 2018 and compared it with a PSO-enabled neural network model, which showed slightly better performance. In the same year, Yeom and Kwak [
22] proposed an ELM based on the TSK fuzzy system, which used random partitions to generate the initial matrix and combined with LSE to optimize the model parameters. In addition, Elfaki and Ahmed [
23] explored a regression ANN combining two backpropagation algorithms to estimate the electrical power output of CCPP.
In 2019, Lorencin et al. [
24] adopted GA to optimize the structure of MLP and improved the model performance by adjusting the number of hidden layer nodes. The optimal RMSE was 4.305. Meanwhile, Bandic et al. [
25] used Random Forest, Random Tree, and ANFIS for regression analysis and compared the performance of full feature sets and reduced feature sets. In the same year, Han [
26] proposed a fuzzy neural network algorithm based on a logic tree structure to achieve efficient power prediction by selecting key neuronal nodes and simplifying rules.
In 2020, Hundi and Shahsavari [
27] compared a variety of machine learning models and achieved the best result, with RMSE = 3.5, MAE = 2.4, and R
2 = 0.959. Wood [
28] used the TOB algorithm and the firefly optimization algorithm to improve prediction accuracy. Subsequently, Qu et al. [
5] used the stacking method combined with hyperparameter optimization to achieve higher precision power load prediction by training multiple heterogeneous models in parallel.
In 2021, Afzal et al. [
29] modeled Ridge regression, Linear Regression, and SVR and compared their performance through a number of evaluation metrics. Santarisi and Faouri [
30] use PCA to simplify the data dimension, and although its cost is significantly reduced, the predictive performance is also slightly reduced.
Subsequently, in 2022, Zhao et al. [
31] proposed a model combining ESDA and ANN, which showed superior prediction effects. In 2023, Yi et al. [
32] developed a new method combining a Transformer encoder and DNN, which achieved excellent results, with RMSE = 3.5370, MAE = 2.4033, MAPE = 0.5307%, and R
2 = 0.9555.
In 2024, Ntantis and Xezonakis [
33] innovatively used Levenberg–Marquardt, Bayesian regularization, and SCG to configure multiple ANN models for CCPP power prediction. Later, Xezonakis [
34] separately proposed an ANFIS model combining the least squares method and gradient descent to further optimize the performance of the Sugeno fuzzy model. Anđelić, et al. [
35] used genetic programming to generate symbolic expressions and searched the optimization model by random hyperparameter values. Song et al. [
36] integrated six machine learning models and improved the generalization performance of prediction. Finally, Zhang et al. [
37] used a variety of machine learning methods combined with the HGS algorithm to optimize short-term power prediction, which significantly improved the stability and accuracy of prediction.
In addition, there are related papers that use other datasets. For example, Karacor et al. [
38] used fuzzy logic and ANN to predict the electrical power output of a 243 MW CCPP in Izmir, Turkey, and found that ANN was able to estimate the power output with high accuracy, and the lowest RPE was between 0.59% and 3.54% in the FL model. In ANN, it is between 0.001% and 0.84%. Shuvo et al. [
39] used LR, LAR, DTR, and RFR methods to predict the power output of a 210 MW CCPP located in India, and the LR method achieved the best prediction performance.
Overall, many methods have achieved good results in the power prediction of CCPPs, primarily by relying solely on machine learning models. A summary of these representative methods, along with their respective advantages and limitations, is provided in
Table S1. In contrast to previous studies that primarily relied on pure data-driven machine learning models for CCPP power prediction, this work introduces a domain-informed hybrid framework combining physical insights, feature selection, and an advanced ensemble model.
The main contributions of this article are summarized as follows:
- (1)
A hybrid approach integrating CatBoost, domain knowledge, and RFE is proposed to enhance power prediction accuracy for CCPPs.
- (2)
Twenty new features are designed based on thermodynamic and operational principles, many of which have not been explored in prior works. These features yield consistent performance gains across seven machine learning algorithms.
- (3)
Comparative analysis shows that CatBoost consistently outperforms six commonly used machine learning models, both with and without domain knowledge integration.
- (4)
RFE is applied to optimize feature selection, with the best predictive performance achieved using 11 selected features, resulting in an RMSE of 2.8545, an MAE of 1.9645, and an R2 of 0.9702.
- (5)
The proposed method outperforms existing literature methods, demonstrating its effectiveness in power prediction for CCPPs.
3. Methodology
3.1. Feature Engineering
Feature engineering is a step used to enhance the performance of machine learning by integrating domain knowledge [
41]. In this study, 20 new features were proposed based on the original features from
Table 1, with detailed formulas provided in
Table S2. These new features, as summarized in
Table 2, encompass a wide range of thermodynamics and fluid mechanics, as well as specific characteristics of the exhaust system and interaction terms between the original features.
The thermodynamic properties, such as gas density and saturation vapor pressure, provide critical insights into the air’s state and behavior under varying conditions. Gas density is a fundamental parameter in fluid dynamics and energy calculations, as it directly influences the mass flow rate and pressure drop in the system. Saturation vapor pressure is essential for understanding the air’s capacity to hold moisture, which is particularly relevant in humidity control and condensation analysis. The absolute humidity and dew point temperature further quantify the air’s moisture content and saturation point. Enthalpy, which combines sensible and latent heat, is a key parameter in energy balance calculations, enabling the assessment of heating and cooling loads in thermodynamic systems. The wet-bulb temperature and specific humidity further describe the air’s moisture content and cooling potential. Thermal conductivity determines the rate of heat transfer between the air and surrounding surfaces. Heat load quantifies the thermal energy required to maintain a desired temperature.
In the realm of fluid mechanics, the dynamic viscosity and kinematic viscosity describe the air’s resistance to flow, which is crucial for analyzing pressure losses and flow distribution in ducts and pipes. The Prandtl number is a dimensionless parameter that represents the ratio of momentum diffusivity to thermal diffusivity, characterizing the relative thickness of thermal and velocity boundary layers. The speed of sound and diffusion coefficient of water vapor are also included, as they are critical in acoustics and mass transfer studies, respectively. These features collectively provide a detailed understanding of the air’s transport properties and their impact on system performance.
For the exhaust system, the exhaust pressure and exhaust power are derived to characterize the system’s pressure conditions and energy consumption. Exhaust pressure, calculated as the difference between ambient pressure and exhaust vacuum, reflects the pressure drop across the system. Exhaust power quantifies the energy associated with the exhaust flow, providing a measure of the system’s efficiency and energy requirements.
To capture the complex interactions between the original features, interaction terms such as pressure–vacuum, temperature–pressure, and temperature–vacuum were introduced. These terms model the combined effects of temperature, pressure, and vacuum on the system’s performance, offering a more nuanced representation of the underlying physical processes.
3.2. Handling Outliers
Z-score is a value used to measure the degree of deviation between a data point and the average value of the dataset. It first converts all data to a unified dimension and then calculates the degree of deviation of different data from the mean value; 99.7% of the data’s Z-score are between −3 and 3. And the data that are not within this range are outliers. The calculation formula for Z-score is
where
is the value of the data point,
is the mean of the feature, and
is the standard value of the feature.
3.3. Dataset Division
To accurately evaluate the performance of the proposed model, 10-fold cross-validation was employed, as shown in
Figure 3. In each iteration, the model was trained on nine subsets and tested on the remaining one, with the process repeated ten times to ensure that each subset served as the test set once. The final performance metrics, including RMSE, MAE, and R
2, were obtained by averaging the results across all folds. It can mitigate potential biases arising from uneven data partitioning and provide a robust assessment of the model’s predictive capability.
3.4. Data Normalization
The data from the training set is normalized. Due to the varying scales of features, the impact of different features on the model may be skewed. To mitigate this effect and ensure that all features contribute equally to the model at a consistent scale, the data is scaled to the interval [0, 1]. This transformation is achieved using the following formula:
where
and
are the minimum and maximum values of this feature, respectively. The normalized data makes the values of different features in the same interval, avoiding the negative impact of scale differences on model training.
3.5. Feature Selection
The introduction of new features significantly increases the dimensionality of the dataset, enriching the model’s input information. However, as the feature dimension grows, redundant features and noise may negatively impact model performance. To address this, RFE was employed to refine the feature space, enhancing the model’s generalization ability by selecting the most representative features.
RFE is an iterative feature selection technique that identifies the most relevant features by training a base model and assessing feature importance [
42,
43,
44]. As illustrated in
Figure 4, RFE progressively removes features with minimal influence on the target variable, ultimately retaining the most critical features for the classification task. This process improves model stability and predictive accuracy by eliminating irrelevant or redundant information.
3.6. CatBoost
CatBoost [
45] is a kind of gradient boosting technology developed by Yandex, which constructs multiple decision trees to improve model performance. It, like other gradient boosting methods, aims to minimize a loss function
L(
y,
F(
x)) by iteratively adding a weak learner to an ensemble. At each iteration
t, it updates the model as
where
η is the learning rate, and
ht(
x) is the newly fitted weak learner trained to approximate the negative gradient of the loss function:
Unlike XGBoost and LightGBM, CatBoost uses the completely symmetric binary tree [
46], which builds the tree layer by layer until it reaches the specified depth. In each iteration, it selects and uses the way of the least amount of loss to split all leaf nodes of the tree in the layer. In addition, CatBoost stands out for its ability to perform effectively with limited training data, handle various data formats, and internally manage missing values, ensuring stability and robustness [
47]. It divides a given dataset into random permutations and applies ordered enhancement to these random permutations, avoiding target leakage and gradient bias.
Another advantage of CatBoost is that it can handle categorical features directly without traditional preprocessing steps such as one-hot encoding or label encoding. CatBoost adopts an improved Greedy TBS method, which adds a prior term and a weight coefficient, as shown in the following formula:
where
represents the
k-th training sample of feature
;
represents the average value of the class
;
represents the label of the
training sample;
represents the
training sample whose class feature is
;
is used to judge whether the category feature
of the
training sample is consistent with that of the
training sample;
represents the weight coefficient; and
represents the added prior term.
5. Conclusions
In this study, a novel method integrating domain knowledge with CatBoost was proposed for predicting the power output of CCPPs. To enhance predictive accuracy, 20 new features were designed based on domain expertise, and RFE was employed to identify the most informative features. Comparative experiments involving CatBoost and six other commonly used machine learning algorithms demonstrated that CatBoost consistently outperformed its counterparts, regardless of whether domain knowledge was incorporated. The incorporation of domain knowledge led to a significant improvement in prediction accuracy across all machine learning algorithms, highlighting the universal applicability and effectiveness of the newly proposed features in enhancing model performance.
To further optimize the model, RFE was applied to determine the optimal number of features for prediction. Experimental results revealed that varying the number of selected features between 4 and 24 influenced predictive accuracy, with the best performance achieved when selecting 11 features, yielding an RMSE of 2.8545, an MAE of 1.9645, and an R2 of 0.9702. Finally, a comparison with existing methods in the literature confirms the superior predictive accuracy of the proposed approach. These results underscore the effectiveness of integrating domain knowledge with machine learning and its potential for enhancing power output prediction in CCPPs. The proposed framework, with its strong generalization ability and computational efficiency, is well suited for integration into real-time plant operation systems. In smart grid scenarios, where dynamic energy management and short-term load forecasting are critical, the model can contribute to improved scheduling, grid stability, and energy optimization.