applied

.


Introduction
Gas turbines, due to its advantages of lower emissions, fast starting-accelerating, high power density [1], and high fuel flexibility [2], have been widely applied in aircrafts, shipping [3] and power generation [4]. Performance estimation [5], operating optimization [6] and diagnostics [7] of gas turbines are important and depend heavily on an accurate gas turbine performance model. The accuracy of the performance model relies considerably on its component characteristic maps, and component maps are obtained in rig tests under different conditions, which can be costly and time consuming [8]. Thus, only component characteristic maps coming from a similar gas turbine can be provided by original equipment manufacturer (OEM) [9]. This is insufficient; as discrepancies caused by manufacturing, assembly deviation, and overhaul of gas turbines will always exist, variations of component characteristic maps between gas turbines are inescapable [10].
To overcome this problem and obtain an accurate performance model, considerable attention has been paid to the research of performance adaption based on operating data. Stamatis et al. [11] introduced modification factors to modify component characteristic maps and established an optimization procedure to optimize the modification factors. Lambiris et al. [12] further developed this method by introducing determining corrections and sensitivity analysis. Kong et al. [13] proposed a new scaling method based on system identification, which used component characteristic maps and scaling factors at both design and off-design points. Then, a genetic algorithm (GA) was introduced to this method to create better shapes of the speed lines [14,15]. Li et al. [16] proposed a (1) The proposed cluster sampling method divides all operating points into categories and selects the points closest to the cluster centers. Compared with random sampling, it can enhance the coverage and dispersion (representing the degree of sample difference) of the sampling set and enhance the representativeness of the sampling set. (2) Compared with the performance model before adaption, the accuracy of the performance model tuned by the particle swarm optimization algorithm (PSO) is increased. (3) The proposed method was applied on a real E-class gas turbine power station and compared with the performance adaption based on the random sampling method. (4) Compared with the model based on the random sampling method, the proposed method has a higher accuracy in the entire field dataset, especially the overlooked random sampling conditions. This paper is organized as follows: Section 1 reviews several performance adaptation research and proposes a performance adaption based on cluster sampling and the iteration-eliminating model. Section 2 introduces the simulation model used in this study, performance adaptation, cluster sampling method, and the flowchart of performance adaption based on cluster sampling. Section 3 uses an example of a power plant to demonstrate the effectiveness of the proposed method. Section 4 summarizes this paper.

Methodology
The proposed performance adaption process depicted in Figure 1 can tune the component characteristic maps to meet the performance of sampling points extracted from the field data. This method is mainly composed of a gas turbine performance model, a cluster sampling method, and an optimization algorithm. The performance model is described in

Methodology
The proposed performance adaption process depicted in Figure 1 can tun component characteristic maps to meet the performance of sampling points extr from the field data. This method is mainly composed of a gas turbine performance el, a cluster sampling method, and an optimization algorithm. The performance mo described in Section 2.1, the performance adaption method that minimizes the devi between the field measurement data and the simulated results is described in Se 2.2, the cluster sampling method is described in Section

Gas Turbine Performance Model
Boundary condition x K-means method Field data Figure 1. Flowchart of the performance adaption method.

Performance Model of the Gas Turbine
As performance adaption is conducted on the component-level model, this se introduces the performance simulation model of the gas turbine used in this stu model consisting of compressor, combustor, turbine, rotator, and generator was us predict the measurements, as shown in Figure 2. In which, the number "1" mean inlet of compressor, "2" means the outlet of compressor, "3" means the outlet of bu "4" means the outlet of turbine.

Performance Model of the Gas Turbine
As performance adaption is conducted on the component-level model, this section introduces the performance simulation model of the gas turbine used in this study. A model consisting of compressor, combustor, turbine, rotator, and generator was used to predict the measurements, as shown in Figure 2. In which, the number "1" means the inlet of compressor, "2" means the outlet of compressor, "3" means the outlet of burner, "4" means the outlet of turbine.  The performance model of the gas turbine was used to simulate the perform parameters with the input parameters. The performance model is a steady-state m nism model established by thermodynamic principles, energy conservation, mass servation equations, and the component characteristic maps, which are used to ch terize the relationship between the component equivalent flow rate, pressure ratio, tropic efficiency, and equivalent speed. The simulation model can be expressed as E tion (1): where (•) denotes the mechanism model of the gas turbine; = { 1 , 1 , , , , is the vector of the input parameters, which are shown by the variables without ast in Table 1; and = { 2 , 2 , 4 , } is the vector of the output parameters, whic shown by the variables marked with asterisks in Table 1.

Performance Adaption
The gas turbine model was introduced in the previous section, and the comp characteristic maps play an important role in the model computation. Therefore performance adaption method is proposed to modify component characteristic map enhancing the model precision. The basic method of performance adaption used i paper contains two procedures: (1) Map scaling for design point, which introduces of fixed scaling factors to modify the original components' characteristic maps. procedure is not described in this paper as it is not its focus. More details can be obt from Li et al. [17]. (2) Performance adaption for off-design points, in which a set of sc The performance model of the gas turbine was used to simulate the performance parameters with the input parameters. The performance model is a steady-state mechanism model established by thermodynamic principles, energy conservation, mass conservation equations, and the component characteristic maps, which are used to characterize the relationship between the component equivalent flow rate, pressure ratio, isentropic efficiency, and equivalent speed. The simulation model can be expressed as Equation (1): where f (·) denotes the mechanism model of the gas turbine; x = {T 1 , P 1 , m f , T f , P f , P 4 , n} is the vector of the input parameters, which are shown by the variables without asterisks in Table 1; and y = {T 2 , P 2 , T 4 , P w } is the vector of the output parameters, which are shown by the variables marked with asterisks in Table 1. Compressor outlet temperature * T 2 9 Turbine outlet pressure P 4 4 Compressor outlet pressure * P 2 10 Power output * P w 5 Fuel mass flow m f 11 Rotor speed n 6 Fuel temperature T f * Performance parameters used to test the predicted result.

Performance Adaption
The gas turbine model was introduced in the previous section, and the component characteristic maps play an important role in the model computation. Therefore, the performance adaption method is proposed to modify component characteristic maps for enhancing the model precision. The basic method of performance adaption used in this paper contains two procedures: (1) Map scaling for design point, which introduces a set of fixed scaling factors to modify the original components' characteristic maps. This procedure is not described in this paper as it is not its focus. More details can be obtained from Li et al. [17]. (2) Performance adaption for off-design points, in which a set of scaling factors are introduced to improve the prediction accuracy of the simulation model during off-design conditions. In this paper, the component characteristic maps were adapted according to multiple off-design operation data. As shown in Figure 3, the off-design speed lines in the initial component characteristic map (solid line) are different from the actual map (dotted line). To modify the maps, three characteristic parameters (corrected mass flow rate, pressure ratio, and isentropic efficiency) were calibrated at different speed lines. Taking the compressor as an example, the off-design scaling factors of the corrected mass flow rate, pressure ratio, and isentropic efficiency were introduced and defined as shown in Equations (2)-(4).
Appl. Sci. 2023, 13, x FOR PEER REVIEW 5 of 18 factors are introduced to improve the prediction accuracy of the simulation model during off-design conditions. In this paper, the component characteristic maps were adapted according to multiple off-design operation data. As shown in Figure 3, the off-design speed lines in the initial component characteristic map (solid line) are different from the actual map (dotted line). To modify the maps, three characteristic parameters (corrected mass flow rate, pressure ratio, and isentropic efficiency) were calibrated at different speed lines. Taking the compressor as an example, the off-design scaling factors of the corrected mass flow rate, pressure ratio, and isentropic efficiency were introduced and defined as shown in Equations (2)-(4). It can be observed that the scaling factors at each speed line are different, and the variation of the scaling factors with the speed lines is nonlinear. In this paper, a quadratic form (Equation (5)) was applied to describe the nonlinearity of the corrected relative non-dimensional rotational speed (CN) [17,18].
In Equations (2)-(4), terms with the superscript "*" mean the modified characteristic parameters and terms without "*" are the original characteristic parameters. The subscript "Comp" means compressor. WAC, PR, and ETA represent the corrected mass flowrate, pressure ratio, and isentropic efficiency, respectively. OD denotes the off-design condition. Equation (5) defines those scaled factors as a quadratic form of CN. x represents one of the characteristic parameters, WAC, PR, and ETA. DP denotes the design point.
and represent the first-order and second-order coefficients in the correlation function, respectively. n denotes CN, and �1 − � � � represents the difference of CN between any OD point and design point. It can be observed that the scaling factors at each speed line are different, and the variation of the scaling factors with the speed lines is nonlinear. In this paper, a quadratic form (Equation (5)) was applied to describe the nonlinearity of the corrected relative non-dimensional rotational speed (CN) [17,18].
In Equations (2)-(4), terms with the superscript "*" mean the modified characteristic parameters and terms without "*" are the original characteristic parameters. The subscript "Comp" means compressor. WAC, PR, and ETA represent the corrected mass flowrate, pressure ratio, and isentropic efficiency, respectively. OD denotes the off-design condition. Equation (5) defines those scaled factors as a quadratic form of CN. x represents one of the characteristic parameters, WAC, PR, and ETA. DP denotes the design point. b x and c x represent the first-order and second-order coefficients in the correlation function, respectively. n denotes CN, and 1 − n OD n DP represents the difference of CN between any OD point and design point. Figure 3 shows the errors between the original component map and the actual map. The error is larger at a lower speed. Thus, the performance adaption for the off-design points needs to be conducted on the speed lines. During map adaptation, the original map with solid lines were modified to the new map with dotted lines.
Suppose A is a point on a speed line with CN = n OD with the characteristic parameters PR A and WAC A . A* with characteristic parameters PR A * and WAC A * is the target point at ; after being scaled by Equations (2)-(4), the point A moves to A * and the speed line with CN = n OD reaches the dotted line.
In this paper, the scaled factors were decided by CN and the coefficients b x and c x . CN is an independent input, and b x and c x are adjustable parameters. By adjusting the coefficients, the characteristic maps can be modified, and the prediction of the performance measurements changes accordingly. Thus, errors between the actual performance measurements and the predicted performance measurements obtained by the performance model of gas turbine shown in Figure 2 can be minimized by tunning the coefficients b x and c x .
To obtain the coefficient b x and c x , the objective function is formed by minimizing the predicted error: whereẑ means the predicted performance measurements, calculated by the simulation model in Figure 2; and z means the actual performance measurements. M represents the number of performance measurements; and N represents the number of targeted off-design points.

Data Clustering for Sampling
When performing performance adaption, it is inevitable to incur considerable computational cost if all data points are included. It is necessary to select few points from the entire available dataset. The quality of data selection will directly affect the accuracy of the prediction model. The random sampling method can reflect the distribution characteristics of the data to some extent, but it may have difficulties when facing operating condition data. For example, some operating conditions may repeatedly occur in a certain sampling time period, increasing their distribution density. This may easily result in these similar data being sampled multiple times, while some operating conditions are not collected. Thus, a data-clustering technique was employed to learn the intrinsic relationship between data and divide them into different clusters. The basic concept is that similar operating condition data are close to each other and have similar outputs. The clustering technique can help to group points that are within close proximity and find the centroids that have the smallest sum of distances to all other points. In this paper, k-means clustering [26] was used to group the data and find the centroids; then, the data points closest to the centroids were chosen and formed the clustering sampling set to calculate the characteristic adaption. K-means is a representative method of unsupervised learning, which divides data into k classes predetermined according to the distance between the samples. The k-means algorithm has the advantages of fast convergence speed and strong interpretability; its drawback is that the number of k clusters needs to be predetermined, but in this study, it was turned into an advantage as the number of sampling points can be chosen according to the computational needs. On the one hand, cluster sampling can make full use of the collected data. On the other hand, it can also collect data with large feature differences as much as possible. Suppose there are m data points and divide them into k clusters. The calculation steps are as follows: 1.

2.
In the t th iteration step, calculate the distance from each point to the k centroids according to Equation (7), and assign them to the nearest cluster.

3.
Calculate the mean of the points in each cluster and update the centroids by Equation (8).
Repeat steps 2 and 3, until the difference between two consecutive centroids is smaller than a predefined threshold or the max iteration step is achieved.

Particle Swarm Optimization (PSO)
Particle swarm optimization (PSO) was used in this study to minimize the function in Equation (6). PSO is a heuristic algorithm proposed by Kennedy and Eberhardt in 1995 [27]. For a detailed description, please refer to Kennedy and Eberhardt [27] and Poli et al. [28]. Each solution of the optimization problem is called a particle. Particles are located in a n-dimension search space with a certain speed, and fitness function is used to evaluate the merits of the particles. The particles can remember and track the personal best position Pbest and global best position Gbest; the speed is calculated according to the flight experience and the best particle position. Suppose the position and velocity of the ith particle at the kth generation is represented as a n-dimension vector: Then, in the (k + 1)th generation, the position and velocity of particles are updated as follows: where Pbest i,k represents the previous personal best position of each particle, and Gbest k refers to the global best position of all particles. rand is a random value from 0 to 1. c 1 is the acceleration constant of Pbest, c 2 is the acceleration constant of Gbest, and w refers to inertia weight. The values of w, c 1 , and c 2 can be adjusted depending on the specific problems. In this study, w = 0.5, c 1 = 1, c 2 = 2, the generation was set to 200, and the population size was set to 20.

Application
In this section, the performance adaption method with cluster sampling is applied to a real E-class gas turbine power plant. The gas-path schematic of the gas turbine and the measurable gas-path parameters are shown in Figure 2 and Table 1, respectively. The main parameters in ISO conditions are summarized in Table 2. It was a 127.6 MW gas turbine, with an system efficiency of 33.6%. The data came from the steady-state operating conditions of the E-class gas turbine from June to November, totally 146 data points. Some measurements of the collected field data are shown in Figure 4. The data came from the steady-state operating conditions of the E-class gas turbine from June to November, totally 146 data points. Some measurements of the collected field data are shown in Figure 4.

Comparison of Random Sampling and Cluster Sampling
In this section, two sampling methods, cluster sampling and random sampling, are used to select the sampling points for the subsequent performance adaption. The number of sampling data points was set to 10.
Random sampling randomly generates 10 points from the dataset. Cluster sampling uses the k-means clustering algorithm and clusters the 146 data points into 10 clusters based on the variables in Table 1. It produces 10 cluster centers and selects the points closest to the cluster centers as the sampling points. To compare the two sampling methods, the coverage of the total dataset and the coefficient of variation (CV, representing the degree of dispersion) of each sampling set were calculated by Equations (13) and (14), respectively. The analysis results of the two sampling methods are shown in Figure 5: x max,sample set − x min,sample set x max,total data − x min,total data (13) where x max,sample set and x min,sample set are the max and min values of the variables in Table 1 among the selected sample set, respectively; and x max,total data and x min,total data are the max and min values of the variables among the total dataset, respectively. σ is the standard deviation of the selected sample set, and µ is the mean value.
uses the k-means clustering algorithm and clusters the 146 data points into 10 clusters based on the variables in Table 1. It produces 10 cluster centers and selects the points closest to the cluster centers as the sampling points. To compare the two sampling methods, the coverage of the total dataset and the coefficient of variation (CV, representing the degree of dispersion) of each sampling set were calculated by Equations (13) and (14), respectively. The analysis results of the two sampling methods are shown in Figure 5: where , and , are the max and min values of the variables in Table 1 among the selected sample set, respectively; and , and , are the max and min values of the variables among the total dataset, respectively.
is the standard deviation of the selected sample set, and is the mean value. It can be seen that the cluster sampling method can collect data points with large differences as much as possible, which increases the data diversity and extends the coverage of the total dataset. However, using random sampling may lead to repeated sampling and missed sampling.

Optimization Results
In Section 3.2, two 10-point sampling datasets were formed by using random sampling and cluster sampling. Based on these two sampling sets, the PSO algorithm was used to minimize the performance measurement error by tuning the scaling factor function coefficients. Taking the calculation process of PSO based on cluster sampling as an example, its convergence process is shown in Figure 6. It can be seen that the cluster sampling method can collect data points with large differences as much as possible, which increases the data diversity and extends the coverage of the total dataset. However, using random sampling may lead to repeated sampling and missed sampling.

Optimization Results
In Section 3.2, two 10-point sampling datasets were formed by using random sampling and cluster sampling. Based on these two sampling sets, the PSO algorithm was used to minimize the performance measurement error by tuning the scaling factor function coefficients. Taking the calculation process of PSO based on cluster sampling as an example, its convergence process is shown in Figure 6.
In which, the red dots represent the iteration steps. It can be seen that the optimization process converges at the 114th generation, where the fitness function is 0.610%. During the optimization process of PSO, the fitness function decreases continuously, indicating that the PSO algorithm significantly reduces the prediction error of the four measurement variables by continuously adjusting the correction coefficients. The average prediction errors of each measurement variable before and after performance adaption are shown in Table 3. From Table 3, after performance adaption, the accuracy of the performance model was improved, the error of power, compressor outlet temperature, and pressure reached 0.5%, and the error of the turbine outlet temperature reached around 1.1%. In which, the red dots represent the iteration steps. It can be seen that the optimization process converges at the 114th generation, where the fitness function is 0.610%. During the optimization process of PSO, the fitness function decreases continuously, indicating that the PSO algorithm significantly reduces the prediction error of the four measurement variables by continuously adjusting the correction coefficients. The average prediction errors of each measurement variable before and after performance adaption are shown in Table 3. From Table 3, after performance adaption, the accuracy of the performance model was improved, the error of power, compressor outlet temperature, and pressure reached 0.5%, and the error of the turbine outlet temperature reached around 1.1%.

Comparison of the Prediction using Random Sampling and Cluster Sampling
In the previous section, the performance adaption processes of two ten-point sampling datasets were completed using the PSO algorithm, and the performance models after adaption based on data points of the random sampling set and the cluster sampling set were obtained, which were denoted as Model 1 and Model 2, respectively. This section compares and discusses the accuracy of these two models using the remaining data points (excluding the sampling points). The prediction errors of the two models based on different sampling methods at the remaining data points are shown in Figure 7.

Comparison of the Prediction Using Random Sampling and Cluster Sampling
In the previous section, the performance adaption processes of two ten-point sampling datasets were completed using the PSO algorithm, and the performance models after adaption based on data points of the random sampling set and the cluster sampling set were obtained, which were denoted as Model 1 and Model 2, respectively. This section compares and discusses the accuracy of these two models using the remaining data points (excluding the sampling points). The prediction errors of the two models based on different sampling methods at the remaining data points are shown in Figure 7.
As can be seen from Figure 7, the cluster sampling correction results are better than the random sampling results in terms of power prediction and compressor outlet pressure prediction. The prediction accuracy of the compressor outlet temperature is similar in both methods, and both are relatively high and around 0.3%; the prediction accuracy of the turbine outlet temperature is also similar in both methods. Table 4 lists the mean error and maximum error of each measured variable in the dataset.
According to Table 4, it can be seen that, overall, the models obtained using random sampling correction and cluster sampling correction have average errors of less than 1.2% for each prediction variable, and the overall average error of cluster sampling is better than that of random sampling. For the power and compressor outlet pressure prediction results, random sampling is slightly worse than the results of cluster sampling, which is reflected in both the average and maximum errors. The average error of power increased by 0.559% and that of the compressor outlet pressure increased by 0.302%. Their maximum errors increased by 1.193% and 1.041%, respectively. For the compressor outlet temperature, the prediction accuracy of both methods is similar, with random sampling results slightly higher than cluster sampling results, but they are generally at the same level of accuracy. For the turbine outlet temperature, the average prediction accuracy is 0.586% for Model 1 and 0.650% for Model 2, basically in the same accuracy level. The maximum error of Model 1 is 3.027%, which can be relatively high, while Model 2 using cluster sampling is 2.037%, improving by 0.685% in comparison with random sampling. In both cases, most prediction errors are below 1.5%, 93.6% for random sampling and 96.0% for cluster sampling. It can be seen that cluster sampling seems to have a certain effect on reducing the maximum error and the average error. This may be due to cluster sampling taking into account multiple clustering conditions and selecting the center points of the conditions, which considers the full range of accuracy and also maintains a certain representativeness for each clustering condition.
Appl. Sci. 2023, 13, x FOR PEER REVIEW 11 of 1 Figure 7. Comparison of the prediction errors of four different measurements between the two models.
As can be seen from Figure 7, the cluster sampling correction results are better tha the random sampling results in terms of power prediction and compressor outlet pre sure prediction. The prediction accuracy of the compressor outlet temperature is simila in both methods, and both are relatively high and around 0.3%; the prediction accurac of the turbine outlet temperature is also similar in both methods. Table 4 lists the mea error and maximum error of each measured variable in the dataset.  To analyze the accuracy improvement created by cluster sampling, the data set was divided into two parts. In Section 3.2, it was mentioned that the k-means algorithm was used to divide the dataset into 10 categories. Cluster sampling selects the points closest to the cluster centers from each cluster to form a sample set, while random sampling picks points directly from the data. Thus, the random sample set may include some clusters and exclude others; so, the data can be classified into two groups: dataset 1, which contains the clusters that were sampled by random sampling, and dataset 2, which contains the clusters that were not sampled by random sampling. Then, the prediction accuracies of the previous Model 1 and Model 2 were evaluated for both groups of data, and the results are shown in Table 5. By observing columns 1 and 3 of Table 5, it can be seen that the model built using random sampling has some accuracy differences in the two datasets. The model prediction accuracy in the data categories that are not covered by random sampling is lower than that in the data categories that are covered by random sampling. This is reflected in the average and maximum errors of almost all measurement parameters, except the compressor outlet temperature, which is almost the same accuracy level. This result indicates that whether the dataset contains similar operating conditions has a certain impact on the model prediction accuracy, and all operating conditions should be included as much as possible. Observing the prediction results of cluster sampling in the two categories of data, in columns 2 and 4, there is a slight difference but the degree is much lower than that of the random sampling results and almost the same level. This may be because performance adaption is a process of balancing the accuracy of each point, which results in a decrease in accuracy on these categories, or it may be due to some randomness in the results. Observing columns 1 and 2 in Table 5, although both sampling methods cover this part of data categories, and random sampling contains even more data points than cluster sampling, the prediction accuracy does not improve considerably. This may be because cluster sampling not only covers more categories, but also uses the points closest to the cluster center to help to extract the internal features of the data, which can increase the representativeness of the data points. Therefore, more accurate prediction results can be obtained using the cluster sampling method.
Based on Model 2 obtained by the cluster sampling method, the isentropic efficiency of the turbine and energy efficiency were also calculated, and the results are shown in Figure 8. It can be seen that the isentropic efficiency of the turbine was around 88.5% and the system efficiency was around 31.4%. egories of data, in columns 2 and 4, there is a slight difference but the degree is much lower than that of the random sampling results and almost the same level. This may be because performance adaption is a process of balancing the accuracy of each point, which results in a decrease in accuracy on these categories, or it may be due to some randomness in the results. Observing columns 1 and 2 in Table 5, although both sampling methods cover this part of data categories, and random sampling contains even more data points than cluster sampling, the prediction accuracy does not improve considerably. This may be because cluster sampling not only covers more categories, but also uses the points closest to the cluster center to help to extract the internal features of the data, which can increase the representativeness of the data points. Therefore, more accurate prediction results can be obtained using the cluster sampling method.
Based on Model 2 obtained by the cluster sampling method, the isentropic efficiency of the turbine and energy efficiency were also calculated, and the results are shown in Figure 8. It can be seen that the isentropic efficiency of the turbine was around 88.5% and the system efficiency was around 31.4%. Figure 8. Prediction of the isentropic efficiency of the turbine and system efficiency. Figure 8. Prediction of the isentropic efficiency of the turbine and system efficiency.

Discussion of Random Sampling and Cluster Sampling for a Long Period
To compare and demonstrate the performance of two sampling methods using longterm data, the field data were extended to a one-year period. Figure 9 shows the compressor inlet temperature data in a one-year period. The data were collected every minute and the points with loads below 75% were removed.

Discussion of Random Sampling and Cluster Sampling for a Long Period
To compare and demonstrate the performance of two sampling methods using long-term data, the field data were extended to a one-year period. Figure 9 shows the compressor inlet temperature data in a one-year period. The data were collected every minute and the points with loads below 75% were removed. Overall, summer is the peak of electricity consumption, and the gas turbine units run for longer periods, resulting in more data points. In contrast, there were less data in winter. The mean value of the compressor inlet temperature was 301.8 K, with a range of [275.9 K, 316.3 K] in a one-year period. The distribution of the dataset is shown in Figure  10. Overall, summer is the peak of electricity consumption, and the gas turbine units run for longer periods, resulting in more data points. In contrast, there were less data in winter. The mean value of the compressor inlet temperature was 301.8 K, with a range of [275.9 K, 316.3 K] in a one-year period. The distribution of the dataset is shown in Figure 10. Overall, summer is the peak of electricity consumption, and the gas turbine units run for longer periods, resulting in more data points. In contrast, there were less data in winter. The mean value of the compressor inlet temperature was 301.8 K, with a range of [275.9 K, 316.3 K] in a one-year period. The distribution of the dataset is shown in Figure  10. As can be seen in Figure 10, the data have three distribution centers, corresponding to winter, spring/autumn, and summer. The operation conditions of the field data are naturally over-dispersed due to the over-dispersion of environmental conditions. The field data of high (315 K) or lower (275 K) compressor inlet temperature conditions are much less than that under about 307 K. If the random sampling method is used, theoretically, the distribution of the collected data will be consistent with the original data distribution, mainly distributed around 307 K around summer, and the rest of the data distributed around 294 K and 284 K. This will result in the rare operation conditions being probably covered up by the common operation conditions. Furthermore, considering the  As can be seen in Figure 10, the data have three distribution centers, corresponding to winter, spring/autumn, and summer. The operation conditions of the field data are naturally over-dispersed due to the over-dispersion of environmental conditions. The field data of high (315 K) or lower (275 K) compressor inlet temperature conditions are much less than that under about 307 K. If the random sampling method is used, theoretically, the distribution of the collected data will be consistent with the original data distribution, mainly distributed around 307 K around summer, and the rest of the data distributed around 294 K and 284 K. This will result in the rare operation conditions being probably covered up by the common operation conditions. Furthermore, considering the computation cost of performance adaption, few sampling points are selected from the original data, making it more likely to miss some operating conditions. Cluster sampling clusters the entire dataset based on the minimum distance to the cluster centers, ensuring the diversity and representativeness of the sampling points. Correspondingly, the clustering method was proposed to detect these rare operation conditions. Taking 10 sampling points as an example, the clustering results and the sampling points obtained by the cluster sampling method are shown in Figure 11, together with random sampling points marked by green triangles. computation cost of performance adaption, few sampling points are selected from the original data, making it more likely to miss some operating conditions. Cluster sampling clusters the entire dataset based on the minimum distance to the cluster centers, ensuring the diversity and representativeness of the sampling points. Correspondingly, the clustering method was proposed to detect these rare operation conditions. Taking 10 sampling points as an example, the clustering results and the sampling points obtained by the cluster sampling method are shown in Figure 11, together with random sampling points marked by green triangles. Compared with the random sampling points, cluster sampling points marked by red circles cover a more diverse range of operating conditions and detect the rare operation conditions both in low and high temperatures. As it can be seen, cluster sampling maintains its advantages of diversity and representativeness even with a longer data collection time and larger data volume. Compared with the random sampling points, cluster sampling points marked by red circles cover a more diverse range of operating conditions and detect the rare operation conditions both in low and high temperatures. As it can be seen, cluster sampling maintains its advantages of diversity and representativeness even with a longer data collection time and larger data volume.

Conclusions
This paper proposed a performance adaption method based on cluster sampling to adjust the component characteristic map and minimize the predicted errors of performance parameters. The tuning factors were the coefficients of scaling factors defined by the ratio of the original and target characteristic parameters. The optimal coefficients were determined by PSO. Through this process, the predicted errors of the performance model can be reduced. Different from other adaption methods, the adaption based on cluster sampling method selects more representative sampling points, which improves the model accuracy on the entire dataset.
The proposed method was applied to a real E-class gas turbine. The simulated performance based on the cluster sampling method was compared with the simulated performance based on the random sampling method. The average and maximum errors of the simulated performance based on the random sampling method on the entire dataset were 0.661% and 1.552%, respectively. The average and maximum errors of the simulated performance based on the cluster sampling method were 0.466% and 1.088%, respectively, all showing some degree of improvement. In the data categories that were not included in the random sample set, the average and maximum errors of the simulated performance based on the random sampling method were 1.067% and 1.552%, respectively. The average and maximum errors of the simulated performance based on the cluster sampling method were 0.479% and 0.887%, respectively. The performance adaption based on cluster sampling enhances the prediction accuracy of the model on the entire dataset and the prediction stability of some operating conditions, and helps to improve the application effect in performance estimation and gas path diagnosis.
Some aspects that can be enhanced in future research are: (1) Clustering is sensitive to outliers in the data. When the data are large, clustering can be used to automatically remove outliers before performing cluster sampling. (2) The current data collection does not include the complete operating data in winter and spring. The model stability on unknown operating conditions can be verified and adjusted in subsequent research.
Author Contributions: Conceptualization, J.K., W.Y. and H.Z.; methodology, J.K. and J.C.; software, J.K. and J.C.; writing-original draft preparation, J.K.; writing-review and edit, J.C., W.Y. and H.Z. All authors have read and agreed to the published version of the manuscript.