Combine Clustering and Machine Learning for Enhancing the E ﬃ ciency of Energy Baseline of Chiller System

: Energy baseline is an important method for measuring the energy-saving beneﬁts of chiller system, and the beneﬁts can be calculated by comparing prediction models and actual results. Currently, machine learning is often adopted as a prediction model for energy baselines. Common models include regression, ensemble learning, and deep learning models. In this study, we ﬁrst reviewed several machine learning algorithms, which were used to establish prediction models. Then, the concept of clustering to preprocess chiller data was adopted. Data mining, K-means clustering, and gap statistic were used to successfully identify the critical variables to cluster chiller modes. Applying these key variables e ﬀ ectively enhanced the quality of the chiller data, and combining the clustering results and the machine learning model e ﬀ ectively improved the prediction accuracy of the model and the reliability of the energy baselines.


Introduction
With the popularity of sustainable development concepts, an increasing number of enterprises are adopting energy conservation and carbon reduction as a significant aspect of corporate development. In most current enterprises, air-conditioning systems are the most energy-intensive equipment. Subsequently, chiller system are the most energy-intensive subsystems in air-conditioning systems. Therefore, improving the energy efficiency of chiller system can significantly reduce the energy consumption of entire systems.
Once the energy efficiency of chiller system is improved, our next focus is the effectiveness and benefits of the improvement methods. In this stage, accurately assessing the energy efficiency of improvement methods becomes a critical topic. Currently, the most widely used method is the establishment of energy baselines. An energy baseline refers to the collection of data within a time period before equipment improvement. The collected data can then be used to establish the mathematical equations that can describe the operation modes of equipment. This process is known as baseline modeling. Then, data are collected within a time period after equipment improvement to determine the prediction values of the post-improvement data in the baseline model. Finally, energy efficiency can be calculated by comparing the prediction values and post-improvement data.
Because energy baselines are an essential approach for assessing the improvement performance of chiller system, many studies have focused on developing chiller prediction models. The models can be predominantly classified into semi-empirical models and empirical models. Semi-empirical models refer to the use of equations derived from relevant laws of physics to describe performance of chiller system. For example, Lee and Reddy developed regression models to predict the coefficient of performance (COP) of screw chillers and centrifugal chillers [1,2]. Empirical models are data-oriented models. Equations that describe chiller performance can be established without having to collected chiller-related system data. For example, Adnan et al. combined artificial neural network (ANN) models of different structures and used three variables, specifically refrigeration ton, inlet temperature, and outlet temperature, to create a chiller prediction model [3]. Kim et al. used different combinations of input variables to identify the ANN model with the highest prediction accuracy [4]. Yu et al. used random forest model to predict the operating parameters that maximize chiller COP under different working conditions [5,6].
The development of prediction models can effectively enhance the accuracy of energy baseline predictions. Nonetheless, chiller system are intricate pieces of equipment. Many operating parameters must be collected, and operating modes may vary depending on the setting. Appropriately preprocessing data can facilitate overall analysis efficiency. Clustering is an excellent data preprocessing approach. It functions by calculating the relationships between data points and identifying hidden data structures. Malinao et al. applied the X-means clustering method to cluster chiller system and identify different operating modes [7]. Habib et al. used a two-layer K-means algorithm to cluster chiller system and identify and remove outliers to enhance energy analysis efficiency [8]. Habib et al. combined K-means, BoWR, and hierarchical clustering to preprocess chiller data. The researchers proposed a model to automatically detect the energy systems of different constructs. The model can be used for fault detection and diagnosis [9].
The operating modes in different conditions can be identified by clustering chiller data. This process enhances data quality and usability, thereby improving analysis efficiency. However, existing studies mostly used clustering for fault detection and diagnosis and rarely used preprocessed data in the development of prediction models. Therefore, using the COP of chiller system as the target of research, we applied a clustering method to preprocess chiller data and identify the operating modes of chiller system in different settings. In addition, a machine learning method was used to create prediction models for various operating modes.
The contribution of this paper is the proposal of a methodology for improving the prediction accuracy of chiller system. The chiller system examined in this study was a 230RT air-conditioning chiller equipped with a variable-frequency, centrifugal compressor. The methodology first selected K-means as clustering method based on characteristics of data. Then, we used data mining and statistical techniques to identify the critical variables for clustering method. After successfully identifying the critical variables, we applied K-means clustering and gap statistic to cluster chiller modes. For finding the best prediction accuracy of chiller system, the optimal number of clusters was calibrated, if needed. Finally, we combined the clustering results and machine learning models to establish a prediction model of chiller system. The simulation showed that the error rate of prediction model was successfully reduced and the prediction accuracy of chiller energy baselines without excessively increasing computational cost was enhanced.
The structure of this paper is as follows. In Section 2, we introduce commonly used chiller-related prediction models, such as regression models, ANN models, and random forest models. Extreme gradient boosting model is compared, which has gained considerable popularity in recent data analysis competitions. In Section 3, data, modeling, and model assessment criteria are discussed. In Section 4, a prediction simulation on the data is performed and we discuss the results. In Section 5, a conclusion to this study is provided.

Review of Machine Learning Algorithm
In this section, we review several machine learning algorithms which were used to establish prediction models of chiller system or related work. Here, we briefly review the final mathematical form of each model, and a detailed formulation is described in Appendix A. Lee combined law of thermodynamics and heat exchanger to develop a prediction model of screw chillers [1]. Equation (1) describes the prediction model of coefficient of performance (COP): where A 0 , A 1 , and A 2 are coefficients of model and can be derived by regression analysis (see Appendix A).

Multivariate Polynomial Regression Model
Reddy and Andersen used three variables, specifically cooling capacity, cooling water inlet temperature, chilled water outlet temperature, and their interaction, to create a multivariate regression model of centrifugal chillers [2]. Equation (2) describes the prediction model:

Artificial Neural Networks
A basic ANN framework is illustrated in Figure 1. Blue circles mark the neurons. They are responsible for recording values. The arrows illustrate the neural connections and the direction of data transfer. The framework can be broadly categorized into an input layer, hidden layer, and output layer. The hidden layer is responsible for receiving and converting data from the input layer and transferring the converted data to the output layer to derive a solution. The structure of the hidden layer and the data conversion method influence the quality of the overall ANN. Lee combined law of thermodynamics and heat exchanger to develop a prediction model of screw chillers [1]. Equation (1) describes the prediction model of coefficient of performance (COP): where , , and are coefficients of model and can be derived by regression analysis (see Appendix A).

Multivariate Polynomial Regression Model
Reddy and Andersen used three variables, specifically cooling capacity, cooling water inlet temperature, chilled water outlet temperature, and their interaction, to create a multivariate regression model of centrifugal chillers [2]. Equation (2) describes the prediction model:

Artificial Neural Networks
A basic ANN framework is illustrated in Figure 1. Blue circles mark the neurons. They are responsible for recording values. The arrows illustrate the neural connections and the direction of data transfer. The framework can be broadly categorized into an input layer, hidden layer, and output layer. The hidden layer is responsible for receiving and converting data from the input layer and transferring the converted data to the output layer to derive a solution. The structure of the hidden layer and the data conversion method influence the quality of the overall ANN.
where is connection weight, is bias; i is the number of input nodes, and j is the number of hidden nodes.
is a activation function that transfer the inputs to the hidden layer by way of Equation (3) describes the relationship of inputs x i and the j th node in the hidden layer: where w ij is connection weight, b j is bias; i is the number of input nodes, and j is the number of hidden nodes. σ is a activation function that transfer the inputs to the hidden layer by way of nonlinear transformation. The widely used activation function are sigmoid function, relu function, and softmax function. Equation (4) describes the relationship of j th node in the hidden layer and outputŷ k : ANN models can derive the optimal solution for parameters (w, b) by differentiating the loss function, whereby the loss function is expressed as L = loss (y,ŷ). If the hidden layer comprises more than one sublayer, it may be challenging to derive the optimal solutions for the parameters of the various layers using common differentiation methods. In this instance, the chain rules in calculus can be applied to derive the solutions.

Random Forest
Random forest is a classic ensemble learning algorithm. Predictions are carried out by combining the results of multiple classifications and regression tree (CART) models. When developing a CART in a random forest, the data and the variables are repeatedly sampled to increase the differences between models and prevent the overfitting problem common to CART models. The form of random forest can be written as:ŷ where R k is the kth output space, C k is average value of R k , and m is the number of CART in the random forest model.

Extreme Gradient Boosting
Extreme gradient boosting (XGBoost) is a popular method used in data analysis competitions recently. It is a strong ensemble learning algorithm improved from gradient boosting decision tree algorithm (GBDT) [10]. In recent years, XGBoost have been actively applied to energy related issues [11][12][13][14].
XGBoost combines the results of CART models one by one to establish the prediction model, and uses residual as prediction target. For a given data set with n examples and d features D = ( (6) describes a tree ensemble model using K additive functions to predict the output:ŷ where w q(x i ) is the CART model. To learn the optimal parameters used in prediction model, Equation (7) describes the regularized objective function Ob j: Energies 2020, 13, 4368 5 of 20 where l is a differentiable convex loss function and Ω is the complexity of the model. For a fixed structure q(x i ), the optimal parameter w * j and corresponding value Ob j * of output space j can be calculated by where G j and H j represents the sum of first and second-order gradient statistics in output space j.

Clustering
Clustering is an unsupervised machine learning method. The purpose of clustering is to analyze the distal relationships of data points and identify underlying data structures, thereby facilitating users in carrying out advanced data analysis. Depending on the nature of the data, clustering approaches can be based on data prototype, class, density, or graphics. Table 1 summarized common clustering algorithm and their applicability from popular machine learning web, scikit-learn (https://scikit-learn.org/stable/modules/clustering.html). This subsection introduces the K-means clustering and gap statistic used in this research.

K-Means
K-means is a clustering method with relatively simple computational procedures [15]. Although K-means fails to obtain good results in some cases, such as nonspherical, different variance, and different density, it is still popular for its simplicity to implement, known limitations, and excellent Energies 2020, 13, 4368 6 of 20 fine-tuning capabilities [16]. Several researchers have proposed different methods to solve different problems [17,18].
K-means can be performed in three steps. First, a k number of cluster centers are randomly established. Then, the Euclidean distance between each sample and the k cluster center is determined, and the sample point is classified into its nearest cluster. Finally, the centers of each cluster are updated using the detailed data until all sample groups reach the shortest distance to the core of their clusters. A detailed formulation is described in Appendix A.

Gap Statistic
The idea of the gap statistic is to compare the total within intracluster variation W c with its expectation under an appropriate null reference distribution of the data [19]. The estimate of the optimal k is the value for which the total within intracluster variation falls the farthest below this reference curve. Hence, the optimal k is the smallest value k, satisfied in Expression (10): Gap(k) and s k+1 described in Equations (11) and (12).
where B is the number of sampling.

Data Description and Statistic
The data examined in this study were from a chiller monitoring system in an undisclosed research center. The system was a 230RT air-conditioning chiller equipped with a variable-frequency, centrifugal compressor. The operating data between April 2018 and May 2019 were collected. Each data point represents one minute. After excluding the idle and maintenance times, a total of 316,749 data points and 28 variables were retained. Using a ratio of 8:2, 253,399 data points were used for training, and 63,350 data points were used for testing. The target of research was the COP of chiller system. The descriptive statistics of the training data and the key variables are illustrated in Figure 2 and tabulated in Table 2.   The top left image in Figure 2 is a trend chart for COP. The chart shows that the COP values were predominantly distributed between 2 and 6. The top right figure is a trend chart for power consumption. The chart shows that power consumption was significantly higher in specific periods. The full distance of the data approximated 100. The lower left figure is a trend chart for load rate. The distribution was similar to COP. The lower right figure is a trend chart for chilled water flow.
Then, calculate the maximal information coefficient (MIC) for COP. MIC provides a measure of the strength of the linear or nonlinear association between two variables [20]. To ensure a fair comparison, MIC normalized the values and obtained modified values between zero and one. Table 3 tabulated some variables with higher correlation coefficient.  Figure 3 is a scatter diagram of COP and kW/RT. COP and kW/RT presented a reciprocal relationship. The anticipated results were a curve presenting a convex to origin. Instead, five curves and numerous sporadic scatter points were plotted in Figure 3. Therefore, we speculated that other variables influencing the scattering of COP and kW/RT were present.  Subsequently, a scatter diagram was plotted to observe the distribution relationships between each variable. Scatter relationships of interest are plotted in Figures 3-5. Figure 3 is a scatter diagram of COP and kW/RT. COP and kW/RT presented a reciprocal relationship. The anticipated results were a curve presenting a convex to origin. Instead, five curves and numerous sporadic scatter points were plotted in Figure 3. Therefore, we speculated that other variables influencing the scattering of COP and kW/RT were present.   Figure 4 shows that the data were distributed into six distinct clusters in an apparent manner. Most of the condenser flow trend values ranged between 175 and 200, and the degree of COP dispersion increased concurrently with the condenser flow trend. Figure 5 is a scatter diagram of the chilled water flow and COP. The degree of COP dispersion increased concurrently with the chilled water flow. A block distribution of data points could be vaguely observed.   Figure 4 shows that the data were distributed into six distinct clusters in an apparent manner. Most of the condenser flow trend values ranged between 175 and 200, and the degree of COP dispersion increased concurrently with the condenser flow trend. Figure 5 is a scatter diagram of the chilled water flow and COP. The degree of COP dispersion increased concurrently with the chilled water flow. A block distribution of data points could be vaguely observed.   Figure 4 shows that the data were distributed into six distinct clusters in an apparent manner. Most of the condenser flow trend values ranged between 175 and 200, and the degree of COP dispersion increased concurrently with the condenser flow trend. Figure 5 is a scatter diagram of the chilled water flow and COP. The degree of COP dispersion increased concurrently with the chilled water flow. A block distribution of data points could be vaguely observed.

Model
This subsection describes the integration of clustering and machine learning. First, a suitable clustering approach was selected based on the data characteristics. In order to obtain a robust clustering effect, we also recommend using other clustering methods as validation. The necessity of estimating the optimal clustering value k was determined based on the approach. Estimation methods primarily included the elbow method, silhouette coefficient, and gap statistic. Third, the clustering method was employed to cluster the trained data, and the necessity of adjusting the clustering value k was determined by observing the clustering trends. Fourth, the clusters were then incorporated into a chiller prediction model to optimize the parameters and derive the final prediction model. Finally, the results of the different prediction models were compared based on the test data and the model assessment standards. Figure 6 summary the flow chart of establishing prediction model.

Model
This subsection describes the integration of clustering and machine learning. First, a suitable clustering approach was selected based on the data characteristics. In order to obtain a robust clustering effect, we also recommend using other clustering methods as validation. The necessity of estimating the optimal clustering value k was determined based on the approach. Estimation methods primarily included the elbow method, silhouette coefficient, and gap statistic. Third, the clustering method was employed to cluster the trained data, and the necessity of adjusting the clustering value k was determined by observing the clustering trends. Fourth, the clusters were then incorporated into a chiller prediction model to optimize the parameters and derive the final prediction model. Finally, the results of the different prediction models were compared based on the test data and the model assessment standards. Figure 6 summary the flow chart of establishing prediction model. estimating the optimal clustering value k was determined based on the approach. Estimation methods primarily included the elbow method, silhouette coefficient, and gap statistic. Third, the clustering method was employed to cluster the trained data, and the necessity of adjusting the clustering value k was determined by observing the clustering trends. Fourth, the clusters were then incorporated into a chiller prediction model to optimize the parameters and derive the final prediction model. Finally, the results of the different prediction models were compared based on the test data and the model assessment standards. Figure 6 summary the flow chart of establishing prediction model. Figure 6. Flowchart of establishing prediction model. The procedure first selected a suitable clustering method according to the training data. Then, we determined the best number of cluster k, if the clustering method was needed. After obtaining the result of clustering, we drew and observed the scatterplot of clustering to determine whether to adjust k or not. Finally, we used the machine learning algorithm to train each cluster to obtain the prediction models, and optimized these models to obtain the final model. Figure 6. Flowchart of establishing prediction model. The procedure first selected a suitable clustering method according to the training data. Then, we determined the best number of cluster k, if the clustering method was needed. After obtaining the result of clustering, we drew and observed the scatterplot of clustering to determine whether to adjust k or not. Finally, we used the machine learning algorithm to train each cluster to obtain the prediction models, and optimized these models to obtain the final model.

Evaluation Metrics
To evaluate the performance of the prediction models, three different metrics were used: The MSE (mean square error; Equation (13)), the CVRMSE (coefficient of variation of root-mean squared error; Equation (14)) and the MAPE (mean absolute percentage error; Equation (15)).
whereŷ i is the predicted value, y i is the actual value, and N is the total number of data. MSE intuitively represent the error of predicted value and actual values. CVRMSE gives an indication of the model's ability to predict the overall load shape that is reflected in the data. MAPE provides an overall assessment of the general percent error [21]. In addition to these three metrics, we also took computation speed into account.

Discussion
In this chapter, we elucidate whether integrating clustering and machine learning improved the model's predictive accuracy of energy baselines. The aforementioned machine learning model and chiller data were used to train and validate the prediction model. The target of validation was chiller COP, and the variables used in this research were the variables with high MIC values. The simulation environment was Anaconda, the popular data science platform, and the machine learning models were package from scikit-learn (https://scikit-learn.org/stable/preface.html). The assessment results of the test data are tabulated in Table 4. In Table 4, the four evaluation metrics, MSE, CVRMSE, MAPE, and Time(s), are calculated. The results indicate that ensemble learning model, random forest, and XGBoost had the better prediction error. The three-error metric of the XGBoost model and random forest model were relatively similar, and the computation speed of XGBoost model was faster than random forest model. Although the evaluation metrics of the three regression models were acceptable, they were less favorable in terms of performance compared to the ensemble learning model, only outperforming the ensemble learning model in computation time. The performance of ANN model was between the regression models and ensemble learning. Then, we assessed whether integrating clustering and machine learning improved the accuracy of the prediction models.
According to the Figure 4, the data were distributed into six distinct clusters in an apparent manner. Although the data seemed a bit uneven, they were well separated from each other. So, we tried to use K-means as the clustering method. Gap statistic is the ideal method for calculating the clustering value k. To validate the choice, we also tried to run and compare different clustering methods. The outcome is presented in Appendix B. From the results, K-means was a great choice in this research.
K-means clustering and gap statistic were performed on the 28 variables of the chiller data. The clustering results were then consolidated onto a graph. Based on the calculation results, the condenser flow trend was the most suitable variable of the 28 variables for clustering. Figure 7 is a scatter diagram of the condenser flow trend and COP after clustering. The diagram shows that K-means distributed the data into ten clusters.
Energies 2020, 13, x FOR PEER REVIEW 11 of 20 methods. The outcome is presented in Appendix B. From the results, K-means was a great choice in this research. K-means clustering and gap statistic were performed on the 28 variables of the chiller data. The clustering results were then consolidated onto a graph. Based on the calculation results, the condenser flow trend was the most suitable variable of the 28 variables for clustering. Figure 7 is a scatter diagram of the condenser flow trend and COP after clustering. The diagram shows that Kmeans distributed the data into ten clusters.  The diagram shows that besides a small number of scatter data, the data points of each cluster presented a convex to origin. The data distribution mode was more precise than that plotted chart in Figure 3. Based on the aforementioned two points, we validated that the condenser flow trend was a suitable variable for clustering chiller COP data.  Figure 8 is a scatter diagram of kW/RT and COP after clustering using the condenser flow trend. The diagram shows that besides a small number of scatter data, the data points of each cluster presented a convex to origin. The data distribution mode was more precise than that plotted chart in Figure 3.
Based on the aforementioned two points, we validated that the condenser flow trend was a suitable variable for clustering chiller COP data.  Figure 9 shows the outcome of gap statistic. The x-coordinate is the number of cluster k, and the y-coordinate is the gap value Gap( ). The optimal value for clustering is the smallest value k satisfied Expression (10). Here, the optimal value for clustering was = 10. Subsequently, the clustered data was incorporated into the prediction models, and the individual test error and overall test error of 10 clusters were calculated. The results were presented as sum of squares (SSE) and MSE, where SSE was the value of MSE without average. The ideal results and post-integration performance of the different prediction models are tabulated in Tables 5 and 6.   Figure 9 shows the outcome of gap statistic. The x-coordinate is the number of cluster k, and the y-coordinate is the gap value Gap(k). The optimal value for clustering is the smallest value k satisfied Expression (10). Here, the optimal value for clustering was k = 10. Subsequently, the clustered data was incorporated into the prediction models, and the individual test error and overall test error of 10 clusters were calculated. The results were presented as sum of squares (SSE) and MSE, where SSE was the value of MSE without average. The ideal results and post-integration performance of the different prediction models are tabulated in Tables 5 and 6.
Solely examining the overall error of the models, the performance of the models was similar for the clustered data and the unclustered data. A closer observation of the performance of individual clusters revealed that the models performed better in 7 of the 10 clusters compared to the unclustered data, suggesting that poor model performance was a direct result of a few individual clusters. We performed an in-depth review into the clustering results to explain this phenomenon and found that Clusters 1, 4, 8, 9, and 10 were the aforementioned larger data clusters with values ranging between 175 and 200. The cluster boundaries of these clusters were less prominent compared to the other clusters. We speculate that the clustering approach adopted in this study was less capable of processing the Energies 2020, 13, 4368 13 of 20 data volume, resulting in the clustering results not fully reflecting the data modes. In response, we attempted to calibrate the clusters to resolve this issue. Figure 9 shows the outcome of gap statistic. The x-coordinate is the number of cluster k, and the y-coordinate is the gap value Gap( ). The optimal value for clustering is the smallest value k satisfied Expression (10). Here, the optimal value for clustering was = 10. Subsequently, the clustered data was incorporated into the prediction models, and the individual test error and overall test error of 10 clusters were calculated. The results were presented as sum of squares (SSE) and MSE, where SSE was the value of MSE without average. The ideal results and post-integration performance of the different prediction models are tabulated in Tables 5 and 6.   Two calibration methods were adopted. The first method involved independently clustering the five sets of data to eliminate the effects of the other data. The second method was grouping the data in the five clusters without clear boundaries into one cluster for analysis. The assessment results of the two calibration methods are tabulated in Table 7. The table shows that the calibrated results produced using the first method were similar to the initial clustering results. In contrast, the calibrated results produced using the second method were better than the original clustering results, suggesting that integrating clustering and machine learning can improve model predictions after appropriate calibration.
The percentages of improvement between the results of this study and those of the original prediction models are tabulated in Table 8. The target of comparison was the XGB model, which had the best performance among the original prediction models. The results show that although computation time increased by 80% after clustering and calibration, the MSE, CVRMSE, and MAPE of the proposed method reduced by 21.35%, 11.96%, and 19%, respectively, suggesting a significant improvement in prediction accuracy. The results confirm that clustering can effectively enhance the quality of chiller data and increase the efficiency of incorporating machine learning in the prediction of chiller data if the limitations were satisfied: (1) If the data could be clustered well or (2) if the clustering method failed to get good results, the revised approach must work.

Conclusions
In this study, we first simulated the common prediction models for chiller system. The best results were produced by the random forest and XGBoost models. Then, we employed statistical analysis methods, K-means clustering, and gap statistic to identify the ideal clustering variables and clustering value k. We successfully identified the key variables suitable for clustering and enhanced data quality and usability for prediction. We adopted MSE, CVRMSE, MAPE, and times as the assessment standards. After simulation and suitable calibration, MSE, CVRMSE, and MAPE improved by 21.35%, 11.96%, and 19%, respectively, without drastically increasing computation time. Therefore, we successfully improved the prediction accuracy of the model.
The findings of this study may serve as a reference for third parties responsible for assessing energy efficiency in the future. Applying the procedures outlined in this study for establishing a prediction model can effectively improve the accuracy of energy efficiency verification, reduce prediction error, and enhance the reliability of the improvement method.
In this research, the situations in which clustering methods may fail to get good results were not fully listed. In the future, the flowchart of establishing prediction model can be expanded for application in general contexts.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
Combine Expression (A4) and Expression (A5), the form of random forest can be written as: where m is the number of CART in random forest model.

Appendix A.3. Extreme Gradient Boosting
For a given data set with n examples and d features D = (x i , y i ) x i ∈ R d , y i ∈ R, i = 1, . . . , n , Equation (A7) describes a tree ensemble model using K additive functions to predict the output: where f k is the k th CART model. The CART model can be expressed as Equation (A8): where q is the structure of CART that maps the inputs x i to the corresponding output space, w is the weights of output space. To learn the optimal parameters used in prediction model, Equation (A9) describes the regularized objective function Obj: where l is a differentiable convex loss function and Ω is the complexity of the model. Here, the loss function is least squares method (y i −ŷ i ) 2 , and Ω is defined as Equation (A10): where T is the number of output space, γ and λ are hyper parameters. Because tree ensemble model is an additive function, the objective function should satisfy Obj (t) < Obj (t−1) . Letŷ i (t) be the prediction of the i th instance at the t th iteration, Equation (A11) becomes: and Equation (A9) becomes: Here, the term t k=1 Ω( f k ) can be expanded to Ω( f t ) + t−1 k=1 Ω( f k ), and t−1 k=1 Ω( f k ) can be regarded as a constant.
To minimize the objective function, Equation (A12) can be expanded and rewritten as following.
and h i = are first-and second-order gradient statistics on the loss function.
Then, remove the constant term, and the objective function becomes Equation (A14): and Ω( f t ) are substituted by Equation (A8) and (A10):  represents the sum of first-and second-order gradient statistics in output space j.
For a fixed structure q(x i ), the optimal parameter w * j and corresponding value Obj * of output space j can be calculated by where B is the number of sampling.

Appendix B. Compare of Different Clustering Methods
In this appendix, we ran and compared different clustering methods to validate whether K-means is a good choice or not. In total, we ran four clustering methods to compare with K-means. The four clustering methods are Mean-shift, OPTICS, Birch, and HDBSCAN. We summarized a detailed information of each clustering methods. Table A1 describes the detailed information. From Table A1, the computation speed of Birch, HDBSCAN and K-means are better than Mean-shift and OPTICS. Then, we plotted the scatter diagram of each clustering methods in Figure A1. From Figure 1, none of these five methods could perfectly separate the data, and a calibration method was necessary for the next research. Observing the scatter diagram, K-means seems to be a better method. It well separated data from each other without noises except data, which values ranging between 175 and 200. The calibration of K-means appeared easier than others. Hence, we selected K-means as the clustering method used in this research.
Energies 2020, 13, x FOR PEER REVIEW 18 of 20  Mean-shift  bandwidth  12  617  OPTICS  epsilon MinPts  40  2432  Birch  Not necessary  3  3  HDBSCAN  Not necessary  18  40  K-means  number of clustering  10  2 From Table A1, the computation speed of Birch, HDBSCAN and K-means are better than Meanshift and OPTICS. Then, we plotted the scatter diagram of each clustering methods in Figure A1. From Figure 1, none of these five methods could perfectly separate the data, and a calibration method was necessary for the next research. Observing the scatter diagram, K-means seems to be a better method. It well separated data from each other without noises except data, which values ranging between 175 and 200. The calibration of K-means appeared easier than others. Hence, we selected Kmeans as the clustering method used in this research.