Comparison and Determination of Optimal Machine Learning Model for Predicting Generation of Coal Fly Ash

: The rapid development of industry keeps increasing the demand for energy. Coal, as the main energy source, has a huge level of consumption, resulting in the continuous generation of its combustion byproduct coal ﬂy ash (CFA). The accumulated CFA will occupy a large amount of land, but also cause serious environmental pollution and personal injury, which makes the resource utilization of CFA gradually to be attached importance. However, given the variability of the amount of CFA generation, predicting it in advance is the basis to ensure effective disposal and rational utilization. In this study, CFA generation was taken as the target variable, three machine learning (ML) algorithms were used to construct the model, and four evaluation indices were used to evaluate its performance. The results showed that the DNN model with the R = 0.89, R 2 = 0.77 on the testing set performed better than the traditional multiple linear regression equation and other ML algorithms, and the feasibility of DNN as the optimal model framework was demonstrated. Applying this model framework to the engineering ﬁeld enables managers to identify the next step of the disposal method in advance, so as to rationally allocate ways of recycling and utilization to maximize the use and sales beneﬁts of CFA while minimizing its disposal costs. In addition, sensitivity analysis further explains ML’s internal decisions and veriﬁes that coal consumption is more important than installed capacity, which provides a certain reference for ensuring the rational utilization of CFA.


Introduction
The acceleration of industrial processes has led to the rapid development of the energy industry as a large player. As the main energy supplier, power stations based on coal and lignite provide massive amounts of energy, and coal consumption has soared [1]. Although electricity demand and coal emissions experienced a small decline in 2020, as the COVID-19 outbreak depressed energy demand, the economic stimulus package and the rollout of vaccines promoted the economic rebound, leading to a 9% increase in coal-fired power generation in 2021, its highest level ever [2]. In the first half of 2021, coal market consumption showed an 11% year-on-year growth. Coal consumption in the European Union is expected to increase by 4% by the end of the year [3], and coal may remain a mainstay of international energy in the short term.
The huge consumption of coal makes the generation of coal fly ash (CFA), as a byproduct of coal combustion [4], continue to increase [5], particularly in India. Over a 10-year span (2009-2010 to 2018-2019), CFA in the power sector increased by nearly 76% and is now producing around 217 million tones [6]. A large amount of deposition of CFA takes up land resources and directly pollutes the soil [7], in addition to causing serious impacts on water and air due to inappropriate treatment. On the one hand, a large amount of rainfall makes CFA landfills a potentially dangerous place, producing toxic leachate that seeps into groundwater and pollutes water resources [8]; on the other hand, toxic elements may be discharged into the air with the flue gas produced, endangering air quality. Li et al. have pointed out that solid Hg waste produced by fly ash from coal burning is the main source of Hg in the environment [9]. Moreover, human health is also at risk from long-term exposure to CFA diffusing into the air or from drinking contaminated groundwater [10]. As a result, academia and industry are paying increasing attention to the resource utilization of CFA.
In recent years, CFA has gradually been effectively utilized in various fields, among which the construction industry is the most widely used. CFA is used as the supplementary cementitious material to partially replace cement in concrete or to prepare geopolymers [11] and is also used as coarse and fine aggregate in asphalt pavement [12]. Due to its potential for soil improvement and heavy metal adsorption [13], CFA has a good development prospect in the agricultural field. Moreover, CFA also has a large presence in the manufacturing of ceramic glass [14], metal matrix composites, and metal coatings.
However, the amount of generation of CFA is variable, so predicting the generation of CFA in advance is the basis for ensuring its effective disposal and rational utilization. Some scholars used neural networks in MATLAB and linear regression statistical analysis in IBM SPSS to predict the generation of CFA in power plants in five or ten years [15]. However, the description of this method is too simple and general, and the accuracy of the prediction has not been verified. Others predicted the average annual output of hazardous wastes by multiplying the amount of industrial hazardous wastes generated in the base year by the average annual growth rate index [16], which is too complicated and time-consuming, and the accuracy cannot be guaranteed either.
In view of the limitations of the above prediction methods, this paper used advanced machine learning (ML) algorithm [17] to predict the generation of CFA by constructing three different regression models, of which installed capacity and coal consumption were input variables, and the generation of CFA was output variable. The established model framework can be applied to the engineering site after thorough evaluation and comparison, which can quickly and accurately predict the amount of CFA generation, thus saving time for further planning of CFA disposal, and is the basis for reasonable recycling of CFA.

Dataset
Data are the basis for the development of machine learning algorithms, and any ML algorithms need data to evaluate their effects. How to collect a comprehensive and appropriate dataset and analyze it was key to this research.

Data Collection
In this study, domestic and foreign databases and related academic websites were searched, a large number of studies in the literature and academic reports related to CFA were consulted, the relevant data were sorted out and recorded, and finally, the dataset used in this paper was obtained through screening. This dataset was extracted from a report documenting CFA generation and utilization in coal-fired power plants across India in 2019-2020 and contained data from 183 power plants across 17 states (outliers with a coal consumption value of 0 were removed) [18], as shown in Figure 1. Chhattisgarh was the most sampled state, with 27 power plants, accounting for 14.8%, but only one power plant was sampled in Assam. From a holistic perspective, the distribution of sampling sites was relatively uniform.

Data Analysis
The ultimate purpose of the algorithm is to fit the distribution of the data and predict the trend of change. Different datasets have different feature distributions, so the statistical distribution and correlation analysis of data serve as sources of reference for establishing the optimal algorithm model.
In this paper, the dataset with a distribution characteristic presented in the form of a bubble chart included two features: installed capacity and coal consumption, and the target variable was the generation of CFA. As shown in Figure 2, the size of bubbles represents the CFA generation. Data points were mainly distributed in the lower-left corner, meaning that when the installed capacity was between 0 and 2000 MW, and the coal consumption varied from 0 to 5 MT, the amount of generation of CFA was small, less than 2.95 MT. With the increase in installed capacity and coal consumption, the CFA generation also increased. The maximum generation of 8.85 MT was realized when the installed capacity was 4760 MW, and the coal consumption was about 25 MT.

Data Analysis
The ultimate purpose of the algorithm is to fit the distribution of the data and predic the trend of change. Different datasets have different feature distributions, so the statisti cal distribution and correlation analysis of data serve as sources of reference for establish ing the optimal algorithm model.
In this paper, the dataset with a distribution characteristic presented in the form of a bubble chart included two features: installed capacity and coal consumption, and the tar get variable was the generation of CFA. As shown in Figure 2, the size of bubbles repre sents the CFA generation. Data points were mainly distributed in the lower-left corner meaning that when the installed capacity was between 0 and 2000 MW, and the coal con sumption varied from 0 to 5 MT, the amount of generation of CFA was small, less than 2.95 MT. With the increase in installed capacity and coal consumption, the CFA generation also increased. The maximum generation of 8.85 MT was realized when the installed ca pacity was 4760 MW, and the coal consumption was about 25 MT.

Data Analysis
The ultimate purpose of the algorithm is to fit the distribution of the data and predict the trend of change. Different datasets have different feature distributions, so the statistical distribution and correlation analysis of data serve as sources of reference for establishing the optimal algorithm model.
In this paper, the dataset with a distribution characteristic presented in the form of a bubble chart included two features: installed capacity and coal consumption, and the target variable was the generation of CFA. As shown in Figure 2, the size of bubbles represents the CFA generation. Data points were mainly distributed in the lower-left corner, meaning that when the installed capacity was between 0 and 2000 MW, and the coal consumption varied from 0 to 5 MT, the amount of generation of CFA was small, less than 2.95 MT. With the increase in installed capacity and coal consumption, the CFA generation also increased. The maximum generation of 8.85 MT was realized when the installed capacity was 4760 MW, and the coal consumption was about 25 MT.  Correlations between features or between features and target variables were measured by Pearson correlation coefficients (R), which were between −1 and 1, as shown in Figure 3. The correlation degree between coal consumption and generation of CFA was the highest, and R was 0.9. Meanwhile, R between installed capacity and target variable was 0.73, indicating a strong correlation between installed capacity and CFA generation.
Correlations between features or between features and target variables were measured by Pearson correlation coefficients (R), which were between −1 and 1, as shown in Figure 3. The correlation degree between coal consumption and generation of CFA was the highest, and R was 0.9. Meanwhile, R between installed capacity and target variable was 0.73, indicating a strong correlation between installed capacity and CFA generation.

Methodology
To achieve rapid and accurate prediction of CFA generation, Python 3.8 programming language was used in this paper, and three machine learning algorithms were selected to construct the model framework using the scikit-learn library [19]. Four evaluation indices were used to measure the performance of the model [20]. Finally, the optimal model was determined according to the evaluation results and compared with traditional methods to verify the feasibility and superiority of the prediction framework. The specific methodology is shown in Figure 4.

Methodology
To achieve rapid and accurate prediction of CFA generation, Python 3.8 programming language was used in this paper, and three machine learning algorithms were selected to construct the model framework using the scikit-learn library [19]. Four evaluation indices were used to measure the performance of the model [20]. Finally, the optimal model was determined according to the evaluation results and compared with traditional methods to verify the feasibility and superiority of the prediction framework. The specific methodology is shown in Figure 4. the highest, and R was 0.9. Meanwhile, R between installed capacity and target variable was 0.73, indicating a strong correlation between installed capacity and CFA generation.

Methodology
To achieve rapid and accurate prediction of CFA generation, Python 3.8 program ming language was used in this paper, and three machine learning algorithms were se lected to construct the model framework using the scikit-learn library [19]. Four evalua tion indices were used to measure the performance of the model [20]. Finally, the optima model was determined according to the evaluation results and compared with traditiona methods to verify the feasibility and superiority of the prediction framework. The specifi methodology is shown in Figure 4.

Modeling Methods
Machine learning algorithms used in this study had a general modeling process. Firstly, the original data were preprocessed, including the removal of outliers or normalization (this part is explained in detail in Section 3.2). Then, coal consumption and installed capacity after treatment were taken as input variables, and CFA generation was taken as the output variable. Then, training, evaluation, and prediction were carried out by random forest, support vector machine, and neural network. The specific principles and steps of the three algorithms are as follows: Random forest (RF) is an integration algorithm that combines the outputs of multiple decision trees into one result to deal with classification and regression problems [21]. It has the characteristics of ease of use and flexibility [22]. The construction of RF includes the following four main steps:

1.
Random sampling and training decision tree: The original data population with sample size N is randomly sampled N times, and each time, the samples need to be put back [23]. N samples formed at last are used to train a decision tree; 2.
Randomly selected attributes as node-splitting attributes: When the nodes of the decision tree are split, m attributes (m << M) should be randomly selected from the M attributes of each sample, and then some strategies (such as information gain) should be adopted to select one attribute as the final split attribute of the node; 3.
Step 2 is repeated until the tree cannot be split, noting that no pruning occurs during the entire decision tree formation process; 4.
A large number of decision trees are established according to steps 1~3 to form an RF.

Support Vector Regression
Support vector regression (SVR) is an important branch of support vector machine (SVM) [24]. SVR has only one type of sample point in the end. The optimal hyperplane it seeks is to minimize the total deviation of all sample points from the hyperplane.
Different from traditional regression methods, SVR indicates that, as long as the deviation degree of f (x) = ω T Φ(x) + b and y is not too large, the prediction can be considered correct without calculating the loss. SVR can obtain a regression model in the form of is the vector mapped to X, ω = (ω 1 , ω 2 , . . . ω n ) is the normal vector, and b is the intercept. Then, an interval band with a distance of ε is created on both sides of the linear function (tolerance deviation) [26]. The loss is not calculated when all samples fall into the interval band but is calculated only when the absolute value of the gap between f (x), and y is greater than ε. Finally, the optimized model is obtained by minimizing the total loss and maximizing the interval.

Deep Neural Network
A deep neural network (DNN) is an extension based on perceptron. The internal structure of DNN has only one input layer and one output layer, but there are multiple hidden layers in the middle [27]. Each layer of the neural network has several neurons. The neurons between layers are connected to each other but are not within a layer, and the neurons in the next layer are connected to all the neurons in the previous layer [28,29].
Generally speaking, the steps of constructing a DNN structure include the following three points: (1) network construction, (2) assignment parameters, and (3) iterative calculation. The main principles of iterative calculation include forward-propagation (FP) and back-propagation (BP) algorithms [30].
The FP algorithm uses several weighted coefficient matrices W and bias vector B to carry out a series of linear operations and activation operations with input vector X. Starting from the input layer, the output of the previous layer is used to calculate the output of the next layer, and then one layer after another is calculated until it reaches the output layer, and the predicted value Y is obtained. In comparison, the BP algorithm uses the gradient descent method to iteratively optimize the loss function to obtain the minimum value [31]. Additionally, it then seeks the appropriate linear coefficient matrix W and bias vector B corresponding to the hidden layer and output layer, so that the output calculated by all the input of training samples is equal or close to the sample label as far as possible [32].
In short, FP is the recognition process of the predicted value Y, while BP is the reverse adjustment of parameters W and B according to the difference between the target y and the predicted Y. After repeated forward-and back-propagation training, the neural network model with high accuracy is finally formed.

Dataset Preprocessing and Splitting
The dimensionality and its unit of evaluation index (feature) affect the result of data analysis. To eliminate the influence of dimension between indicators, standardizing or normalizing data to achieve comparability between data indicators are generally adopted [33].
The variance ratio between features and target variables of the dataset in this paper was 200:4:2. There are several orders of magnitude differences between the variances, which leads to features with large variances dominating the algorithm, resulting in poor modeling performance [34]. Therefore, the "processing" module in the sklearn was used to standardize data (sklearn.preprocessing.scale) whose outliers has been removed.
The preprocessed dataset was divided into training and testing sets. Among them, the training set was used to train the model, whereas the testing set was used to verify the final effect of the model. In view of the impact of the division ratio of the dataset on model performance, the size of the testing set in this paper varied from 10% to 45% with an interval of 5%, and R was used as the evaluation index to determine the optimal division ratio [22].

Model Evaluation
After the model was constructed, it was necessary to evaluate its effect and then select the optimal model by comparison. In this paper, four common indicators-namely, R, R squared (R 2 ), mean-squared error (MSE), and mean absolute error (MAE)-were used to evaluate the model. The calculation formulas are as follows: where n represented the number of samples, y i was the real observed value, y represented the average of the real value, and f (x i ) was the predicted value, with a mean value of f (x i ).
As introduced in Section 2.2, R is used to reflect the degree of linear correlation between two variables; in addition, R 2 is used to judge the degree of fit between the prediction model and the real data [35]. The best value of R 2 is 1 and can be negative. MSE calculates the mean of the sum of squares of sample point errors corresponding to the fitting data and original data, and the smaller the value is, the better the fitting effect is [36]. MAE is used to evaluate how close the predicted results are to the real dataset, and the smaller the MAE, the better the model [37].

Determination of Dataset Division Ratio
To avoid the randomness of the evaluation results, in this paper, we evaluated each division ratio of the dataset 50 times repeatedly and took the mean value of the correlation coefficient R as the final performance of the model under a specific partition. As shown in Figure 5, the RF model was taken as an example. For the training set, the influence of the division ratio on the modeling performance was small, and the R fluctuated, by a small margin, around 0.98. Focusing on the testing set, when it accounted for 10% of the dataset, R was 0.84; when the size of the testing set was 15%, the performance of the model reached the highest, satisfying R = 0.87. After that, R generally decreased, with a further increase in division ratio up to 45%. In summary, the RF model performed best when the size of the testing set was 15%, and the analysis results of SVR and DNN were consistent with it. Therefore, the training set:testing set = 0.85:0.15 ratio was determined as the optimal division ratio. tween two variables; in addition, R 2 is used to judge the degree of fit between the prediction model and the real data [35]. The best value of R 2 is 1 and can be negative. MSE calculates the mean of the sum of squares of sample point errors corresponding to the fitting data and original data, and the smaller the value is, the better the fitting effect is [36]. MAE is used to evaluate how close the predicted results are to the real dataset, and the smaller the MAE, the better the model [37].

Determination of Dataset Division Ratio
To avoid the randomness of the evaluation results, in this paper, we evaluated each division ratio of the dataset 50 times repeatedly and took the mean value of the correlation coefficient R as the final performance of the model under a specific partition. As shown in Figure 5, the RF model was taken as an example. For the training set, the influence of the division ratio on the modeling performance was small, and the R fluctuated, by a small margin, around 0.98. Focusing on the testing set, when it accounted for 10% of the dataset, R was 0.84; when the size of the testing set was 15%, the performance of the model reached the highest, satisfying R = 0.87. After that, R generally decreased, with a further increase in division ratio up to 45%. In summary, the RF model performed best when the size of the testing set was 15%, and the analysis results of SVR and DNN were consistent with it. Therefore, the training set:testing set = 0.85:0.15 ratio was determined as the optimal division ratio.

Parameters of the Model
In this study, RF and SVR, as traditional machine learning regression algorithms, were trained with corresponding default parameters in the ensemble module of sklearn, as shown in Table 1. DNN is a deep learning model in which performance is greatly affected by network structure and parameters [38]. Based on the trial-and-error method and

Parameters of the Model
In this study, RF and SVR, as traditional machine learning regression algorithms, were trained with corresponding default parameters in the ensemble module of sklearn, as shown in Table 1. DNN is a deep learning model in which performance is greatly affected by network structure and parameters [38]. Based on the trial-and-error method and suggestions in references [39,40], the neural network layer, learning rate, activation function, and epoch were constantly changed during the model training process, and 10% of the data were separated for performance verification. As shown in Figure 6, when the loss on the validation set tends to fluctuate stably with the increase in steps, the DNN model that included one input layer, five hidden layers, and one output layer was finally determined. The number of neurons in each layer was 2→8→32→64→16→8→1. To speed up convergence based on the gradient descent method and prevent overfitting, two "Batch normalization" layers and one "dropout" layer were also included. The specific network structure and parameters are shown in Figure 7 and Table 2. that included one input layer, five hidden layers, and one output layer was finally determined. The number of neurons in each layer was 2→8→32→64→16→8→1. To speed up convergence based on the gradient descent method and prevent overfitting, two "Batch normalization" layers and one "dropout" layer were also included. The specific network structure and parameters are shown in Figure 7 and Table 2.    tion, and epoch were constantly changed during the model training process, and 10% of the data were separated for performance verification. As shown in Figure 6, when the loss on the validation set tends to fluctuate stably with the increase in steps, the DNN model that included one input layer, five hidden layers, and one output layer was finally determined. The number of neurons in each layer was 2→8→32→64→16→8→1. To speed up convergence based on the gradient descent method and prevent overfitting, two "Batch normalization" layers and one "dropout" layer were also included. The specific network structure and parameters are shown in Figure 7 and Table 2.

Comparative Analysis of Model Performance
To obtain a reliable model, fivefold-cross-validation was adopted for RF and SVR models (cross_val_predict), while DNN used the parameter "validation_split" to perform simple cross-validation. Moreover, as the result of a simple random partition is accidental, it cannot represent the actual performance of the model. Therefore, modeling and evaluation for three ML models were repeated 50 times on the training and testing sets, respectively, and the evaluation indexes were averaged as the final performance of the model.
As shown in Figure 8, the linear fitting functions between the prediction results of RF model on the training and testing sets, and the real values were y = 0.878x + 0.134 and y = 0.861x + 0.094, while those of the DNN model were y = 0.790 + 0.375 and y = 0.837x + 0.316. All data points were relatively concentrated on the two curves, and the p values were 4.68E-30, 1.30E-27, 8.35E-73, and 0.00000000000109, respectively, which were less than the significance level of 0.05. In addition, R and R 2 were relatively high, indicating the good performance of the models. Moreover, the linear regression between actual and SVR-estimated generation of CFA was y = 0.451x + 0.614 and y = 0.861x + 0.094, respectively, on the training and testing sets. Compared with RF and DNN models, the SVR model had relatively discrete data distribution on the training set, and its performance was slightly worse.

Comparative Analysis of Model Performance
To obtain a reliable model, fivefold-cross-validation was adopted for RF and SVR models (cross_val_predict), while DNN used the parameter "validation_split" to perform simple cross-validation. Moreover, as the result of a simple random partition is accidental, it cannot represent the actual performance of the model. Therefore, modeling and evaluation for three ML models were repeated 50 times on the training and testing sets, respectively, and the evaluation indexes were averaged as the final performance of the model.
As shown in Figure 8, the linear fitting functions between the prediction results of RF model on the training and testing sets, and the real values were y = 0.878x + 0.134 and y = 0.861x + 0.094, while those of the DNN model were y = 0.790 + 0.375 and y = 0.837x + 0.316. All data points were relatively concentrated on the two curves, and the p values were 4.68E-30, 1.30E-27, 8.35E-73, and 0.00000000000109, respectively, which were less than the significance level of 0.05. In addition, R and R 2 were relatively high, indicating the good performance of the models. Moreover, the linear regression between actual and SVR-estimated generation of CFA was y = 0.451x + 0.614 and y = 0.861x + 0.094, respectively, on the training and testing sets. Compared with RF and DNN models, the SVR model had relatively discrete data distribution on the training set, and its performance was slightly worse. As can be seen from Figure 9, the difference between the actual and estimated generation of CFA was small in the three models, and the data were mostly concentrated around 0, indicating the good prediction performance of ML models. The probability of data points in RF and DNN models appearing in the small interval [−0.1,0.1] was close to 0.9, while that of SVR was only 0.45. In addition, for the testing set, the data points on the DNN model were more concentrated in the areas with smaller differences, which implied that the DNN model had higher prediction accuracy. As can be seen from Figure 9, the difference between the actual and estimated generation of CFA was small in the three models, and the data were mostly concentrated around 0, indicating the good prediction performance of ML models. The probability of data points in RF and DNN models appearing in the small interval [−0.1,0.1] was close to 0.9, while that of SVR was only 0.45. In addition, for the testing set, the data points on the DNN model were more concentrated in the areas with smaller differences, which implied that the DNN model had higher prediction accuracy. Crystals 2022, 12, x FOR PEER REVIEW 10 of 16  Figure 10 shows a comparison of the performance of the three models more intuitively with four evaluation indices. For the training set in Figure 10a, R and R 2 values of RF and DNN models were the same, which were 0.98 and 0.95, respectively, and slightly higher than those of SVR models, which were 0.92 and 0.83. Meanwhile, the MSE and MAE values of the RF model were the smallest of the three models. On the contrary, the RF model had the lowest R and R 2 values on the testing set of Figure 10b, which were 0.87 and 0.7. However, R and R 2 values of the DNN model were the highest, which were 0.89 and 0.77, and MSE and MAE were relatively low. In general, The DNN model was the optimal model framework suitable for the CFA dataset in this study.

Comparison with Multiple Linear Regression
Multiple linear regression is a conventional data analysis method that uses multiple independent variables to predict or estimate dependent variables [41]. In this method, the dataset was repeatedly divided 50 times according to the same ratio of 0.85:0.15, and the multiple linear regression equation Y = Ax1 + Bx2 + C was established. The average results of statistical analysis are shown in Table 3. After 50 evaluations, the mean R 2 and R of the multiple regression training set were 0.82 and 0.90, which were lower than the results of the three ML models using fivefold cross-validation. Then, the data of the testing set were put into the equation for verification, and the mean values of R and p-value were 0.86 and 0.0000643805, respectively, indicating a significant correlation between the results. However, the mean value of R 2 was 0.76, which was higher than the RF and SVR models but  Figure 10 shows a comparison of the performance of the three models more intuitively with four evaluation indices. For the training set in Figure 10a, R and R 2 values of RF and DNN models were the same, which were 0.98 and 0.95, respectively, and slightly higher than those of SVR models, which were 0.92 and 0.83. Meanwhile, the MSE and MAE values of the RF model were the smallest of the three models. On the contrary, the RF model had the lowest R and R 2 values on the testing set of Figure 10b, which were 0.87 and 0.7. However, R and R 2 values of the DNN model were the highest, which were 0.89 and 0.77, and MSE and MAE were relatively low. In general, The DNN model was the optimal model framework suitable for the CFA dataset in this study.  Figure 10 shows a comparison of the performance of the three models more intuitively with four evaluation indices. For the training set in Figure 10a, R and R 2 values of RF and DNN models were the same, which were 0.98 and 0.95, respectively, and slightly higher than those of SVR models, which were 0.92 and 0.83. Meanwhile, the MSE and MAE values of the RF model were the smallest of the three models. On the contrary, the RF model had the lowest R and R 2 values on the testing set of Figure 10b, which were 0.87 and 0.7. However, R and R 2 values of the DNN model were the highest, which were 0.89 and 0.77, and MSE and MAE were relatively low. In general, The DNN model was the optimal model framework suitable for the CFA dataset in this study.

Comparison with Multiple Linear Regression
Multiple linear regression is a conventional data analysis method that uses multiple independent variables to predict or estimate dependent variables [41]. In this method, the dataset was repeatedly divided 50 times according to the same ratio of 0.85:0.15, and the multiple linear regression equation Y = Ax1 + Bx2 + C was established. The average results of statistical analysis are shown in Table 3. After 50 evaluations, the mean R 2 and R of the multiple regression training set were 0.82 and 0.90, which were lower than the results of the three ML models using fivefold cross-validation. Then, the data of the testing set were put into the equation for verification, and the mean values of R and p-value were 0.86 and 0.0000643805, respectively, indicating a significant correlation between the results. However, the mean value of R 2 was 0.76, which was higher than the RF and SVR models but

Comparison with Multiple Linear Regression
Multiple linear regression is a conventional data analysis method that uses multiple independent variables to predict or estimate dependent variables [41]. In this method, the dataset was repeatedly divided 50 times according to the same ratio of 0.85:0.15, and the multiple linear regression equation Y = Ax 1 + Bx 2 + C was established. The average results of statistical analysis are shown in Table 3. After 50 evaluations, the mean R 2 and R of the multiple regression training set were 0.82 and 0.90, which were lower than the results of the three ML models using fivefold cross-validation. Then, the data of the testing set were put into the equation for verification, and the mean values of R and p-value were 0.86 and 0.0000643805, respectively, indicating a significant correlation between the results. However, the mean value of R 2 was 0.76, which was higher than the RF and SVR models but lower than that of the DNN models, which once again proved that the DNN model was more suitable for the dataset. The specific results are in the attachment.

Feature Analysis
In this section, the analysis of the sensitivities of two features that affect the generation of CFA is presented using the permutation importance provided by sklearn and eli5, and "TreeExplainer" and "KernelExplainer" in the Shapley Additive Interpretation (SHAP) library.

Permutation Importance
The evaluation of the sensitivity of the feature depends on the degree of degradation of the model performance score after the feature is randomly rearranged [42]. As shown in Figure 11, after the values of coal consumption were randomly shuffled, the decrease in MSE of RF, SVR, and DNN algorithms were 1.73, 1.07, and 1.89, respectively, which were generally higher than those in the case of installed capacity randomly disturbed. This proved that the three models reached a consensus on the view that coal consumption had a greater impact on the generation of CFA.
Crystals 2022, 12, x FOR PEER REVIEW 11 of 16 lower than that of the DNN models, which once again proved that the DNN model was more suitable for the dataset. The specific results are in the attachment.

Feature Analysis
In this section, the analysis of the sensitivities of two features that affect the generation of CFA is presented using the permutation importance provided by sklearn and eli5, and "TreeExplainer" and "KernelExplainer" in the Shapley Additive Interpretation (SHAP) library.

Permutation Importance
The evaluation of the sensitivity of the feature depends on the degree of degradation of the model performance score after the feature is randomly rearranged [42]. As shown in Figure 11, after the values of coal consumption were randomly shuffled, the decrease in MSE of RF, SVR, and DNN algorithms were 1.73, 1.07, and 1.89, respectively, which were generally higher than those in the case of installed capacity randomly disturbed. This proved that the three models reached a consensus on the view that coal consumption had a greater impact on the generation of CFA.

SHAP
SHAP is a model agnostic interpretation method that can be used for both global and individual applications. SHAP can judge which feature is more important, as well as reflect the positive and negative influences of features on the target variable [43]. The model generates a predictive value for each sample, and the SHAP value is the contribution value assigned to each feature in the sample [44].
To better understand the overall pattern, Figure 12a shows the results of calculated SHAP values for each feature of each sample. Among them, features were arranged from

SHAP
SHAP is a model agnostic interpretation method that can be used for both global and individual applications. SHAP can judge which feature is more important, as well as reflect the positive and negative influences of features on the target variable [43]. The model generates a predictive value for each sample, and the SHAP value is the contribution value assigned to each feature in the sample [44].
To better understand the overall pattern, Figure 12a shows the results of calculated SHAP values for each feature of each sample. Among them, features were arranged from top to bottom in order of importance on the y axis [45], which indicated that coal consumption had a greater influence on the model, consistent with permutation importance.
In addition, the color represented the feature value (red was high, blue was low [46]). It can be seen that, under the three algorithm models, higher coal consumption increased the predicted generation of CFA. However, for installed capacity, the results were different. In RF and SVR models, larger installed capacity increased the predicted generation of CFA, but in DNN, the result was completely opposite. As shown in Figure 12b, the first sample for which preprocessed feature values were 0.8059 and 1.363 was used as an example to explain the generation details of a single prediction. In the figure, the red bar represents the range in which a feature played a positive role in the prediction of the model [47]; the base value was the mean value of the target variables of all samples, and f (x) was the final predicted value for this sample, which satisfies f (x) = base value + ∑SHAP value. The analysis showed that the prediction results of the three algorithm models were slightly different for the same sample, which may be affected by the algorithm principle and disrupted data. However, coal consumption and installed capacity both played positive, driving roles in the prediction of the model, but coal consumption had a greater impact. top to bottom in order of importance on the y axis [45], which indicated that coal consumption had a greater influence on the model, consistent with permutation importance. In addition, the color represented the feature value (red was high, blue was low [46]). It can be seen that, under the three algorithm models, higher coal consumption increased the predicted generation of CFA. However, for installed capacity, the results were different. In RF and SVR models, larger installed capacity increased the predicted generation of CFA, but in DNN, the result was completely opposite. As shown in Figure 12b, the first sample for which preprocessed feature values were 0.8059 and 1.363 was used as an example to explain the generation details of a single prediction. In the figure, the red bar represents the range in which a feature played a positive role in the prediction of the model [47]; the base value was the mean value of the target variables of all samples, and f(x) was the final predicted value for this sample, which satisfies f(x) = base value + ∑SHAP value. The analysis showed that the prediction results of the three algorithm models were slightly different for the same sample, which may be affected by the algorithm principle and disrupted data. However, coal consumption and installed capacity both played positive, driving roles in the prediction of the model, but coal consumption had a greater impact. In addition to explaining the model globally and locally, Figure 12c revealed hidden relationships among features through quick, precise interactions. The analysis showed that the interaction between coal consumption and installed capacity had positively correlated influences on CFA generation prediction. Specifically, when both coal consumption and installed capacity were high, the installed capacity had a great influence on the generation of CFA, except for some outliers. On the contrary, when the coal consumption and installed capacity were relatively small, the installed capacity contributed little to the variation in the model output and even hindered the prediction.
As indicated above, the effect of installed capacity on the CFA generation was not always positive, compared with that of coal consumed. In real life, the installed capacity is the designed capacity for one specific powder station. The actual capacity is influenced by many external factors, such as coal production, the market, policies, etc. Therefore, the correlation between installed capacity and CFA generation was not as close as the correlation between coal consumed and CFA generation. Moreover, it is possible that a power station with a large installed capacity produced a relatively small amount of power, and thus CFA, due to the influence of the above-mentioned factors. ML models based on datasets with such special cases might indicate the negative influence of installed capacity for some data samples.

Significance and Outlook
High energy consumption leads to increased generation of solid wastes such as CFA, posing a potential threat to the environment and human health. Meanwhile, more CFA byproducts are gradually being recycled and utilized to achieve sustainability [48]. However, the uncertainty of CFA generation poses difficulties to the rational planning and design of its disposal and utilization. The optimal model framework constructed in this study can quickly and accurately predict the generation of CFA only by inputting coal consumption and installed capacity, which is feasible and efficient. Applying this model framework to the engineering field enables managers to identify the next step of the disposal method in advance, so as to rationally allocate ways of recycling and utilization to maximize the use and sales benefits of CFA while minimizing its disposal costs. However, due to the small size of the dataset and few input variables, the results of this model framework lack further validation, and its general application needs to be improved. Subsequent studies can expand the search scope and consider various factors affecting CFA generation.

Conclusions
DNN was determined as the optimal ML model through comparative evaluation, which can accurately predict the generation of CFA. In addition, the sensitivity analysis of the features also provided a certain point of reference for ensuring the rational utilization of CFA. The specific conclusions are as follows: (1) Among the three model algorithms, the DNN model had the best performance. R and R 2 on the training set were 0.98 and 0.95, whereas these on the testing set were 0.89 and 0.77, respectively; (2) The R 2 of the traditional multiple linear regression equation on the testing set was 0.76, higher than those of RF and SVR models, but lower than that of the DNN model; (3) Permutation importance and SHAP both indicated that coal consumption had a greater positive effect on the generation of CFA. As influenced by other factors, the influence of installed capacity on CFA generation was as significant as coal consumed and could be negative for some special data samples.