An Interpretable Extreme Gradient Boosting Model to Predict Ash Fusion Temperatures

: The hemispherical temperature (HT) is the most important indicator representing ash fusion temperatures (AFTs) in the Polish industry to assess the suitability of coal for combustion as well as gasification purposes. It is important, for safe operation and energy saving, to know or to be able to predict value of this parameter. In this study a non-linear model predicting the HT value, based on ash oxides content for 360 coal samples from the Upper Silesian Coal Basin, was developed. The proposed model was established using the machine learning method—extreme gradient boosting (XGBoost) regressor. An important feature of models based on the XGBoost algorithm is the ability to determine the impact of individual input parameters on the predicted value using the feature importance (FI) technique. This method allowed the determination of ash oxides having the greatest impact on the projected HT. Then, the partial dependence plots (PDP) technique was used to visualize the effect of individual oxides on the predicted value. The results indicate that proposed model could estimate value of HT with high accuracy. The coefficient of determination (R 2 ) of the prediction has reached satisfactory value of 0.88.


Introduction
Ash fusion temperatures (AFTs) are widely used to characterize the ash fusibility, which is a basic characteristic to assess the propensity of ash to slagging and fouling processes during combustion at high temperatures [1,2]. The formation of slag in combustion processes is responsible for significant operational and maintenance problems [3]. This may result in increased costs and reduced process efficiency. AFTs are an important parameters because they also determines the behavior of coal ash in the other processes of coal conversion e.g., coal gasification, liquefaction and ash-utilization [4][5][6].
Determination of AFTs is based on measurements of the dimensional changes of an ash cone as a function of temperature during which four characteristic temperature points are identified under both oxidizing and reducing conditions [7][8][9]. These points define the temperature range in which the ash melting process takes place. Details of the ash fusion test, like the shape and size of samples and the way of their preparation, are defined by standard used in Poland-PN-ISO 540:2001. Specific temperature points are described in Table 1. The temperature at which the first change of form of the sample occurs (rounding of the corners or apex).
Spherical temperature (ST) The temperature at which the cone fuses down to a spherical lump at which the height is equal to the width of the base. Hemispherical temperature (HT) The temperature at which the cone fuses down to a hemispherical lump at which point the height is one half the width of the base. Fluid temperature (FT) The temperature at which the fused mass of sample spreads out in a nearly flat layer.
An AFTs measurement is generally expensive, time consuming and hard to repeat because it could generate an error during laboratory analysis [10,11]. Moreover, the ash fusion laboratory test has been questioned as it is more like a quantitative observation [12]. Despite the shortcomings, AFTs are still widely used to assess the deposition characteristics of coal. Therefore, it is important to know, or be able to predict, the ash fusibility characteristics of a coal before it is fed to the boiler or reactor [13].
A number of studies have been carried out to predict AFTs of the coal ash using various methods. In the literature, there are three main directions of research aimed at forecasting AFTs. Many of them have focused on empirical or statistical correlations between AFTs and ash composition [6,8,[14][15][16][17][18][19]. In most cases various regression analysis were used in order to find correlations between AFTs and coal ash composition. For instance, Özbayoglu et al. [17] showed that non-linear correlations are better than linear correlations for estimating AFTs of Turkish coals. Nonlinear correlations were developed by using the chemical composition of ash (eight oxides) and coal parameters such as ash content, specific gravity, Hardgrove index and mineral matter content. Lolja et al. [16] analyzed the ash fusion temperatures for Albanian coals using oxides analysis from various perspectives. It was reconfirmed that AFTs were decreased by an increase in the basic oxide content.
As an alternative for empirical and statistical methods Huggins et al. [14] proposed equilibrium phase diagrams to determine liquid temperatures that parallel the fusion temperatures of the formed ash. Further research was carried out by Jak [8] using thermodynamic modeling of phase equilibria to characterize the behavior of coal mineral matter and predict AFTs.
Another direction of research is the use of machine learning methods to predict AFTs [11,[20][21][22][23][24][25][26][27][28][29]. The most frequently techniques used were artificial neural networks (ANNs) and support vector machine (SVM). Yin et al. [27] and Miao et al. [22] used a back propagation neural network models on Chinese coals in order to predict AFTs. In the first case seven inputs were used. In the second case different numbers of inputs and hidden layers were tested and the model giving the best results was chosen. Karimi et al. [21] applied the adaptive neuro-fuzzy inference system (ANFIS) method on samples of US coals. As a model input different combinations of metal oxides where used. The SVM method was used to create the AFTs predictive model by Gao et al. [20]. The parameters of SVM were optimized by the improved ant colony algorithm. Recently, Xiao et al. [26] have used the SVM model optimized by the grey-wolf algorithm. The purpose of the study was to predict IDT and investigate how SO3 as an input parameter affects prediction performance.
Despite many studies devoted to prediction and finding the relationship between ash composition and AFTs, this issue is still not fully recognized and requires further research and improvement. For example, methods like ANN and SVM are classified as black box models. The main drawback of these types of models is that they cannot lead to any new knowledge about the process. The results cannot be checked in any simple way, as no additional information is available from the model. In other words, they are unable to reveal the complicated relationship between ash oxides and AFTs.
Vassilev et al. [4] specified the hemispherical temperature (HT) as the major ash fusion temperature measurement, which correspond significant melting of the most minerals (resulting from the intensive solution of the more refractory components) and some changes of flow properties and viscosity of the plastic phases. In Poland, the HT value among the AFTs characteristics is the most desirable parameter when determining the coal supply specifications for combustion processes.
Especially, stoker-fired boilers for private users are designed to burn coal with a specific HT. This value also forms the basis of the Polish coal classification due to ash fusibility (easily, medium and difficult fusible) [30].
Therefore, the aim of this study was to create predicting model of HT using machine learning method-extreme gradient boosting (XGBoost) regressor [31]. It is a specific implementation (uses more accurate approximations) of the gradient boosting method, which is an ensemble learner. The effectiveness of XGBoost method has been proven by better performance than other machine learning methods on a variety of benchmark datasets [32].
An important feature of models based on XGBoost algorithm is the ability to determine the impact of individual input parameters on the predicted value using the feature importance (FI) technique. This could allow the determination of ash oxides having the greatest impact on the predicted HT. The main advantage of FI is the ability to capture non-linear relationships between parameters. Other methods of determining the significance of individual input terms such as, linear regression models, can only represent linear relationships. In this case, manual non-linearity modeling is needed, which is difficult to find. XGBoost is able to automatically calculate both linear and non-linear dependency between variables. In addition, the partial dependence plots (PDP) technique [33] was used to achieve a graphical representation of the marginal effect that a particular variable has on the ensemble predictions ignoring the rest of the variables. It is important that PDP is not a representation of the dataset values, it is a representation of the ensemble results, therefore it allowed us to investigate whether the relationship between the particular oxides and a predicted HT is linear, monotonic or more complex. Together, these two techniques enable the in-depth interpretation of machine learning black box models.
The predictive model was established for 360 samples of Polish coals. Ash content varies widely from country to country and even regions of a particular country [16,34]. Therefore, coals from different geological areas could reveal distinct ash fusibility due to their different coal-forming environment and different composition of the mineral matter [35]. This study focuses on Polish coals from the Upper Silesian Coal Basin. To our knowledge it is the first comprehensive analysis of Polish coal ashes and its influence on AFTs. Therefore, this investigation can be useful through comparative analysis with results of other basin studies.

Data Set
A set of 360 samples of Polish hard coals, acquired from various mines of the Upper Silesian Coal Basin (USCB), was the subject of the analyses presented in this work. USCB is one of the most intensively mined areas in Europe. The area of USCB is located in Poland and in the Czech Republic. Within the Polish territory ( Figure 1) the area of the Upper Silesian Coal Basin is estimated at about 5600 km 2 . There are 144 documented hard coal deposits, including 43 deposits under exploitation, 54 undeveloped deposits and 47 abandoned deposits [36].
For each sample following parameters were determined:  The analyses of the samples were carried out in the accredited laboratory of the Department of Solid Fuel Quality Assessment at the Central Mining Institute, Poland. Descriptive statistics of above mentioned parameters in an analyzed set of samples are presented in Table 2. Histogram of hemispherical temperatures in an analyzed set of samples is shown in Figure 2.

Extreme Gradient Boosting (XGBoost)
XGBoost, short of extreme gradient boosting, [31] is an algorithm extending technique of gradient boosted decision trees. The basic idea of boosting is to combine a series of weak prediction models to a single, strong learner. Models are added to ensemble sequentially, in order to correct the errors made by existing learners [37]. The gradient boosting approach enhances the flexibility of boosting technique by employing gradient descent algorithm to minimize errors in sequential models [38].
A simple gradient boosting method is characterized by highly accurate estimations, but on the other hand, that technique is prone to over-fitting. XGBoost algorithm overcomes this limitation, by adding a regularization term into the objective function [31]: This is an example of an equation: where: yi-real value, ( ) -the prediction at the r-th round, gr-term denoting structure of decision tree, , ( ) -loss function, n-number of training examples, Ω( )-regularization term, given by formula: where: T-number of leaves, ω-weight of the leaves, λ and γ are coefficients, with default values set as λ=1, γ=0.

Feature Importance (FI)
Individual input predictors are rarely equally important in practical data mining problems. Usually only a few features have a significant impact on the output. The vast majority of predictors are irrelevant and might as well not be included in the model [39].
Feature importance (FI) is the statistical significance of the feature with respect to its impact on the generated model. Feature importance is explicitly calculated for each feature in the dataset, allowing them to be ordered and compared. The higher is relative importance of given attribute, the more significant is its impact on the output of the model. Feature importance score is determined after the boosted trees are constructed. For a single decision tree D importance of predictor xj is calculated as [40]: where the summation is over the non-terminal nodes t (non-leaves) of the J-terminal node tree D, vt is the splitting variable associated with node t, and ̂ is the corresponding empirical improvement in squared error as a result of using predictor xj as a splitting variable in the non-terminal node t. Then, the feature importance is averaged across all of M decision trees within the model [33]:

Partial Dependence Plots (PDP)
The partial dependence plots (PDP) attempts to show the marginal effect of a given feature on the predicted outcome of a machine learning model, by plotting the average model outcome in terms of different values of that feature [33]. Hence, while the feature importance specifies what input features most affect prediction, PDP visualizes how a given variable affect prediction.
Given the output f(x) of the machine learning model, the partial dependence function for the chosen predictor is defined by the following formula [33,41]: where: S-chosen predictor, C-the complement set of S (containing all other predictors), , -feature vectors, ℙ( )-marginal probability density of . Equation (5) can be estimated from a set of training data: where: , ( = 1,2, … , )-the values of that occur in the training sample. The partial dependence plots is a method with a global character. This technique considers all record and allows one to draw conclusions about the global relationship between the analyzed input variable and the predicted output [41].

Model Evaluation
The performance of predictive model was evaluated through the following metrics: • Mean absolute error (MAE): • Root mean squared error (RMSE): • Coefficient of determination R 2 : where: yi-the actual value of the dependent variable, di-the value of the dependent variable determined from the model, -the arithmetic mean of the actual values of the dependent variable.

Software
Data preparation, variable importance, development of predictive model and its evaluation were carried out using the Python programming language and its libraries dedicated to data analysis, machine learning and visualization (Pandas, Scikit-Learn, Matplotlib, Seaborn, XGBoost and pdpbox).

Feature Importance
A benefit of using ensembles of decision tree methods like XGBoost is that they can automatically provide estimates of feature importance from a trained predictive model. The effect of 11 different ash components (oxides) on the output of the model was investigated using the FI measure ( Figure 3). The results showed that Al2O3 had the most significant influence on HT prediction, then respectively-Fe2O3, SO3, Na2O and CaO. These results were in a good agreement with previous studies concerning the relationship between ash composition and AFTs. Vassilev et al. [4] examined the impact of chemical and mineral composition of coals ashes on fusibility. According to them oxides of Si, Al, Fe, Ca and S had the main contribution for values of hemispherical temperature. It is well known that, acidic oxides (SiO2 and Al2O3) form a polymer structure that increase AFTs, while basic oxides (Fe2O3, CaO, MgO, Na2O and K2O) have a tendency to break the network structure, which decreases AFTs [18,42]. However, the fusibility of the coal may decrease to the minimum first and then increase as the content of flux increases [43].

Determining the Optimal Values of Model Hyperparameters
Every hyperparameter has a significant impact on the model performance. Therefore, their correct setting is crucial for the accuracy of the prediction. In this study, model hyperparameters were estimated using the grid search procedure. Grid search is the process of performing hyperparameters tuning that will methodically build and evaluate a model for each combination of algorithm hyperparameters specified in a grid. Grid search was additionally optimized by k-fold cross validation. This technique is widely used to evaluate performance of a model by handling the variance problem of the result set. The number of folds to use for cross-validation was set to 5. It was proved that for the following parameters the model performance was the best: • n_estimators = 200-refers to number of trees in the ensemble, • learning_rate = 0.08-step size shrinkage used in update to prevents overfitting, • gamma = 0.3-minimum loss reduction required to make a further partition on a leaf node of the tree, • subsample = 0.95-controls the number of samples (observations) supplied to a tree, • min_child_weight = 1.5-the minimum number of instances required in a child node, • colsample_bytree = 0.8-controls the number of features (variables) supplied to a tree, • max_depth = 8-controls the depth of the tree.

Evaluation of the Model
The performance of the XGBoost model for prediction of HT was as follows: • mean absolute error: 21.71, • Root mean squared error: 29.16, • R 2 : 0.88.
It is assumed that the smaller MAE and RMSE values indicate the model is more accurate. In the case of the R 2 measure, which is one of the most popular indicators, the closer the value is to 1, the more accurate the model. Results show that the model could predict the HT with satisfactory efficiency. This was confirmed by the observations of the Figure 4, where individual points were arranged along the regression line. To assess the performance of the proposed XGBoost model, two other typical predictive techniques were examined for the estimation of HT: • Support vector regression (SVR) with RBF (radial basis function) kernel function, hyperparameters of that model were determined with grid search procedure: C = 1, ε = 0.01, γ = 10, • Multiple linear regression (MLR), the coefficients of the model were determined by the least mean square algorithm.
In order to compare these two techniques with the XGBoost model the R 2 metric was applied. Results ( Table 3) indicated, that the proposed XGBoost model outperformed the other methods, as it had the highest value of R 2 metric. However, while the effectiveness of another machine learning technique as SMV was also satisfactory (R 2 =0.83), the performance of the linear regression model was low (R 2 = 0.34). This means that the linear relationships between the analyzed input variables and the predicted HT were very weak. The learning curve of the developed XGBoost model is presented in Figure 5. The shape and dynamics of the learning curve allows you to determine the correctness of the behavior of the machine learning model. There are three typical dynamics observed in learning curves: underfit, overfit and good fit. A good fit is noticed on the learning curve when a training and testing loss decreases to a point of stability with a minimal gap between the two final loss values.

Model Interpretation Using PDPs (Partial Dependence Plots)
Figures 6-11 show partial dependence plots for oxides having the greatest impact on the HT prediction (according to Figure 3)-Al2O3, Fe2O3, SO3, Na2O, CaO and SiO2 (which in percentage is the main component of ash). The blue areas in the plots indicate the extent of uncertainty.
The HT of coals increased, as the content of Al2O3 in the ash became higher. However, a significant increase of fusibility was observed when the Al2O3 content in the ash was higher than about 25% ( Figure 6). Generally, as the Fe2O3 content increased, HT of coal samples decreased. However, when the concentration of Fe2O3 exceeded about 13%, the changes of HT were not significant (Figure 7). The decrease fusibility connected with the increase in Fe2O3 content was also described in other works [1,43]. The HT of coals decreased, as the content of SO3 became higher. A significant reduction of HT was observed when the content of SO3 was in the range 0-5. Meanwhile the concentration of SO3 exceeded about 10%, the changes of HT were not significant ( Figure 8).
However, it should be indicated that SO3 did not exist in isolation in coal ash minerals, but together with other elements (Ca, Mg) in the sulfate form. Therefore, the cations in the chemical compounds had an effect on slag chemistry, not SO3 itself. For the analyzed sample set, the ash fusion temperature decreased as the concentration of Na2O in the ash increased. However, this trend was reversed when the Na2O content exceeded 3%. An amount of Na2O in ash higher than about 5% no longer contributed to changes in HT (Figure 9). It can be seen that the HT of the coal decreased with the increase of the CaO content until reaching the minimum around 12% content of CaO. Then fusibility increased gradually with the increase of the CaO content ( Figure 10). Similar results were also presented in other works, among others [1,43]. SiO2 is a main component of coal ash. However, SiO2 fraction did not significantly affect the predicted HT ( Figure 11).

Conclusions
• The aim of this study was to create a HT prediction model. The machine learning method was used for this purpose-XGBoost regressor, well known to provide better solutions than other machine learning algorithms.

•
The effect of 11 different ash components (oxides) on HT prediction was investigated using the feature importance technique. The results showed that Al2O3 had the most significant influence on HT prediction, then respectively, Fe2O3, SO3, Na2O and CaO.

•
The partial dependence plots technique was used to examine whether the relationship between the particular oxide and a predicted HT was linear, monotonic or more complex. It was revealed that: o HT of coals increased, as the content of Al2O3 in the ash became higher. However, a significant increase of fusibility was observed when the Al2O3 content in the ash was higher than about 25%.
o The ash fusion temperature decreased as the concentration of Na2O in the ash increased. However, this trend was reversed when the Na2O content exceeded 3%. An amount of Na2O in ash higher than about 5% no longer contributed to changes in HT.
o As the Fe2O3 content increased, HT of coal samples decreased. However, when the concentration of Fe2O3 exceeded about 13%, the changes of HT were not significant.
o It can be seen that HT of the coal decreased with the increase of the CaO content until reaching the minimum around 12% content of CaO. Then fusibility increased gradually with the increase of the CaO content.
o HT of coals decreased, as the content of SO3 became higher. A significant reduction of HT was observed when the content of SO3 was in range of 0-5%. Meanwhile the concentration of SO3 exceeded about 10%, the changes of HT were not significant. However, it should be indicated that SO3 did not exist in isolation in coal ash minerals, but together with other elements (Ca, Mg) in the sulfate form. Therefore, the cations in the chemical compounds had an effect on slag chemistry, not SO3 itself.
o SiO2 fraction did not significantly affect the predicted HT. • Results showed that the model created in this study could predict the HT with satisfactory efficiency R 2 equal to 0.88. Finally, the results proved that XGBoost could be used as a reliable method for predicting HT.