1. Introduction
Coal is a carbonaceous sedimentary rock formed from the gradual transformation of plant matter under geological pressure and temperature conditions over millions of years. During this process, organic residues accumulate in oxygen-deficient environments such as swamps and wetlands, where decomposition is limited. Progressive burial and compaction lead to physical and chemical transformations that increase the carbon concentration and energy density of the material, eventually producing coal as a combustible fossil fuel [
1,
2].
As coalification progresses, plant material evolves through several stages (see
Figure 1), including peat, lignite, bituminous coal, and anthracite, each characterized by increasing carbon content and calorific value. These stages reflect progressive loss of moisture and volatile components while the fixed carbon fraction increases. Consequently, the physicochemical properties of coal, including its heating value and suitability for energy applications, depend strongly on its elemental composition and coalification degree [
3,
4].
Coal classification and characterization commonly rely on parameters such as moisture content, ash content, volatile matter, and calorific value. In particular, ultimate analysis determines the weight percentages of the main elemental constituents of coal—carbon (C), hydrogen (H), oxygen (O), nitrogen (N), and sulfur (S)—which together describe the primary chemical structure of the fuel and strongly influence its energy performance.
The current global challenges encompass energy, environment, and sustainable development. Extended Sustainable development, environmental preservation, and energy security are current worldwide issues. Long-term reliance on fossil fuels, especially coal, has exacerbated climate change, resource exhaustion, and the greenhouse effect. As a result, investigators from all over the world are actively looking into efficient and environmentally friendly methods of using alternative fuels in the chain of supply for energy [
3,
4].
Coal plays a major role in the global energy supply. The widespread usage of this solid fossil fuel has been fueled by the growing need for power production and thermal energy uses; by 2030, coal consumption is expected to almost double [
5]. Coal remains one of the most abundant and widely used fossil fuels worldwide and can be regarded as a long-term reservoir of stored solar energy. It contains intrinsic moisture and more than 50% organic content, mainly carbon [
6].
Specifically, this analytical procedure includes the quantification of carbon by weight percentage, as well as the weight percentages of sulfur, nitrogen, and oxygen. Furthermore, ultimate analysis accounts for trace elements that may be present in coal. Data obtained from the elemental analysis of coal are essential for understanding its combustion behavior, including its heating value and suitability for diverse applications.
Ultimate coal analysis is a valuable method for succinctly characterizing the primary organic elemental composition of coal. In this procedure, the combustion of a representative coal sample is used to determine the weight percentages of hydrogen, sulfur, carbon, and nitrogen. An analyzer system measures the total nitrogen, carbon, and hydrogen from the same sample, while the remaining values are used to calculate the total oxygen content.
Figure 2 presents the standardized procedure for conducting ultimate analysis in accordance with established protocols.
In fact,
Figure 2 depicts the main stages involved in the ultimate analysis of coal:
Coal sampling: After drying, grinding, and sieving, the coal sample is produced with uniformly small particles.
Laboratory test: To change the components of the coal sample into the appropriate oxides, the coal sample is burned.
Detection: The products of combustion are used to define the composition of the sample for analysis.
Data analysis: The elemental composition and possible applications are predicted using the analysis results.
The energy yield of a solid fuel is determined by its calorific value, often known as its higher heating value (HHV). Consequently, a variety of applications, such as categorization, evaluation of energy potential, assessment of productive use, and precise assessment of commodities markets, depend on the right determination of coal’s HHV [
7]. Moreover, insight into the HHV is crucial for the correct plans and procedures of coal-reliant systems [
8]. Therefore, it is preferable to create and apply techniques that enable the quick and precise calculation of coal HHV, providing significant money savings over customary laboratory measurements. In keeping with this goal, earlier studies have suggested mathematical correlations, found empirically, for forecasting coal HHV based on the essential components found in the final analysis [
9,
10,
11,
12,
13,
14].
In recent years, machine learning (ML) approaches have increasingly been applied to estimate the higher heating value (HHV) of coal from proximate or ultimate analysis data. Methods such as artificial neural networks, adaptive neuro-fuzzy inference systems, Gaussian process regression, and decision tree-based algorithms have shown promising predictive capability in comparison with traditional empirical correlations in the context of AI and energy-related regression modeling [
15,
16,
17,
18]. However, several limitations remain in the existing literature. Many studies focus primarily on predictive accuracy, while paying less attention to model interpretability, robustness to correlated predictors, and systematic hyperparameter optimization. In addition, the performance of some ML approaches may depend strongly on the specific dataset used, which may limit the generalizability of the models when applied to coal samples from different geological basins or coal ranks. Another common limitation is that several models operate as “black-box” predictors, providing limited insight into the relative contribution of elemental variables to the calorific value. Consequently, despite the progress achieved, there remains a need for predictive frameworks that combine high predictive performance with optimization strategies and interpretable analysis of feature importance.
Despite the increasing use of machine learning techniques for estimating the higher heating value (HHV) of coal, several limitations remain in the current literature. Many previous studies have focused on traditional models such as artificial neural networks, decision trees, Gaussian process regression, or hybrid metaheuristic approaches, often emphasizing predictive accuracy without providing a systematic framework for model optimization and interpretability. In particular, the application of extreme gradient boosting (XGBoost) to HHV prediction from ultimate analysis has received very limited attention, and the integration of advanced evolutionary optimization techniques for tuning its hyperparameters has not been sufficiently explored. Moreover, only a few studies have analyzed the relative importance of the elemental composition variables using explainable artificial intelligence tools, which are necessary to better understand the physical relevance of the predictors involved in energy conversion processes.
In this context, ensemble learning methods based on gradient boosting have recently attracted significant attention due to their strong predictive performance and ability to capture nonlinear interactions between variables. Among them, extreme gradient boosting (XGBoost) has proven particularly effective in various regression problems involving complex tabular datasets. Nevertheless, its potential for predicting coal HHV from elemental composition remains insufficiently explored, especially when combined with evolutionary optimization strategies capable of systematically tuning its hyperparameters. Furthermore, integrating explainable artificial intelligence techniques with such models can provide valuable insight into the physical relevance of the elemental predictors involved in fuel energy characterization. The ability of extreme gradient boosting (XGBoost) to forecast the higher heating value (HHV) of coal in various kinds, deposits, and geographical regions has not yet been investigated. The comprehensive characterization of coal, commonly referred to as ultimate analysis, involves the precise determination of its various compositional components.
This work addresses an application that has not been previously explored. Here, XGBoost model [
19,
20,
21,
22,
23,
24,
25] is employed tuning its parameters by means of differential evolution (DE) [
26,
27,
28,
29,
30,
31,
32,
33], which is subsequently used for HHV estimation in coal samples from different deposits and geographical settings.
To assess the coal HHV output variable, the observed dataset was further subjected to random forest regression [
34,
35,
36], M5 model trees [
37,
38,
39], and multivariate linear regression [
40,
41]. Regression problems are especially well-suited for the XGBoost methodology [
19,
20,
21,
22,
23,
24,
25], a method of supervised learning known for its resilience and ability to manage nonlinear connections.
In a number of domains, such as fault location in non-homogeneous multi-terminal direct current (MTDC) systems [
42], building energy performance prediction [
43], and predictive modeling of blood pressure during hemodialysis [
44], extreme gradient boosting (XGBoost) has proven to be effective. Many factors highlight the advantages of the suggested XGBoost method [
19,
20,
21,
22,
23,
24,
25]. (1) High predictive performance: XGBoost typically delivers highly accurate regression results by sequentially combining multiple decision trees, thereby correcting errors from previous iterations (boosting); (2) Effective handling of nonlinear relationships: It can capture complex and nonlinear interactions between explanatory variables and the target variable, which many linear models cannot achieve without additional transformations; (3) Integrated regularization to prevent overfitting: XGBoost incorporates L1 and L2 regularization on the trees, controlling model complexity and reducing overfitting, particularly in regressions involving numerous predictors; (4) Robust handling of missing values: The algorithm can automatically determine the optimal direction in a tree for null values without requiring prior imputation, simplifying data preprocessing; (5) High computational efficiency: XGBoost is optimized for speed and memory usage through parallelization, efficient tree pruning, and optimized data structures, making it suitable for large datasets; (6) Feature importance assessment: It provides metrics for feature importance, facilitating model interpretation and identifying which variables most strongly influence the prediction of the continuous variable; and (7) Flexibility in the loss function: It allows the definition of various objective functions for regression (for instance, the Huber loss function, mean squared error, and mean absolute error), enabling adaptation to different problem types and error distributions.
Several machine learning models for estimating coal HHV employing elemental analysis data collected from coal samples with various origins and locations are compared in this investigation. These approaches include the optimized DE/XGBoost-based model, the optimized DE/RFR-based model, the M5 model tree, and multivariate linear regression (MLR). Additionally, the study looks at how five input components—sulfur (S), hydrogen (H), oxygen (O), nitrogen (N), and carbon (C)—affect the accuracy of coal HHV as the objective variable.
By suggesting an efficient and comprehensible machine learning framework for HHV prediction reliant on the ultimate examination of coal samples, the current work attempts to close this research gap. The novelty of this study lies in three main aspects. First, the hyperparameters of the XGBoost regression model are automatically optimized using the differential evolution (DE) algorithm, enabling a systematic search of the parameter space and improving predictive performance. Second, the proposed DE/XGBoost hybrid model is compared with several widely used approaches in HHV prediction, including random forest regression (RFR), M5 model trees, multivariate linear regression (MLR), and classical empirical correlations. Third, the interpretability of the predictive model is enhanced through the use of SHAP (Shapley Additive Explanation) values, which allow the quantification and ranking of the influence of the elemental variables on the predicted HHV. These contributions offer a methodology for calculating the calorific value of coal from its constituent makeup that is more precise, efficient, and comprehensible.
The rest of the paper is structured as follows. First, the instruments and methods required to carry out this inquiry are listed. In the second step, the findings are discussed and presented. After that, the key consequences are explained.
3. Results and Discussion
Figure 8 displays the correlation matrix for all variables considered in the ultimate analysis.
A high negative Pearson correlation coefficient of 0.87 was observed between carbon (C) and oxygen (O) content, indicating significant multicollinearity between these variables. This relationship, inherent to the stoichiometric structure of carbonaceous matter, poses challenges for model interpretation and stability, even for robust algorithms like XGBoost. While XGBoost can accommodate correlated predictors, the inclusion of oxygen may distort the assessed relative importance of variables, artificially inflating its perceived contribution due to its inverse relationship with carbon—the primary determinant of the higher heating value (HHV).
From a thermodynamic standpoint, the oxygen in coal does not directly contribute to energy release during combustion. Instead, it is associated with functional groups such as hydroxyl (–OH) or carbonyl (C=O), which reduce the availability of carbon and hydrogen for oxidation but do not generate significant heat. Oxygen thus acts as an inverse indicator of coal rank, reflecting lower energy quality. For instance, empirical formulas like Dulong’s often disregard or correct for oxygen to simplify calculations without compromising accuracy [
45]. Thus, the decision to omit oxygen is justified by its redundancy, lack of direct energetic contribution, and adherence to the principle of parsimony, resulting in a more robust and theoretically consistent model. Consequently, oxygen is treated as a redundant variable, whose exclusion optimizes the performance of the DE/XGBoost model in both statistical and physical terms.
Table 4 and
Table 5 illustrate the optimal hyperparameters obtained for the optimized XGBoost-based and RFR-based techniques for the coal’s HHV, as produced by the differential evolution (DE) optimizer, respectively.
For comparison purposes, this investigation also employed the M5 model tree and multivariate linear regression (MLR) models.
Figure 9 displays the DE/XGBoost method’s first-order terms. This picture makes it easier to grasp the connections between the many input factors utilized in this method. For example, with the other four input variables maintained constant, the coal’s HHV is plotted on the
Y-axis versus the carbon concentration (C) on the
X-axis (see the first graph in
Figure 9). The second and third graphs in
Figure 9 show the coal’s higher heating value on the
Y-axis in relation to the hydrogen and nitrogen concentrations on the
X-axis, respectively, with all other input variables held constant.
In a similar manner, the second-order terms of the DE/XGBoost technique are shown in
Figure 10. Furthermore, when all other factors are held constant, the first graph in
Figure 10a shows the coal’s HHV on the
Z-axis as a result of the hydrogen concentration on the
Y-axis and the carbon composition on the
X-axis. Similar patterns can be seen in the other graphs in
Figure 10b,c, which plot the coal HHV on the
Z-axis against the contents of carbon and nitrogen on the
X-axis, and hydrogen and nitrogen on the
Y-axis, respectively, while keeping the other variables constant.
Table 6 compiles representative empirical correlations documented in the literature for estimating coal HHV. These expressions are formulated from the elemental composition obtained by ultimate analysis, using the mass fractions of the principal coal constituents as predictor variables. The formulas capture synergistic effects such as the oxidation of sulfur or the energetic contribution of hydrogen, which improves predictive accuracy over simple linear models by reflecting the complexity of the carbonaceous matrix.
The DE-XGBoost, DE-RFR, M5 model tree, and multivariate linear regression models’ coefficients of determination and correlation are presented in
Table 7, together with the results for F6 [
12], the best-performing empirical correlation, using the test dataset.
The most current statistical estimates indicate that the XGBoost technique is the optimal model for predicting the coal HHV as a dependent factor for different types of coal. For the coal HHV factor, this approach yielded a coefficient of determination of 0.9691 and a correlation value of 0.9858. This choice shows a consistent goodness-of-fit, which suggests that the XGBoost method and the data from the experimentally collected measurements of the samples agree appropriately.
Importance of the Variables
Each feature’s contribution to a machine learning model’s prediction for a particular instance is represented by its SHAP value (Shapley Additive Explanation value) [
67]. It is modified for feature importance in predictive models and Shapley values from cooperative game theory serve as its foundation. Any machine learning model, including intricate models like XGBoost, can be used with SHAP values since it is model-agnostic. We can reliably understand the importance of variables across many models thanks to this flexibility. SHAP offers explanations that are both local (individual prediction) and global (overall feature importance). This dual capability aids in our comprehension of how factors affect the model’s predictions at various granularities. This ensures that the feature priority is assigned in a fair and accurate manner, reflecting each feature’s real contribution to the model’s output [
67,
68]. The degree to which a feature influences a model’s prediction in comparison to its baseline (or expected) output is measured by a SHAP value. Negative SHAP values indicate that a feature decreases the predicted response, whereas positive SHAP values indicate that it increases it. The spread of points along the
x-axis reflects how the contribution of that feature varies across individual observations.
A popular way to gauge a variable’s importance is to examine its average absolute SHAP value for every sample. This index measures the average contribution of the variable (in terms of magnitude, irrespective of direction) to the model’s predictions [
68]:
In this case, there are n samples. A greater impact of the variable on the model’s estimates is shown by greater average absolute SHAP values.
The relevance and impacts of the input variables are displayed in the summary graph of the SHAP technique. Each point is a Shapley value for a particular occurrence and input variable. The
x-axis displays the Shapley value, and the
y-axis represents the input variable. The input variable’s value is indicated using colors. Along the
y-axis, points are varied quickly to improve the display of the Shapley values’ distribution for a specific variable. The order of the variables indicates their relative importance: the higher the value on the
y-axis, the more significant the variable. A favorable relationship between the value of the characteristic and its SHAP value is indicated by a gradient trend, such as red dots pointing to the right. Complex or nonlinear relationships are shown if both sides have blue and red marks. For example, the SHAP values for forecasting the coal HHV are shown in
Figure 11.
An additional result of these analyses—the hierarchical relevance of the process variables (input factors) in forecasting the coal HHV (output-dependent factor) for this complex investigation—is displayed in
Table 8 and
Figure 12. According to the XGBoost framework, the process variable carbon amount (C) emerges as the major predictor of the output variable coal HHV. Hydrogen amount (H), nitrogen amount (N), and sulfur amount (S) come next, in decreasing order of significance.
For solid fuels like coal to burn, one must understand their elemental composition, i.e., by using ultimate analysis, because it determines the thermal energy production, HHV, and energy management, i.e., the potential of the pollutant emissions such as NOx or SO2. The ultimate analysis shows the S, N, O, H, and C percentages.
In energy research, carbon (C) and hydrogen (H) are the key parameters used to estimate HHV, whereas nitrogen (N) and sulfur (S) gain relevance mainly due to their sustainable energy management by energy conversion. In this context, carbon (C), as the main component of all carbonaceous fuels, has a decisive role in the material’s energy performance [
69].
As per the DE/XGBoost-based approximation ranking order, as it directly contributes to the energy released during combustion and is the most significant indication of the coalification grade, the element (C) is the key component of the suggested model. Carbon is one of the main energetic constituents of coal, as its oxidation produces a substantial amount of heat per unit mass. The higher carbon content in coal, for example, when comparing anthracite with lignit, the greater its calorific value, making carbon the dominant variable in the calculation of energy conversion and evaluation of energy management in tehcnologies such as co-combustion of fuels, oxy-fuel combustion, emissions from furnaces, and carbon capture and sequestration [
70].
Hydrogen also contributes significantly to the HHV, as its combustion produces water and releases a greater amount of heat per unit mass than carbon, although its content in coal is usually lower. In the definition of HHV, it is assumed that the water formed from hydrogen combustion condenses, thereby recovering the latent heat of vaporization, which substantially increases its energetic contribution in the energy conversion processes [
71]. In this sense, the oxygen in coal is bonded partly to hydrogen (for example, in hydroxyl groups) and mainly to carbon (for example, in carbonyl groups) [
69].
Nitrogen does not significantly contribute to the fuel energy; instead, part of the available energy is dissipated in the formation of nitrogen oxides, which are pollutant compounds generated through chemical reactions. Sulfur does release heat upon oxidation, but its content in coal is relatively low and its specific energy is much smaller than that of carbon or hydrogen, so its contribution to energy production is limited; however, it is relevant in the control of energy production due to the sulfur–nitrogen interactions with air [
72].
For all these reasons, understanding the fuel resource is essential because its elemental composition directly determines the HHV and thus the amount of energy that can be obtained during combustion. Accurate knowledge of the resource allows for better design and optimization of resources for energy conversion, improving energy efficiency and reducing environmental impacts [
8,
73,
74,
75].
In summary, this study effectively illustrates how to use the DE/XGBoost-based method to estimate the coal HHV as an output factor in accordance with the actual observed values. The DE/XGBoost model captures nonlinear interactions reflecting key combustion processes. For example, the synergistic effect of carbon (C) and hydrogen (H) on HHV aligns with their combined oxidation efficiency, while sulfur’s (S) nonlinear contribution—positive at low concentrations but inhibitory at high levels—stems from SO2 formation competing with fuel oxidation. These dependencies, poorly addressed by linear empirical formulas (e.g., Dulong), underscore ML’s ability to integrate complex physicochemical mechanisms.
Figure 13 compares the experimental and anticipated values of the coal HHV using the following models: the most accurate empirical correlation F6 (
Figure 13a), the MLR method (
Figure 13b), the M5 model tree (
Figure 13c), the DE/RFR-based model (
Figure 13d) and the DE/XGBoost-based model (
Figure 13e). As a result, using an XGBoost technique is crucial to finding the best solution to the regression problem. These findings unequivocally show that the DE/XGBoost-based method satisfies the crucial statistical goodness-of-fit requirement (
R2) and offers the best fit.