3.1. Model Performance Evaluation
To evaluate the predictive performance of the machine learning models, four complementary statistical metrics were applied: mean absolute error (MAE), root mean square error (RMSE), mean absolute percentage error (MAPE), and the coefficient of determination (R2). Together, these metrics quantify the magnitude of prediction errors and provide a comprehensive assessment of the models’ accuracy.
Table 6 presents the MAE values. Among the tested algorithms, ExtraTrees achieved the lowest MAE in most cases, indicating the highest predictive accuracy. For heat evolved (12–168 h), its errors ranged from 12.090 to 24.109, outperforming both CatBoost and XGBoost, which exhibited larger deviations-particularly after 168 h. The only exception was for heat evolved (72 h), where XGBoost performed slightly better. Similarly, for the rate of heat evolution, ExtraTrees (MAE = 2.328) outperformed CatBoost (MAE = 2.595) and XGBoost (MAE = 2.801).
The RMSE results (
Table 7) confirm the trends observed for the MAE. The ExtraTrees model achieved the lowest errors for most target variables-particularly at 12 h, 168 h, and for the Rate of heat evolution-demonstrating its strong overall predictive accuracy. The only exception occurred at 72 h, where XGBoost slightly outperformed ExtraTrees (RMSE = 20.739 vs. 20.969). Nevertheless, the consistently lower RMSE values of ExtraTrees across the remaining targets highlight its robustness in predicting both cumulative and rate-based heat evolution.
A comparison of the models based on MAPE values (
Table 8) indicates that XGBoost performed best overall. For heat evolved after 12 and 72 h, XGBoost achieved the lowest MAPE values (0.178 and 0.076, respectively), demonstrating superior capability in modeling early-stage hydration behavior. After 168 h, however, the ExtraTrees model slightly outperformed the others (0.103 versus 0.110 for XGBoost and 0.120 for CatBoost), suggesting that its ensemble averaging approach may better capture long-term cumulative heat trends. For the rate of heat evolution, ExtraTrees again achieved the lowest MAPE (0.209), followed by CatBoost (0.239) and XGBoost (0.245).
The coefficient of determination (R
2) values (
Table 9) confirm that all three models achieved a high degree of fit, particularly during the early stages of hydration. The ExtraTrees model attained the highest R
2 values at 12 h (0.942) and 168 h (0.765), as well as for the rate of heat evolution (0.907). It was only slightly outperformed by XGBoost (0.913) for heat evolved (72 h), where it achieved an R
2 of 0.911.
Overall, the comparative analysis of MAE, RMSE, MAPE, and R2 demonstrates that the ExtraTrees model provided the most consistent and accurate predictions across the evaluated targets. The XGBoost algorithm also showed strong performance in several cases, while CatBoost generally exhibited lower accuracy, though it occasionally surpassed XGBoost in specific instances. These findings indicate that ExtraTrees offers the best overall balance of precision, robustness, and generalizability among the tested models.
3.2. Residual Analysis of the Machine Learning Models
To complement the model evaluation based on error metrics, a residual analysis was conducted for all tested models. This approach enables a more detailed examination of model behavior by assessing the distribution, magnitude, and potential structure of residuals. The analysis includes graphical diagnostics such as True vs. Predicted, Residuals vs. Predicted, histogram of Residuals, and Q-Q plots. These visualizations provide additional insights into model stability, error patterns, and potential dependencies that may influence prediction quality.
In the True vs. Predicted plot (
Figure 4a), a clear linear relationship is observed between the actual and predicted values, with most points distributed close to the reference line. Some deviations occur, particularly at higher value ranges, suggesting increased dispersion for larger true values. The histogram of Residuals (
Figure 4b) shows an asymmetric distribution, with negative errors prevailing over positive ones. Although the histogram peaks near zero, several positive and negative outliers are present. The Residuals vs. Predicted plot (
Figure 4c) indicates that residuals remain generally clustered around zero across the prediction range. They are distributed both above and below the zero line without a distinct directional trend, though a slight increase in dispersion is noticeable for higher predicted values. Finally, the Q-Q plot of residuals (
Figure 4d) shows that most points align well with the reference line, suggesting partial conformity to a normal distribution. However, deviations at the extremes indicate the presence of outliers.
The True vs. Predicted plot (
Figure 5a) illustrates the relationship between the actual and predicted values of the CatBoost model for the heat evolved parameter after 72 h. Most points lie along the reference line, indicating good agreement between predictions and observations. However, some dispersion is evident at lower actual values, suggesting greater model deviation in this range. True–Predicted plots are used to assess how closely model estimates follow the ideal 1:1 relationship between observed and predicted values, providing a visual indication of overall model agreement [
42]. The histogram of Residuals (
Figure 5b) shows a distribution close to normal, with a slight predominance of negative residuals and a few extreme values on both sides. Histograms of residuals are widely applied to evaluate the symmetry and approximate normality of prediction errors, helping to identify skewness or extreme deviations [
43]. The Residuals vs. Predicted plot (
Figure 5c) indicates that the residuals fluctuate around zero across the prediction range, with a minor increase in dispersion at lower predicted values. Residuals–Predicted plots allow visual inspection of potential bias or heteroscedasticity, as systematic patterns in the residuals may indicate model misspecification [
44]. In the Q-Q plot (
Figure 5d), the residuals largely follow the reference line, particularly in the central region. Deviations at the distribution tails suggest slight asymmetry or the presence of outliers. Q-Q plots are used to compare the empirical distribution of residuals with the theoretical normal distribution, where alignment with the reference line suggests approximate normality [
45].
The True vs. Predicted plot (
Figure 6a) illustrates the distribution of predicted versus actual values for the CatBoost model and the heat evolved (168 h) variable. Most points lie close to the y = x line, indicating good overall agreement, although moderate scatter is observed both above and below the line across the full value range. In the lower range, predictions tend to be slightly underestimated, while at higher values, both underestimations and overestimations occur. The histogram of Residuals (
Figure 6b) shows that most predictions cluster around values slightly above zero, with a distinct peak in the 0–25 range. The distribution is slightly asymmetric, with a minor skew toward negative residuals. The Residuals vs. Predicted plot (
Figure 6c) does not reveal any clear systematic trend, though a moderate increase in scatter is noticeable for higher predicted values (above 250), where residuals exhibit larger positive and negative amplitudes. The Q-Q plot (
Figure 6d) indicates a general conformity of the residual distribution to normality in the central region, where points align closely with the theoretical reference line. Minor deviations appear at the distribution tails, particularly in the lower part, where several residuals are more negative than expected under a normal distribution.
The True vs. Predicted plot (
Figure 7a) shows a clear linear relationship between the actual and predicted values for the rate of heat evolution. Most points lie close to the y = x line, indicating strong agreement between predictions and observed data. However, for higher values (above 30), several points fall slightly below the line, suggesting a minor tendency toward underestimation in this range. The histogram of residuals (
Figure 7b) peaks between −2.5 and 0, with most cases clustered around slightly negative and near-zero values. Although the highest concentration of residuals appears between −2.5 and 0, a noticeable number of cases also falls within the range of 0 to 2.5. This distribution indicates that the model produces both small underpredictions and small overpredictions, with most errors remaining close to zero on either side. This pattern suggests the absence of a strong directional bias, as the residuals are primarily shaped by natural variability in the data. The Residuals vs. Predicted plot (
Figure 7c) shows that residuals are generally distributed around zero across the full prediction range. In the lower prediction range (below approximately 20 units), a predominance of negative residuals is observed, indicating a slight overestimation tendency in this region. At higher values, the scatter increases but remains largely random, without a clear directional pattern. The Q-Q plot (
Figure 7d) indicates that the residual distribution aligns well with the normal distribution in the central region. Minor deviations are visible at both tails, particularly in the upper range, where several observations exceed the theoretical values predicted by the normal distribution.
The True vs. Predicted plot (
Figure 8a) presents the results of the ExtraTrees model for heat evolved after 12 h. The points are broadly distributed around the y = x line, showing particularly good agreement in the mid-range values, while slightly greater dispersion is observed at higher values. The histogram of Residuals (
Figure 8b) reveals an asymmetric error distribution, with a noticeably higher number of negative residuals, suggesting that the model tends to slightly overestimate the predicted values in some cases. Despite this asymmetry, most residuals fall within a relatively narrow range. In the Residuals vs. Predicted plot (
Figure 8c), residuals are scattered on both sides of the zero baseline, indicating that the model does not display any significant systematic bias across the prediction range. In the lower prediction range (below 80), a slight predominance of negative residuals can be observed, again suggesting a minor tendency to overestimate in this area. The Q-Q plot (
Figure 8d) shows that most points align closely with the reference line in the central part of the distribution, indicating partial conformity to normality. Deviations occur primarily at the tails, especially in the upper part, where several observations exhibit higher-than-expected values.
In the True vs. Predicted plot (
Figure 9a) for the ExtraTrees model, most points lie close to the perfect-fit (y = x) line, indicating strong agreement between predictions and observations. The scatter around the line appears relatively uniform across the data range, though a few larger deviations are visible at lower values, where some observations are slightly overestimated. High true values are well captured, showing no clear pattern of systematic error. The histogram of Residuals (
Figure 9b) displays a relatively symmetrical distribution centered near zero, with most residuals falling within the range of approximately −30 to +20, and only a few distant values. In the Residuals vs. Predicted plot (
Figure 9c), points are distributed on both sides of the zero axis, indicating no apparent relationship between error magnitude and predicted value. In the lower prediction range, greater scatter and several more negative residuals appear, suggesting occasional overestimation and one case of significant underestimation. For higher predicted values, errors are smaller and more symmetrically distributed around zero. The Q-Q plot (
Figure 9d) shows that points in the central region align closely with the reference line, while larger deviations occur at the distribution tails. These deviations reflect individual outliers but do not significantly affect the overall conformity of residuals to a normal distribution.
The True vs. Predicted plot (
Figure 10a) shows that the ExtraTrees model captures the overall increasing trend between the true and predicted values; however, many samples display noticeable deviations from the y = x reference line. This dispersion indicates that, although the model learns the general direction and relative ordering of the data, its point-wise predictive accuracy is limited, with several predictions showing either overestimation or underestimation across the target range. The histogram of Residuals (
Figure 10b) displays a largely symmetrical distribution centered around zero, with a slight predominance of negative residuals. Most residuals fall within the range of approximately −25 to +25, while only a few outliers reach values near −75 and +75. The Residuals vs. Predicted plot (
Figure 10c) shows that errors remain evenly distributed around the zero axis across the entire prediction range. The distribution of residuals is relatively uniform, though a few positive and negative outliers appear at higher predicted values. The Q-Q plot (
Figure 10d) indicates partial conformity of the residuals to a normal distribution in the central range, where most points align closely with the theoretical line. Deviations are visible at both distribution tails, suggesting the presence of a few extreme cases.
The True vs. Predicted plot (
Figure 11a) for the model shows a clear linear relationship between predicted and actual values. Most points lie close to the y = x line, indicating strong agreement between predictions and observed data. Larger discrepancies appear at higher values, suggesting slightly reduced accuracy in this range. The histogram of Residuals (
Figure 11b) reveals an asymmetry toward negative values, with most residuals concentrated between −2.5 and 0, forming a dominant left tail of the distribution. A few higher positive residuals (above 5) occur sporadically and may represent outliers. The Residuals vs. Predicted plot (
Figure 11c) shows that residuals generally cluster around zero, confirming a good model fit. A slight increase in residual dispersion is visible for higher predicted values. The Q-Q plot (
Figure 11d) indicates that the central portion of the residual distribution aligns well with the normal distribution, while deviations from the reference line appear at the extremes, particularly for the highest positive residuals.
In the True vs. Predicted plot (
Figure 12a) for the XGBoost model, most points lie close to the perfect-fit (y = x) line, indicating strong agreement between experimental results and model predictions. Moderate dispersion is observed, particularly at higher values, where a few points deviate further from the line. The histogram of Residuals (
Figure 12b) is relatively symmetrical around zero, suggesting that the model does not exhibit a clear tendency to systematically over- or underpredict. A slight predominance of residuals near zero and slightly negative values is visible. A few isolated extreme residuals, ranging from approximately −60 to +60, indicate the presence of outliers. The Residuals vs. Predicted plot (
Figure 12c) shows points evenly dispersed around the zero reference line, without a discernible systematic trend. A few larger positive and negative residuals are present, likely reflecting the inherent variability of the experimental data. The Q-Q plot (
Figure 12d) demonstrates partial conformity of the residuals to a normal distribution, particularly in the central region where most points align closely with the reference line. Minor deviations at the distribution tails suggest the presence of outliers but do not significantly affect the overall goodness of fit.
The True vs. Predicted plot (
Figure 13a) for the XGBoost model shows a clear linear relationship between predicted and actual values. Most points lie close to the y = x reference line, demonstrating strong agreement between model predictions and observed data. One point deviates notably from the line, indicating a single case of greater underestimation. The histogram of Residuals (
Figure 13b) displays a moderately symmetric distribution with slight asymmetry. The left tail extends to approximately −40, while the right tail reaches about +80. A single large positive residual deviates from the general trend, whereas most residuals fall within the −30 to +30 range, indicating good prediction stability. The Residuals vs. Predicted plot (
Figure 13c) shows residuals dispersed around the zero axis across the full prediction range. Most residuals fall between −20 and +20, reflecting small errors and stable predictions. A few outliers are visible at both ends of the predicted range, including one significantly higher positive residual, likely attributable to natural experimental variability. The Q-Q plot (
Figure 13d) shows that most points align closely with the reference line, indicating near-normal residual distribution. Deviations at the extremes suggest the presence of outliers and a slight asymmetry in the distribution.
The True vs. Predicted plot (
Figure 14a) for the XGBoost model shows that most points cluster near the y = x line, indicating good agreement between model predictions and observed values. The scatter of points around the line remains relatively uniform, though deviations appear at both the lower and upper ends of the range, without a clear systematic pattern of over- or underestimation. The histogram of Residuals (
Figure 14b) reveals a distribution centered close to zero, with the highest concentration of residuals occurring between approximately −25 and +10. The distribution is slightly asymmetric, exhibiting a marginally longer left tail. The Residuals vs. Predicted plot (
Figure 14c) shows residuals dispersed around the zero axis across the entire prediction range. Most residuals lie between −50 and +50, with a few isolated outliers at both extremes; however, no systematic spatial pattern is evident. The Q-Q plot (
Figure 14d) indicates that most points align closely with the theoretical reference line, confirming near-normal residual behavior. A single point deviates markedly below the line, representing a large negative residual that may correspond to an outlier or a localized instance of model underperformance.
The True vs. Predicted plot (
Figure 15a) shows a clear linear relationship between predicted and actual values, with points generally following the y = x reference line. This pattern indicates a good overall model fit, though a few outliers are visible, particularly at higher values. The histogram of Residuals (
Figure 15b) reveals that most residuals cluster near zero, suggesting small prediction errors. A few extreme values on either side indicate occasional outliers but do not significantly affect the overall distribution. In the Residuals vs. Predicted plot (
Figure 15c), residuals are distributed relatively evenly around the zero axis, with isolated deviations observed above 10 and below −10. The overall pattern suggests model stability and no clear systematic bias. The Q-Q plot (
Figure 15d) shows that the central portion of the residual distribution aligns closely with the theoretical reference line, indicating approximate normality. Larger deviations appear mainly at the distribution tails for both positive and negative values, suggesting the presence of a few outliers.
3.4. Test Set Performance Metrics for Each Model and Target
This section presents the performance metrics obtained by the models on an independent test set. The evaluated metrics include the mean absolute error (MAE), root mean square error (RMSE), mean absolute percentage error (MAPE), and coefficient of determination (R2), calculated for the CatBoost, ExtraTrees, and XGBoost models across all four target variables. These results enable a direct comparison of the predictive performance of the individual algorithms and allow for identifying which model achieves the lowest prediction error based on the given output data.
The lowest MAE values (
Table 11) for the test set were achieved by XGBoost for the short-term heat evolved forecasts at 12 h and 72 h. ExtraTrees performed best for the long-term heat evolved predictions at 168 h and for the rate of heat evolution. The CatBoost results for most variables are comparable to those of ExtraTrees and XGBoost, indicating similar predictive performance. Overall, ExtraTrees achieved the lowest absolute errors for rate of heat evolution and heat evolved (168 h)-though only marginally lower than the other models-while XGBoost provided the most accurate predictions for the short-term heat evolved (12 h and 72 h) horizons.
Table 12 presents the RMSE values obtained by the models. Based on the test data, XGBoost achieved the lowest errors for heat evolved predictions at 12 h and 72 h, while CatBoost performed best for the 168 h horizon, yielding a lower RMSE than both ExtraTrees and XGBoost. In contrast, ExtraTrees clearly outperformed the other models for the rate of heat evolution, achieving the lowest error in this category. Each model demonstrates specific strengths, with overall performance remaining comparable across short- and medium-term horizons. However, ExtraTrees shows superior precision when predicting the heat release rate.
A comparison of the MAPE values (
Table 13) for the test data shows that ExtraTrees achieved the lowest percentage errors for heat evolved after 12 h and for the rate of heat evolution, indicating superior relative accuracy in these tasks. CatBoost slightly outperformed the other models in predicting heat evolved after 72 h. XGBoost obtained the best result only for heat evolved at 72 h, while in other cases its MAPE values were higher, particularly for the rate of heat evolution, suggesting greater sensitivity to individual outliers in this task. Overall, ExtraTrees demonstrates the most consistent relative accuracy across targets, whereas XGBoost shows more variable performance depending on the predicted feature.
A comparison of the R
2 values on the test set (
Table 14) shows that all models achieved a very high degree of fit for the short-term heat evolved predictions at 12 h and 72 h, as well as for the rate of heat evolution, with R
2 values exceeding 0.94 in each case. This indicates excellent agreement between the model predictions and the observed data. XGBoost achieved the highest R
2 values for heat evolved at 12 h and 72 h, CatBoost performed best for the long-term heat evolved at 168 h, and ExtraTrees showed the best performance for the rate of heat evolution (R
2 = 0.989). The largest differences among models are observed for the long-term prediction horizon. The moderate reduction in R
2 observed for the 168 h horizon can be attributed to the cumulative nature of long-term heat evolution, which naturally increases variability and makes later-age values more challenging to model accurately compared with short-term measurements. Moreover, the limited dataset size contributes to higher variance in the estimated model fit, which is reflected in the lower R
2 relative to earlier prediction horizons [
46]. Nevertheless, the obtained R
2 values of approximately 0.79–0.82 indicate that the models still maintain a reasonable level of explanatory power for long-term heat evolution. Overall, all algorithms exhibit very strong fits to the test data for short-term horizons and for heat release rate prediction, while CatBoost demonstrates the highest goodness of fit for long-term forecasts.
3.5. Sample-Wise Diagnostic Evaluation on the Test Set
This section presents an evaluation of prediction errors for individual samples in the independent test set, using residual plots and predicted-versus-actual value comparisons.
The True vs. Predicted plots (
Figure 16) for the CatBoost model on the test set illustrate the accuracy of individual predictions for the four output parameters. For rate of heat evolution (
Figure 16a), heat evolved 12 h (
Figure 16b), and heat evolved 72 h (
Figure 16c), the predictions align closely with the line of perfect fit (y = x), indicating high model precision relative to the actual values. For heat evolved 168 h (
Figure 16d), a noticeable increase in dispersion is observed; points within the ~240–300 range systematically fall below the y = x line, indicating a tendency to underestimate values in the mid-to-upper range. Despite the extended prediction horizon, no sample shows a substantial deviation or a clearly erroneous prediction. Overall, the plots demonstrate a strong fit of the CatBoost model to the test data, reflecting stable predictive performance which is characterized by smaller errors for short-term forecasts and consistent accuracy even in more challenging, long-term cases.
The residual plots for the CatBoost model (
Figure 17) illustrate the distribution of prediction errors for each test sample across the four output parameters. For the rate of heat evolution (
Figure 17a), the residuals are distributed on both sides of zero, with several samples showing deviations of approximately 1.5–3 units. These values indicate noticeable variation in point-wise prediction accuracy, which is also reflected in the mean (−0.83) and standard deviation (1.6). For heat evolved (12 h) (
Figure 17b), errors are evenly distributed (mean −0.54; SD 12.5), although two residuals deviate noticeably from zero. At longer prediction horizons heat evolved (72 h) (
Figure 17c) and heat evolved (168 h) (
Figure 17d) greater error variability is evident, with mean and standard deviation increasing to 2.63/11.9 and 11.9/17.2, respectively. This pattern indicates wider residual scatter and the emergence of a few extreme error values. Overall, CatBoost demonstrates stable and accurate predictions for short-term forecasts, while error magnitudes increase moderately for longer horizons.
Figure 18 compares the values predicted by the ExtraTrees model with the actual observations from the test set for the parameters heat evolved (12 h, 72 h, and 168 h) and rate of heat evolution. In each plot, the points represent individual test samples and are plotted against the line of perfect fit (y = x). Analysis of the point distributions relative to this line shows that for rate of heat evolution (
Figure 18a), heat evolved 12 h (
Figure 18b), and heat evolved 72 h (
Figure 18c), most predictions closely match the actual values. In contrast, for heat evolved 168 h (
Figure 18d), a broader scatter and a tendency toward increased forecast dispersion are observed. The distance of the points from the line of perfect fit indicates a gradual increase in individual prediction errors with longer time horizons. No dataset exhibits significant outliers, and the model maintains predictions within reasonable limits. However, the extended forecast horizon (168 h) is associated with larger error amplitudes and slight systematic deviations at higher reference values. Overall, the analysis confirms that the ExtraTrees model maintains high predictive accuracy for short-term horizons, while error dispersion becomes more pronounced in long-term forecasts.
The residual plots for the ExtraTrees model (
Figure 19) illustrate the individual prediction errors for each sample in the test set, shown separately for heat evolved after 12, 72, and 168 h, and for the rate of heat evolution. For the rate of heat evolution (
Figure 19a), most residuals are concentrated near zero, exhibiting moderate variability and only a few deviations exceeding one unit. The mean (−0.09) and standard deviation (1.15) confirm the excellent stability of the model in this task. For heat evolved after 12 h (
Figure 19b), errors remain relatively small (mean 3.43, SD 11.0), although two more pronounced deviations are observed. As the time horizon increases to 72 h (
Figure 19c) and 168 h (
Figure 19d), the residual scatter widens, with mean shifts of 8.6 (SD 9.97) and 16.4 (SD 14.5), respectively, and more noticeable outliers appearing. The progressive increase in residual dispersion with longer prediction horizons reflects the typical decline in model stability over time. Overall, the analysis indicates that the ExtraTrees model exhibits high stability and accuracy for predicting the rate of heat evolution and short-term heat evolved values. However, for longer horizons, the systematic increase in errors suggests a need for further model optimization and investigation of potential outliers or atypical input cases.
The True vs. Predicted plots (
Figure 20) for the XGBoost model show that the predictions generally follow the increasing trend of the observed values across all four output variables. While several points align reasonably well with the y = x reference line, other samples exhibit noticeable deviations, indicating variability in point-wise predictive accuracy. Overall, the model captures the global relationship between true and predicted values, but individual test samples display moderate dispersion around the line of perfect agreement. Predictions for rate of heat evolution (
Figure 20a), heat evolved 12 h (
Figure 20b), and heat evolved 72 h (
Figure 20c) align particularly well with the reference line, while for heat evolved 168 h (
Figure 20d), a slightly larger scatter is observed, though the overall distribution remains consistent. Minor underestimations and overestimations appear mainly at the extreme values. Overall, the analysis confirms that XGBoost achieves strong predictive accuracy and stability across all targets.
The residual plots for the XGBoost model illustrate the distribution of prediction errors for each test sample across the four target variables. For the rate of heat evolution (
Figure 21a), the residuals are concentrated near zero (mean −0.96, variance 1.72), indicating strong model stability. The results for heat evolved 12 h (
Figure 21b), also show relatively low residual scatter (mean −0.78, SD 11.4), with only a single markedly positive outlier. At the 72 h horizon (
Figure 21c), (mean 1.72, SD 11.1), a slight underestimation trend is observed, though the overall error distribution remains moderate. For heat evolved 168 h (
Figure 21d), a distinct positive bias appears (mean 16.3, SD 15.1), along with several large positive residuals, suggesting systematic underestimation at longer time horizons. Overall, the XGBoost model demonstrates high stability and low error levels for short-term predictions but exhibits an increasing susceptibility to higher errors and systematic positive bias in long-term forecasts, particularly for the most demanding test cases.