4.1. Experiment I: Baseline
In Experiment I, all features in the dataset, except for life expectancy, were used to assess the multi-model’s HI prediction performance using ML regressors. After the pre-processing step, individual linear, tree, kernel-based, and classical ensemble regressors were then evaluated, where all models were trained and tested on an 80:20 split and evaluated using the same performance metrics shown in
Table 4. The models tested in this experiment were linear, Ridge, Lasso, SVR, DT, RF, GB, AdaBoost, XGBoost, LightGBM, and CatBoost regressors. In addition to individual model results, hybrid ensemble predictions were generated by averaging outputs from the highest-performing individual models (e.g., RF + GB, LightGBM + CatBoost).
As demonstrated by
Table 4, foundational regressors offered the lowest predictive ability compared to others. Lasso regression scored the highest error metrics among all, with an MAE of 10.82, an RMSE of 13.77, and an R
2 value of 0.552. Linear and Ridge regression followed as the second worst with an MAE score of 9.63 and 9.62, MSE values of 172.76 and 172.17, an RMSE of 13.14 and 13.13, and an R
2 score of 0.492 and 0.509, respectively.
Nonlinear kernel and tree-based regressors were the second category of ML algorithms in Experiment I. Kernel-based SVR exhibited the lowest predictive accuracy in this group, with an MAE score of 9.82, RMSE of 14.07, MSE of 197.86, and a low R2 score of 0.43, thus indicating poor generalization performance. DT’s slightly better performance is noted in its MAE, scoring 7.64; however, a score of 197.86 was obtained for MSE and 14.07 for RMSE, and a low value of 0.421 as R2. On the contrary, RF substantially improved, achieving 6.42 for MAE, 82.2 for MSE, 9.06 for RMSE, and a strong performance R2 of 0.749. This validates that ensembled bagging performance on RF can effectively capture nonlinearity in the data.
Next, the performance of boosting ensembles was explored, using GB, AdaBoost, XGboost, LightGBM, and CatBoost, where GB recorded the best performance among all individual models. Low error scores of 6.29 for MAE and 53.29 for MSE, an RMSE of 7.29, and an R2 of 0.759 for GB evidenced its strength in error correction and handling nonlinear interactions. Conversely, AdaBoost performed poorly, achieving the highest MAE of 11.70 and a low R2 of 0.450. Similarly, XGBoost, LightGBM, and CatBoost exhibited strong performances and low error scores, with XGBoost achieving the highest R2 score of 0.766 and a strong MAE of 6.917. However, its MSE and RMSE recorded values of 180.86 and 10.42, respectively, indicating possible sensitivity to outliers. A balanced performance was exhibited in LightGBM performance, with an MAE of 6.25, MSE of 80.85, RMSE of 8.99, and R2 of 0.749. Finally, CatBoost, which inherently supports categorical feature handling, recorded an MSE of 94.03, an MAE of 6.39, an RMSE of 9.74, and an R2 of 0.728.
Following the evaluation of individual models’ performance, the hybridization phase took place, where pairwise combinations were created to form averaged ensembles, and the two best-performing hybrids were selected for further analysis. Accordingly, the (LightGBM + CatBoost) and (RF + GB) ensembles achieved the highest performance, scoring an MAE of 6.09, an MSE of 83.01, an RMSE of 9.11, and an R2 score of 0.757 for the (LightGBM + CatBoost) ensemble, whereas the (RF + GB) ensemble achieved an MAE of 6.2, a score of 85.81 for MSE and 9.26 for RMSE, and an R2 score of 0.745. These findings confirm that the (LightGBM + CatBoost) ensemble is the best-performing model among all possible combinations.
The parity plots presented in
Figure 3 provide a visual representation of actual and predicted HI, confirming the findings of
Table 4 and offering further insights into each model’s performance, especially regarding overfitting. For foundational ML regressors (Linear, Ridge, and Lasso), as shown in
Figure 3a–c, large scatter deviations are exhibited for predicted values from the 1:1 reference line, thus resulting in high error scores, as depicted by the performance metrics in
Table 4.
However, nonlinear kernel regressors (SVR) exhibit poor generalized behavior, with test predictions being widely dispersed around the parity line. Conversely, for tree-based regressors, the DT demonstrates strong overfitting with perfect alignment with the training data but scattered test predictions, while the RF demonstrates improved generalization, with test points more closely aligned along the 1:1 line.
Regarding the boosting-based techniques,
Figure 3g–k, the parity plots illustrate a better alignment with the 1:1 line. GB exhibits the closest clustering of predictions around the parity line among all individual models, in line with its superior numerical performance. However, AdaBoost demonstrates a broader spread, confirming its poor generalization. XGBoost, LightGBM, and CatBoost display competitive predictive performances, with their points nearer to the diagonal line compared to other models, with CatBoost showing less alignment.
Finally, the hybrid ensembled approach’s parity plot shown in
Figure 3l,m, based on a combination of individual learners, demonstrates the strongest visual agreement with the parity trend line. In the (LightGBM + CatBoost) ensemble, both training and test predictions are concentrated close to the diagonal, leading to the lowest overall errors. Likewise, the RF + GB ensemble attained improved alignment relative to its constituent models.
These findings further confirm not only that the (LightGBM + CatBoost) ensemble delivers the best performance among all tested models in terms of accuracy and errors, but also that the average ensemble combination exhibits no signs of overfitting, indicating stronger generalization capabilities. In addition, from a model architectural point of view, this combinational effect could lead to promising complementary results. LightGBM combines two cutting-edge features that increase its accuracy and speed through gradient-based one-sided sampling (GOSS) and exclusive feature bundling (EFB). While GOSS prioritizes samples with larger gradients, EFB reduces the feature dimensionality by merging mutually exclusive features, thereby accelerating the search for optimal split points as well as computational efficiency [
57]. On the other hand, Catboost is designed to overcome overfitting problems effectively by applying regularization and early stopping through adopting ordered target statistics and ordered boosting [
58,
59]. In contrast to LightGBM, Catboost, the ordered boosting and symmetric tree architecture provides unbiased learning, making it ideal when dealing with categorical and noisy datasets [
60]. By incorporating these structurally dissimilar learners via model averaging, the ensemble combination achieves improved robustness and generalization across the HI spectrum.
4.2. Experiment II: Tuned Models
Experiment II is based on optimizing the performance of the top-performing algorithms of Experiment I. For this reason, hyperparameter tuning is crucial, since the predictive capability of each base model depends strongly on the choice of its internal configuration. In addition, tuning reduces the chances of over- and under-fitting, thereby ensuring that each model is systematically optimized for fair comparison. At this stage, hyperparameter tuning is carried out for the top four performers, namely, RF, GB, LightGBM, and CatBoost.
To ensure fairness and reproducibility, all constituent models were optimized using an explicit hyperparameter tuning strategy prior to the models’ hybridization. Each model was tuned independently using a structured grid search, where a predetermined set of hyperparameter ranges was systematically evaluated under five-fold cross-validation. Each model was trained on nine folds and validated on the remaining fold, and this process was repeated over all folds. The mean validation score was used to select the best-performing configuration. This procedure was applied separately to the RF, GB, LightGBM, and CatBoost regressors, ensuring that each base learner operated under its optimal settings before combining predictions in the ensemble. The model’s key aspects, such as tree depth, learning rate, and number of estimators, for parameter hyper-tuning can be found in
Table 5. In addition, a fixed random seed (random_state = 42) was assigned for all models equally throughout the tuning and training to ensure reproducibility of the optimization path and final results. As a result of this tuning process, models were retained and further integrated into an ensemble hybrid model combination.
Similarly to Experiment I, another ensemble was formed from the tuned versions in Experiment II, whose performance is demonstrated in the parity plots of
Figure 3g and error metrics in
Table 6. Two combinations were formed: the first combined the tuned LightGBM and CatBoost models, while the second was built by combining the average outputs of the tuned RF and GB models.
As illustrated in
Table 6 and shown in
Figure 4, the tuned (LightGBM + CatBoost) ensemble achieved the best overall performance, with a cross-validated MAE of 5.70, a value of 77.31 for MSE, an RMSE value of 8.793, and an R
2 score of 0.774. In comparison, the tuned RF-GB ensemble also demonstrated a slightly lower performance, with a score of 6.17 for MAE, a score of 80.2 for MSE, an RMSE of 8.95, and an R
2 of 0.766. Compared to Experiment I, where the highest R
2 was 0.766 with XGBoost, this also came at the cost of a high RMSE. These results confirm that combining tuned models results in an elevated model performance with more accurate predictions, better generalizability, and improved fitting.
4.3. Experiment III: Tuned and Interpretable Models
The aim of Experiment III was to test whether feature reduction would maintain a balance between interpretability and maintained model performance or not. For this reason, Experiment III introduced SHAP-based explainability and feature selection compared to Experiments I and II. Instead of using a full set of features, SHAP values were computed in Experiment III to identify the most influential features impacting the HI prediction, thus offering a more interpretable predictive tool. Based on SHAP analysis and insights, low-contributing features were omitted from the model, resulting in a reduced feature space. The best-tuned ensemble models from Experiment II—namely the (LightGBM + CatBoost) combination—were reused in this experiment but trained on the refined feature set after SHAP.
Figure 5 presents the SHAP analysis, and
Figure 6 shows the feature ranking summary plot. These two plots were used to rank the input variables according to their average absolute impact across all samples on the model output, as well as visualize their directional contribution. In this work, a combination of two feature selection tools was employed: SHAP, acting as an embedded approach that evaluates feature relevance based on their contribution to the trained model, and the Pearson correlation, used as a complementary filter technique to assess the statistical relationships between individual variables and the target. This combination ensures that feature pruning was not based solely on model-based importance, but also included statistical correlation, thus providing a more rigorous selection mechanism.
As shown in
Figure 6, DBDS was identified as the dominant variable, with a mean SHAP value of 4.17, which was also confirmed by the correlational heat map of
Figure 2, where DBDS exhibited a positive relationship with HI (
= 0.47). A group of moderately important features to HI followed, including hydrogen (SHAP 2.08,
= 0.39), acetylene (1.07,
= 0.42), methane (0.85,
= 0.36), ethylene (0.60,
= 0.27), CO
2 (0.53,
= 0.24), power factor (0.49,
= 0.23), interfacial value (0.41,
= −0.28) and water content (0.36,
= −0.10).
On the contrary, the lowest SHAP scores, as shown in
Figure 6, were associated with nitrogen (0.25), dielectric rigidity (0.24), oxygen (0.19), and ethane (0.19). Correspondingly, these features showed weak or inconsistent statistical relationships with the target variable, which is shown in their corresponding Pearson heatmap in
Figure 2. Oxygen had a near-zero correlation with HI (
= 0.01) and dielectric rigidity (
= −0.12), and nitrogen also showed a very minor association (
= 0.16). In addition to the model-based and statistical evidence, domain knowledge from DGA supports limited diagnostic value for ethane, oxygen, and nitrogen on HI in real operational conditions, further reinforcing their exclusion from the final feature set [
61,
62].
For a valid comparative assessment of the model’s performance before and after feature pruning, Experiments I and II were conducted using the full set of 14 input features, whereas Experiment III applied a reduced feature set by removing the four least-contributing variables. This pruning action was performed in a manual, iterative manner such that, after each removal, the model was retrained, the corresponding error performance metrics were recorded, and the process was halted once additional feature elimination produced a noticeable decline in performance. Conclusively, this feature elimination approach was guided by both SHAP and statistical correlation evidence. The results of Experiment III are summarized in
Table 7, confirming that the removal of the fewest contributing features did not compromise the overall predictive accuracy, and the reduced model achieved comparable performance with a less complex configuration.
Notably, the tuned (LightGBM + CatBoost) ensemble, when retrained on a SHAP-guided reduced feature count, still maintained strong predicting performance, as indicated by
Table 7. With a cross-validated score of 5.79 for MAE, an MSE of 79.36, an RMSE of 8.9, and an R
2 score of 0.767, it is still highly competitive compared to its counterpart in Experiment II. The results of Experiment III indicate that eliminating low-impact features did not significantly compromise accuracy, while improving model interpretability and computational efficiency without sacrificing predictive reliability.
4.4. Cross-Validation Analysis of Experiment III
In this section, the effectiveness and robustness of the selected tuned and reduced hybrid ensemble (LightGBM + CatBoost) are investigated via cross-validation and different numbers of folds are examined. In this way, it is possible to identify whether the ensemble’s predictive behavior is consistent and less subject to changes in data partitioning, thus confirming the model’s robustness and generalizability.
Figure 7 shows the model’s performance with 3, 5, and 10 folds, respectively, which are commonly used values that eliminate any chance of errors from extremely high bias or very high variance [
63]. At a
value of 3, the tuned hybrid ensemble (LightGBM + CatBoost) showed a consistent performance, with an average MAE of 6.229, an RMSE of 9.313, and an R
2 score of 0.717, as observed in
Figure 7a. With five folds, the hybrid ensemble performance can be depicted from
Figure 7b, with a slightly better average performance than the three-fold case. At five folds, the average MAE was depicted at 5.976, with an RMSE value of 9.113, and an R
2 value of 0.732. Finally, at 10 folds,
Figure 7c illustrates that the tuned and reduced hybrid ensemble achieves the strongest average predictive performance, with a mean MAE of 5.795, an RMSE value of 9.034, and an average R
2 value of 0.716.
Figure 8 illustrates box-and-whisker plots for the MAE, RMSE, and R
2 of multiple seed examinations across the 3-, 5-, and 10-fold configurations. Each jittered dot represents the performance of an individual fold, whereas the box captures the median and interquartile range associated with each CV configuration. It is clear that, across all metrics, the fold-level distributions remain compact, showing no irregular spikes or any unstable behavior in model performance, even with small folds. For MAE, the standard deviations were 0.356, 0.922, and 1.200, corresponding to the 3-, 5-, and 10-fold CV, respectively; RMSE showed moderate variability, with standard deviations of 0.355, 1.201, and 1.650; whereas R
2 exhibited tighter dispersion, with standard deviations of 0.033, 0.067, and 0.115. These values demonstrate that even as the number of folds increases, and the training subsets become slightly smaller, the variability across folds remains controlled and well within acceptable limits. In addition, the 10-fold configuration central tendencies remain aligned with those of the 3- and 5-fold schemes. These findings confirm that the tuned and reduced (LightGBM + CatBoost) ensemble is not sensitive to a specific choice of cross-validation scheme, even with a small dataset size. This statistical evidence strongly supports the robustness and repeatability of the proposed hybrid framework.