3.3.1. Analysis for Dataset #1
Figure 23 presents the predicted Factor of Safety (FOS) obtained from four machine learning models—GA-MLP, RF, SVR, and GPR—plotted against the measured FOS values calculated using the Spencer Limit Equilibrium Method (LEM). The x-axis (“Measured FOS”) represents the reference stability values computed through the deterministic Spencer method, which serves as the benchmark for model evaluation. The y-axis (“Predicted FOS”) indicates the outputs from each trained machine learning model when applied to the validation dataset.
In all subplots, the dashed 1:1 line represents perfect agreement between predicted and measured values. Points located closer to this line indicate higher predictive accuracy, as the predicted FOS closely matches the Spencer-calculated reference. Conversely, deviations from the line represent prediction errors, with the magnitude of deviation reflecting the absolute residual. The performance indicators R2, RMSE, MSE, and MAE reported in each subplot quantitatively summarized model performance, where higher R2 values and lower error metrics correspond to better predictive fidelity.
Comparatively, the GPR model exhibits the strongest agreement with the Spencer results (R
2 = 0.988, RMSE = 0.06) (Shown in
Figure 24), with nearly all points tightly aligned with the reference line. The GA-MLP model also achieves strong performance R
2 = 0.940) but displays a few moderate deviations at higher FOS values. SVR and RF show lower overall agreement, with R
2 values of 0.862 and 0.814, respectively, and a wider scatter of points around the reference line, indicating less consistent generalization.
From an engineering perspective, this plot highlights not only the general predictive capability of each model but also its reliability in reproducing deterministic stability assessments. Since the Spencer method is widely regarded as a robust LEM approach for slope stability analysis, close alignment between predicted and measured FOS strengthens the confidence in a model’s applicability for real-world geotechnical decision-making.
Another critical assessment that must be considered is shown in
Figure 25, which displays the residual-based outlier detection results for the four machine learning models evaluated in this study. Outliers are defined as predictions with an absolute residual exceeding 0.20 in Factor of Safety (FOS), representing a practical engineering tolerance threshold. These points are annotated with their dataset indices and measured–predicted FOS values, allowing direct traceability to specific slope cases.
The distribution and frequency of outliers provide important insights into each model’s predictive robustness. The GA-MLP model exhibits several under-predictions and over-predictions, suggesting that while it captures nonlinear relationships effectively, it is more sensitive to cases underrepresented in the training set. RF shows fewer outliers but still displays notable deviations in specific instances, likely due to localized overfitting in certain decision tree partitions. SVR, while generally accurate, produces a small cluster of over-predictions in the higher FOS range, indicating potential difficulty in extrapolating beyond the most represented stability conditions. GPR demonstrates the most consistent performance, with no detected outliers at the selected threshold, reflecting its ability to capture both the central trend and the local variability in the data.
From a geotechnical risk perspective, under-predictions are conservative but may penalize design optimization, whereas over-predictions present a more critical concern as they may lead to unsafe stability assessments. The residual analysis suggests that, for the dataset and conditions evaluated, GPR offers the most reliable balance between accuracy and error dispersion, followed closely by RF and SVR, with GA-MLP showing greater variability in extreme cases. This reinforces the importance of combining global performance metrics (e.g., R2, RMSE) with residual-based outlier inspection to ensure that model selection accounts not only for average accuracy but also for the frequency and magnitude of potentially critical prediction errors.
Table 7 summarizes the comparative performance of the evaluated models using multiple statistical indicators, with scores assigned to each metric and summed to produce an overall ranking. The ranking framework allows for an integrated assessment that balances accuracy and error measures, providing a clear picture of overall predictive capability.
Figure 26 illustrates the ranking scores assigned to each ML model.
The results indicate that the Gaussian Process Regressor (GPR) consistently outperformed the other models across all evaluation criteria, achieving the highest overall score and demonstrating superior predictive reliability. The GA-MLP model followed closely, also showing strong performance. In contrast, the SVR model achieved moderate results, and the RF model ranked lowest due to comparatively weaker accuracy and higher errors. This scoring approach offers a systematic basis for selecting the most suitable predictive method to predict the FOS.
In this study, machine learning is positioned as a complementary tool to established limit-equilibrium and numerical methods, not a replacement. First, ML enables rapid, scalable screening of many slopes using a small set of readily available inputs, which is critical in open pit operations where prioritization drives safety and productivity. Second, probabilistic models such as GPR provide confidence intervals around FOS predictions, directly supporting risk communication and decision thresholds—an attribute not natively offered by deterministic LEM outputs. Third, trained ML models act as surrogates to accelerate parametric sweeps and “what-if” analyses, reducing turnaround time before detailed numerical back-analysis. Fourth, ML models can be updated as monitoring data arrive (e.g., lab updates, in situ measurements), improving adaptability over static design assumptions. In our results, SVR delivered the highest average accuracy on the mining dataset, making it well-suited for fast screening; GPR consistently provided the most stable predictions and uncertainty bounds on the highway dataset, making it valuable for conservative, safety-critical decisions. Together, these capabilities highlight the practical significance of ML as an efficient front-end to guide where—and how—numerical modelling effort should be concentrated.
3.3.2. Analysis for Dataset #2
For the second dataset, the scatter plots present the predictive performance of the evaluated models across the four test groups (G1–G4) as shown in
Figure 27. While the overall analysis follows the same predicted-versus-measured comparison as in the first dataset, some variations in relative performance are evident. Notably, the Support Vector Regressor (SVR) shows improved alignment with measured values across most groups, surpassing the Gaussian Process Regressor (GPR) in predictive accuracy under this data configuration. The GA-MLP model maintains a strong and consistent performance, remaining competitive across all groups. In contrast, the Random Forest model exhibits greater variability and larger deviations in several cases, indicating less stable predictions for this dataset. These results suggest that model effectiveness can be dataset-dependent, with SVR emerging as the top performer here, followed closely by GA-MLP. At the same time, GPR and RF occupy the subsequent ranks.
Figure 28 presents the predicted versus measured FOS values for the second dataset, highlighting the detected outliers based on residual thresholds. Across all models and groups, most predictions closely align with the 1:1 reference line, indicating generally strong predictive performance. However, certain groups exhibit a higher concentration of outliers—particularly in GA-MLP for G2 and G3—suggesting occasional deviations under specific input conditions. In contrast, SVR and GPR show fewer extreme residuals, reflecting more stable behavior across the evaluated subsets.
SVR and GPR are very different methods, but they also have similarities. For this reason, it makes sense to compare them. They both learn from data and can be used to predict continuous, real-valued outcomes. Nonlinear dependencies can be considered by transforming the training domain implicitly via kernels. SVR is a deterministic approach that optimizes a margin-based cost function and provides point predictions (estimates). GPR, on the other hand, is a probabilistic Bayesian framework that also generates point predictions but can also produce predictions with quantified uncertainty by estimating posterior variances. For the study at hand, the difference in performance on the two datasets might be due to the different methodological strengths. The margin-based approach of SVR might have been more suited for the second dataset if the structure of that dataset allows for more linear extrapolation outside the training domain and less smooth local fits. The Gaussian process would smooth more due to the kernel, its hyperparameters, and the covariance structure. This might have been the reason for better performance on the first dataset and potential underfitting on the second dataset. Both methods are somewhat complementary to each other, and the performance is data dependent.
The boxplots in
Figure 29 display the variability of R
2 and RMSE of all the groups for each model, where the red dot represents the mean. It can be seen that the SVR model has slightly better average values of R
2 and RMSE. However, by examining the boxplots, it can be noted that the SVR has a much wider range compared to GPR. This means that while the SVR model might have higher performance in certain groups, its performance is less stable compared to GPR.
To enable a fair and consistent comparison of model performance across all groups, the statistical indicators for each model—namely R
2, RMSE, MSE, and MAE—were computed for each of the four groups (G1–G4) and then averaged. This averaging process yields a single representative value for each metric per model, reflecting its overall predictive capability over the entire dataset division rather than relying on a single group’s performance. The results of these average metrics for each model are summarized in
Table 8 and shown in
Figure 30, providing a consolidated basis for ranking and evaluating the models.
Table 9 and
Figure 31 present a comparative evaluation of the machine learning models by aggregating their average performance across all groups of the second dataset. The scoring framework assigns values to each performance metric and sums them to produce an overall ranking, allowing for a balanced assessment between accuracy and error measures. The results reveal that the Support Vector Regressor (SVR) achieved the highest total score, indicating superior predictive performance in this dataset, followed closely by the Gaussian Process Regressor (GPR). The GA-MLP model demonstrated moderate performance, while the Random Forest (RF) model ranked lowest due to comparatively lower accuracy and higher error values. This combined tabular and graphical representation provides a clear visual and numerical basis for selecting the most effective predictive method for FOS estimation in this scenario.