Application of Machine Learning Methods for Predicting the Factor of Safety in Rock Slopes

Trinidad, Miguel; Momayez, Moe

doi:10.3390/geotechnics6010015

Open AccessArticle

Application of Machine Learning Methods for Predicting the Factor of Safety in Rock Slopes

by

Miguel Trinidad

^*

and

Moe Momayez

School of Mining Engineering and Mineral Resources, University of Arizona, 1235 E James E Rogers Way, Tucson, AZ 85721, USA

^*

Author to whom correspondence should be addressed.

Geotechnics 2026, 6(1), 15; https://doi.org/10.3390/geotechnics6010015

Submission received: 21 October 2025 / Revised: 10 January 2026 / Accepted: 11 January 2026 / Published: 3 February 2026

Download

Browse Figures

Versions Notes

Abstract

Factor of Safety (FOS) is a significant index to measure the stability condition of a rock slope in mining or civil engineering. In this paper, we evaluate and compare four different machine learning models, Gaussian Process Regressor (GPR), Support Vector Regressor (SVR), Random Forest (RF), and a hybrid genetic algorithm–multi-layer perceptron (GA-MLP), using two separate real-world datasets. The two separate datasets used in this study are from a previously conducted study on highway excavation with rock cutting in China, and another one in a mining site in Peru, with five geotechnical properties used as inputs, including slope height, slope angle, unit weight, cohesion, and friction angle. The two separate datasets were separated into training, validation, and testing datasets. The testing dataset of the models is unseen data used to assess model performance in an unbiased manner. The result shows that the SVR had the highest prediction accuracy, followed by GPR for the mining dataset, and GPR had the highest performance among all the models for the highway excavation dataset. From the boxplot, we can see that SVR, while having the highest predictive accuracy, has a larger variance in prediction compared to GPR for the mining dataset.

Keywords:

factor of safety (FOS); slope stability; machine learning models; geotechnical parameters; limit equilibrium method; model performance

1. Introduction

In open pit mining and civil engineering projects that involve slope excavations in rock masses, predicting and maintaining slope stability are essential for safe and cost-effective operations. Slope failures not only cause significant economic losses and environmental damage but also pose a direct threat to the lives of workers [1,2]. Accurately determining the factor of safety (FOS) for slope designs is crucial to achieving a balance between efficient excavation and acceptable risk thresholds. Traditionally, limit equilibrium analyses such as Spencer’s method are used to determine FOS under predefined conditions [3]. These deterministic methods are based on static equilibrium equations and yield accurate results in assessing the stability of a slope for known failure surfaces [4]. However, they are sensitive to geotechnical parameter uncertainties and are limited by a need to specify the slip surface geometry a priori, which may not always accurately reflect real-world failure mechanisms under complex, heterogeneous conditions [5,6].

In recent years, there has been a growing interest in probabilistic and data-driven methods to address some of these limitations. Machine learning (ML) approaches, for example, offer the potential to model the nonlinear relationships between geotechnical parameters and slope stability outcomes without making explicit a priori assumptions about the failure mechanism. ML models can be trained directly from empirical datasets to capture the underlying dependencies between input parameters and output FOS values, while accounting for complex interactions and nonlinearities. In this manuscript, four different ML algorithms have been tested for FOS prediction, including Random Forests (RF), Support Vector Regression (SVR), Gaussian Process Regression (GPR), and a hybrid approach such as Genetic Algorithm-optimized Multi-Layer Perceptrons (GA-MLP) [7]. Machine learning algorithms are often distinguished as either “black box” or “white box” approaches. Black box models, such as neural networks and support vector machines, are powerful tools for capturing complex nonlinear relationships between geotechnical parameters and slope stability outcomes. However, they provide limited interpretability, which can be a drawback in engineering practice where transparent decision-making is essential. White box models, such as decision trees or linear regression, offer greater interpretability by explicitly showing parameter contributions, but they may struggle to capture the nonlinearities inherent in slope stability problems.

The algorithms selected in this study represent a deliberate balance between these two categories. Random Forest (RF) is often considered semi-transparent, as it allows the extraction of feature importance to provide insights into the relative influence of geotechnical parameters. Gaussian Process Regression (GPR) offers a probabilistic framework that not only predicts FOS values but also provides confidence intervals, thereby enhancing transparency in uncertainty quantification. Support Vector Regression (SVR) has demonstrated strong generalization ability on small datasets, which is particularly relevant given the limited number of slope cases available in practice. Finally, the GA-MLP hybrid neural network is a black-box approach optimized for convergence and nonlinear pattern recognition, offering high predictive accuracy at the expense of interpretability. The results are generally encouraging, but vary based on the algorithm, input dataset, training and testing procedures, making it difficult to identify best-performing techniques for general rock slope conditions.

The definition of a training dataset and the way it is processed and partitioned for model training and evaluation can have a significant impact on the perceived accuracy and reliability of the final model. While many previous studies use aggregated datasets from multiple sites, which can contain hidden biases or site-specific inconsistencies, this study instead utilizes two independent datasets that are analyzed separately. In addition, these datasets come from real data. Dataset #1 corresponds to previous research [8], and Dataset #2 corresponds to an open pit operation located in Peru. In both cases, the FOS was calculated by Spencer’s method. The results of the Spencer method in all cases are used for reference and are also used in the testing stage of each model to evaluate the model’s performance. In addition to separating the datasets, a three-stage data division strategy was adopted to further ensure realistic estimates of the model’s generalization capabilities. A three-stage data division, including training, validation, and testing stages, has been commonly adopted in ML applications. The training stage was used for model fitting, and the validation stage for hyperparameter tuning and overfitting prevention, but the testing stage, which was reserved for data never seen by the models, is key to a final assessment of generalization performance.

Normalization was employed during data preprocessing to ensure that all the geotechnical parameters were on the same scale, avoiding that a model assigns greater importance to those features with a wider range of values in raw data. This was particularly relevant for the algorithms based on distance computations, such as SVR and GPR, and on kernel functions. 5-fold cross-validation was used in the training and validation stages to increase the statistical robustness of the modelling procedure and limit the impact of random sampling variability [9]. The repeated random partitioning of the training and validation datasets into different subsets allows for a more accurate estimate of the performance metrics than a single training/test split.

The geotechnical parameters in both datasets included unit weight (γ), cohesion (c′), internal friction angle (φ′), slope height (H) and slope angle (β). The selected geotechnical parameters are some of the most common slope stability inputs used in the literature for rock slopes and represent the most important aspects of the mechanical and geometrical conditions that control failure. Unit weight, cohesion and friction angle are the primary parameters used in the Mohr–Coulomb failure criterion, which is the most widely used strength model for rocks and soils. The slope height and slope angle are, of course, the main geometric parameters controlling the stress distribution and potential sliding modes in a slope. In addition to the mechanical properties of the slope, pore water pressure is generally considered a very important influence on slope stability due to its control of the effective stress and therefore shear strength. However, pore pressure information is not available for the current datasets, and this constitutes a major source of uncertainty that cannot be accounted for in the subsequent analysis.

The overarching goal of this research is to identify the ML models with the best accuracy and generalization capacity for predicting the factor of safety (FOS) of rock slopes using the real-world data from the two independent real datasets described above. To this end, a review of the state-of-the-art in ML applications in slope stability is first conducted to summarize methodological trends and key findings in the literature, as well as to highlight common strengths and weaknesses. A robust and transparent experimental design is then implemented to provide an unbiased evaluation of different models under realistic conditions. This includes the adoption of a systematic and explicit modelling framework that integrates preprocessing (normalization), model selection (cross-validation), and performance evaluation (independent testing) steps to arrive at useful performance benchmarks. Finally, a direct comparison of the predictive performance of each model is made against the traditional LEM calculations to provide a grounded and quantitative assessment of their potential to serve as complementary or even alternative methods.

In so doing, this work can make several key contributions to the existing body of literature on data-driven approaches in geotechnical engineering. First, by avoiding the use of aggregated datasets, which can potentially compromise geological realism, this study allows for more representative performance evaluations of ML models. Second, by adopting a rigorous model evaluation protocol that clearly separates training/validation and testing data, this study ensures that the performance metrics reported are indicative of true generalization and not simply a reflection of the test set composition. Third, by providing a systematic and direct comparison of several ML algorithms under controlled conditions, this study clarifies the relative strengths and limitations of each for the specific task of FOS prediction in rock slopes. The practical value of this research is also significant for mining operations, as the ability to predict slope stability reliably from a limited set of easily measurable geotechnical parameters would help in the design of more efficient and robust excavation plans, monitoring programs, and risk management practices.

2. Dataset and Methodology

2.1. Dataset and Parameters

Five input parameters were considered for this study: slope angle (β), slope height (H), internal friction angle (φ), cohesion (c), and unit weight (γ). These parameters are chosen for their availability in most rock slope conditions, and they are also common in geotechnical practice. The output of interest is the Factor of Safety (FOS).

The influence of pore water pressure is well known for slope stability, and it is also among the most important input parameters for a stability analysis. Due to the lack of such data for all slope cases, it was not included in the study. If such data becomes available in the future, it can be easily accommodated in the ML models as another input parameter. For the most part, modifying the input parameters will affect the final predictions from trained ML models, but will not cause any substantial change to the forecasting procedure itself. However, the pore water pressure ratio (if applicable) needs to be taken into consideration for the overall slope stability evaluation [10].

The datasets used in this study were carefully prepared to remove unwanted components in the data that may result in inadequate training and validation of the predictive models. In this case, some preprocessing steps were required. In other words, before applying the machine learning tasks (MLTs) on the dataset, it had to be collected, cleaned, transformed, and verified for dimensional consistency, in addition to many other tasks that aimed at the provision of the desired data quality and efficiency [11]. In this study, two different datasets were used to train and test the ML models.

Dataset #1 was collected from the paper by Chen et al. [8], which provides a comprehensive study of slope failures along the Kaili–Sansui highway corridor in Guizhou Province, China. These slopes are developed in epimetamorphic rock, characterized by a thick weathered layer and a highly fractured rock mass, conditions that make them prone to instability. Failures in this region typically occur as rotational slips within the weathered epimetamorphic rock, consistent with the Mohr–Coulomb shear strength criterion. In total, 53 slope cases were collected from an on-site field investigation. Out of these, 41 cases with complete geotechnical input parameters (bulk density γ, slope height H, slope angle β, cohesion c, and friction angle φ) were selected for the model training and validation, with a 70/30 train-validation split. The remaining 12 cases with complete data were saved for an independent test set. The predicted FOS from the trained models was benchmarked against those values that were computed using the Spencer limit equilibrium method.

Dataset #2 was collected from an open pit mining operation in South America. It consisted of 129 slope cases, with complete data on five input variables for each case. The corresponding FOS values were computed for each case using the Spencer method. Then, 100 cases were separated to use for training and validation of the models, with a 70/30 split. The rest of the data, 29 cases, were put aside for independent testing. This set was used to evaluate the generalization capabilities of the ML models by comparing predicted FOS values with those computed using Spencer’s method. Geologically, the slopes in this dataset are developed in intrusive and volcanic rock masses typical of large copper mining operations in southern Peru (Arequipa region), composed mainly of andesite, diorite, and porphyry intrusives affected by alteration, fracturing, and regional fault systems. These conditions make the slopes prone to structurally controlled failures at the bench scale (planar, wedge, or toppling), while rotational slips in weathered or fractured rock masses are also observed at the inter-ramp scale. Importantly, the slope heights considered in this dataset correspond to inter-ramp configurations, i.e., failures involving several benches. Each bench has an approximate height of 30 m (for a double bench), and the dataset records slope heights that represent cumulative inter-ramp scale rather than single-bench failures. This distinction clarifies that the dataset captures larger-scale slope instabilities, consistent with rotational or structurally controlled mechanisms, and supports the use of Spencer’s method with Mohr–Coulomb parameters for analysis.

Each data point in both datasets corresponds to a distinct slope case with its own measured geotechnical parameters, rather than parametric variations in a single slope geometry. Importantly, both datasets represent rock slopes, not soil slopes, documented from field investigations and mining operations. Dataset #1 corresponds to weathered metamorphic rock slopes along the Kaili–Sansui highway corridor in China [8], while Dataset #2 corresponds to fractured rock slopes in a South American open pit mine. The input parameters were measured or estimated directly from site conditions, and the corresponding FOS values were calculated using Spencer’s limit equilibrium method, ensuring that the ML models are trained on meaningful geotechnical relationships rather than arbitrary numerical patterns.

2.2. Descriptive Statistics of Input Data

To ensure transparency of the input datasets utilized to train, validate and test the developed models, univariate statistics are presented for the five chosen geotechnical parameters: unit weight (kN/m³), cohesion (kPa), internal friction angle (°), slope angle (°), and slope height (m), for the two previously introduced datasets (namely, the metamorphic rock slope dataset from China and the open pit mine dataset from South America) in Table 1 and Table 2. This statistical summary offers a preliminary view of data variability, distribution patterns, and possible relationships with other variables, which is useful in order to get an overall understanding of how each feature may impact machine learning models’ performance in terms of predicting FOS [12].

Table 1 and Table 2 present univariate descriptive statistics for each dataset, including the sample size, minimum and maximum values, mean, and standard deviation for each of the five input variables. Each row in Table 1 and Table 2 corresponds to one distinct slope case, with its own set of measured geotechnical parameters, rather than parametric variations in a single slope geometry. The statistics collected help researchers understand the dataset size and variability while identifying data errors and extreme values for preprocessing normalization steps.

The five input parameters are not independent; they interact to determine slope stability. Slope angle (β) and slope height (H) define the geometry and thus the driving forces, which are scaled by unit weight (γ). Cohesion (c) and friction angle (φ) define shear strength according to the Mohr–Coulomb criterion. The Factor of Safety (FOS) is therefore governed by the balance between driving forces (geometry and γ) and resisting forces (c and φ).

Figure 1 and Figure 2 include violin plots that represent the distribution of each input variable from both datasets for deeper analysis. Violin plots are created from kernel density estimation (KDE) and contain a boxplot in the middle. The width of the violin plot at any given value indicates the probability density at that point. It provides a more in-depth visualization of the distribution’s shape, central tendency, variability, and outliers. For instance, Figure 1 presents violin plots for the input variables, which show that most of the unit weight measurements fall in the range of 21–26 kN/m³, while cohesion has a tighter distribution in the range 30–41 kPa, which are reasonable values for weathered metamorphic and mine slope materials.

Additionally, histogram plots of each parameter were generated to provide a distributional view of the input variables within their respective ranges (Figure 3 and Figure 4). In each histogram, a vertical line is displayed that represents the median value, along with a symbol marking the mean value. In conjunction with their respective descriptive statistics, these overlays will be helpful in determining the skewness, multimodality, symmetry, and other statistical features of the distribution.

Pearson correlation matrices were also created and are presented in Figure 5 and Figure 6. These correlation plots provide a visual representation of the strength and direction of relationships between each pair of variables, including the output FOS. The diagonal elements of the correlation matrix show self-correlation coefficients of 1.0 for each variable. Off-diagonal elements display correlations between the variables. Since the correlation matrix is symmetric, the upper and lower triangles of the matrix are mirror images of each other. Values close to 1.0 or −1.0 indicate high correlations, where a positive value indicates a positive linear relationship (i.e., an increase or decrease in one parameter is closely associated with an increase or decrease in the other) [13,14]. This plot will be used to interpret the perceived feature importance quantitatively and to guide feature selection strategies later in the machine learning workflows.

As a result of visually inspecting the numerical and graphical evidence above, potential outliers or clusters of data points were detected and remediated before proceeding to the training stage. Furthermore, this preprocessing analysis was used to inform and design data normalization procedures to ensure that all input features are provided to the learning algorithms on the same scale. This normalization is essential to avoid the model being biased toward certain features during the learning process in ML algorithms that are sensitive to feature magnitude (e.g., gradient descent).

This statistical groundwork not only helps to calibrate the models better but also improves the interpretability, reliability and reproducibility of the overall machine learning workflow as it is applied to the problem of rock slope stability prediction.

2.3. Cross-Validation

A multi-level cross-validation framework was designed to guarantee a rigorous evaluation of machine learning models, while also ensuring no data leakage and overfitting. This framework consists of three levels: (1) keeping a hard separation between model development and testing; (2) splitting the model development set into an internal training/validation set; (3) applying k-fold cross-validation only to the training block.

In both datasets—the 53-case metamorphic rock slope dataset from Chen et al. [8] and the 129-case open pit mine dataset—the entire dataset was first split into a modelling set and a testing set. In dataset #1, 41 cases were used for developing the model, while 12 were set aside exclusively for final testing. In dataset #2, four data compositions were generated with repeated random sampling (without replacement). Each data composition was split into 100 cases for modelling and 29 for testing. This type of hold-out cross-validation, following Monte Carlo logic, provides a more robust estimate of a model’s generalization to varying data compositions.

Within each modelling set (41 or 100 cases), the data were further divided into 70% for training and 30% for validation. This internal validation set remained separate during training and was used to assess the model’s performance after development. Importantly, this 70/30 split preserves the independence of the validation data and prevents optimistic bias that could result from reusing the same data for both training and validation.

Kuhn and Johnson [15] recommend using 10-fold or 5-fold cross-validation. In the present manuscript, 5-fold cross-validation was only performed for the training set in this way: the training set was divided into five parts with an equal number of observations, referred to as folds, as shown in Figure 7. The model was then trained five times, each time with four folds (80% of the data) for training and one-fold (20% of the data) for internal validation. In this way, each training sample was used for validation exactly once and for training four times. The five resulting validation results were averaged to provide a single, stable estimate of model performance during training. The number of folds was restricted to 5 rather than a larger number such as 10, in part because of the relatively small amount of data available for training (53 and 129 samples in the two datasets, respectively). A 5-fold structure represents a compromise between statistical rigor and the availability of a sufficient sample size within each fold.

Although 10-fold cross-validation is commonly recommended in the machine learning literature, the number of folds was limited to 5 due to the relatively small dataset sizes (53 and 129 cases). Using 10 folds would have resulted in validation subsets that were too small to provide stable estimates of model performance. The 5-fold structure, therefore, represents a compromise between statistical rigor and the availability of sample size. Importantly, the multi-level framework adopted here (strict separation of training, validation, and testing sets) ensures that the reported performance metrics are unbiased and not a result of data reuse. For larger datasets, the same framework can be extended to 10-fold cross-validation or other resampling strategies to further reduce variance and strengthen generalization assessment.

This strategy results in a more reliable assessment of the model’s generalization as the training, internal validation, and final testing stages are strictly separated. The internal validation set (30%) is modelled as a real-world situation, in which the model parameters are fine-tuned on data that was not used for training. The final test set (12 or 29 cases, depending on the dataset) was entirely withheld from the model development process and used only once for the final model prediction. Figure 8 shows the entire data processing considering both datasets.

The presented cross-validation framework provides a general, objective, and reproducible approach to validate machine learning models for rock slope stability analysis. It ensures that the predictions are free from overfitting and data reuse issues and that the models are tested on truly unseen and representative cases of slopes with real-world geotechnical conditions.

2.4. The Limit Equilibrium Method

In this study, the reference solution was obtained by the Spencer limit equilibrium method for FOS calculations. The Spencer method is a traditional analysis method widely recognized in the field of geotechnical engineering that satisfies both force and moment equilibrium for each slice of the discretized slope. It was chosen as the reference solution method due to its suitability for analyzing rock slope stability, which often involves complex geometries and heterogeneous material properties commonly encountered in mining and civil engineering projects [16].

FOS calculations were performed with Slide2^® software (version 9.04, Rocscience Inc., Toronto, ON, Canada) and the open-source software Hyrcan (version 3.04) by GeoWizard. Slide2 is a two-dimensional limit equilibrium analysis program that has been used for a wide variety of research and practice applications. Slope geometry and material behavior were specified with the same five input parameters used to train machine learning models, namely slope angle, slope height, internal friction angle, cohesion, and unit weight. In this framework, slope geometry is fully defined by slope angle (β) and slope height (H), which vary across cases and capture the geometric diversity of the analyzed slopes, even though no explicit figures of slope geometries are included.

In this setup, stress conditions and boundary conditions are implicitly defined by the slope geometry and material properties within Spencer’s method, while the Mohr–Coulomb criterion provides the governing failure condition.

Figure 9 illustrates the conceptual slope geometry and failure mechanism considered in this study. Both datasets were analyzed using Spencer’s method under the assumption of circular slip surfaces, with slope geometry defined by height H and angle α, and shear strength governed by the Mohr–Coulomb criterion. This formulation was applied consistently across both datasets. Photographs of the slopes are unavailable due to confidentiality and data-source limitations; schematic diagrams are provided to illustrate the geometry.

The analysis parameters were set as follows:

Strength Type: Mohr-Coulomb
Number of slices: 25
Tolerance: 0.005
Maximum iterations: 75

These parameter settings are generally representative of a trade-off between computational accuracy and convergence robustness. The Mohr-Coulomb criterion was used to model shear strength behaviour since it is consistent with the input variables and is the conventional approach in traditional slope stability analysis.

The FOS values from Slide2 were used as the ground truth for training and evaluating the machine learning model’s performance. While the Spencer method implicitly assumes a constant interslice force inclination and does not account for strain-softening or progressive failure mechanisms, it is still one of the most robust and widely accepted methods for stability assessment of both soil and rock slopes.

Its use in this study serves two purposes:

To establish a reliable, physically based baseline for benchmarking the predictive accuracy of machine learning models; and
To assess whether data-driven approaches can approximate the results of a rigorous geotechnical method, thereby demonstrating their potential as practical tools for initial or supplementary slope stability assessments in real-world engineering applications.

2.5. Proposed Machine Learning Models

In the present study, four machine learning methods were used and compared independently with two datasets to determine which performs better. Supervised ML regression techniques were employed because the goal is to predict a numerical value (FOS).

The selection of four supervised regression algorithms—MLP optimized via Genetic Algorithm (MLP-GA), Random Forest Regressor (RF), Support Vector Regressor (SVR), and Gaussian Process Regressor (GPR)—was motivated by the need to evaluate and compare a diverse range of modelling approaches capable of capturing the complex, nonlinear relationships inherent in slope stability problems. These models have been effectively used in geotechnical engineering to predict the Factor of Safety (FOS) with high accuracy and generalization ability. For instance, Wang et al. [17] demonstrated that hybrid MLP-GA-based models perform better than traditional neural networks in slope failure modelling as they improve the convergence and help to avoid local minima. Zhang and Wei [18] have shown that ensemble models such as Random Forest can yield precise forecasts on heterogeneous slope datasets. While interpretability and robustness to noise or multicollinearity were not directly investigated, the RF model exhibited high performance scores across multiple evaluation metrics. SVR has also been commonly used for the same reason as it is able to generalize well on small- and medium-sized datasets with the help of kernel functions [19]. In addition, SVR is a suitable and effective regression-based method for slope stability analysis when using limited input parameters [20]. GPR is one of the appropriate and newly proposed methods that have been employed for many machine learning examples. The probabilistic solution that is developed by GPR model leads to discerning generic regression problems with kernels [20].

Each of the chosen models provides a different perspective: the MLP-GA captures nonlinear interactions with optimized neural networks; RF uses ensemble averaging to reduce overfitting; SVR applies margin-based learning for precise function approximation; and GPR offers a probabilistic approach, providing not only point predictions but also confidence intervals—crucial for decision-making in high-risk geotechnical applications.

Additionally, before training each of the ML models, all the input geotechnical parameters and the output FOS were standardized using z-score normalization with the StandardScaler from the scikit-learn library. This process ensured each input variable had a mean of zero and a variance of one, based on the training dataset.

x_{n o r m} = \frac{x - μ}{σ}

(1)

where x is the original value, μ and σ are the mean and standard deviation of each parameter, and this normalization process improves numerical convergence and model stability. The same scaler was later applied to testing data during model evaluation to ensure consistency. Figure 10 shows the differences between before and after normalization. The dataset was broader before, which can influence the prediction results because the number of certain variables is much greater in quantity than the other variables, and their influence may be greater.

2.5.1. Multi-Layer Perceptron Neural Network Optimized by Genetic Algorithm (GA-MLP)

The Genetic Algorithm–Multilayer Perceptron, or GA–MLP, is a hybrid machine learning approach that utilizes a Multilayer Perceptron (MLP) neural network for prediction purposes, with a Genetic Algorithm (GA)-based learning and optimization strategy to train the network. The hybrid model is used to improve the learning performance and generalization capabilities of the neural network, particularly in cases where the relationship between the input and output variables are complex and nonlinear, which is typical in geotechnical engineering applications [17].

The MLP is a type of feedforward artificial neural network, consisting of an input layer, one or more hidden layers with nonlinear activation functions, and an output layer, as shown in Figure 11. The network learns to approximate the underlying function by adjusting its internal weights and biases based on training data, which makes it powerful to model complex, nonlinear relationships in the data. However, the training process, typically performed with methods like backpropagation, may converge slowly or become trapped in local minima, especially when the problem is high-dimensional or in the presence of noise [21].

The GA is a population-based, stochastic optimization algorithm inspired by the process of natural evolution [22]. It operates on a population of candidate solutions, in this case encoded as real-valued genomes representing the weights and biases of the MLP. It uses operators like selection, crossover and mutation to iteratively improve the solutions over several generations. The fitness of each individual is typically evaluated based on a loss function, like mean squared error, computed on the training data. In the GA–MLP hybrid approach, the GA is used as a replacement of the gradient-based learning algorithm (like backpropagation) typically used to train the MLP. Instead of relying on derivatives to update the parameters, the GA provides a global search method, exploring the solution space to find an optimal or near-optimal set of parameters for the MLP model [17]. In other words, it decouples the model from any assumption of differentiability of the function it approximates, making it more robust to local minima.

2.5.2. Random Forest Regressor (RF)

The Random Forest Regressor (RF) is a machine learning ensemble algorithm for regression tasks that constructs numerous independent decisions trees and aggregates their predictions to obtain a final more accurate and stable result. It is a non-parametric, supervised learning method that can capture complex nonlinear relationships between input features and a continuous target variable [23].

Random Forest (RF) is a machine learning technique proposed by Breiman (2001) [24], which is an extension of the bagging (bootstrap aggregating) ensemble method. A set of models is trained on multiple random subsets of the original training data, which are created using resampling with replacement. The individual model in the forest is a decision tree that learns the mapping between inputs and outputs on its corresponding data subset.

In the case of regression problems, the prediction of the forest is the average prediction of all individual trees. The ensemble method is less variable than a single decision tree, thus improving generalization and reducing overfitting.

2.5.3. Support Vector Regressor (SVR)

The Support Vector Regression (SVR) is an algorithm that is based on the concept of Support Vector Machines (SVM) and is used for regression problems. The main objective of SVR is to find a function that approximates the target values with maximum flatness. This is achieved by minimizing the norm of the weight vector, subject to a regularization parameter that controls the amount of deviation allowed [20].

The unique aspect of SVR is that it tries to fit the function within a specific tolerance range or epsilon-insensitive zone and does not penalize errors that fall within this range. This means that SVR only considers the support vectors or the points that lie outside this zone and tries to minimize the error between the predicted and the actual values for these points. This makes SVR more robust and less sensitive to outliers and noise in the data.

SVR has a powerful ability to capture nonlinear relationships in the data by using kernel functions, which are functions that map the input data into a higher-dimensional feature space. Some of the standard kernel functions used in SVR are radial basis function (RBF), polynomial, and sigmoid. By using these kernel functions, SVR can perform linear regression in the transformed feature space, while still being computationally efficient. This makes SVR suitable for high-dimensional, noisy, or complex geotechnical datasets.

Unlike many predictive models that try to minimize the calculated error (i.e., the difference between the target and system outputs), SVR aims to improve its performance by optimizing and altering the generalization bound for a regression. In this subject, a predetermined error value can be ignored by a ε-insensitive loss function (LF). If we assume a training dataset that is formed by pairs of samples represented by

S = \{(x_{i}, y_{i}) | i = 1, 2, \dots, N\}

the SVR model containing the mentioned LF (i.e., ε-SVR) attempts to find the optimum hyperplane that has the minimum distance from all sample points. More specifically, ε-SVR seeks a function

g (x)

that has the minimum distance from the target data (i.e.,

y_{i}

) [25]. The linear regression is implemented in a high-dimensional feature space using ε-LF. In addition, the lower value for

‖ w ‖^{2}

, the less complex model [26]. As for the non-linear problems, the input data are transformed into high-dimensional space by means of a kernel mapping indicated by

y (x_{i})

. After that, a linear approach is applied to data in the future space by a convex optimization problem.

Minimize \frac{1}{2} ‖ w ‖^{2} + C \sum_{i = 1}^{z} ϕ_{i} + {ϕ_{i}}^{*}

(2)

Subject to:

{\begin{matrix} y_{i} - w . γ (x_{i}) - b \leq ε + ϕ_{i} \\ w . γ (x_{i}) + b - y_{i} \leq ε + {ϕ_{i}}^{*} \\ ϕ_{i}, {ϕ_{i}}^{*} \geq 0 \end{matrix}

(3)

where the terms

w

and

C

are indicative of the weight vector (i.e., in the future space) and penalty parameter, respectively. It must be noted that the trade-off between the performance error and the complexity of the model is ascertained by

C

constant. Also, b defines the bias and the parameters

ϕ_{i}

and

{ϕ_{i}}^{*}

stand for the slack variables measuring the deviation of training data outside

ε

-LF. The precision factor is also shown by

ε

in the above formula. In other words, only samples with a deviation value greater than

ε

will be considered for the error function [27]. Eventually, to calculate the SVR results, a linear combination is established associated with introducing Lagrange multiplier of

ρ_{i}

and

{ρ_{i}}^{*}

:

f (x_{i}) = w . γ (x_{i}) + b = \sum_{i = 1}^{z} (ρ_{i} - {ρ_{i}}^{*}) γ (x_{i}) . γ (x) + b

(4)

2.5.4. Gaussian Process Regressor (GPR)

GPR is a non-parametric Bayesian approach that defines prior over functions and updates this prior based on observed data, resulting in a posterior distribution over functions [28]. It is applied to issues involving classification as well as regression.

The Gaussian process is a generalization of the Gaussian distribution and is used to define the variation in functions, whereas the Gaussian distribution can be used to describe the distribution of random variables. The covariance function

n (z, z^{'})

and the mean function

j (z)

in the function space can be used to create a Gaussian process.

j (z) = E (g (z))

(5)

n (z, z^{'}) = E ((g (z) - j (z^{'})) (g (z^{'}) - j (z^{'})))

(6)

The Gaussian process may be expressed as:

g (z) = G P ~ (j (z), n (z, z^{'}))

(7)

At this point, for simplicity in notation, as discussed by Wang et al. [29], it is better to take the mean function as zero. Considering a dataset

L

containing n observations

(L = \{(z_{i}, o_{i}) | i = 1, 2, 3, \dots, n\})

, where the output scalar is denoted by

o_{i}

and

z_{i}

is the input vector, which is M-dimensional.

o = g (z) + ε

(8)

where

g (z)

is the random regression function and

ε

is the Gaussian noise with a variance of

{σ_{n}}^{2} (ε ~ N (0, {σ_{n}}^{2}))

and an independent identically distributed Gaussian distribution. Two matrices,

Z = [z_{1}, z_{2}, \dots, z_{n}]

and

O = [o_{1}, o_{2}, \dots, o_{n}]

, represent the input and output data, respectively. The group of functions

g = {[g (z_{1}), g (z_{2}), \dots, g (z_{n})]}^{P}

follows the Gaussian process is to determine

(q (g | T) = B (0, K))

, where K is the covariance function matrix

n (z, z^{'})

.

K (Z, Z) = [\begin{matrix} n (z_{1}, z_{1}) & \dots & n (z_{1}, z_{n}) \\ ⋮ & ⋱ & ⋮ \\ n (z_{n}, z_{1}) & \dots & n (z_{n}, z_{n}) \end{matrix}]

(9)

The multivariate normal distribution is equally distributed as the prediction output, which includes the training outputs

(o)

and testing outputs

(o^{*})

. When testing at a predetermined position, the joint distribution of the actual target value

(o)

and the predicted value

(o^{*})

is expressed as follows.

[\begin{matrix} o \\ o^{*} \end{matrix}] ~ N (0, [\begin{matrix} K (Z, Z) & K (Z, Z^{*}) \\ K (Z^{*}, Z) & K (Z^{*}, Z^{*}) \end{matrix}])

(10)

Then, the predictive distribution of the function values

o^{*}

at test points

Z^{*} = [{z_{1}}^{*}, {z_{2}}^{*}, \dots, {z_{n}}^{*}]

is calculated using GPR.

q (o^{*} | Z^{*}, Z, O) ~ N (\bar{g^{*}}, c o v (g^{*}))

(11)

\bar{g^{*}} = K (Z^{*}, Z) K {(Z, Z)}^{- 1} O

(12)

C O V (g^{*}) = K (Z^{*}, Z^{*}) - K {(Z^{*}, Z)}^{- 1} K (Z, Z^{*})

(13)

One of the most popular uses of Gaussian process regression is Bayesian optimization, as discussed by Sonek et al. [30]. By using a Gaussian process model of the objective function to estimate the next evaluation point, Bayesian optimization techniques efficiently find the optimal value when evaluating a costly objective function.

In addition to the individual descriptions above, Table 3 summarizes the main advantages and disadvantages of each algorithm in the context of slope stability prediction. This comparative overview highlights the trade-offs between accuracy, interpretability, and computational efficiency, and clarifies the rationale for their inclusion in this study.

By selecting these four algorithms, the study ensures coverage of both black box and semi-transparent approaches, probabilistic and deterministic methods, and ensemble versus kernel-based techniques. This diversity allows for a balanced evaluation of predictive accuracy, generalization capability, and practical applicability in slope stability analysis.

2.6. Model Performance Evaluation

The accuracy of the regression model is evaluated through two aspects. Firstly, the magnitude of the error between the predicted FOS and the actual FOS is used as a measure of the model’s performance. Secondly, the fit between the predicted FOS and the actual FOS is assessed to evaluate the model’s performance [18]. Several metrics, including the coefficient of determination (R²), root mean square error (RMSE), mean square error (MSE), and mean absolute error (MAE), are used to gauge the accuracy and precision of the model in predicting FOS.

The patterns identified by the model from the data can be evaluated using R², which indicates the model’s goodness of fit, with values usually ranging from 0 to 1. A value closer to 1 signifies a better fit between the model’s predictions and the actual FOS [31,32].

R^{2} = 1 - \frac{\sum_{i = 1}^{S} {(Y_{i_{p r e d i c t e d}} - Y_{i_{o b s e r v e d}})}^{2}}{\sum_{i = 1}^{S} {(Y_{i_{o b s e r v e d}} - {\bar{Y}}_{o b s e r v e d})}^{2}}

(14)

RMSE represents the sample standard deviation of the difference between the actual FOS and the predicted FOS.

R M S E = \sqrt{\frac{1}{N} \sum_{i = 1}^{S} {[(Y_{i_{o b s e r v e d}} - Y_{i_{p r e d i c t e d}})]}^{2}}

(15)

MSE reflects the deviation between the predicted FOS and the actual FOS. A value closer to 0 indicates higher accuracy of the model.

M S E = \frac{1}{N} \sum_{i = 1}^{S} {[(Y_{i_{o b s e r v e d}} - Y_{i_{p r e d i c t e d}})]}^{2}

(16)

MAE measures explicitly the magnitude of the error between the predicted FOS and the actual FOS.

M A E = \frac{1}{N} \sum_{I = 1}^{S} |Y_{i_{o b s e r v e d}} - Y_{i_{p r e d i c t e d}}|

(17)

In all the above equations,

Y_{i o b s e r v e d}

and

Y_{i p r e d i c t e d}

stand for the actual and predicted values of the FOS, respectively. The term S is an indicator of the number of data points and

{\bar{Y}}_{o b s e r v e d}

is the average of the actual values of the FOS.

3. Results and Discussion

3.1. Evaluation of Machine Learning Model Results

The main goal of the present study is to predict the Factor of Safety (FOS) in rock slopes—that is why it was necessary to investigate the feasibility of four machine learning-based methods, namely, GA-MLP, RF, SVR, and GPR, to evaluate the performance and decided which one is the most suitable based of the numeric values. Each machine learning model was assessed using two independent datasets: one compiled from slope failures along the Kaili–Sansui highway corridor in China (metamorphic slopes), and the other from a large-scale open pit mining operation in South America. Each dataset was processed and evaluated separately to preserve geological and geotechnical consistency and to test model generalization under different real-world conditions.

3.1.1. Results of GA-MLP

A hybrid machine learning model was developed to capture the nonlinear relationships between geotechnical input parameters and the Factor of Safety (FOS). It was inspired by the framework proposed by Wang et al. [17] and adapted in this study for application to rock slope stability analysis. The entire development process was divided into three sequential and interdependent phases: neural network architecture calibration, population size optimization, and final training with evolutionary optimization. The steps applied to dataset #1 will be explained, and the graphs and results in this dataset will be shown. Note that the same was applied to dataset #2, but it had to be repeated four times due to the treatment being given to dataset #2.

The first step involved selecting appropriate architecture for the MLP. The input layer received the five standardized geotechnical features, and the output layer consisted of a single neuron predicting the FOS. To determine the optimal number of hidden neurons, a parametric study was conducted varying the hidden node count from 1 to 12. For each configuration, the MLP model was trained six times using different random seeds to reduce the influence of stochastic effects. In each repetition, the dataset was split into 70% training and 30% validation, maintaining fixed partitions across runs. Performance was evaluated using the R² and RMSE on the validation set. The architecture that achieved the best average performance with the least variability was selected for the next phase. This process helped avoid overfitting and ensured model robustness prior to GA-based training. Figure 12 shows the sensitivity analysis for the MLP model (Dataset #1), and the best number of nodes in the hidden layer was 4.

Once the MLP architecture was fixed, the second phase investigated the impact of GA population size on optimization quality. The finalized architecture—comprising five input neurons, four hidden neurons, and one output neuron—resulted in 29 trainable parameters, encoded in a real-valued genome:

20 weights (5 inputs × 4 hidden nodes)
4 biases for the hidden layer
4 weights from hidden to output
1 bias for the output node

Multiple GA–MLP models were trained using different population sizes while keeping the number of generations constant. Figure 13 shows performance outcomes with variation in the GA model population sizes for Dataset #1. For each configuration, the model was evaluated using R² and RMSE on both training and validation sets.

A ranking-based scoring system was used to objectively compare performance: each model was ranked by its four metrics, and the scores were aggregated to identify the best-performing population size. In addition, this empirical approach ensured that the selected population size maximized optimization quality while maintaining computational efficiency.

In order to validate the generated ranking, estimation curves were plotted for testing and validation, both for the R² and RMSE parameters, as shown in Figure 14. For R², the highest point on the curve represents the best R² value, and if it is intersected with the X-axis, which represents the population size, the required value is obtained. In the case of the RMSE curve, the lowest point on the curve is considered optimal.

It results that for dataset #1, the GA-MLP model with a population size of 325, as indicated in Table 4, provides the best predictive GA-MLP model. The model with a population size of 325 obtained the highest R² and the lowest RMSE. The other models were very close, and the GA-MLP hybrid model was developed correctly.

With the architecture and population size established, the final model training involved executing a reproducible evolutionary optimization procedure. Each individual in the GA population represented a candidate solution encoded as a 29-dimensional real-valued genome. The optimization loop followed standard evolutionary steps:

Fitness Evaluation: Each genome was decoded into MLP weights and biases, followed by a forward pass through the network using the tanh activation function in the hidden layer. The Mean Squared Error (MSE) between predicted and actual FOS values on the normalized training data was used as the fitness metric.
Selection: Parent individuals were selected via tournament selection (k = 3), favoring candidates with lower RMSE values.
Crossover: A uniform crossover mechanism was applied to combine genes from two parents to form new offspring.
Mutation: A small Gaussian mutation (mean = 0, standard deviation = 0.05) was introduced per gene with a fixed mutation probability, promoting genetic diversity.

The GA was executed for a fixed number of generations. Although no explicit elitism was implemented, the global best-performing genome across all generations was preserved and used for final model evaluation.

For the final evaluation, the best genome was decoded to retrieve the optimized weights and biases of the final GA–MLP model.

Figure 15 shows a scatter plot created with the Sklearn library in Python (version 3.11). These plots display a two-dimensional density visualization for the training and validation Dataset #1. The x-axis represents the true values, while the y-axis shows the predicted values of the GA-MLP regression model for the FOS.

After performing the same analysis and applying the GA-MLP hybrid model to Dataset #2, Figure 16 shows the results that were obtained for the four groups generated as explained in Section 2.3.

3.1.2. Results of Random Forest Regressor (RF)

The RF model was trained using the same two datasets and five geotechnical input variables adopted throughout this study. To ensure consistency across models and eliminate the risk of data leakage, all input features were standardized using z-score normalization. This transformation, applied via the StandardScaler from Scikit-learn, adjusts each variable to have zero mean and unit variance, with parameters fitted exclusively to the training data. The standardization process was integrated into a pipeline structure, where the scaler and RF model were combined into a unified workflow to maintain consistent preprocessing during training, cross-validation, and testing. In addition, the model was configured using the following hyperparameters:

Number of trees (n_stimators): 100
Maximum tree depth (max_depth): 5
Minimum samples to split a node (min_samples_split): 2
Minimum samples per leaf (min_samples_leaf): 1
Loss function (criterion): Squared error

These settings were selected to balance model complexity, predictive performance, and computational cost. A relatively shallow depth was chosen to reduce the risk of overfitting, particularly given the limited size of the training subsets within each group.

The results of the training and validation after applying the model to Dataset #1 are shown in Figure 17.

The results obtained in the four groups of Dataset #2 are shown on Figure 18.

3.1.3. Results of Support Vector Regressor (SVR)

Unlike tree-based models, SVR attempts to construct a regression function that approximates the target variable within a specified margin of tolerance (ε), while simultaneously maximizing model generalization by controlling model complexity through regularization. The SVR model was configured with an RBF kernel, which is well suited to geotechnical data due to its ability to approximate complex, nonlinear mappings. The grid search explored combinations of the following hyperparameters:

C: The regularization parameter controlling the trade-off between minimizing training error and maximizing margin (values tested: 1, 10, 100),
gamma: The kernel coefficient defining the influence of individual training examples (values tested: ‘scale’, 0.01, 0.1),
epsilon (ε): The margin of tolerance within which no penalty is given for prediction error (values tested: 0.01, 0.05, 0.1).

The optimal set of hyperparameters was selected based on the average R² score across the five cross-validation folds. This approach provided a statistically grounded basis for model selection and minimized the risk of overfitting. Figure 19 displays the results from Dataset #1, and Figure 20 shows the results from Dataset #2.

3.1.4. Results of Gaussian Process Regressor (GPR)

GPR was incorporated due to its strong ability to model uncertainty in regression tasks. GPR is a non-parametric, Bayesian method that estimates outputs based on a distribution over functions instead of a fixed function. This allows for probabilistic predictions with associated confidence intervals—an essential feature in slope stability applications where safety margins are critical and data availability may be limited.

The following hyperparameters were also configured:

alpha = 1 × e⁻¹⁰: Introduced a small diagonal noise term to ensure matrix invertibility and numerical conditioning.
normalize_y = True: The output (FOS) values were normalized internally prior to fitting to improve convergence.
n_restarts_optimizer = 10: The optimizer restarted from multiple initializations to avoid local optima during marginal likelihood maximization.

Beyond predictive accuracy, GPR uniquely provides confidence intervals around each prediction, offering a probabilistic measure of uncertainty—especially valuable in geotechnical engineering, where decisions often involve safety-critical thresholds. Furthermore, its nonparametric nature and adaptive kernel setup make it effective even with limited data. The scatter plots of training and validation are shown in Figure 21 for Dataset #1 and Figure 22 for Dataset #2.

3.2. Performance Evaluation of Machine Learning Models During Training and Validation

To quantitatively assess the predictive performance of each machine learning model, four well-established regression metrics were employed: Coefficient of Determination (R²), Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and Mean Squared Error (MSE). These metrics provide complementary insights regarding accuracy, error magnitude, and model robustness. The resulting R² values indicate the proportion of variance in the Factor of Safety (FOS) that the input features for each model can explain, while RMSE, MAE, and MSE quantify prediction error from different perspectives. These metrics were aggregated and compared across all models to identify which approach offers the best balance of accuracy and stability.

Table 5 and Table 6 summarize the performance outcomes for Dataset #1 and Dataset #2, respectively, across the four machine learning model techniques. Note that Dataset #2 has four different groups formed by random selection, with each group containing 29 data points for testing, and no data point is repeated across the groups.

3.3. Comparison Between Machine Learning Models and the Limit Equilibrium Method

Considering the four Machine Learning models correctly trained with the two datasets being analyzed, the next step is to perform the prediction or testing using data that the models have never seen before, thus testing the actual performance of each one. The classification of the data was explained in Section 2.3. The ultimate objective is to determine the most accurate model for predicting the Factor of Safety (FOS), where both the limit equilibrium method and the machine learning technique are evaluated in both datasets.

3.3.1. Analysis for Dataset #1

Figure 23 presents the predicted Factor of Safety (FOS) obtained from four machine learning models—GA-MLP, RF, SVR, and GPR—plotted against the measured FOS values calculated using the Spencer Limit Equilibrium Method (LEM). The x-axis (“Measured FOS”) represents the reference stability values computed through the deterministic Spencer method, which serves as the benchmark for model evaluation. The y-axis (“Predicted FOS”) indicates the outputs from each trained machine learning model when applied to the validation dataset.

In all subplots, the dashed 1:1 line represents perfect agreement between predicted and measured values. Points located closer to this line indicate higher predictive accuracy, as the predicted FOS closely matches the Spencer-calculated reference. Conversely, deviations from the line represent prediction errors, with the magnitude of deviation reflecting the absolute residual. The performance indicators R², RMSE, MSE, and MAE reported in each subplot quantitatively summarized model performance, where higher R² values and lower error metrics correspond to better predictive fidelity.

Comparatively, the GPR model exhibits the strongest agreement with the Spencer results (R² = 0.988, RMSE = 0.06) (Shown in Figure 24), with nearly all points tightly aligned with the reference line. The GA-MLP model also achieves strong performance R² = 0.940) but displays a few moderate deviations at higher FOS values. SVR and RF show lower overall agreement, with R² values of 0.862 and 0.814, respectively, and a wider scatter of points around the reference line, indicating less consistent generalization.

From an engineering perspective, this plot highlights not only the general predictive capability of each model but also its reliability in reproducing deterministic stability assessments. Since the Spencer method is widely regarded as a robust LEM approach for slope stability analysis, close alignment between predicted and measured FOS strengthens the confidence in a model’s applicability for real-world geotechnical decision-making.

Another critical assessment that must be considered is shown in Figure 25, which displays the residual-based outlier detection results for the four machine learning models evaluated in this study. Outliers are defined as predictions with an absolute residual exceeding 0.20 in Factor of Safety (FOS), representing a practical engineering tolerance threshold. These points are annotated with their dataset indices and measured–predicted FOS values, allowing direct traceability to specific slope cases.

The distribution and frequency of outliers provide important insights into each model’s predictive robustness. The GA-MLP model exhibits several under-predictions and over-predictions, suggesting that while it captures nonlinear relationships effectively, it is more sensitive to cases underrepresented in the training set. RF shows fewer outliers but still displays notable deviations in specific instances, likely due to localized overfitting in certain decision tree partitions. SVR, while generally accurate, produces a small cluster of over-predictions in the higher FOS range, indicating potential difficulty in extrapolating beyond the most represented stability conditions. GPR demonstrates the most consistent performance, with no detected outliers at the selected threshold, reflecting its ability to capture both the central trend and the local variability in the data.

From a geotechnical risk perspective, under-predictions are conservative but may penalize design optimization, whereas over-predictions present a more critical concern as they may lead to unsafe stability assessments. The residual analysis suggests that, for the dataset and conditions evaluated, GPR offers the most reliable balance between accuracy and error dispersion, followed closely by RF and SVR, with GA-MLP showing greater variability in extreme cases. This reinforces the importance of combining global performance metrics (e.g., R², RMSE) with residual-based outlier inspection to ensure that model selection accounts not only for average accuracy but also for the frequency and magnitude of potentially critical prediction errors.

Table 7 summarizes the comparative performance of the evaluated models using multiple statistical indicators, with scores assigned to each metric and summed to produce an overall ranking. The ranking framework allows for an integrated assessment that balances accuracy and error measures, providing a clear picture of overall predictive capability. Figure 26 illustrates the ranking scores assigned to each ML model.

The results indicate that the Gaussian Process Regressor (GPR) consistently outperformed the other models across all evaluation criteria, achieving the highest overall score and demonstrating superior predictive reliability. The GA-MLP model followed closely, also showing strong performance. In contrast, the SVR model achieved moderate results, and the RF model ranked lowest due to comparatively weaker accuracy and higher errors. This scoring approach offers a systematic basis for selecting the most suitable predictive method to predict the FOS.

In this study, machine learning is positioned as a complementary tool to established limit-equilibrium and numerical methods, not a replacement. First, ML enables rapid, scalable screening of many slopes using a small set of readily available inputs, which is critical in open pit operations where prioritization drives safety and productivity. Second, probabilistic models such as GPR provide confidence intervals around FOS predictions, directly supporting risk communication and decision thresholds—an attribute not natively offered by deterministic LEM outputs. Third, trained ML models act as surrogates to accelerate parametric sweeps and “what-if” analyses, reducing turnaround time before detailed numerical back-analysis. Fourth, ML models can be updated as monitoring data arrive (e.g., lab updates, in situ measurements), improving adaptability over static design assumptions. In our results, SVR delivered the highest average accuracy on the mining dataset, making it well-suited for fast screening; GPR consistently provided the most stable predictions and uncertainty bounds on the highway dataset, making it valuable for conservative, safety-critical decisions. Together, these capabilities highlight the practical significance of ML as an efficient front-end to guide where—and how—numerical modelling effort should be concentrated.

3.3.2. Analysis for Dataset #2

For the second dataset, the scatter plots present the predictive performance of the evaluated models across the four test groups (G1–G4) as shown in Figure 27. While the overall analysis follows the same predicted-versus-measured comparison as in the first dataset, some variations in relative performance are evident. Notably, the Support Vector Regressor (SVR) shows improved alignment with measured values across most groups, surpassing the Gaussian Process Regressor (GPR) in predictive accuracy under this data configuration. The GA-MLP model maintains a strong and consistent performance, remaining competitive across all groups. In contrast, the Random Forest model exhibits greater variability and larger deviations in several cases, indicating less stable predictions for this dataset. These results suggest that model effectiveness can be dataset-dependent, with SVR emerging as the top performer here, followed closely by GA-MLP. At the same time, GPR and RF occupy the subsequent ranks.

Figure 28 presents the predicted versus measured FOS values for the second dataset, highlighting the detected outliers based on residual thresholds. Across all models and groups, most predictions closely align with the 1:1 reference line, indicating generally strong predictive performance. However, certain groups exhibit a higher concentration of outliers—particularly in GA-MLP for G2 and G3—suggesting occasional deviations under specific input conditions. In contrast, SVR and GPR show fewer extreme residuals, reflecting more stable behavior across the evaluated subsets.

SVR and GPR are very different methods, but they also have similarities. For this reason, it makes sense to compare them. They both learn from data and can be used to predict continuous, real-valued outcomes. Nonlinear dependencies can be considered by transforming the training domain implicitly via kernels. SVR is a deterministic approach that optimizes a margin-based cost function and provides point predictions (estimates). GPR, on the other hand, is a probabilistic Bayesian framework that also generates point predictions but can also produce predictions with quantified uncertainty by estimating posterior variances. For the study at hand, the difference in performance on the two datasets might be due to the different methodological strengths. The margin-based approach of SVR might have been more suited for the second dataset if the structure of that dataset allows for more linear extrapolation outside the training domain and less smooth local fits. The Gaussian process would smooth more due to the kernel, its hyperparameters, and the covariance structure. This might have been the reason for better performance on the first dataset and potential underfitting on the second dataset. Both methods are somewhat complementary to each other, and the performance is data dependent.

The boxplots in Figure 29 display the variability of R² and RMSE of all the groups for each model, where the red dot represents the mean. It can be seen that the SVR model has slightly better average values of R² and RMSE. However, by examining the boxplots, it can be noted that the SVR has a much wider range compared to GPR. This means that while the SVR model might have higher performance in certain groups, its performance is less stable compared to GPR.

To enable a fair and consistent comparison of model performance across all groups, the statistical indicators for each model—namely R², RMSE, MSE, and MAE—were computed for each of the four groups (G1–G4) and then averaged. This averaging process yields a single representative value for each metric per model, reflecting its overall predictive capability over the entire dataset division rather than relying on a single group’s performance. The results of these average metrics for each model are summarized in Table 8 and shown in Figure 30, providing a consolidated basis for ranking and evaluating the models.

Table 9 and Figure 31 present a comparative evaluation of the machine learning models by aggregating their average performance across all groups of the second dataset. The scoring framework assigns values to each performance metric and sums them to produce an overall ranking, allowing for a balanced assessment between accuracy and error measures. The results reveal that the Support Vector Regressor (SVR) achieved the highest total score, indicating superior predictive performance in this dataset, followed closely by the Gaussian Process Regressor (GPR). The GA-MLP model demonstrated moderate performance, while the Random Forest (RF) model ranked lowest due to comparatively lower accuracy and higher error values. This combined tabular and graphical representation provides a clear visual and numerical basis for selecting the most effective predictive method for FOS estimation in this scenario.

3.4. Sensitivity Analysis Results

Figure 32 illustrates the sensitivity scores of the five input parameters using the trained Gaussian Process Regression (GPR) model. For Dataset #1, the most influential parameter is cohesion (c), and the friction angle (φ) shows relatively lower impact on the predicted FOS, suggesting the model relies more heavily on strength and geometric characteristics for this specific dataset.

In contrast, Dataset #2 shows a slightly different pattern. While cohesion (c) remains the most dominant factor, the influence of unit weight (γ) increases, and the importance of slope height (H) decreases. This shift in sensitivity suggests that the relationship between input parameters and FOS varies depending on the geological and geotechnical context of the dataset.

Overall, this comparison highlights the adaptability of the GPR model in capturing the influence of different parameters across distinct slope stability scenarios.

The residual cumulative distribution analysis (RCDA) plots, shown in Figure 33, compare residuals from four ML models across Dataset #1 and Dataset #2. In Dataset #1, GPR exhibits the best performance with nearly all residuals below 0.1, while RF shows the largest spread of errors. In Dataset #2, residuals increase across all models; however, GPR and GA-MLP remain more consistent, whereas RF again exhibits high variability. These results confirm that GPR provides more accurate and stable predictions, particularly in Dataset #1, and highlight the variation in model performance across different datasets.

3.5. Recommendations for Future Research

During the evaluation of performance and ranking, it became clear that more investigations are required in some areas of the GA-MLP, RF, SVR, and GPR models for FOS prediction. The SVR model might be responsible for lower or equal prediction accuracy in a few datasets. However, it was more unstable across the groups compared to the GPR model. The exploration of more complex kernels, adaptive parameter tuning, or even the use of hybrid models that combine SVR with ensemble techniques can minimize fluctuations in its performance and increase stability.

The GPR model was shown to be very stable and also has a good average ranking. The GPR model may also be used for real-world deployment in slope stability problems. However, one disadvantage of this model, which is worth exploring in future work, is the computational intensity of this ML method compared to other methods. The GPR model could be optimized for increased computational efficiency through methods such as sparse Gaussian process models or inducing-point methods.

The approach taken in this study to combine model performance across several groups to form an overall average ranking could be repeated using data which span a wider variety of geological/geotechnical scenarios to further validate the ranking process and enhance the general applicability of the resulting model selection criteria.

Other potential metrics for future research beyond the statistical measures used in this work include uncertainty quantification measures, leveraging the probabilistic nature of the GPR model, to consider both the prediction and the confidence in that prediction.

3.6. Practical Recommendations for Mining Operations

The most direct and immediate use of the results in this research for mine operations and engineering practices is to consider the integration of ML models in the typical decision-making process of slope stability assessment. This could be through either developing a faster, adaptable slope stability analysis or implementing a simple user interface, such as software or a mobile application, with the model already trained and implemented.

ML models could be an excellent addition to the typical suite of numerical tools, like the Spencer method, for more rigorous analyses to identify slopes that need immediate attention, where the ML-based solution could help prioritize the slopes that need to be analyzed first. This prioritization could help streamline analysis and remediation processes to meet slope stability and safety requirements.

The software or mobile application could be designed to take in key geotechnical and geometric information about a slope that could be quickly and easily measured, use this input to predict the FOS and corresponding confidence interval based on the trained model, and output the prediction along with the model rankings and relative error. In addition to the predicted FOS and associated confidence range, it may also be helpful to output the model rankings and relative errors from other models trained on the data for that specific input combination, if available.

The mining industry could use this software or application as part of a typical slope monitoring process in the operations center to provide feedback on prioritizing follow-up investigations or remediation work or to identify areas of concern that may require more attention during routine checks.

It is important to distinguish machine learning analysis from traditional sensitivity analysis. Sensitivity analysis explores how variations in individual parameters affect stability in a specific slope model, but it does not generalize across multiple slopes. Machine learning, by contrast, learns nonlinear, multivariate relationships between geotechnical parameters and FOS from diverse datasets. This allows ML models to predict FOS for new slope cases without re-running full parametric studies, offering efficiency and broader applicability. Thus, ML complements traditional sensitivity analysis by providing predictive capability across datasets, while sensitivity analysis remains valuable for detailed case-specific exploration.

4. Conclusions and Recommendations

4.1. Summary of Findings

The comparative assessment of four supervised learning models—GA-MLP, RF, SVR, and GPR—has been performed on two independent datasets to predict the FOS of rock slopes with respect to major geotechnical parameters. The datasets were created and split into training, validation, and just for Dataset #2, testing groups through the Monte Carlo sampling in such a way that they would be statistically robust and not affected by possible sampling bias.

Both sets of results confirm that the performance of a model is not necessarily consistent across different datasets, as shown by the difference in the prediction errors when different data splits were used for the same model. For dataset #1, the GPR algorithm was found to have the best predictive performance, followed by GA-MLP. The worst results were shown by the SVR and RF models. In the case of dataset #2, the SVR was found to be more accurate than the GPR, with higher variance across the multiple independent splits. The GPR model, however, performed better in terms of stability and certainty of the prediction, as can be seen from the residual and boxplot figures. While both models provide a reasonable trade-off between performance and computational efficiency, they should be selected based on the particular use case: when maximum absolute performance is needed, the SVR model should be preferred; when model stability and certainty of the prediction are the primary objectives, the GPR should be used.

A major part of this performance is inherited by the nature of the learning kernel of both methods, as kernel-based regressions are known to be particularly effective for problems with complex nonlinear predictor-target relationships that are present in small or noisy datasets. GPR and SVR both effectively project the input space into a higher-dimensional feature space in which the predictors and outputs are more linearly related, allowing for improved generalization and prediction stability. By contrast, the RF model is limited in its ability to extrapolate beyond the training data (increasing the risk of overfitting), and while GA-MLP can be a very powerful learner, this particular hybrid model was not as stable as the GPR and SVR models as the genetic algorithm component introduced some randomization in the network configuration which may be more problematic for smaller samples.

Overall, the results obtained for both datasets showed that the best results out of all four algorithms are close to the values calculated using Spencer’s LEM, which can be a good indication of their potential in practical applications. The average performance across four independent groupings of data achieved through Monte Carlo sampling and presented across four key evaluation metrics demonstrates the stability of GPR and SVR as the most consistent and reliable approaches in different data and target settings.

4.2. Contributions of Research

This study makes several contributions to the field of rock slope stability analysis and ongoing efforts to improve the efficiency of its practice through the application of machine learning algorithms and methodologies. This research is an example of the systematic use of supervised machine learning—GA-MLP, RF, SVR, and GPR—to develop algorithms that can accurately predict the Factor of Safety (FOS) for mining applications with full reproducibility and transparency. By building and comparing the performance of models on multiple datasets with well-chosen input and target parameters, the study established that GPR and SVR were the most reliable and stable learning models for this purpose, outperforming the others. This conclusion demonstrates the value of kernel-based and probabilistic learning models for capturing the nonlinear and multivariate nature of the underlying relationships in the geotechnical dataset more effectively than purely ensemble-based or hybrid-neural approaches.

The study also established a procedure for rapid FOS prediction that is suitable for practical, operational implementation in a decision support framework in a mining environment, taking into account all the relevant factors, including statistical validation (cross-validation, analysis of residual, identification of outliers, etc.), proper preparation of the training dataset, and integration with other information sources. The code and data prepared for this study also allow us to present a domain-specific dataset structure that can be further explored in future research and used as a ground truth for comparison between multiple learning methods.

In contrast with previous research, this work kept two datasets independently, using one of them for training and validation while reserving a part of it for final testing on previously unseen data. This testing group forms the base for performance metrics of all the models built with the help of the primary dataset and, therefore, should be understood as a completely independent benchmark. This enforces the 3-stage approach—training, validation, and testing—that is in line with real-world application of the proposed technique, where a trained model is expected to generalize to new data, and demonstrates the robustness of the results across the datasets and associated metrics.

4.3. Limitations of Study

The current research is subject to several limitations that must be considered when interpreting the results and planning future work. The choice of input parameters for FOS prediction was deliberately narrowed to five geotechnical variables—slope angle (β), slope height (H), internal friction angle (φ), cohesion (c), and unit weight (γ)—to ensure consistency across datasets and comparability of results. While these parameters are fundamental to classical limit equilibrium analyses, this approach excludes other important influences such as pore-water pressure, rock mass discontinuities, groundwater conditions, and time-dependent effects, all of which can significantly affect slope stability in practice.

In particular, the absence of pore-water pressure data represents a major source of uncertainty, since pore pressure directly affects effective stress and shear strength. Likewise, geological factors such as discontinuity orientation and spacing, degree of weathering, lithology, and anisotropy were not available in the datasets used, limiting the realism of the models. Consequently, the machine learning models developed here should be viewed as complementary tools for preliminary assessment rather than substitutes for detailed geological investigation or advanced numerical modelling. Their value lies in providing rapid, data-driven predictions from a limited set of measurable inputs, which is useful for screening and prioritization in mining and civil engineering projects.

The datasets themselves are of limited size and representativeness, being derived from two specific case studies: a metamorphic slope in a Chinese mine (Dataset #1) and a South American open pit mine (Dataset #2). While these datasets capture real-world conditions, they are unlikely to represent the full diversity of lithologies, climates, or mining practices. Similarly, the use of Spencer’s limit equilibrium method in a two-dimensional Mohr–Coulomb framework as the sole reference ensured methodological consistency but restricted the models to emulate a simplified stability analysis, excluding possible 3D effects, strain softening, or progressive failure.

Finally, limitations of the selected ML approaches should be noted. Although GPR and SVR performed well, their kernel-based nature makes them less scalable to very large datasets (e.g., GPR scales cubically with training size). While this is not a concern for the relatively small datasets used here, larger-scale applications will require approximation strategies or alternative learners.

Future research should prioritize integrating geological and hydrogeological parameters—such as discontinuity orientation, lithology, weathering, and pore-water pressure—into machine learning frameworks, as this will be essential for advancing realism, robustness, and practical applicability of ML-based slope stability prediction.

Author Contributions

M.T. conducted the literature review and wrote the initial draft of the manuscript. M.M. revised and edited the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to privacy.

Acknowledgments

The authors sincerely thank the University of Arizona, which provides essential access to journal publications and databases crucial for this review.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Hoek, E.; Bray, J.D. Rock Slope Engineering, 3rd ed.; CRC Press: Boca Raton, FL, USA, 1981; ISBN 978-0-429-18219-8. [Google Scholar]
Wyllie, D.C. Rock Slope Engineering: Civil Applications, 5th ed.; CRC Press: Boca Raton, FL, USA; Taylor & Francis: Oxfordshire, UK, 2017; ISBN 978-1-315-15403-9. [Google Scholar]
Duncan, J.M. State of the Art: Limit Equilibrium and Finite-Element Analysis of Slopes. J. Geotech. Eng. 1996, 122, 577–596. [Google Scholar] [CrossRef]
Spencer, E. A Method of Analysis of the Stability of Embankments Assuming Parallel Inter-Slice Forces. Géotechnique 1967, 17, 11–26. [Google Scholar] [CrossRef]
Christian, J.T.; Ladd, C.C.; Baecher, G.B. Reliability Applied to Slope Stability Analysis. J. Geotech. Eng. 1994, 120, 2180–2207. [Google Scholar] [CrossRef]
Phoon, K.-K.; Kulhawy, F.H. Characterization of Geotechnical Variability. Can. Geotech. J. 1999, 36, 612–624. [Google Scholar] [CrossRef]
Rasmussen, C.E.; Williams, C.K.I. Gaussian Processes for Machine Learning; The MIT Press: Cambridge, MA, USA, 2005; ISBN 978-0-262-25683-4. [Google Scholar]
Chen, C.; Xiao, Z.; Zhang, G. Stability Assessment Model for Epimetamorphic Rock Slopes Based on Adaptive Neuro-Fuzzy Inference System. Electron. J. Geotech. Eng. 2011, 16, 93–107. [Google Scholar]
Kohavi, R. A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection. In Proceedings of the 14th International Joint Conference on Artificial Intelligence—Volume 2, Québec, QC, Canada, 20–25 August 1995; pp. 1137–1143. [Google Scholar]
Nanehkaran, Y.A.; Pusatlı, T.; Chengyong, J.; Chen, J.; Cemiloglu, A.; Azarafza, M.; Derakhshani, R. Application of Machine Learning Techniques for the Estimation of the Safety Factor in Slope Stability Analysis. Water 2022, 14, 3743. [Google Scholar] [CrossRef]
Bai, G.; Hou, Y.; Wan, B.; An, N.; Yan, Y.; Tang, Z.; Yan, M.; Zhang, Y.; Sun, D. Performance Evaluation and Engineering Verification of Machine Learning Based Prediction Models for Slope Stability. Appl. Sci. 2022, 12, 7890. [Google Scholar] [CrossRef]
Mahmoodzadeh, A.; Mohammadi, M. Forecasting Factor of Safety of Slopes Stability Using Several Machine Learning Techniques. Res. Sq. 2021. [Google Scholar] [CrossRef]
Lin, S.; Zheng, H.; Han, C.; Han, B.; Li, W. Evaluation and Prediction of Slope Stability Using Machine Learning Approaches. Front. Struct. Civ. Eng. 2021, 15, 821–833. [Google Scholar] [CrossRef]
Ragam, P.; Kumar, N.; Ajith, J.; Karthik, G.; Himanshu, V.K.; Machupalli, D.S.; Murlidhar, B.R. Estimation of Slope Stability Using Ensemble-Based Hybrid Machine Learning Approaches. Front. Mater. 2024, 11, 1330609. [Google Scholar] [CrossRef]
Kühn, M.; Johnson, K. Applied Predictive Modeling; Springer Ebooks: New York, NY, USA, 2013. [Google Scholar] [CrossRef]
Nouri, M.; Sihag, P.; Salmasi, F.; Abraham, J. Prediction of Homogeneous Earthen Slope Safety Factors Using the Forest and Tree Based Modelling. Geotech. Geol. Eng. 2021, 39, 2849–2862. [Google Scholar] [CrossRef]
Wang, H.; Moayedi, H.; Kok Foong, L. Genetic Algorithm Hybridized with Multilayer Perceptron to Have an Economical Slope Stability Design. Eng. Comput. 2021, 37, 3067–3078. [Google Scholar] [CrossRef]
Zhang, M.; Wei, J. Analysis of Slope Stability Based on Four Machine Learning Models: An Example of 188 Slopes. Period. Polytech. Civ. Eng. 2025, 69, 505–518. [Google Scholar] [CrossRef]
Khajehzadeh, M.; Keawsawasvong, S. Predicting Slope Safety Using an Optimized Machine Learning Model. Heliyon 2023, 9, e23012. [Google Scholar] [CrossRef] [PubMed]
Tien Bui, D.; Moayedi, H.; Gör, M.; Jaafari, A.; Foong, L.K. Predicting Slope Stability Failure through Machine Learning Paradigms. Int. J. Geo-Inf. 2019, 8, 395. [Google Scholar] [CrossRef]
Meng, J.; Mattsson, H.; Laue, J. Three-dimensional Slope Stability Predictions Using Artificial Neural Networks. Numer. Anal. Methods Geomech 2021, 45, 1988–2000. [Google Scholar] [CrossRef]
Miller, A. Ship Model Identification with Genetic Algorithm Tuning. Appl. Sci. 2021, 11, 5504. [Google Scholar] [CrossRef]
Xu, H.; He, X.; Shan, F.; Niu, G.; Sheng, D. Machine Learning in the Stochastic Analysis of Slope Stability: A State-of-the-Art Review. Modelling 2023, 4, 426–453. [Google Scholar] [CrossRef]
Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Smola, A.J.; Schölkopf, B. A Tutorial on Support Vector Regression. Stat. Comput. 2004, 14, 199–222. [Google Scholar] [CrossRef]
Cherkassky, V.; Ma, Y. Practical Selection of SVM Parameters and Noise Estimation for SVM Regression. Neural Netw. 2004, 17, 113–126. [Google Scholar] [CrossRef] [PubMed]
Xie, X.; Liu, W.; Tang, B. Spacebased Estimation of Moisture Transport in Marine Atmosphere Using Support Vector Regression. Remote Sens. Environ. 2008, 112, 1846–1855. [Google Scholar] [CrossRef]
Zhu, B.; Hiraishi, T.; Pei, H.; Yang, Q. Efficient Reliability Analysis of Slopes Integrating the Random Field Method and a Gaussian Process Regression-based Surrogate Model. Numer. Anal. Methods Geomech 2021, 45, 478–501. [Google Scholar] [CrossRef]
Wang, C.; Wu, X.; Kozlowski, T. Gaussian Process–Based Inverse Uncertainty Quantification for TRACE Physical Model Parameters Using Steady-State PSBT Benchmark. Nucl. Sci. Eng. 2019, 193, 100–114. [Google Scholar] [CrossRef]
Snoek, J.; Larochelle, H.; Adams, R.P. Practical Bayesian Optimization of Machine Learning Algorithms. Adv. Neural Inf. Process. Syst. 2012, 25, 1–9. [Google Scholar]
Moayedi, H.; Tien Bui, D.; Kalantar, B.; Kok Foong, L. Machine-Learning-Based Classification Approaches toward Recognizing Slope Stability Failure. Appl. Sci. 2019, 9, 4638. [Google Scholar] [CrossRef]
Trinidad, M.; Momayez, M. Machine Learning in Slope Stability: A Review with Implications for Landslide Hazard Assessment. GeoHazards 2025, 6, 67. [Google Scholar] [CrossRef]

Figure 1. Violin plots for input parameters from the study by Chen et al. [8].

Figure 2. Violin plots for input parameters from a real open pit operation.

Figure 3. Histogram plots for input parameters from the study by Chen et al. [8].

Figure 4. Histogram plots for input parameters from a real open pit operation.

Figure 5. Correlation matrix of the dataset from the study by Chen et al. [8].

Figure 6. Correlation matrix of the dataset from a real open pit operation.

Figure 7. Schematic of 5-fold cross-validation.

Figure 8. Algorithm that synthesizes data processing.

Figure 9. Conceptual diagram of circular slope failure showing slope height H, slope angle α, and a representative slip surface.

Figure 10. Boxplot comparison of geotechnical variables and FOS before and after normalization.

Figure 11. ANN architecture with two hidden layers [21].

Figure 12. The RMSE sensitivity analysis of the proposed MLP with respect to the number of neurons in each hidden layer.

Figure 13. Performance results with change in GA-MLP model population sizes.

Figure 14. The variation in training and testing accuracy criteria versus population size.

Figure 15. GA-MLP scatter plots for training and validation (Dataset #1).

Figure 16. GA-MLP scatter plots for training and validation (Dataset #2).

Figure 17. RF scatter plots for training and validation (Dataset #1).

Figure 18. RF scatter plots for training and validation (Dataset #2).

Figure 19. SVR scatter plots for training and validation (Dataset #1).

Figure 20. SVR scatter plots for training and validation (Dataset #2).

Figure 21. GPR scatter plots for training and validation (Dataset #1).

Figure 22. GPR scatter plots for training and validation (Dataset #2).

Figure 23. Comparison between the predicted values and true values of the FOS using four ML models.

Figure 24. Comparison of R² and RMSE values for testing data.

Figure 25. Outlier plots with residual > 0.2 for each ML model.

Figure 26. Ranking scores of ML models based on multiple performance metrics—Dataset #1.

Figure 27. Comparison between the predicted values and true values of the FOS using four ML models.

Figure 28. Outlier plots with residual > 0.2 for each ML model.

Figure 29. Boxplot Comparison of R² and RMSE Distributions for ML Models.

Figure 30. Average and comparison of R² and RMSE values for testing data.

Figure 31. Ranking scores of ML models based on multiple performance metrics—Dataset #2.

Figure 32. Sensitivity analysis for GPR model.

Figure 33. Residual cumulative distribution of FOS prediction.

Table 1. Statistical description of the dataset from the study by Chen et al. [8].

Parameter	Unit	Number	Maximum	Minimum	Mean	SD
Unit weight (γ)	kN/m³	53	27.40	19.60	23.88	2.60
Cohesion (c)	kPa	53	44.10	6.50	33.69	8.78
Friction angle (φ)	Degree	53	38.00	19.00	29.94	5.86
Slope angle (α)	Degree	53	50.00	10.00	32.68	7.60
Slope height (H)	m	53	99.00	10.00	46.45	16.09

Table 2. Statistical description of the dataset from a real open pit mine.

Parameter	Unit	Number	Maximum	Minimum	Mean	SD
Unit weight (γ)	kN/m³	129	32.64	19.31	25.68	2.49
Cohesion (c)	kPa	129	126.86	0.07	42.13	26.95
Friction angle (φ)	Degree	129	44.11	9.28	25.81	6.52
Slope angle (α)	Degree	129	27.82	19.90	24.58	1.82
Slope height (H)	m	129	901.80	20.50	271.29	174.85

Table 3. Advantages and disadvantages of selected machine learning algorithms for predicting the FOS in rock slopes.

Model	Advantages	Disadvantages
GA-MLP	Models are highly nonlinear interactions. GA optimization improves convergence and avoids local minima. Suitable for complex geotechnical problems.	Considered a “black box” model with limited transparency; Requires larger datasets for stable training; Higher computational cost due to GA optimization.
RF	Ensemble averaging reduces overfitting; Robust to noise and multicollinearity; Feature importance can be extracted for interpretability.	Predictions are less smooth than kernel-based methods; Interpretability is lower than that of single decision trees; May degrade with highly imbalanced datasets.
SVR	Strong generalization on small/medium datasets; Captures nonlinear relationships via kernel functions; Effective with limited geotechnical inputs.	Limited interpretability (“black box”); Sensitive to kernel and hyperparameter selection; May show higher variance in predictions.
GPR	Provides probabilistic predictions with confidence intervals; Robust with small datasets; Flexible kernel functions capture complex nonlinearities.	Computationally expensive for large datasets; Performance is highly dependent on kernel choice; Less interpretable compared to tree-based methods.

Table 4. The variation in different GA-MLP structures with population size.

Population Size	Network Result				Ranking				Total Rank	Final Rank
	Train		Validation		Train		Validation
	R²	RMSE	R²	RMSE	R²	RMSE	R²	RMSE
50	0.960	0.091	0.776	0.241	12	12	12	12	48	12
100	0.976	0.070	0.818	0.218	11	11	11	11	44	11
150	0.985	0.056	0.858	0.192	4	4	7	7	22	3
200	0.985	0.056	0.837	0.206	3	3	10	10	26	8
250	0.979	0.066	0.850	0.197	10	10	9	9	38	10
300	0.983	0.059	0.886	0.172	7	7	4	4	22	3
325	0.989	0.048	0.927	0.138	1	1	1	1	4	1
350	0.984	0.057	0.852	0.196	5	5	8	8	26	8
375	0.984	0.058	0.860	0.191	6	6	6	6	24	5
400	0.988	0.05	0.915	0.149	2	2	2	2	8	2
450	0.983	0.059	0.888	0.171	9	9	3	3	24	5
500	0.983	0.059	0.880	0.177	8	8	5	5	26	8

Table 5. Evaluation Metrics of ML Models for FOS Prediction—Dataset #1.

Model	Data	Samples	R²	RMSE	MSE	MAE
GA-MLP	Training Set	28	0.920	0.164	0.027	0.094
GA-MLP	Validation Set	13	0.797	0.263	0.069	0.192
RF	Training Set	28	0.740	0.296	0.088	0.151
RF	Validation Set	13	0.829	0.241	0.058	0.160
SVR	Training Set	28	0.991	0.056	0.003	0.042
SVR	Validation Set	13	0.954	0.124	0.015	0.100
GPR	Training Set	28	0.988	0.065	0.004	0.052
GPR	Validation Set	13	0.902	0.182	0.033	0.152

Table 6. Evaluation Metrics of ML Models for FOS Prediction—Dataset #2.

Model	Group	Data	Samples	R²	RMSE	MSE	MAE
GA-MLP	1	Training Set	70	0.856	0.151	0.023	0.096
	1	Validation Set	30	0.874	0.141	0.020	0.107
	2	Training Set	70	0.891	0.133	0.018	0.082
	2	Validation Set	30	0.887	0.142	0.020	0.086
	3	Training Set	70	0.864	0.132	0.017	0.096
	3	Validation Set	30	0.780	0.205	0.042	0.149
	4	Training Set	70	0.876	0.158	0.025	0.097
	4	Validation Set	30	0.871	0.141	0.020	0.104
RF	1	Training Set	30	0.952	0.087	0.008	0.065
	1	Validation Set	30	0.758	0.196	0.038	0.152
	2	Validation Set	70	0.952	0.089	0.008	0.066
	2	Training Set	30	0.810	0.184	0.034	0.114
	3	Validation Set	70	0.940	0.088	0.008	0.066
	3	Training Set	30	0.834	0.178	0.032	0.133
	4	Training Set	70	0.962	0.087	0.008	0.064
	4	Validation Set	30	0.732	0.204	0.041	0.142
SVR	1	Training Set	30	0.884	0.135	0.018	0.062
	1	Validation Set	30	0.948	0.091	0.008	0.060
	2	Validation Set	70	0.884	0.137	0.019	0.052
	2	Training Set	30	0.901	0.133	0.018	0.050
	3	Validation Set	70	0.891	0.118	0.014	0.041
	3	Training Set	30	0.965	0.082	0.007	0.052
	4	Training Set	70	0.891	0.148	0.022	0.048
	4	Validation Set	30	0.923	0.109	0.012	0.057
GPR	1	Training Set	30	0.923	0.110	0.012	0.070
	1	Validation Set	30	0.912	0.119	0.014	0.089
	2	Validation Set	70	0.941	0.097	0.009	0.059
	2	Training Set	30	0.901	0.133	0.018	0.081
	3	Validation Set	70	0.930	0.095	0.009	0.055
	3	Training Set	30	0.956	0.091	0.008	0.071
	4	Training Set	70	0.956	0.094	0.009	0.058
	4	Validation Set	30	0.916	0.114	0.013	0.086

Table 7. Performance assessment of the ML models for predicting FOS by evaluating the statistical index results—Dataset #1.

Model	R²	Score	RMSE	Score	MSE	Score	MAE	Score	RANK
GA-MLP	0.940	3	0.137	3	0.019	3	0.110	3	12
RF	0.814	1	0.241	1	0.058	1	0.160	1	4
SVR	0.862	2	0.208	2	0.043	2	0.154	2	8
GPR	0.988	4	0.061	4	0.004	4	0.055	4	16

Table 8. Average Performance Metrics of Machine Learning Models Across All Data Groups.

Model	Data	Samples	R²	RMSE	MSE	MAE
GA-MLP	Testing Set	29	0.765	0.191	0.039	0.120
RF	Testing Set	29	0.720	0.208	0.160	0.140
SVR	Testing Set	29	0.866	0.144	0.023	0.070
GPR	Testing Set	29	0.855	0.151	0.023	0.099

Table 9. Performance assessment of the ML models for predicting FOS by evaluating the statistical index results—Dataset #2.

Model	R²	Score	RMSE	Score	MSE	Score	MAE	Score	RANK
GA-MLP	0.765	4	0.191	2	0.039	2	0.120	2	8
RF	0.720	1	0.208	1	0.160	1	0.140	1	4
SVR	0.866	2	0.144	4	0.023	3	0.070	4	15
GPR	0.855	3	0.151	3	0.023	3	0.099	3	12

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Trinidad, M.; Momayez, M. Application of Machine Learning Methods for Predicting the Factor of Safety in Rock Slopes. Geotechnics 2026, 6, 15. https://doi.org/10.3390/geotechnics6010015

AMA Style

Trinidad M, Momayez M. Application of Machine Learning Methods for Predicting the Factor of Safety in Rock Slopes. Geotechnics. 2026; 6(1):15. https://doi.org/10.3390/geotechnics6010015

Chicago/Turabian Style

Trinidad, Miguel, and Moe Momayez. 2026. "Application of Machine Learning Methods for Predicting the Factor of Safety in Rock Slopes" Geotechnics 6, no. 1: 15. https://doi.org/10.3390/geotechnics6010015

APA Style

Trinidad, M., & Momayez, M. (2026). Application of Machine Learning Methods for Predicting the Factor of Safety in Rock Slopes. Geotechnics, 6(1), 15. https://doi.org/10.3390/geotechnics6010015

Article Menu

Application of Machine Learning Methods for Predicting the Factor of Safety in Rock Slopes

Abstract

1. Introduction

2. Dataset and Methodology

2.1. Dataset and Parameters

2.2. Descriptive Statistics of Input Data

2.3. Cross-Validation

2.4. The Limit Equilibrium Method

2.5. Proposed Machine Learning Models

2.5.1. Multi-Layer Perceptron Neural Network Optimized by Genetic Algorithm (GA-MLP)

2.5.2. Random Forest Regressor (RF)

2.5.3. Support Vector Regressor (SVR)

2.5.4. Gaussian Process Regressor (GPR)

2.6. Model Performance Evaluation

3. Results and Discussion

3.1. Evaluation of Machine Learning Model Results

3.1.1. Results of GA-MLP

3.1.2. Results of Random Forest Regressor (RF)

3.1.3. Results of Support Vector Regressor (SVR)

3.1.4. Results of Gaussian Process Regressor (GPR)

3.2. Performance Evaluation of Machine Learning Models During Training and Validation

3.3. Comparison Between Machine Learning Models and the Limit Equilibrium Method

3.3.1. Analysis for Dataset #1

3.3.2. Analysis for Dataset #2

3.4. Sensitivity Analysis Results

3.5. Recommendations for Future Research

3.6. Practical Recommendations for Mining Operations

4. Conclusions and Recommendations

4.1. Summary of Findings

4.2. Contributions of Research

4.3. Limitations of Study

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI