Prediction of Rockfill Materials’ Shear Strength Using Various Kernel Function-Based Regression Models—A Comparative Perspective

The mechanical behavior of the rockfill materials (RFMs) used in a dam’s shell must be evaluated for the safe and cost-effective design of embankment dams. However, the characterization of RFMs with specific reference to shear strength is challenging and costly, as the materials may contain particles larger than 500 mm in diameter. This study explores the potential of various kernel function-based Gaussian process regression (GPR) models to predict the shear strength of RFMs. A total of 165 datasets compiled from the literature were selected to train and test the proposed models. Comparing the developed models based on the GPR method shows that the superlative model was the Pearson universal kernel (PUK) model with an R-squared (R2) of 0.9806, a correlation coefficient (r) of 0.9903, a mean absolute error (MAE) of 0.0646 MPa, a root mean square error (RMSE) of 0.0965 MPa, a relative absolute error (RAE) of 13.0776%, and a root relative squared error (RRSE) of 14.6311% in the training phase, while it performed equally well in the testing phase, with R2 = 0.9455, r = 0.9724, MAE = 0.1048 MPa, RMSE = 0.1443 MPa, RAE = 21.8554%, and RRSE = 23.6865%. The prediction results of the GPR-PUK model are found to be more accurate and are in good agreement with the actual shear strength of RFMs, thus verifying the feasibility and effectiveness of the model.


Introduction
In civil engineering projects, such as rockfill dams, slopes, and embankments, rockfill materials (RFMs) are often used as filling materials. RFMs consist of coarse gravels, cobbles, and boulders mined from rock quarries or riverbeds. Quarried materials are angular to sub-angular, whereas riverbed materials are rounded to sub-rounded. Mineral composition, particle size, shape, gradation, individual particle strength, void content, relative density, and surface roughness of the particles all influence the behavior of the RFMs utilized in the construction of rockfill dams. Several studies in geotechnical engineering have been carried out, such as that examining the contact between the soils and concrete used in earth and rockfill dams [1]. Inverse analysis provides an means to better understand dam

•
To examine the capability of various kernel function-based GPR computing techniques, namely, radial basis function kernel, polynomial kernel, and Pearson universal kernel, in the area of predicting the shear strength of RFMs; • To undertake a comparative study of the shear strength prediction of rockfill materials and select the best outcomes provided by the developed GPR models based on the performance metrics; • To conduct sensitivity analyses to determine the effect of each input parameter on the RFMs' shear strength.
The remainder of the paper is organized as follows. In Section 2, the details of three different types of kernel function-based GPR computing techniques for predicting the shear strength of rockfill materials are presented. In Section 3, the details of the data catalog and correlation analysis are presented. Performance evaluation measures are presented in Section 4. Section 5 reports the developed models' results. Based on the observations and results of these models, Section 5 draws conclusions and future research directions.

Gaussian Process Regression
Gaussian process regression (GPR) is a suitable and recently described method that has been used in a variety of machine learning applications [30]. The GPR model's probabilistic solution leads to the identification of general kernel regression problems. The applied regressor's training process can be categorized as Bayesian, and the model relations are assumed to follow a Gaussian distribution to encode the previous output function information [31]. The Gaussian process is defined by a set of variables, each of which has a joint Gaussian allocation [32]. Figure 1 illustrates the scheme of development of the selected methodology.
Materials 2022, 15, 1739 3 of 16 sented in Section 4. Section 5 reports the developed models' results. Based on the observations and results of these models, Section 5 draws conclusions and future research directions.

Gaussian Process Regression
Gaussian process regression (GPR) is a suitable and recently described method that has been used in a variety of machine learning applications [30]. The GPR model's probabilistic solution leads to the identification of general kernel regression problems. The applied regressor's training process can be categorized as Bayesian, and the model relations are assumed to follow a Gaussian distribution to encode the previous output function information [31]. The Gaussian process is defined by a set of variables, each of which has a joint Gaussian allocation [32]. Figure 1 illustrates the scheme of development of the selected methodology.

Radial Basis Function Kernel
The RBF kernel is a typical kernel function used in several machine learning algorithms for kernel learning. It is widely used in vector machinery classification. RBF kernels are typically a decent first option. Instead of the usual kernel, this kernel transfers samples to higher-dimensional areas, allowing a nonlinear relationship between class labels and attributes to be dealt with [33]. Furthermore, with the linear kernel and normalized poly kernel, RBF is a unique situation because the linear kernel with a penalty parameter C operates with the same parameters as the RBF kernel.

Radial Basis Function Kernel
The RBF kernel is a typical kernel function used in several machine learning algorithms for kernel learning. It is widely used in vector machinery classification. RBF kernels are typically a decent first option. Instead of the usual kernel, this kernel transfers samples to higher-dimensional areas, allowing a nonlinear relationship between class labels and attributes to be dealt with [33]. Furthermore, with the linear kernel and normalized poly kernel, RBF is a unique situation because the linear kernel with a penalty parameter C operates with the same parameters as the RBF kernel.
In comparison, kernel values can be used to calculate γX T i + r > 1 for polynomial kernels up to infinity as long as the grade is high [34]. Furthermore, under such parameters, the sigmoid kernel is not true, i.e., it is not an internal two-vector product. The RBF kernel is not appropriate in several cases. The linear kernel can be employed in particular when the number of functions is quite large [35].

Polynomial Kernel
The polynomial kernel (poly kernel) is a kernel function often used with supportvector machines (SVMs) and other kernel-coded models that relate to parallel vectors in the machine learning language. The polynomial kernel appears to achieve similarity not only on the input samples' stated functions, but also in their combinations. In the context of regression analysis, such groupings were categorized as interaction features. These groupings have been identified as interaction features in the context of regression analysis. The enclosed polynomial kernel space is the same as a polynomial regression, but it is an educated sum of parameters that does not include a combined blow-up. The functionality links to logical function connections if the function data input are binary [36]. The polynomial kernel for degree polynomials is well-defined as where x and y are vectors in the input space, i.e., feature vectors found from training or trial samples, and C ≥ 0 is an unconstrained parameter that trades off higher-order vs. lower-order polynomial definitions. When C = 0, the kernel is said to be homogeneous.

Pearson Universal Kernel
The Pearson universal kernel (PUK) is a machine learning programming tool that aids in the comprehensive interpretation and understanding of various data types. The Pearson VII function is considered to have a general form for curve suitably, and is assumed by where H is the top tallness at the middle x 0 of the peak, and x represents the selfdetermining variable. The parameters σ and ω regulate the half-width and the following factor of the peak. A function ω, on the other hand, belongs to the class of effective kernel functions. The kernel matrix might be symbolic and positively semi-definite; to show that PUK is certainly resolving these situations, Uestuen [37] redrafted Equation (3) into a function of both vectors:

Data Catalog and Correlation Analysis
The dataset collected from Kaunda [27] was separated into training (80 percent of total data) and testing (20 percent of the remaining data) datasets for this investigation. The database has been presented in Kaunda [27] in detail. Table 1 summarizes the 165-sample dataset, which includes numerous shear strength of RFM tests, and the input and output variables' minimum (min), maximum (max), mean, and standard deviation (SD). As can be seen in the table, the database includes input parameters, i.e., particle material size (or sieve) gradation, fineness modulus, gradation modulus, material hardness, relative density, and confining (normal) stress, and one output parameter, i.e., shear strength.
To choose the most resilient representation, a statistical study of input and output variables of the training and testing data was performed (see Table 2). It was accomplished through the use of a trial-and-error strategy. Previous studies show that the shear strength (τ) of RFM is a function of D 10 , D 30 , D 60 , and D 90 , which correspond to the 10%, 30%, 60%, and 90% passing sieve sizes, while UCS min and UCS max (MPa) indicate the minimum and maximum uniaxial compressive strengths (MPa), the FM and GM parameters describe fineness modulus and gradation modulus, respectively, γ is the dry unit weight (kN/m 3 ), σ n is the normal stress (MPa), and R shows the International Society of Rock Mechanics (ISRM) hardness rating [27,29]. As a result, the current study's GPR models are constructed using these input variables. Table 1. The inputs and output of the present study.

S.
No.  Understanding the relationship between each input and result can definitely facilitate the development of a proper prediction model. Among the numerous correlation coefficients described thus far, the correlation coefficient technique has proven to be more common. As stated in Equation (4), the correlation coefficient is equal to the product of the covariance of two parameters divided by their standard deviation. ρ m,n ≈ 1 represents the high degree of interdependence between two variables, while ρ m,n ≈ 0 stands for a linear relationship between two variables m and n that are independent of one another. The Pearson correlation coefficients for the variable inputs and the target output are reported in Table 3.
Obviously, the values provided in Table 3 reveal that σ n and D 90 have the most significant influence on the τ, while FM, C u , and UCS max affect the output non-considerably. Table 3. Pearson correlation coefficients for variable inputs and the target output.

Performance Evaluation Measures
The coefficient of determination (R 2 ), Pearson's correlation coefficient (r), mean absolute error (MAE), root mean square error (RMSE), relative absolute error (RAE), and root relative squared error (RRSE) are used to evaluate the data-driven modeling in this study. These parameters can be calculated as follows: where y i o and y i p represents the actual measurement and predicted shear strength of RFM, respectively, y o is the average of the reference samples' values, and n is the defined amount of data. The degree of collinearity between predicted and measured data is described by the coefficient of determination (R 2 ) and correlation coefficient (r). The correlation coefficient, which varies from 1 to −1, is a measure of the degree to which observed and predicted data are linearly related. There is no linear relationship if r = 0. A perfect positive or negative linear relationship arises if r = 1 or −1. Similarly, R 2 denotes the percentage of variance in the measured data that the model can explain. R 2 spans from 0 to 1, with higher values suggesting less error variation, and values above 0.5 are usually regarded as acceptable [38,39]. The MAE represents the average value of the predicted and actual values. When the MAE is close to 0, the adjustment has a better effect, implying that the prediction model more accurately describes the set of training data [40]. As a single measure of predictive power, the RMSE is the average magnitudes of the errors in predictions for all observations. The RMSE is greater than or equal to 0, with 0 indicating a statistically perfect fit for the observed data. The relative absolute error (RAE) is the difference between expected and actual values that is calculated by dividing the mean difference by the arithmetic mean. It ranges from 0 to infinite, with being 0 the best value. The RRSE criteria measure the model's percentage error, which ranges from 0 to 100. Consequently, the better the model, the lower the values of these criteria are. Furthermore, visual inspections, such as scatter plots, were used to compare the performance of the developed models.

Comparative Performance
In this paper, various kernel function-based Gaussian process regression techniques were implemented using Waikato environment for knowledge analysis (WEKA). WEKA is an open-source software which consists of a collection of machine learning algorithms for data mining tasks. Most machine learning algorithms have hyperparameters that must be tuned. The critical hyperparameters in the GPR-RBF, GPR-Poly, and GPR-PUK models are tuned in this study as shown in Table 4. The values for the models' tuning parameters were chosen first, and then varied in the trials until the best fitness measures in Table 4 were obtained.  Table 5 shows the R 2 , r, MAE, RMSE, RAE, and RRSE values for shear strength estimation after hyperparameter tuning for the training and testing phases, respectively. At a glance, for both tables, GPR-PUK is the top-ranked model.  [27] and Andjelkovic et al. [41], respectively, for the test data, the proposed GPR-PUK (R 2 = 0.9455) has better prediction capacity. In general, the generalization and reliability of the GPR-PUK perform well, and larger datasets can yield better prediction results. In addition, for the GPR-RBF, GPR-Poly, and GPR-PUK models, the graphical correlation between measured (on the horizontal axis) and predicted (on the vertical axis) shear strength is presented in Figure 2a for the training dataset and Figure 3a for the testing dataset, respectively. The trend line for GPR-PUK has been drawn by comparing the observed regression in Figures 2a and 3a, and the GPR-PUK findings have the maximum inclination to the line of y = x (i.e., R 2 = 0.9806).
The accuracy of all the developed models i.e., GPR-RBF, GPR-Poly, and GPR-PUK in predicting RFM shear strength is illustrated in Figure 2b for the training dataset and Figure 3b for the testing dataset, respectively. As seen in this graph, the closer one moves to the y axis (i.e., the lower the error), the higher the accuracy in both the training and testing datasets. Here, the GPR-PUK model has presented the most reliable prediction. This is evident by the higher aggregation of the results around the y axis (y = 0), except for a few noise points. In comparison to the other models, i.e., GPR-RBF and GPR-Poly, the comparison findings are sufficiently consistent, which is adequate for the proposed GPR-PUK model to predict RFM shear strength values.

Rank Analysis
Rank analysis assigns score values to statistical parameters, using their ideal values as a benchmark, based on the number of models utilized. The model with the best performance receives the highest score, and vice versa. Table 6 shows how the total efficiency ranking of the developed models is interpreted using a rank analysis (i.e., the summation of the ranks of R 2 , r, MAE, RMSE, RAE, and RRSE into a single ranking score for the training and testing datasets). Based on the obtained total ranking scores of 12, 24, and 36 (respectively for GPR-RBF, GPR-Poly, and GPR-PUK), the superiority of GPR-PUK can be concluded (i.e., it has the most significant total rank). The GPR-Poly model has commonly been labeled as the second most accurate model. The most significant point about Table 6 is that the GPR-RBF, GPR-Poly, and GPR-PUK predictive models achieved the same rank for each measure. The GPR-PUK and GPR-RBF, for instance, had the highest (3) and lowest (1) level of accuracy, based on the R 2 , r, MAE, RMSE, RAE, and RRSE indices simultaneously. Table 6. The results of the employed models' rank analysis. Parameter  Training  Testing  Training  Testing  Training  Testing   R 2  1  1  2  2  3  3  r  1  1  2  2  3  3  MAE  1  1  2  2  3  3  RMSE  1  1  2  2  3  3  RAE  1  1  2  2  3  3  RRSE  1  1  2  2  3  3  Rank Score  6  6  12  12  18  18 Total Ranking Score (Training and Testing) 12 24 36 Total Rank 3 2 1

Sensitivity Analysis
Yang and Zang's [42] sensitivity analysis was used to analyze the developed models' ability to analyze the impact of input variables on the shear strength of rockfill material. This method has been used in several research studies [43][44][45][46], and is as follows, r ij = ∑ n m=1 (y im × y om ) ∑ n m=1 y im 2 ∑ n m=1 y om 2 (11) where n is the number of data values, and y im and y om are the input and output parameters. For each input parameter, the r ij value varied from zero to one, with the highest values indicating the most efficient output parameter (which was τ in this study). To estimate the relationship between input and output variables, the value of r ij must be close to 1. Figure 4 shows the degree of importance of the input variables based on the experimental actual and predicted values of the shear strength. As it can be seen, the importance of different parameters can be displayed as σ n > D 90 > γ > R > FM > UCS min > D 60 > GM > UCS max > D 30 > C c > C u > D 10 . In other words, the σ n is the most important parameter, and the D 10 is the least important parameter for predicting the shear strength of the RFMs. Furthermore, Table 3 shows that the normal stress σ n has the highest ρ of 0.966 in all other parameters, validating the sensitivity analysis results.
to 1. Figure 4 shows the degree of importance of the input variables based on the experimental actual and predicted values of the shear strength. As it can be seen, the importance of different parameters can be displayed as σn In other words, the σn is the most important parameter, and the D10 is the least important parameter for predicting the shear strength of the RFMs. Furthermore, Table 3 shows that the normal stress σn has the highest of 0.966 in all other parameters, validating the sensitivity analysis results.

Taylor Diagram
The Taylor diagram [47] is a simple visual depiction of a model's performance compared to other models. Three indices are represented in the Taylor diagram: the correlation coefficient, the standard deviation, and the root mean square difference (RMSD). The model outcomes are compared in the Taylor diagram displayed in Figure 5 for a more indepth examination of the results. The Taylor diagram, which compares the standard deviation (vertical and horizontal axes), correlation coefficient (radial lines), and RMSD, is a valuable tool for illustrating the accuracy of prediction models (green circular lines). The most accurate model, indicated by a pink dot (i.e., GPR-PUK), is introduced as having a similar standard deviation, higher correlation, and reduced RMSD when evaluating real values in the training and testing datasets. Figure 5 shows that the GPR-PUK model is closer to the red dot (actual/reference values) than the other GPR-RBF and GPR-poly models, indicating that this model is accurate.

Taylor Diagram
The Taylor diagram [47] is a simple visual depiction of a model's performance compared to other models. Three indices are represented in the Taylor diagram: the correlation coefficient, the standard deviation, and the root mean square difference (RMSD). The model outcomes are compared in the Taylor diagram displayed in Figure 5 for a more in-depth examination of the results. The Taylor diagram, which compares the standard deviation (vertical and horizontal axes), correlation coefficient (radial lines), and RMSD, is a valuable tool for illustrating the accuracy of prediction models (green circular lines). The most accurate model, indicated by a pink dot (i.e., GPR-PUK), is introduced as having a similar standard deviation, higher correlation, and reduced RMSD when evaluating real values in the training and testing datasets. Figure 5 shows that the GPR-PUK model is closer to the red dot (actual/reference values) than the other GPR-RBF and GPR-poly models, indicating that this model is accurate.

Summery and Conclusions
In this research, efforts have been made to create various kernel function-based regression models, i.e., GPR-RBF, GPR-Poly, and GPR-PUK, that may be used to predict the shear strength of RFMs. To train and test the development models, a database from the published literature with different values of influential parameters on RFMs, including D10, D30, D60, D90, UCSmin, UCSmax, FM, GM,γ, σn, and R, is considered. The data are split into two categories: training set (80%) and testing set (20%). The output shear strength (

Summery and Conclusions
In this research, efforts have been made to create various kernel function-based regression models, i.e., GPR-RBF, GPR-Poly, and GPR-PUK, that may be used to predict the shear strength of RFMs. To train and test the development models, a database from the published literature with different values of influential parameters on RFMs, including D 10 , D 30 , D 60 , D 90 , UCS min , UCS max , FM, GM, γ, σ n , and R, is considered. The data are split into two categories: training set (80%) and testing set (20%). The output shear strength (τ) of the developed models was evaluated using statistical parameters, including R 2 , r, MAE, RMSE, RAE, and RRSE. Furthermore, visual inspection, such as with scatter plots, was also used to assess the effectiveness of the developed models. The applications for the aforementioned models for predicting the shear strength of RFMs were compared and discussed. The following conclusions are made based on the obtained results of this study: 1.
GPR-PUK achieved an R-squared (R 2 ) of 0.9806, a correlation coefficient (r) of 0.9903, a mean absolute error (MAE) of 0.0646 MPa, a root mean square error (RMSE) of 0.0965 MPa, a relative absolute error (RAE) of 13.0776%, and a root relative squared error (RRSE) of 14.6311% in the training phase. In the testing phase, it performed equally well, with R 2 = 0.9455, r = 0.9724, MAE = 0.1048 MPa, RMSE = 0.1443 MPa, RAE = 21.8554%, and RRSE = 23.6865%. The GPR-PUK model was found to be more accurate and stable than the other models. Furthermore, the PUK kernel model had a superior agreement to the observed data based on the scatter plots of actual and predicted values, indicating that it has the potential for wider applications in RFM properties prediction.

2.
The results of the sensitivity analysis show that the degree of importance of different input parameters for predicting the shear strength of RFMs is as follows: σ n > D 90 > γ > R > FM > UCS min > D 60 > GM > UCS max > D 30 > C c > C u > D 10 .

3.
The developed PUK kernel model makes predictions as accurate as those made by other soft computing techniques. This research also points out that these machine learning techniques can be a potential approach for estimating basic soil parameters, such as the soil permeability coefficient.
GPR-PUK can be used to predict the shear strength of RFMs with high accuracy, according to this study. The sample size is, however, limited. As a result, this study should be extended to include a larger sample size. Furthermore, future studies using other algorithms, such as XGBoost, evolutionary polynomial regression, and gene expression programming, should be utilized to evaluate the algorithms' effectiveness and gain a comprehensive understanding of the techniques used for predicting the shear strength of RFMs.

Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.

Data Availability Statement:
The data that support the findings of this study are available from the corresponding author, upon reasonable request.