Comparison of Six Machine-Learning Methods for Predicting the Tensile Strength (Brazilian) of Evaporitic Rocks

Featured Application: Determination of rock tensile strength (TS) is an important task, especially during the initial design stage of engineering applications such as tunneling, slope stability, and foundation. Owing to its simplicity, the Brazilian tensile strength (BTS) test is widely used to assess the TS of rocks indirectly. Powerful regularization techniques such as the Elastic Net, Ridge, and Lasso; and Keras sequential models based on TensorFlow neural networks can be successfully used to predict BTS. Abstract: Rock tensile strength (TS) is an important parameter for the initial design of engineering applications. The Brazilian tensile strength (BTS) test is suggested by the International Society of Rock Mechanics and the American Society for Testing Materials and is widely used to assess the TS of rocks indirectly. Evaporitic rock blocks were collected from Al Ain city in the United Arab Emirates. Samples were tested, and a database of 48 samples was created. Although previous studies have applied different methods such as adaptive neuro-fuzzy inference system and linear regression for BTS prediction, we are not aware of any study that employed regularization techniques, such as the Elastic Net, Ridge, and Lasso, and Keras based sequential neural network models. These techniques are powerful feature selection tools that can prevent overﬁtting to improve model performance and prediction accuracy. In this study, six algorithms, namely, the classical best subsets, three regularization techniques, and artiﬁcial neural networks with two application-programming interfaces (Keras on TensorFlow and Neural Net) were used to determine the best predictive model for the BTS. The models were compared through ten-fold cross-validation. The obtained results revealed that the model based on Keras on TensorFlow outperformed all the other considered models.


Introduction
The TS of a rock is a critical variable for geotechnical, mining, and geological engineering applications in designing foundations, tunneling, ensuring slope stability, rock blasting, underground excavation, and mining [1][2][3][4]. Two types of methods, direct and indirect, are available for predicting the TS of rocks. The direct methods are difficult, expensive, time-consuming, and require high-quality core samples. Alternatively, the TS of rocks can be estimated using empirical equations [2,[4][5][6][7][8][9]. Indirect methods are preferred because they are simple, economical, and faster at predicting the TS of rocks and reduce the burden on the laboratory facilities incurred by direct TS testing or limitations on laboratory facilities for direct TS testing. The BTS test suggested by the International Society of Rock Mechanics is widely used, as it is a simple and easy-to-perform test [10]. In addition, various empirical relationships between the BTS and point load index (PLI), Shore hardness index, Schmidt ASTM standards [15]. In addition, index tests, namely, the I d2 and G s tests [39,40], as well as the BTS test, were performed. The partial test results of this study are listed in Table 1. I d2 (%) values were in the range of 8.42-60.77% with a mean value of 36.42%. BTS was in the range of 1. 47-4.39 MPa with a mean value of 2.58 MPa. The G s values were in the range of 2.06-2.36 with a mean value of 2. 16. For the analysis, the BTS was defined as the target variable and I d2 and G s were the input variables.
Appl. Sci. 2021, 11, x FOR PEER REVIEW 3 of 14 failures due to preexisting veins, macro cracks, and fissures, as such defects can cause measurement bias. Then, 124 test samples were prepared and tested according to the ASTM standards [15]. In addition, index tests, namely, the Id2 and Gs tests [39,40], as well as the BTS test, were performed. The partial test results of this study are listed in Table 1. Id2 (%) values were in the range of 8.42-60.77% with a mean value of 36.42%. BTS was in the range of 1. 47-4.39 MPa with a mean value of 2.58 MPa. The Gs values were in the range of 2.06-2.36 with a mean value of 2. 16. For the analysis, the BTS was defined as the target variable and Id2 and Gs were the input variables. Figure 1. Geological map of the study area and sampling location [4]. Figure 1. Geological map of the study area and sampling location [4].

Methodology
After the evaporitic rock samples were tested, qualitative and quantitative assessments were conducted. Figure 2 shows the scatterplot and probability plots of the 3 variables-BTS, I d2 , and G s -along with their correlations. The probability plots show that BTS and G s were unimodal whereas I d2 is bimodal with modes at 20.7 and 45.7. Figure 3 shows the normal probability plot of the BTS. A Darling-Anderson normality test was conducted, and it had produced a p-value of 0.278, which was clearly indicating the symmetrical distribution of the BTS.

Methodology
After the evaporitic rock samples were tested, qualitative and quantitative assessments were conducted. Figure 2 shows the scatterplot and probability plots of the 3 variables-BTS, Id2, and Gs-along with their correlations. The probability plots show that BTS and Gs were unimodal whereas Id2 is bimodal with modes at 20.7 and 45.7. Figure 3 shows the normal probability plot of the BTS. A Darling-Anderson normality test was conducted, and it had produced a p-value of 0.278, which was clearly indicating the symmetrical distribution of the BTS. Quantitative summary statistics of the data are presented in Table 2; the means of the three variables, BTS, Id2, and Gs, are 2.58, 36.29, and 2.16, respectively, whereas their medians were 2.52, 42.9, and 2.15, respectively. In addition, 95% confidence intervals for their true means are listed in the same table. Quantitative summary statistics of the data are presented in Table 2; the means of the three variables, BTS, I d2 , and G s , are 2.58, 36.29, and 2.16, respectively, whereas their medians were 2.52, 42.9, and 2.15, respectively. In addition, 95% confidence intervals for their true means are listed in the same table.

Model Development
The aim of this study was to compare the effectiveness of four regression mo a machine-learning setup and ANN models to explain and predict BTS. Statistic tionships between the response variable, BTS, and Id2 and Gs were established to es the BTS. The predictive performances of multiple linear regression (MLR), ANN panelized regression models, including Ridge regression, Lasso, and Elastic Ne compared.

A. ANN
ANN is one of the most commonly used supervised machine-learning me These computational models have been applied to a variety of problems in many ANN comprise three main parts: input layers, hidden layers, and an output laye structure of an ANN plays a major role in determining its performance [41]: the ch the number of hidden layers and neurons is crucial. Many software packages, inc deepnet, neuralnet, mxnet, h2o, keras, and tensorflow, implement ANN. In this two of the most commonly used packages in R, namely, Neural Net and Keras on T Flow were employed. A Keras sequential model with two hidden layers with thr two neurons respectively, was found to be the optimal ANN model. Details abo limiting number of hidden layers and neurons that can be used for any given set o layers are available in the literature [42][43][44][45][46].

B. Regularization
Ridge, Lasso, and Elastic Net belong to a family of regression techniques that u norm and L2-norm regularization penalty terms; a tuning parameter λ contr strengths of these penalty terms. These techniques were used as an alternative to t

Model Development
The aim of this study was to compare the effectiveness of four regression models in a machine-learning setup and ANN models to explain and predict BTS. Statistical relationships between the response variable, BTS, and I d2 and G s were established to estimate the BTS. The predictive performances of multiple linear regression (MLR), ANN, and panelized regression models, including Ridge regression, Lasso, and Elastic Net were compared.

A. ANN
ANN is one of the most commonly used supervised machine-learning methods. These computational models have been applied to a variety of problems in many fields. ANN comprise three main parts: input layers, hidden layers, and an output layer. The structure of an ANN plays a major role in determining its performance [41]: the choice of the number of hidden layers and neurons is crucial. Many software packages, including deepnet, neuralnet, mxnet, h2o, keras, and tensorflow, implement ANN. In this study, two of the most commonly used packages in R, namely, Neural Net and Keras on TensorFlow were employed. A Keras sequential model with two hidden layers with three and two neurons respectively, was found to be the optimal ANN model. Details about the limiting number of hidden layers and neurons that can be used for any given set of input layers are available in the literature [42][43][44][45][46].

B. Regularization
Ridge, Lasso, and Elastic Net belong to a family of regression techniques that use L1-norm and L2-norm regularization penalty terms; a tuning parameter λ controls the strengths of these penalty terms. These techniques were used as an alternative to the best subsets. Ridge regression was introduced by [47,48] to improve the prediction accuracy of the regression model by minimizing the following loss function: If λ = 0, the resulting estimates are the ordinary least squares of the MLR. In Ridge regression, the L2-norm penalty term was used to shrink regression coefficients to nonzero values to prevent overfitting, but it did not play the role of feature selection.
Lasso regression was developed in the field of geophysics in 1986 and 1996 [49][50][51][52]. Lasso regression performs both feature selection and regularization penalty to improve prediction accuracy. It combats multicollinearity by selecting the most important predictor from any group of highly correlated independent variables and removing all the others. An L1-norm penalty term was used to shrink regression coefficients, some to zero, thereby guaranteeing the selection of the most important explanatory variables. Another advantage of Lasso is that if a dataset of size n is fitted to a regression model with p parameters and p > n, the Lasso model can select only n parameters [53]. The following loss function is minimized to obtain estimates of the regression: Elastic Net is a variant of Ridge and Lasso and was introduced by [54]; its penalty term contains a mixture of the Ridge and Lasso penalty terms and has the following loss function: where 0 ≤ α ≤ 1; α = 0 denotes Lasso whereas α = 1 denotes Ridge regression [54]. Some of the coefficients can be shrunk as in Ridge, and some coefficients can be set to zero as in Lasso.

Results and Discussion
After the data were collected, they were randomly split into training and test sets with an 80:20 ratio (80% training and 20% testing; [55]), and the ranges of the independent variables in the training data were normalized by subtracting their means and dividing them by their standard deviations. In machine learning, data normalization in the preprocessing stage replaces the actual values of each independent variable into z-scores with a mean of zero and a unit variance to reduce the variability among the different variables. The normalization method is widely used to improve the convergence of the machine-learning algorithms [56][57][58]. After data normalization, cross-validation (CV) techniques are used to choose the best model. Similar to the bootstrap procedure, CV is a resampling method used to validate the performance of a fitted model. In K-fold CV, the data are divided into K subsamples. (K − 1)/K proportion of the data are used to build the model, and the remaining 1/K proportion of the data are used as a test; this procedure is repeated K times.
In this study, CV was used to compare the performances of the six competing models to identify the best model for BTS prediction. The root mean square (RMSE), mean absolute error (MAE), and coefficient of determination (R 2 ) were used to determine the best model for predicting BTS.

A. ANN Model
Two R packages, namely, Keras on TensorFlow and neuralnet, were used to build the ANN model. Keras is a high-level neural network application-programming interface (API) written in Python, and neuralnet is a well-known ANN package written in R. Keras runs on TensorFlow for the development and implementation of deep-learning models.
TensorFlow is an open-source platform for machine learning developed by the Google Brain Team. A Keras sequential model with the rectified linear unit (Relu) activation function and neuralnet were used to determine the best model for predicting BTS.
Loss function (MSE), epochs, batch size, and learning rate are the training parameters for the ANN sequential (ANNS) model. The epochs indicate the number of times the dataset is passed through the network. The best ANNS model identified by the accuracy measurement results was the model with a learning rate of 0.01%, a hidden-dim value of 2 with three and two neurons, respectively, the number of epochs as 100, a batch size of 16, and a validation split of 0.20. The model had 20 parameters-9 for the first hidden layer, 8 for the second hidden layer, and 3 for the output layer. The model was trained very well with the data, and the training error rate decreased very sharply, as seen in Figure 4; both MSE and MAE decreased exponentially before 60 epochs and stabilized thereafter.
Appl. Sci. 2021, 11, x FOR PEER REVIEW 7 of 14 runs on TensorFlow for the development and implementation of deep-learning models.
TensorFlow is an open-source platform for machine learning developed by the Google Brain Team. A Keras sequential model with the rectified linear unit (Relu) activation function and neuralnet were used to determine the best model for predicting BTS. Loss function (MSE), epochs, batch size, and learning rate are the training parameters for the ANN sequential (ANNS) model. The epochs indicate the number of times the dataset is passed through the network. The best ANNS model identified by the accuracy measurement results was the model with a learning rate of 0.01%, a hidden-dim value of 2 with three and two neurons, respectively, the number of epochs as 100, a batch size of 16, and a validation split of 0.20. The model had 20 parameters-9 for the first hidden layer, 8 for the second hidden layer, and 3 for the output layer. The model was trained very well with the data, and the training error rate decreased very sharply, as seen in Figure 4; both MSE and MAE decreased exponentially before 60 epochs and stabilized thereafter. The best ANN Neural net (ANNN) model had the same number of hidden layers as ANNS. The coefficients of determination (R 2 ) for the two models, ANNN and ANNS, were 62% and 69%, respectively.

B. Regression Model
To determine the best regression model for the normalized training set, we used regression feature selection methods such as forward selection, backward elimination, and the best subsets. All these methods unanimously selected the second-order regression model with two explanatory variables, Id2 and Gs. All the parameters were highly significant (see Table 3), and the coefficient of determination R 2 and adjusted R 2 were 51.8% and 50.5%, respectively. Figure 5 shows the predicted values from the interpolated regression model. The best ANN Neural net (ANNN) model had the same number of hidden layers as ANNS. The coefficients of determination (R 2 ) for the two models, ANNN and ANNS, were 62% and 69%, respectively.

B. Regression Model
To determine the best regression model for the normalized training set, we used regression feature selection methods such as forward selection, backward elimination, and the best subsets. All these methods unanimously selected the second-order regression model with two explanatory variables, I d2 and G s . All the parameters were highly significant (see Table 3), and the coefficient of determination R 2 and adjusted R 2 were 51.8% and 50.5%, respectively. Figure 5 shows the predicted values from the interpolated regression model.   The normality test of the residuals is shown in Figure 6. The p-value of the Kolmogorov-Smirnov test exceeded 15%, which clearly shows that there was no deviation from normality. Besides, the variance inflation factor (VIF) was very low (1.14), indicating that multicollinearity was not detected. VIF values exceeding 10 were regarded as indicative of multicollinearity.  The normality test of the residuals is shown in Figure 6. The p-value of the Kolmogorov-Smirnov test exceeded 15%, which clearly shows that there was no deviation from normality. Besides, the variance inflation factor (VIF) was very low (1.14), indicating that multicollinearity was not detected. VIF values exceeding 10 were regarded as indicative of multicollinearity.  The normality test of the residuals is shown in Figure 6. The p-value of the Kolmogorov-Smirnov test exceeded 15%, which clearly shows that there was no deviation from normality. Besides, the variance inflation factor (VIF) was very low (1.14), indicating that multicollinearity was not detected. VIF values exceeding 10 were regarded as indicative of multicollinearity.    Figure 7 did not show any pattern of heteroscedasticity in the "nonconstant error variance". To test the correlation among the residuals, the Durbin-Watson test was performed, and a test statistic of d = 2.50 was obtained. At the 5% significance level, the upper critical value of the test was du, 0.025 = 1.51; clearly, the observed value of the test statistic was larger than both du, 0.025 and 4-du, 0.025, these critical values support the claim that those errors are not correlated. Figure 7 shows a diagnostic plot for the residuals of the model. The residual plot in Figure 7 did not show any pattern of heteroscedasticity in the "nonconstant error variance." To test the correlation among the residuals, the Durbin-Watson test was performed, and a test statistic of d = 2.50 was obtained. At the 5% significance level, the upper critical value of the test was du, 0.025 = 1.51; clearly, the observed value of the test statistic was larger than both du, 0.025 and 4-du, 0.025, these critical values support the claim that those errors are not correlated. The family of penalized regression techniques was an alternative to MLR models and examples of these family include Ridge, Lasso, and Elastic Net. Lasso and Elastic Net are feature selection tools as well as predictive modeling techniques. These models apply regularized constraints (λ, α) to the model coefficients and shrink some of them to zero. To determine the optimal regularization parameter λ for these models, the cv.glmnet, and glmnet R packages were used. These functions use penalized maximum likelihood method to fit generalized linear models. The Ridge, Lasso, and the Elastic Net model paths were fitted using the mean-squared error CV criterion. The workflow of the methodology of the study is summarized in Figure 8. The family of penalized regression techniques was an alternative to MLR models and examples of these family include Ridge, Lasso, and Elastic Net. Lasso and Elastic Net are feature selection tools as well as predictive modeling techniques. These models apply regularized constraints (λ, α) to the model coefficients and shrink some of them to zero. To determine the optimal regularization parameter λ for these models, the cv.glmnet, and glmnet R packages were used. These functions use penalized maximum likelihood method to fit generalized linear models. The Ridge, Lasso, and the Elastic Net model paths were fitted using the mean-squared error CV criterion. The workflow of the methodology of the study is summarized in Figure 8. Figure 7 shows a diagnostic plot for the residuals of the model. The residual plot in Figure 7 did not show any pattern of heteroscedasticity in the "nonconstant error variance." To test the correlation among the residuals, the Durbin-Watson test was performed, and a test statistic of d = 2.50 was obtained. At the 5% significance level, the upper critical value of the test was du, 0.025 = 1.51; clearly, the observed value of the test statistic was larger than both du, 0.025 and 4-du, 0.025, these critical values support the claim that those errors are not correlated. The family of penalized regression techniques was an alternative to MLR models and examples of these family include Ridge, Lasso, and Elastic Net. Lasso and Elastic Net are feature selection tools as well as predictive modeling techniques. These models apply regularized constraints (λ, α) to the model coefficients and shrink some of them to zero. To determine the optimal regularization parameter λ for these models, the cv.glmnet, and glmnet R packages were used. These functions use penalized maximum likelihood method to fit generalized linear models. The Ridge, Lasso, and the Elastic Net model paths were fitted using the mean-squared error CV criterion. The workflow of the methodology of the study is summarized in Figure 8. First, a sequence of n lambdas was generated, and the training dataset was divided into K = 10 folds. The model was cross-validated n times using nine subsamples as the training set and the remaining sample as the test set. Each time, as MSE was computed, a fold was removed and a different one was chosen. The lambda value with the smallest MSE was chosen, and the best model was fitted. The upper panels of Figure 9 show CV MSE as a function of log (λ). The vertical dashed lines in these plots represent the log (λ) value with the minimum MSE and the largest value of log (λ) within one standard error to that of log (λ) with the minimum MSE for the Ridge and Lasso models, respectively.
First, a sequence of n lambdas was generated, and the training dataset was divided into K = 10 folds. The model was cross-validated n times using nine subsamples as the training set and the remaining sample as the test set. Each time, as MSE was computed, a fold was removed and a different one was chosen. The lambda value with the smallest MSE was chosen, and the best model was fitted. The upper panels of Figure 9 show CV MSE as a function of log (λ). The vertical dashed lines in these plots represent the log (λ) value with the minimum MSE and the largest value of log (λ) within one standard error to that of log (λ) with the minimum MSE for the Ridge and Lasso models, respectively. The plots in the lower panels of Figure 9 show the shrinking of the coefficients of the Ridge and Lasso models as a function of log (λ). All the coefficients of the Ridge and Lasso models approach zero at log (λ) = 4 and −1.5, respectively, whereas those of the Elastic Net approach zero at log (λ) = 1. For the Elastic Net, the best value of the regularization parameter estimate is λ = 0.007, corresponding to α = 0.1.
The mean square error of each model was calculated, and the (λ, α) values with the minimum mean square error were chosen to build the best model. The estimates of these parameters for the four competing regression models with their root mean squares (RMSE) are listed in Table 4.  The plots in the lower panels of Figure 9 show the shrinking of the coefficients of the Ridge and Lasso models as a function of log (λ). All the coefficients of the Ridge and Lasso models approach zero at log (λ) = 4 and −1.5, respectively, whereas those of the Elastic Net approach zero at log (λ) = 1. For the Elastic Net, the best value of the regularization parameter estimate is λ = 0.007, corresponding to α = 0.1.
The mean square error of each model was calculated, and the (λ, α) values with the minimum mean square error were chosen to build the best model. The estimates of these parameters for the four competing regression models with their root mean squares (RMSE) are listed in Table 4. The results of the information criteria for the 10-fold CV for the four regression models are listed in Table 5. These results show no apparent differences among the four regressions models; all the accuracy measurement results are close.  Figure 10 shows the performances of the compared models. The R 2 values of the models indicate that the ANNS outperformed all the other models.  Figure 10 shows the performances of the compared models. The R 2 values of the models indicate that the ANNS outperformed all the other models. The results of the accuracy measurements, i.e., RMSE, MAE, and R 2 , are listed in Table 6 for comparing the performances of the six models. The R 2 and MAE values indicate that the ANNS model outperformed all the other fitted models. With regard to RMSE, the Lasso and MLR models have a slight advantage over the other models. Penalized regression methods work very well when the number of explanatory variables is large, whereas ANNS performs best when the sample size is large. The results of the accuracy measurements, i.e., RMSE, MAE, and R 2 , are listed in Table 6 for comparing the performances of the six models. The R 2 and MAE values indicate that the ANNS model outperformed all the other fitted models. With regard to RMSE, the Lasso and MLR models have a slight advantage over the other models. Penalized regression methods work very well when the number of explanatory variables is large, whereas ANNS performs best when the sample size is large.

Conclusions
In this study, six methods, namely, ANNS, ANNN, Ridge regression, Lasso regression, MLR, and Elastic Net regression, were examined to build a model that can be used to predict BTS. Most of these methods perform variable selection, which is the process by which a reduced number of independent variables is chosen, as well as prediction. Both ordinary least squares and maximum likelihood methods were used to fit the BTS data to these models. Those are well-known methods that can provide highly accurate predictions. The limitations of those methods were investigated using 10-fold CV criteria, and the results have demonstrated all the methods are useful and competitive for use along with the other existing modeling methods. Another key limitation of this study is the size of the data. However, these samples were the averages of large sample sizes with unequal lengths. Such cases are more common when the cost of the extraction is very high or it is difficult to obtain enough samples. Prediction results from the six best models produced by the above techniques are compared by using Root Mean Square Error (RMSE), Absolute Mean Error (MAE), and the coefficient of determination (R 2 ). Based on the results of the RMSE, the accuracy of the predictions for the BTS values obtained from all the competing models are very close, but the results of the MAE and R 2 have shown that the Keras sequential model outperformed the other competing models. Although this dataset indicated the simplicity and the potential superiority of the ANNS model, but ANNS is closely adapted to the training data, and exploiting its broad flexibility demands ingenuity in choosing the estimation method to achieve high accuracy prediction. Besides, Elastic Net and Lasso play an important role in the studies with small sample sizes that have large number of parameters. In such cases, those techniques, which are mainly used for the analysis of small samples, are the best candidates to be employed for the modeling and the prediction of such data types.

Funding:
The field and laboratory studies of this study were supported by a grant from the United Arab Emirates University, Research Affairs, under the title of UPAR 2016-31S252 program. The obtained data were used to prepare this paper.