Soil Liquefaction Assessment Using Soft Computing Approaches Based on Capacity Energy Concept

: Soil liquefaction is one of the most complicated phenomena to assess in geotechnical earthquake engineering. The conventional procedures developed to determine the liquefaction potential of sandy soil deposits can be categorized into three main groups: Stress-based, strain-based, and energy-based procedures. The main advantage of the energy-based approach over the remaining two methods is the fact that it considers the e ﬀ ects of strain and stress concurrently unlike the stress or strain-based methods. Several liquefaction evaluation procedures and approaches have been developed relating the capacity energy to the initial soil parameters, such as the relative density, initial e ﬀ ective conﬁning pressure, ﬁne contents, and soil textural properties. In this study, based on the capacity energy database by Baziar et al. (2011), analyses have been carried out on a total of 405 previously published tests using soft computing approaches, including Ridge, Lasso & LassoCV, Random Forest, eXtreme Gradient Boost (XGBoost), and Multivariate Adaptive Regression Splines (MARS) approaches, to assess the capacity energy required to trigger liquefaction in sand and silty sands. The results clearly prove the capability of the proposed models and the capacity energy concept to assess liquefaction resistance of soils. It is also proposed that these approaches should be used as cross-validation against each other. The result shows that the capacity energy is most sensitive to the relative density.


Introduction
Liquefaction is a catastrophic ground failure, which usually occurs in loose saturated soil deposits under earthquake excitations. After the disastrous damage observed during the Niigata and Alaska 1964 earthquakes, the liquefaction phenomenon became the scope of many studies in the field of geotechnical earthquake engineering [1][2][3]. A few systems have been created to assess the liquefaction potential in the field [4]. The accessible assessment systems can be arranged into three main gatherings: (1) Stress-based techniques, (2) strain-based methodology, and (3) energy-based strategies.
The stress-based methodology is the most broadly received strategy for liquefaction evaluation [2,5]. This strategy is predominantly empirical and depends on lab and field studies. The cyclic shear stress real seismic motion to the laboratory harmonic loading conditions, the equivalent stress intensity and the loading cycles must be characterized [2]. Seed et al. [3] suggested the equivalent stress as 65% of the maximum shear stress, while Ishihara and Yasuda [6] proposed 57% for 20 cycles of loading. Some probabilistic frameworks for assessing liquefaction potential of soils based on in situ tests, such as the cone penetration test (CPT) or standard penetration test (SPT), were also carried out [7][8][9][10][11]. Despite the fact that the stress-based procedure has been continuously revised and extended in subsequent studies and the database of liquefaction case histories expanded, the uncertainty concerning random loading still persists [12,13].
Several probabilistic studies have been published that have examined the liquefaction potential of soils dependent on in situ tests, for example, the CPT or SPT [7][8][9][10]. Even though the stress-based approach has been continuously updated and the liquefaction case histories expanded, there are still a number of obstacles to this process, for instance, the uncertainty of random loading [12,13].
Dobry et al. [14] suggested a strain-based approach which originated from the model of two sand grain system and then extend to the actual soil layers. In this model, pore water pressure started to evolve only when the shear strain surpassed a threshold shear strain (around 0.01%), irrespective of sand type, relative density, initial effective confining pressure, and sample preparation method. This strain-based approach is less popular than the stress-based procedure because it is much more difficult to estimate the cyclic strain compared with the cyclic shear stress [15]. Davis and Berrill [16] introduced an energy-based approach for liquefaction potential assessment following the assumption by Nemat-Nasser and Shokooh [17] that the pore-pressure buildup had direct relations with the seismic energy dissipated in the unit volume of soil. The energybased method absorbs the basic of both the stress and strain approach. Total strain energy at the onset of liquefaction can be calculated from laboratory or field tests. Figure 1 shows typical stress-strain relationship of undrained cyclic triaxial test. The hysteretic area enveloped by a thick dashed curve A-B-C-D represents the dissipated energy per unit volume for a kth stress cycle. The total dissipated energy summed up from the start to the onset of liquefaction cycle is [18]: (1) Figure 1. Typical stress-strain relationship during undrained cyclic triaxial test [18].
The summation of the energy describes the capacity of the soil sample against initial liquefaction. The energy-based approach has some main advantages in evaluating the liquefaction possibility of soils [13,19]: (1) Energy associates with both shear stress and shear strain;  [18].
The summation of the energy describes the capacity of the soil sample against initial liquefaction.
Geosciences 2020, 10, 330 3 of 31 The energy-based approach has some main advantages in evaluating the liquefaction possibility of soils [13,19]: (1) Energy associates with both shear stress and shear strain; (2) The dissipated energy correlates very well with the pore pressure buildup, no matter how complicated the stress-strain history is, how many cycles, or how large the applied stress ratio.
(3) Energy has strong correlation with the character of earthquake, for example, the duration and the number of cycles of the seismic wave. Furthermore, it considers the entire spectrum of ground motions, while the stress-based approach uses only the peak value of ground acceleration.
The energy-based liquefaction evaluation methods fall into two main categories: One using earthquake case histories, and the other based on laboratory cyclic shear and centrifuge tests [12]. Many researchers have developed the soil liquefaction prediction model considering the effective mean confining pressure and initial relative density [20][21][22], mainly by performing multiple linear regression (MLR) analysis. These models can obtain high correlation coefficient R values, but generally they are based on a limited number of tests and fail to consider the important role of the fines content.
Besides, Baziar and Jafarian [13] showed that these formulas could not work well in a large data set of various types of sand. Using their updated database, Baziar and Jafarian [13] demonstrated the unacceptable performance of MLR-based relationship and highlighted the need of developing an artificial neural network (ANN)-based model. Moreover, Chen et al. [23] proposed a seismic wave energy-based method with back-propagation neural networks to evaluate the liquefaction probability. Despite the good performance of the ANN-based models, the black-box nature restricts their practical applications.
Baziar et al. [19] suggested a genetic programming (GP) evolutionary approach to predict the capacity energy of liquefiable soils after expanding the database. Alavi and Gandomi [24] improved the GP and provided linear genetic programming (LGP) and multi-expression programming (MEP) to evaluate the liquefaction resistance of sandy soils. Cabalar et al. [25] presented an Adaptive Neuro-Fuzzy Inference System (ANFIS) for the prediction of liquefaction triggering, which possesses the natural language description of fuzzy systems and the learning capability of neural networks.
The multivariate adaptive regression splines (MARS) model can give the explicit expression, is much easier to interpret, and can provide the relative importance of the input variables [26][27][28][29][30][31][32][33][34][35][36][37][38][39][40][41]. In this study, based on the capacity energy database by Baziar et al. [19], analyses have been carried out on a total of 405 previously published tests using soft computing techniques, including Ridge, Lasso & Lasso CV, Random Forest, eXtreme Gradient Boost (XGBoost) and MARS approaches, to assess the capacity energy required to trigger liquefaction in sand and silty sands. The capacity energies estimated by these proposed models compare favorably with the testing data sets used for validation purpose. It is also proposed that these approaches should be used as cross-validation against each other.

Ridge Regression Algorithm
The method of Ridge regression, proposed by Hoerl and Kennard [42], has been used extensively for dealing with the problem of multicollinearity in regression. Multicollinearity is the existence of near-linear relationships among the independent variables. The multicollinearity may cause a great error in the ordinary least squares (OLS) method by giving large regression coefficients with unstable signs [43]. The ridge regression method introduces an additional ridge parameter α to control this regression bias. For example, a widely used linear regression model has the form of: In which Y is a n × 1 vector of observations on a response variable. β is a p × 1 vector of regression coefficients to be estimated, X is a matrix of order (n × p) of observations on 'p' regressor variables. ε represents an n × 1 vector of residuals (errors).
In ordinary least squares, the regression coefficient β is estimated using the formula: Whenever the multicollinearity presents in the data, the OLS estimator may create inaccurate estimates of the regression coefficients. The matrix of X T X is singular. The eigenvalues of the matrix are small, which results in very large diagonal elements in the inverse of the matrix X T X and degrades the predictability of the model. A tiny variation in the input X in Equation (2) will lead to great changes in the output Y.
Ridge regression proceeds by adding a small value α to the diagonal elements of matrix X T X. This has the effect of increasing the eigenvalue of the matrix, converting the singular matrix into a non-singular matrix to some extent, and improving the stability of parameter estimation. Theregression coefficient β is solved as: In which α is the Ridge regression parameter. If α = 0, then it becomes the least squares estimate, which is unbiased at this time. It is a biased estimate when α 0. The larger α, the smaller the effect of multicollinearity on the stability of the regression parameters. But as α increases, the variance of the prediction also increases. Though a groups of estimators result in biased, for certain value of α, they yield a minimum mean squared error (MMSE) compared to the OLS estimator [42,44]. Hoerl and Kennard [45] suggested using a graphic which they called the Ridge trace to choose a proper α. In this study, the Spearman rank correlation coefficient was used to calculate the correlation coefficient of each two input variable, as shown in Figure 2. The results showed that correlations between each feature variable were not significant, which means that the input data did not have multivariate collinearity and were promising for modeling. The input feature importance of a Ridge model can be evaluated through the weight coefficient using the L 2 -norm based sparse linear model.
Whenever the multicollinearity presents in the data, the OLS estimator may create inaccurate estimates of the regression coefficients. The matrix of X T X is singular. The eigenvalues of the matrix are small, which results in very large diagonal elements in the inverse of the matrix X T X and degrades the predictability of the model. A tiny variation in the input X in Equation (2) will lead to great changes in the output Y.
Ridge regression proceeds by adding a small value α to the diagonal elements of matrix X T X. This has the effect of increasing the eigenvalue of the matrix, converting the singular matrix into a non-singular matrix to some extent, and improving the stability of parameter estimation. The regression coefficient β is solved as: In which α is the Ridge regression parameter. If α = 0, then it becomes the least squares estimate, which is unbiased at this time. It is a biased estimate when α ≠ 0. The larger α, the smaller the effect of multicollinearity on the stability of the regression parameters. But as α increases, the variance of the prediction also increases. Though a groups of estimators result in biased, for certain value of α, they yield a minimum mean squared error (MMSE) compared to the OLS estimator [42,44]. Hoerl and Kennard [45] suggested using a graphic which they called the Ridge trace to choose a proper α . In this study, the Spearman rank correlation coefficient was used to calculate the correlation coefficient of each two input variable, as shown in Figure 2. The results showed that correlations between each feature variable were not significant, which means that the input data did not have multivariate collinearity and were promising for modeling. The input feature importance of a Ridge model can be evaluated through the weight coefficient using the L2-norm based sparse linear model.

Lasso and LassoCV Algorithm
The Least Absolute Shrinkage and Selection Operator (Lasso), proposed by Tibshirani [46], is a convenient and frequently used algorithm for variable selection and prediction in linear regression problems. Lasso improves the model fitting process by choosing only a subset of the initial input variables for using in the final model rather than using all of them [46,47]. The main task of Lasso estimate is the solution of the L 1 optimization problem: In which t is the upper limit of the sum of the coefficients. This optimization equals to the parameter estimation as:β A larger value of λ will lead to greater variable shrinkage. In addition, because of the bias-variance tradeoff, variable shrinkage can improve prediction accuracy in a certain level. The work of Donoho, Johnstone, Kerkyacharian, and Picard [48] shown that the lasso shrinkage had the optimality which can approach the lowest of a set of maximum values when using orthogonal predictors.
Research also proved that with proper operation, the L 1 method can find suitable models with sparse coefficients [49][50][51]. The work of Meinshausen and Bühlmann [52] gave the conditions for the stability of variable selection with the lasso.
Nevertheless, it is almost impossible to get independent test data because of the lack of data. A practical solution to this problem is the cross-validation method, which divides the available data into several groups and uses each data group as an independent test data set. The LassoCV is a combination of the original Lasso algorithm with the cross-validation method to deal with the scarcity of data.
Generally speaking, the procedures to perform an n-fold cross-validation are as follows: 1. Randomly divide the existing collected data into n equal or approximately equal groups (folds), e.g., divide the data by 10-fold.
2. For the n th fold, use the other n-1 folds as a training set to fit each candidate model. 3. Use the n th fold as testing set to evaluate the performance index of each candidate model fitted in step 2.
4. Repeat steps 2-3 to assess the performance index for each candidate model. 5. Calculate the average performance index for each candidate model over the n folds. 6. The candidate model with the best average performance index is selected as the final model, and naturally, its corresponding value of λ is the optimalλ.
The input feature importance of a Lasso model can be evaluated through the weight coefficient using the L 1 -norm based sparse linear model or using the approach known as analysis of variance (ANOVA) decomposition.

Random Forest Regression Algorithm
The Random Forest regression (RF), proposed by Breiman [53], is an ensemble method which gives the prediction based on the average of a series of decision trees trained from the supervised data. The word 'forest' means this method uses multiple decision trees for prediction. The word 'random' represents the training sample sets, and the splitting features are chosen randomly. Because of the introduction of these two random variables, the RF has the following advantages: (1) Robustness to the noise, (2) ability to handle discrete and continues data and no need for regularization, Geosciences 2020, 10, 330 6 of 31 (3) good performance with large features and no dimensionality reduction, and (4) lower reliance on cross-validation according to some studies [54].
Breiman [55] first introduced the concept of Decision Tree, which is also called Classification and Regression Tree (CART). A decision tree is a nonparametric model which does not need to assume any prior classification parameters or the tree structure. A decision tree is composed of decision nodes and leaf nodes. Figure 3 illustrates the process of RF classification with N trees. Starts from the root node ν n , samples are passed to the right node (ν R ) or the left node (ν L ) after comparing with some criteria or threshold values. Repeat this partition until reaching a terminal node and get a classification label (classes A or B in this case). For classification problem, the ensemble prediction is obtained as a combination of the results of the individual trees by majority voting rule [56].
Geosciences 2020, 10, x FOR PEER REVIEW 6 of 33 of the introduction of these two random variables, the RF has the following advantages: (1) Robustness to the noise, (2) ability to handle discrete and continues data and no need for regularization, (3) good performance with large features and no dimensionality reduction, and (4) lower reliance on cross-validation according to some studies [54]. Breiman [55] first introduced the concept of Decision Tree, which is also called Classification and Regression Tree (CART). A decision tree is a nonparametric model which does not need to assume any prior classification parameters or the tree structure. A decision tree is composed of decision nodes and leaf nodes. Figure 3 illustrates the process of RF classification with N trees. Starts from the root node νn, samples are passed to the right node (νR) or the left node (νL) after comparing with some criteria or threshold values. Repeat this partition until reaching a terminal node and get a classification label (classes A or B in this case). For classification problem, the ensemble prediction is obtained as a combination of the results of the individual trees by majority voting rule [56]. RF combines all the generated decision trees by a method called 'bagging' or 'bootstrap aggregation' to decrease the variance of model. The main idea of bagging/ bootstrap is to train several different models from randomly sampling feature subset or training data subset separately, and then let all models vote on the output of the test samples. Bootstrap is a sampling method with replacement. For acquiring a bootstrap sample, it randomly selects a subset of observations from the complementary set Sn. Therefore, the probability of each observation to be picked is 1/n.
The ensemble prediction given by entire forest is obtained as a combination of the results from k trees. Therefore, for regression problem, the estimation of Y can be calculated as:

Majority Voting
Final Class ν n ν L ν R RF combines all the generated decision trees by a method called 'bagging' or 'bootstrap aggregation' to decrease the variance of model. The main idea of bagging/ bootstrap is to train several different models from randomly sampling feature subset or training data subset separately, and then let all models vote on the output of the test samples. Bootstrap is a sampling method with replacement. For acquiring a bootstrap sample, it randomly selects a subset of observations from the complementary set S n . Therefore, the probability of each observation to be picked is 1/n. First, RF uses bagging algorithm to generate several bootstrap samples S n are constructed and the RF is built from these collection of trees. The predictions corresponding to each tree from the previous tree decision algorithm arê The ensemble prediction given by entire forest is obtained as a combination of the results from k trees. Therefore, for regression problem, the estimation ofŶ can be calculated as:Ŷ whereŶ is the prediction of i-th tree, and i = 1, 2, . . . , k. The framework of using RF regression for prediction is illustrated in Figure 4. where Ŷ is the prediction of i-th tree, and i= 1, 2, …, k. The framework of using RF regression for prediction is illustrated in Figure 4. Usually, RF uses the Residual Sum Squares (RSS=∑ − ) as split criteria in regression. The optimization process tries to find the split that maximizes the RSS between-groups in an analysis of variance [57]. The relative importance of input features of a RF model can be evaluated by collecting each feature's contribution from the prediction trees in the model. A higher value of this metric, compared to another, implies greater importance of such feature for generating a prediction.

XGBoost Algorithm
XGBoost (eXtreme Gradient Boost), proposed by Chen et al. [58], is an integrated machine learning algorithm based on decision tree. Recently, XGBoost has been one of the most widely used methods in Kaggle's data science contest. It possesses many outstanding properties, including high efficiency and promising prediction accuracy [40,41]. Essentially, XGBoost can be regarded as an improved GBDT (Gradient Boosting Decision Tree) algorithm, because they are all composed of multiple decision trees. Nowadays, XGBoost is widely used in many areas, such as data science competitions and industry, because it has the following advantages: (1) Many strategies are used to prevent overfitting, including normalization in the objective function, shrinkage and column subsampling.
(2) Second-order Taylor expansion is applied to the loss function, which can make the gradient convergence faster and more accurate.
(3) The parallel computing is implemented in the process of node splitting, which improves the speed of calculation effectively.

Selected Features
... The optimization process tries to find the split that maximizes the RSS between-groups in an analysis of variance [57]. The relative importance of input features of a RF model can be evaluated by collecting each feature's contribution from the prediction trees in the model. A higher value of this metric, compared to another, implies greater importance of such feature for generating a prediction.

XGBoost Algorithm
XGBoost (eXtreme Gradient Boost), proposed by Chen et al. [58], is an integrated machine learning algorithm based on decision tree. Recently, XGBoost has been one of the most widely used methods in Kaggle's data science contest. It possesses many outstanding properties, including high efficiency and promising prediction accuracy [40,41]. Essentially, XGBoost can be regarded as an improved GBDT (Gradient Boosting Decision Tree) algorithm, because they are all composed of multiple decision trees. Nowadays, XGBoost is widely used in many areas, such as data science competitions and industry, because it has the following advantages: (1) Many strategies are used to prevent overfitting, including normalization in the objective function, shrinkage and column subsampling.
(2) Second-order Taylor expansion is applied to the loss function, which can make the gradient convergence faster and more accurate.
(3) The parallel computing is implemented in the process of node splitting, which improves the speed of calculation effectively.
(4) The techniques for handling of missing/sparse data, cross-validation and early stop in building trees, all of which can contribute to higher speed and accuracy. The schematic of XGBoost algorithm is illustrated in Figure 5. XGBoost constructs decision trees one after one so that each successor tree is built to minimize errors in the previous tree. Therefore, the next tree in the sequence will try to decrease the updated residual from the previous tree. Considering a model with k trees finishes training in the dataset S n = (x i , y i ) (i = 1, 2, . . . , n), the outputŷ i of the model should be:ŷ Geosciences 2020, 10, x FOR PEER REVIEW 8 of 33 (4) The techniques for handling of missing/sparse data, cross-validation and early stop in building trees, all of which can contribute to higher speed and accuracy.
The schematic of XGBoost algorithm is illustrated in Figure 5. XGBoost constructs decision trees one after one so that each successor tree is built to minimize errors in the previous tree. Therefore, the next tree in the sequence will try to decrease the updated residual from the previous tree.
In which f(x) is a regression tree, K is the total number of trees, k represents the k th tree, xi is the input variable corresponding to sample i, ŷ i corresponds to the predicted score from this tree, and F is the space of regression trees. XGBoost's objective function includes two parts: the training error and the regularization: where L(.) is the loss function quantifying the deviation of the predicted values from the actual values. Ω (fk) is the regularization function penalizes the complexity of the model from overfitting: In which T represents the total number of leaf nodes, and j ω is the score of each leaf node. γ and λ are the controlling factors employed to avoid overfitting. γ indicates the complexity In which f (x) is a regression tree, K is the total number of trees, k represents the k th tree, x i is the input variable corresponding to sample i,ŷ i corresponds to the predicted score from this tree, and F is the space of regression trees.
XGBoost's objective function includes two parts: the training error and the regularization: where L(.) is the loss function quantifying the deviation of the predicted values from the actual values. Ω (f k ) is the regularization function penalizes the complexity of the model from overfitting : In which T represents the total number of leaf nodes, and ω j is the score of each leaf node. γ and λ are the controlling factors employed to avoid overfitting. γ indicates the complexity cost by introducing additional leaf.
T j=1 ω 2 j evaluates the performance of a decision tree. Therefore, this objective function tends to choose a simple predictive function from candidate models.
When a new tree is created to fit residual errors of last iteration, the prediction for the t th tree can be expressed as:ŷ In whichŷ (t) i is the prediction of the i th instance at the t th iteration. Therefore, the objective function can be expressed as: Then, the objective function is approximated by the second-order Taylor expansion as: where g i and h i are first and second order gradient statistics of the loss function, respectively: Because the previous (t−1) trees' residual errors have minimal influence on the modification of the objective function, an approximation in step t can be: Then, the t th tree concerning the model parameters and predictions can be determined by optimization of Equation (16). In the same way, the successor trees are built until the whole XGBoost is constructed satisfying predefined finishing standards [58]. The input feature importance of a XGBoost model can be evaluated through the gain criterion. The gain is obtained by collecting each feature's contribution from each tree in the model. A feature with higher value indicates greater relative contribution and importance.

MARS Methodology
Friedman [59] presented a Multivariate Adaptive Regression Splines model (MARS) as a flexible regression method for modeling of high dimensional data. MARS has become particularly popular in the area of data mining because it does not assume any particular type or class of relationship between the predictor variables and the outcome variable. MARS partitions the training data sets into continuous regions and tries to fit the data in each region with separate piecewise linear segments (splines) of differing gradients (slope). The endpoints of the segments are called knots. A knot marks the end of one region of data and the beginning of another. The resulting piecewise linear segments (known as basis functions), give greater flexibility to the model, allowing for bends, thresholds, and other departures from linear functions.
Let y be the target output and X = (X 1 , . . . , X P ) be a matrix of P input variables. In case of a continuous response, this would be: In which e is the distribution of the error. MARS approximates the function f by applying basis functions (BFs). BFs are splines (smooth polynomials), including piecewise linear and piecewise cubic functions. For simplicity, only the piecewise linear function is expressed here. Piecewise linear functions in MARS are of the form with a knot occurring at value a as: The MARS model f (X) is constructed as a linear combination of BFs as: where each λ m (X) is a basis function. It can be a spline function, or the product of two or more spline functions already contained in the model. The coefficients β are constants estimated using the least-squares method. Figure 6 shows a simple example of how MARS would use piecewise linear spline functions to attempt to fit data. The MARS mathematical equation is expressed as: where BF1 = max(0, 16−x), BF2 = max(0, x−11), BF3 = max(0, x−6.5), and BF4 = max(0,6.5−x). The knots are located at x = 6.5, 11 and 16. They define four intervals with different linear relationships.
Geosciences 2020, 10, x FOR PEER REVIEW 10 of 33 ( ) ( ) 1 ,..., P y f X X e f X e = + = + (17) In which e is the distribution of the error. MARS approximates the function f by applying basis functions (BFs). BFs are splines (smooth polynomials), including piecewise linear and piecewise cubic functions. For simplicity, only the piecewise linear function is expressed here. Piecewise linear functions in MARS are of the form with a knot occurring at value a as: The MARS model f(X) is constructed as a linear combination of BFs as: where each λm (X) is a basis function. It can be a spline function, or the product of two or more spline functions already contained in the model. The coefficients β are constants estimated using the leastsquares method. Figure 6 shows a simple example of how MARS would use piecewise linear spline functions to attempt to fit data. The MARS mathematical equation is expressed as: where BF1 = max(0, 16−x), BF2 = max(0, x−11), BF3 = max(0, x−6.5), and BF4 = max(0,6.5−x). The knots are located at x = 6.5, 11 and 16. They define four intervals with different linear relationships. In general, MARS follows three basic steps to fit the model in Equation (19) In which the coefficients β can be obtained through the method of least squares. Stop adding BFs when the model has more terms than a predefined number in order to construct an overfit model intentionally. In general, MARS follows three basic steps to fit the model in Equation (19): (i) Constructive phase (forward phase). The intercept β 0 and the basis functions that generate the maximum training error decrease are added to the Model. For a current model with M basis functions, the following terms to be added can take the form of: In which the coefficients β can be obtained through the method of least squares. Stop adding BFs when the model has more terms than a predefined number in order to construct an overfit model intentionally.
(ii) Pruning phase (backward phase). The least significant BFs that does not meet the standard of contribution is removed to find an optimal submodel. Candidate models are compared by the computationally effective method of Generalized Cross-Validation (GCV). GCV reduces the chance of overfitting by penalizing the model with large numbers of BFs. Hastie et al. [60] gave the formula to calculate the GCV for a MARS model as: In which N is the number of observations in the training set, M is the number of BFs, andŷ i denotes the predicted values of the MARS model. The numerator is the mean squared error (MSE) of a candidate model. The denominator takes into account the larger uncertainty with increasing model complexity. d is the penalizing parameter with a default value of 3 [59]. The GCV can penalize both the error of the model, the number of BFs and number of knots. At each step in the pruning phase, an extraneous BF is removed to minimize the error of the model until a satisfactory model is gained.
As long as the final MARS model is constructed, the method known as analysis of variance (ANOVA) decomposition (Friedman, 1991) [59] can be utilized to evaluate the contributions from the input variables and the BFs through comparing (testing) variables for statistical significance. This is done by collecting together all the BFs that involve one variable and another grouping of BFs that involve pairwise interactions.

Performance Measures
Assessments of the performance of models are evaluated according to the performance measures. In the following equations, y is the mean of the target values of y i , Y is the mean of the predicted Y i , and N is the total number of data points in the training set or testing set.
The root mean square error (RMSE) evaluates the errors between prediction values and true values as [66]: The mean absolute error (MAE) is the average of absolute errors of predictions, calculated as: The coefficient of determination R 2 (R2) measures the deviation of a group of data and characterizes the goodness of regression fitting [67]:

The Database
The database used for modeling comprises of 271 cyclic triaxial tests, 116 cyclic torsional shear tests, and 18 cyclic simple shear tests giving a total of 405 tests. A summary of the laboratory tests, as well as the parameter statistics, are listed in Table 1. The specific details of the 405 tests, including the test type, values of parameters, and the failure mode, can be found in the studies by Jafarian (2007, 2011) [13,19]. Of the 405 data sets, 324 (approximately 80% of the total data sets) were randomly selected as the training patterns while the remaining 81 were used for testing purposes. As the previously proposed regression and soft computing models did not indicate the specific information of the training and testing patterns, for performance comparison, the criterion of data pattern selection used in this study was based on ensuring that the statistical properties, including the mean and standard deviations of the training and testing subsets, were similar to each other.

Ridge Modeling Results
The ridge regression parameter α has strong influence on the performance of the model. Figures 7 and 8 shows the variation of performance measures with α. It was found that when α was between 0.1 and 0.001, the RMSE, MAE, and coefficient of determination curves remained almost constant in both the training and testing set. Therefore, a ridge regression parameter α of 0.01 was picked finally.
Geosciences 2020, 10, x FOR PEER REVIEW 13 of 33 feature variable, followed by D50 and mean σ′ . FC had a very small relative importance index (4.42%), and Cu had a negative importance index of −10.1%.      The comparison between the true value and Ridge model predicted value of Log(W) is shown in Figure 9. The main part of the estimations of Log(W) were within ±10% of the target values, but quite a lot of points were still outside these two limit lines. The relative errors (defined as the ratio of the difference between the Ridge predicted and the target Log(W) values divided by the target value, in percentage) for the training and testing patterns are plotted in Figure 10. It can be seen that some relative errors were above 20%. The RMSE, MAE, and R 2 scores were 0.271, 0.221, and 0.582 for the training set, with the corresponding scores of 0.258, 0.201, and 0.382 for the testing set. The results show that this model could not provide satisfactory predictions, especially in the testing set.      Figure 11 gives the plots of the relative importance of the input variables for the Ridge model, which was evaluated by the regression coefficient. For simplicity, percentage was used to sort the importance scores of the five variables from high to low. It can be seen that Dr was the most important feature variable, followed by D50 and σ mean . FC had a very small relative importance index (4.42%), and Cu had a negative importance index of −10.1%.
Geosciences 2020, 10, x FOR PEER REVIEW 15 of 33 Figure 11. Relative importance of the input variables for the developed Ridge model.

Lasso and LassoCV Analysis
The performance of Lasso model greatly depended on the parameter λ in Equation (6). Figures 12  and 13 shows the variation of performance measures with λ. It can be found that the RMSE, MAE, and R 2 curve become constant when λ was smaller than 0.01. The LassoCV algorithm automatically picked 0.001053 as an optimal value of λ. The true value and LassoCV model predicted value of Log(W) are compared in Figure 14. The main part of the estimations of Log(W) were within ±10% of the target values, but quite a few data points were still located outside the limit lines. The relative errors for the training and testing patterns are plotted in Figure 15. It can be seen that some relative errors were above 20%.
It should be noticed that the RMSE, MAE, and R 2 scores were 0.275, 0.225, and 0.578 for the training set, with the corresponding scores of 0.255, 0.201, and 0.581 for the testing set. The results show that this model still could not produce reasonable predictions. Its RMSE and MAE were almost the same as that of the Ridge model, although with a R 2 score 51.3% higher than that of the Ridge model. This again proves the insufficiency of this linear regression type algorithm in modeling the complex problem of soil liquefaction.
The relative importance of the input variables for Lasso & LassoCV are shown in Figure 16. It can be seen that Dr was the most important feature variable, followed by D50 and mean σ′ . FC was automatically removed by the LassoCV algorithm by having a zero-regression coefficient. Cu had a negative importance index of −2.2%. Figure 11. Relative importance of the input variables for the developed Ridge model.

Lasso and LassoCV Analysis
The performance of Lasso model greatly depended on the parameter λ in Equation (6). Figures 12 and 13 shows the variation of performance measures with λ. It can be found that the RMSE, MAE, and R 2 curve become constant when λ was smaller than 0.01. The LassoCV algorithm automatically picked 0.001053 as an optimal value of λ. The true value and LassoCV model predicted value of Log(W) are compared in Figure 14. The main part of the estimations of Log(W) were within ±10% of the target values, but quite a few data points were still located outside the limit lines. The relative errors for the training and testing patterns are plotted in Figure 15. It can be seen that some relative errors were above 20%.        show that this model still could not produce reasonable predictions. Its RMSE and MAE were almost the same as that of the Ridge model, although with a R 2 score 51.3% higher than that of the Ridge model. This again proves the insufficiency of this linear regression type algorithm in modeling the complex problem of soil liquefaction.
The relative importance of the input variables for Lasso & LassoCV are shown in Figure 16. It can be seen that Dr was the most important feature variable, followed by D 50 and σ mean . FC was automatically removed by the LassoCV algorithm by having a zero-regression coefficient. Cu had a negative importance index of −2.2%.

RF Modeling Result
The parameter nestimators is related to the number of decision trees in the RF model. Figures 17 and  18 give the RMSE, MAE, and R 2 change curves of RF model with parameter nestimators. It can be seen that when nestimators was greater than 100, the curves of performance measures had only minor fluctuations.
The other two important parameters for the RF model are min_samples_split and min_samples_leaf. The min_samples_split limits the conditions for the subtree to continue to be

RF Modeling Result
The parameter n estimators is related to the number of decision trees in the RF model. Figures 17 and 18 give the RMSE, MAE, and R 2 change curves of RF model with parameter n estimators . It can be seen that when n estimators was greater than 100, the curves of performance measures had only minor fluctuations.     The other two important parameters for the RF model are min_samples_split and min_samples_leaf. The min_samples_split limits the conditions for the subtree to continue to be partitioned. If the number of samples in a node is less than min_samples_split, it will not continue to try to select the optimal feature to partition. The min_samples_leaf limits the minimum number of samples of leaf nodes. If the number of leaf nodes is smaller than the number of samples, it will be pruned together with the sibling nodes. It can be found that the R 2 score reaches the maximum value when min_samples_split equals 2 and min_samples_split equals 1, as shown in Figure 19.   With these optimal pamaters, the comparison between true value and RF model predicted value of Log(W) is shown in Figure 20. It can be seen that almost all the data points of the training set were located in the 100% agreement line, and majority of the testing set data were within ±10% of the target values. The relative errors for the training and testing patterns are plotted in Figure 21. Almost all of the RF estimations of the training set data fell within ± 5% of the target values, and most of the predictions of the testing data were within −5% to +10% of the target values.

XGBoost Modeling Result
In the XGBoost model, the most important parameter is nestimators, which controls the number of decision trees. Figures 23 and 24 give the RMSE, MAE, and R 2 change curves of XGBoost model with parameter nestimators. It can be seen that for the training set, when nestimators increased, RMSE and MAE decreased and the R 2 score increased. However, for the testing set, when nestimators was greater than 200, the curves of performance measures had only minor fluctuations. The optimal nestimators was chosen as 352. Figure 25 shows the comparison between true value and XGBoost model predicted value of Log(W). It can be seen that almost all the data points of the training set were located in the 100% agreement line, and majority of the testing set data were within ± 10% of the target values. The relative errors for the training and testing patterns are plotted in Figure 26. Almost all of the XGBoost estimations of the training set data fell within ± 5% of the target values, and most of the predictions of the testing data were within −5% to +10% of the target values. It should be noticed that the RMSE, MAE, and R 2 scores were 0.075, 0.052, and 0.97 for the training set, with the corresponding scores of 0.175, 0.124, and 0.82 for the testing set. The lower RMSE/MAE and large R 2 scores indicate a higher fitting performance than the Ridge model and Lasso model.
The relative importance of the input variables for RF model is plotted in Figure 22. It can be seen that Cu and Dr were the most important two feature variables, followed by σ mean , D 50 , and FC.

XGBoost Modeling Result
In the XGBoost model, the most important parameter is nestimators, which controls the number of decision trees. Figures 23 and 24 give the RMSE, MAE, and R 2 change curves of XGBoost model with parameter nestimators. It can be seen that for the training set, when nestimators increased, RMSE and MAE decreased and the R 2 score increased. However, for the testing set, when nestimators was greater than 200, the curves of performance measures had only minor fluctuations. The optimal nestimators was chosen as 352. Figure 25 shows the comparison between true value and XGBoost model predicted value of Log(W). It can be seen that almost all the data points of the training set were located in the 100% agreement line, and majority of the testing set data were within ± 10% of the target values. The relative errors for the training and testing patterns are plotted in Figure 26. Almost all of the XGBoost estimations of the training set data fell within ± 5% of the target values, and most of the predictions of the testing data were within −5% to +10% of the target values.

XGBoost Modeling Result
In the XGBoost model, the most important parameter is n estimators , which controls the number of decision trees. Figures 23 and 24 give the RMSE, MAE, and R 2 change curves of XGBoost model with parameter n estimators . It can be seen that for the training set, when n estimators increased, RMSE and MAE decreased and the R 2 score increased. However, for the testing set, when n estimators was greater than 200, the curves of performance measures had only minor fluctuations. The optimal n estimators was chosen as 352.
RMSE/MAE and large R 2 scores indicate a higher fitting performance than the Ridge model and Lasso model.
The relative importance of the input variables for XGBoost model is shown in Figure 27. It can be seen that Dr (100%) was the most important input variable, followed by mean σ′ (68.5%). Cu, D50, and FC had the relative importance of 21.6%, 20.9%, and 17.4%. All were quite small compared to that of Dr.   RMSE/MAE and large R 2 scores indicate a higher fitting performance than the Ridge model and Lasso model. The relative importance of the input variables for XGBoost model is shown in Figure 27. It can be seen that Dr (100%) was the most important input variable, followed by mean σ′ (68.5%). Cu, D50, and FC had the relative importance of 21.6%, 20.9%, and 17.4%. All were quite small compared to that of Dr.    Figure 25 shows the comparison between true value and XGBoost model predicted value of Log(W). It can be seen that almost all the data points of the training set were located in the 100% agreement line, and majority of the testing set data were within ±10% of the target values. The relative errors for the training and testing patterns are plotted in Figure 26. Almost all of the XGBoost estimations of the training set data fell within ±5% of the target values, and most of the predictions of the testing data were within −5% to +10% of the target values.     The relative importance of the input variables for XGBoost model is shown in Figure 27. It can be seen that Dr (100%) was the most important input variable, followed by σ mean (68.5%). Cu, D 50 , and FC had the relative importance of 21.6%, 20.9%, and 17.4%. All were quite small compared to that of Dr.

MARS Modeling Result
In the MARS model, the most important parameter is Maxterms, which controls the maximum number of terms generated by the forward pass. Figures 28 and 29 give the RMSE, MAE, and R 2 change curves of MARS model with parameter Maxterms. It can be seen that RMSE and MAE decreased and the R 2 score increased with the increasing Maxterms until the value of 50, then the curves remained almost unchanged. The optimal Maxterms was chosen as 60. Figure 30 shows the comparison between true value and MARS model predicted value of Log(W). It can be seen that majority of the data points were within ± 10% of the target values. The relative errors for the training and testing patterns are plotted in Figure 31. Almost all of the MARS estimations of the training and testing set data fell within ± 10% of the target values. It should be noticed that the RMSE, MAE, and R 2 scores were 0.148, 0.072, and 0.879 for the training set, with the corresponding scores of 0.165, 0.134, and 0.84 for the testing set. The results prove the satisfactory predictive capacities of MARS model.

MARS Modeling Result
In the MARS model, the most important parameter is Max terms , which controls the maximum number of terms generated by the forward pass. Figures 28 and 29 give the RMSE, MAE, and R 2 change curves of MARS model with parameter Max terms . It can be seen that RMSE and MAE decreased and the R 2 score increased with the increasing Max terms until the value of 50, then the curves remained almost unchanged. The optimal Max terms was chosen as 60.             (26).

Comparison of the Results from Different Models
The performance indications of the Ridge, Lasso, RF, XGBoost, and MARS model for Soil Liquefaction Assessment using Capacity Energy Concept are listed in Figures 33 and 34. It can be seen that the Ridge model performs worse among these five models. Its RMSE and MAE scores were 0.271 and 0.221 for the training set, with the corresponding scores of 0.258 and 0.201 for the testing set. The R 2 scores were 0.582 and 0.382 for training set and testing set of Ridge model. The Ridge mode had the highest MRSE/MAE and lowest R 2 score among these five models. This reflects the shortcoming of linear regression nature of Ridge model. The R 2 scores were 0.578 and 0.581 for training set and testing set, respectively. The RF, XGBoost, and MARS models achieved better results, whose RMSE and MAE values were smaller than 0.16 and 0.14, with the R 2 scores were all above 0.8.
Overall, the R 2 scores of RF and XGBoost models were higher than the MARS model, e.g., the R 2 scores were 0.97 and 0.977 for training data for these two models, while the MARS was 0.879. This means that the RF and XGBoost model fit better than the MARS model in the training set. However, the R 2 scores of the testing set for MARS model were 0.84, which is the same as the XGBoost model, and even higher than the RF model (0.825).     The Lasso & LassoCV model also did not perform well. The RMSE and MAE scores were 0.275 and 0.225 for the training set, and 0.255, 0.201 for the testing set. The R 2 scores were 0.578 and 0.581 for training set and testing set, respectively. The RF, XGBoost, and MARS models achieved better results, whose RMSE and MAE values were smaller than 0.16 and 0.14, with the R 2 scores were all above 0.8.
Overall, the R 2 scores of RF and XGBoost models were higher than the MARS model, e.g., the R 2 scores were 0.97 and 0.977 for training data for these two models, while the MARS was 0.879. This means that the RF and XGBoost model fit better than the MARS model in the training set. However, the R 2 scores of the testing set for MARS model were 0.84, which is the same as the XGBoost model, and even higher than the RF model (0.825). Figure 35 displays the relative importance of the input variables for the soil liquefaction assessment models developed by Ridge, Lasso & LassoCV, RF, XGBoost, and MARS. The results indicate that the capacity energy Log(W) was most sensitive to Dr, whose relative importance was almost 100% for all the five machine learning methods. The second important input variable should be either σ mean or D 50 , because their average relative importance were quite close and around 60%. But if we only focus on the RF, XGBoost, and MARS model whose performance measures were above 0.8, it can be found that the σ mean had the average relative importance of 56.4% while the D 50 had only 36.4%. The FC and Cu did not play a significant role, because their relative importance were all quite low for the five machine learning models. In addition, the LassoCV algorithm automatically removed the FC as input variable by giving it a zero regression coefficient. The relative importance of Cu were negative for Ridge and Lasso & LassoCV, and 21.6% and 4.34% for XGBoost model, which are all quite low.
It should be noticed that the performance of the MARS models were not only desirable, but the relative importance obtained by MARS model were also quite close to the average value of the five models. In addition, compared to the black box model of RF and XGBoost model, the MARS model had an advantage of the capacity to output explicit expressions.
of Cu were negative for Ridge and Lasso & LassoCV, and 21.6% and 4.34% for XGBoost model, which are all quite low.
It should be noticed that the performance of the MARS models were not only desirable, but the relative importance obtained by MARS model were also quite close to the average value of the five models. In addition, compared to the black box model of RF and XGBoost model, the MARS model had an advantage of the capacity to output explicit expressions.

Conclusions
In the present study, analyses have been carried out on a total of 405 previously published tests using machine learning algorithm, including Ridge, Lasso & LassoCV, Random Forest, XGBoost, and MARS approaches, to assess the capacity energy required to trigger liquefaction in sand and sandy soils. Major findings obtained in this research include: (1) The performance measures for Ridge and Lasso & LassoCV models were all below 0.6, reflecting the shortage of linear regression methods in handling the complex data mapping in highdimensional data sets.
(2) RF, XGBoost, and MARS models were capable of capturing the nonlinear relationships involving a multitude of variables with interaction among each other without making any specific assumption about the underlying functional relationship between the input variables and the response.

Conclusions
In the present study, analyses have been carried out on a total of 405 previously published tests using machine learning algorithm, including Ridge, Lasso & LassoCV, Random Forest, XGBoost, and MARS approaches, to assess the capacity energy required to trigger liquefaction in sand and sandy soils. Major findings obtained in this research include: (1) The performance measures for Ridge and Lasso & LassoCV models were all below 0.6, reflecting the shortage of linear regression methods in handling the complex data mapping in high-dimensional data sets.
(2) RF, XGBoost, and MARS models were capable of capturing the nonlinear relationships involving a multitude of variables with interaction among each other without making any specific assumption about the underlying functional relationship between the input variables and the response.
(3) Using the results of Ridge, Lasso & LassoCV, Random Forest, XGBoost, and MARS as cross-validation, it can be found that the capacity energy Log(W) was most sensitive to Dr, whose relative importance were almost 100% for all the five machine learning methods.
(4) The R 2 scores of the testing set for MARS model was the same as the XGBoost model, and even higher than the RF model. The relative importance obtained by MARS model were quite close to the average value of the five models. In addition, compared to the black box model of RF and XGBoost model, the MARS model had an advantage of the capacity to output explicit expressions.