Prediction of Swelling Index Using Advanced Machine Learning Techniques for Cohesive Soils

: Several attempts have been made for estimating the vital swelling index parameter conducted by the expensive and time-consuming Oedometer test. However, they have only focused on the neuron network neglecting other advanced methods that could have increased the predictive capability of models. In order to overcome this limitation, the current study aims to elaborate an alternative model for estimating the swelling index from geotechnical physical parameters. The reliability of the approach is tested through several advanced machine learning methods like Extreme Learning Machine, Deep Neural Network, Support Vector Regression, Random Forest, LASSO regression, Partial Least Square Regression, Ridge Regression, Kernel Ridge, Stepwise Regression, Least Square Regression, and genetic Programing. These methods have been applied for modeling samples consisting of 875 Oedometer tests. Firstly, principal component analysis, Gamma test, and forward selection are utilized to reduce the input variable numbers. Afterward, the advanced techniques have been applied for modeling the proposed optimal inputs, and their accuracy models were evaluated through six statistical indicators and using K-fold cross validation approach. The comparative study shows the efﬁciency of FS-RF model. This elaborated model provided the most appropriate prediction, closest to the experimental values compared with other models and formulae proposed by the previous studies.


Introduction
The geotechnical study is the first phase of any structural project. It addresses the morphological, geological, local, and regional site conditions through a set of successive steps, aimed mainly at providing efficiently the necessary and sufficient information on the natural characteristics of a site. Moreover, it allows for determining all information and parameters necessary for the computation of foundations, leading to the stability of the structures under different risks, such as swelling [1,2]. The latter is considered a notoriously problematic phenomenon for any infrastructure foundation because of the considerable volume changes experienced by soils upon drying and wetting [3]. However, the integrated components of the geotechnical process for detecting swelling risk are very diverse. Moreover, these experimental processes are often time-consuming and, in many cases, the computation costs are very high. On the other hand, the Oedometer test suffers from several disadvantages, such as the requirement of sophisticated and expensive equipment, being very time-consuming, and requiring highly qualified laboratory workers [4,5]. A well-known example is the consolidated clay, which could take up to 30 days when the Oedometer test is conducted [6]. However, the mechanical soil properties are sometimes inexistent or insufficient, in circumstances where the engineer must provide a quick and important decision to rapidly handle the critical situation. These disadvantages Table 1. Correlations proposed in the literature to estimate the swelling index C s .

Variables
Correlations Comments References C s (W L , G s ) C s = 0.0463 W L 100 G s (1) fine-grained soils [18] C s (I P ) C s = 0.00194(I P − 4.6) (2) fine-grained soils [19] C s (W n ) C s = 0.0133e 0.036W n (3) fine-grained soils Isik 1 [20] C s (e 0 ) C s = 0.0121e 1 (5) fine grained soils Isik 3 [20] On the other hand, the use of the ANN method in geotechnical engineering has witnessed a major development since the early nineties [21]. Several attempts using neural networks have led to impressive results. Between the important research works dealing with the swelling of soil, Işık has utilized one hidden layer of the ANN model by analyzing a database consisting of 42 test data for fine-grained soil. The selected input parameters included the initial void ratio (e 0 ) and W. The (2-8-1) ANN model (meaning two inputs, eight nodes in the hidden layer and one output) efficiently predicted C s [20]. Das et al. have predicted the swelling pressure using an ANN model with an input layer containing W, dry density (Y d ), WL, PI, and clay fraction [22]. Kumar and Rani have developed an ANN model with one hidden layer to predict C s and the C c of clay by learning from 68 samples. They used the FC, WL, PI, maximum dry density, and optimum moisture content as an input layer. The suggested ANN model (5-8-1) provided better predictability in comparison with the multiple regression analysis (MRA) model [23]. Kurnaz et al. have used ANN models to estimate C s and C c from input layer including the W, e 0 , WL, and PI. The proposed ANN model (4-6-2) has proven its efficiency in the prediction of C c . Nevertheless, the predicted C s values were not satisfactorily compared to the compression index [24]. Table 2 recapitulates the aforementioned proposed ANN models in the literature to estimate the swelling index C s . W, e 0 , WL, and PI C s and C c 4-6-2 246 [24] The quality, learning ability, and effectiveness have made the use of the ANN method very useful. According to the author's knowledge, the previous studies have used only the ANN method for estimating C s , although recent studies have showed that other techniques could have yielded more effective and accurate results than the ANN method in geotechnical applications [25][26][27]. Furthermore, the aforementioned studies have modeled C s using a few input parameters and, therefore, ignored the different soil parameters that could increase the learning capacity of the network. Consequently, the complicated mechanism of the swelling phenomena has been oversimplified. Moreover, few samples have been used, meaning that the proposed models have a limited capacity to generalize new data not used in the few training data. Furthermore, they evaluated the predictive capacity of proposed models based on only one split to validate data learning. Therefore, the capacity of their model to overcome the over-fitting and under-fitting problems cannot be confirmed.
In view of the aforementioned shortcoming, this research sheds the light on the capability of the advanced machine learning methods to generate a reliable model, contributing to easily and effectively predict C s , filling a gap in the literature where there is a lack in the use of the advanced machine learning methods in modeling swelling phenomenal. Consequently, the elaborated model offers plenty of benefits such as its reliability, and lowering the budget used to predict C s from the easily obtained soil parameters and without the need to operate the odometer test.

Overview of the Methodology
Several-advanced machine learning techniques like Extreme Learning Machine (ELM), Deep Neural Network (DNN), Support Vector Regression (SVR), Random Forest (RF), LASSO regression (LASSO), Partial Least Square Regression (PLS), Ridge Regression (Ridge) Kernel Ridge (KRidge), Stepwise Regression (Stepwise), Least Square Regression (LS), and Genetic Programming (GP) have been applied for modeling 875 samples. Multiple input parameters, including the wet density (Y h ), the dry density (Y d ), the degree of saturation (Sr), the plasticity index (PI), the water content (w), the void ratio (e 0 ), the liquid limit (WL), sample depth (Z), and the fine contents (FC) have been used. Firstly, from an effective viewpoint, the suitable input variables and nonlinear components are of considerable importance for efficient prediction. Thus, the Principal Component Analysis (PSA), Gamma Test (GT) and Forward selection (FS) methods have been used to select the optimal set of input variables. Afterward, the advanced machine learning techniques have been applied for modeling optimal inputs, and their accuracy models were evaluated through numerous statistical indicators. To assess the predictive capability of the best model, the k-fold crossvalidation approach based on 10 splits has been utilized. Finally, to answer the question "Which input variables have the most or less influence on C s through the proposed model?", a sensitivity analysis has been carried out using the step-by-step selection method.

Oedometer Test
The compressibility of soils is one of the most important phenomena in civil engineering. The Oedometer test is used to determine the compressibility properties of soil, which are usually described using C c , C s , and the coefficient of consolidation (C v ) [20,24]. The compressibility properties are used to predict how the settlement and the swelling will be held. A number of parameters influencing the swelling behavior have been reported in the past, such as W, e, WL, PI, the type and amount of clay (FC), and others [8]. C s is usually determined using the graphical analysis of compression and recompression curves in void ratio effective stress (e = f(log(σ))) plots [20,28] (see Figure 1), and used typically to estimate the consolidation settlement for soil layers using these formulae: where e 0 : initial void ratio, ∆σ v : load increment, σ' c : pre-consolidation pressure, σ' v0 : initial vertical effective stress, C c : compression index, and C s : swelling index.
of soil, which are usually described using Cc, Cs, and the coefficient of con [20,24]. The compressibility properties are used to predict how the settle swelling will be held. A number of parameters influencing the swelling been reported in the past, such as W, e, WL, PI, the type and amount of others [8]. Cs is usually determined using the graphical analysis of com recompression curves in void ratio effective stress (e=f(log(σ))) plots [20,2 1), and used typically to estimate the consolidation settlement for soil laye formulae: where e0: initial void ratio, Δσv: load increment, σ'c: pre-consolidation press vertical effective stress, Cc: compression index, and Cs: swelling index.

Case Study
In the current study, a database of 875 Oedometer and other geotech been collected from 570 boreholes. The reason for choosing the Northeast A attributed to the widespread clay and marl geological formations, which fered from shrinkage-swelling phenomena, resulting into a database inc range of data, sufficient for a reliable study. The sample depth (Z) ranges 45 m with an average of 6.66 m. Figure 2 illustrates the distribution of the holes in the study area. The soil parameters used in this study were m lab depending on the international or European standards.

Case Study
In the current study, a database of 875 Oedometer and other geotechnical tests has been collected from 570 boreholes. The reason for choosing the Northeast Algerian area is attributed to the widespread clay and marl geological formations, which generally suffered from shrinkage-swelling phenomena, resulting into a database including a wide range of data, sufficient for a reliable study. The sample depth (Z) ranges between 0.5 to 45 m with an average of 6.66 m. Figure 2 illustrates the distribution of the collected boreholes in the study area. The soil parameters used in this study were measured in the lab depending on the international or European standards. PCA is an exploratory statistics approach [29] considered between the multivariate statistical methods, which is generally utilized to decrease the complexity of the input variables. PCA is used in the practical case when we have a great number of information, and, in many fields, sometimes in conjunction with other methods [30]. Furthermore, this method reduces the input variables into a better interpretation of independent principal components (PCs) [31,32]. Instead of the direct use of input variables, we reduce them into PCs in order to use them as lower-dimensional inputs. Details for mastering the art of applying the PCA are published by other studies [33][34][35][36].

Overview of Gamma Test (GT)
The Gamma Test (GT) is an advanced approach for evaluating the variance of the noise, or the mean square error (MSE), which could be fulfilled without over-fitting. This method is very advantageous for assessing the nonlinear relationship between two random variables (input and target). Koncar (1997) [37] and Agalbjörn et al. (1997) [38] were the first reporting gamma test method and later many investigators have developed and detailed [35,36,39]. Only summarized information about GT is discussed here. Details for mastering the art of GT are presented in the aforementioned articles for the interested readers. By using necessary conditions detailed in the above-mentioned papers, the variance of the noise is specified by the bias of the regression between γ(k) and δ(k), where 1 ≤ k ≤ p. This variance is dubbed Γ. γ(k) and δ(k) are presented in Equations (9) and (10): [ , ] is the kth nearest neighbor of xi, and [ , ] is the corresponding target. In order to calculate Γ, a least squares fit line should be carried out for the p points (δ(k), γ(k)). Afterward, the bias of the regression line will be easily estimated, which represents the gamma statistics parameter Γ. It is worth mentioning that Γ provides useful findings for building an accurate model. The smaller the value of Γ, the more appropriate is the input set. In addition, Vratio presents an important indication to assess the predictability of the selected target depending on utilized inputs, and illustrated as:  PCA is an exploratory statistics approach [29] considered between the multivariate statistical methods, which is generally utilized to decrease the complexity of the input variables. PCA is used in the practical case when we have a great number of information, and, in many fields, sometimes in conjunction with other methods [30]. Furthermore, this method reduces the input variables into a better interpretation of independent principal components (PCs) [31,32]. Instead of the direct use of input variables, we reduce them into PCs in order to use them as lower-dimensional inputs. Details for mastering the art of applying the PCA are published by other studies [33][34][35][36].

Overview of Gamma Test (GT)
The Gamma Test (GT) is an advanced approach for evaluating the variance of the noise, or the mean square error (MSE), which could be fulfilled without over-fitting. This method is very advantageous for assessing the nonlinear relationship between two random variables (input and target). Koncar (1997) [37] and Agalbjörn et al. (1997) [38] were the first reporting gamma test method and later many investigators have developed and detailed [35,36,39]. Only summarized information about GT is discussed here. Details for mastering the art of GT are presented in the aforementioned articles for the interested readers. By using necessary conditions detailed in the above-mentioned papers, the variance of the noise is specified by the bias of the regression between γ(k) and δ(k), where 1 ≤ k ≤ p. This variance is dubbed Γ. γ(k) and δ(k) are presented in Equations (9) and (10): x N[i,K] is the kth nearest neighbor of x i , and y N[i,K] is the corresponding target. In order to calculate Γ, a least squares fit line should be carried out for the p points (δ(k), γ(k)). Afterward, the bias of the regression line will be easily estimated, which represents the gamma statistics parameter Γ. It is worth mentioning that Γ provides useful findings for building an accurate model. The smaller the value of Γ, the more appropriate is the input set. In addition, V ratio presents an important indication to assess the predictability of the selected target depending on utilized inputs, and illustrated as: where σ 2 (y) is the output variance. Overall, in the current paper, by using the least value of Γ and V ratio , the best-input combinations were chosen.

Overview of Forward Selection (FS)
The forward selection method is based on regression modelling. Firstly, input variables are considered in ascending order according to their correlation with the target variable (from the best to the weak correlated variable). Then, the best correlated variable with the target is chosen as the first one. The other variables are afterward added one by one as the second input, and we assess the predictive capability of the set according to their correlation development. Generally, the most significant variable increases the determination coefficient (R 2 ) is selected and added to the input set. In other words, if the R 2 is increased more than 5%, the new variable is accepted and added to the optimal input set. This step is repeated N − 1 times for assessing the impact of each parameter on modelling the target. Finally, among N tested inputs, the ones with optimum R 2 are accepted as the model input subset.

Machine Learning Methods
In the current study, several machine-learning methods have been used in order to conduct a reliable study and to propose a high performance model. Only the methods actually used are cited, followed by relevant references, which can be examined by the interested readers to better understand each one. The used methods are: Extreme Learning Machine (ELM) [40], Deep Neural Network (DNN) [6,41], Support Vector Regression (SVR) [42], Random Forest (RF) [43], LASSO regression (LASSO) [44], Partial Least Square Regression (PLS) [45], Ridge Regression (Ridge) [46], Kernel Ridge Regression (KRidge) [47], Stepwise Regression (Stepwise) [48], and Genetic Programing (GP) [6]. Matlab software has been used for programming the algorithms corresponding to each method, except GP when the HeuristicLab Interface has been utilized [49]. The controlling parameters of the ELM, DNN, SVR, RF, LASSO, PLS, Ridge, KRidge, Stepwise, and PG algorithms used in this study are listed in Table 3.

Statistical Performance Indicators
The prediction accuracy of the proposed models was evaluated through various statistical performance indicators and using graphical presentation. The statistical performance indicators are Mean absolute error (MAE), Root mean square error (RMSE), Index of scattering (IOS), Nash-Sutcliffe efficiency (NSE), Pearson correlation coefficient (R), and Index of agreement (IOA). They are expressed as follows [50]:

5.
Pearson correlation coefficient (R): 6. Index of agreement (IOA): where Y tar,i , Y out,i , Y tar , and Y out present the target, output, mean of target, and mean of output swelling index values for N data samples, respectively. Furthermore, the proposed machine learning model having the lowest value of RMSE, IOS, and MAE and the highest value of IOA, NSE, and R presents the better one and the closest to the experimental data.
Subsequently, after selecting the best model using statistical performance indicators, its predictive capacity is assessed using the K-fold cross-validation approach. This shows more accuracy and robustness for evaluating the capability of performance predictive of the best model by testing the existence of over-fitting and under-fitting problem in data learning [51,52]. The method consists of separating the data set into k equal sizes splits. Hence, for each one, K−1-folds are used for training and the last one for validation. This operation is repeated successively until the use of all split for the validation step [53]. The main advantage of this method is that all collected data are utilized in both the training and validation steps [52]. Breiman and Spector (1992) have demonstrated that K = 10 or K = 5-fold cross validation is the best choice for model evaluation [51]. In this work, we chose K-fold cross-validation with K = 10 for comparing the predictive capacity of each model.

Methodology
In order to find the most appropriate model to predict C s of soil using previously mentioned geotechnical parameters as an input, the methodology consisted of the following steps: 1.
Creation of a geotechnical database of Algerian soil, collected from different laboratories around the geotechnical constructions projects in progress or completed before.

2.
Selecting the optimal input variables using Principal component analysis (OSA), Gamma Test (GT), and Forward selection (FS) has been used.

3.
Analyzing selected optimal inputs using several machine learning methods. The ELM, DNN, SVR, RF, LASSO, PLS, Ridge, KRidge, Stepwise, and PG methods have been used in this step for proposing 30 models.

4.
Determine the most appropriate model for predicting the C s value between the thirty proposed models using important statistical performance indicators as MAE, RMSE, IOS, NSE, R, and IOA.

5.
Assessing the predictive capacity of the best model to overcome under-fitting and over-fitting problem by using the K-fold cross validation approach with K = 10. 6.
Doing a sensitivity analysis by utilizing the step-by-step method to know the most or less influenced input on C s through the proposed model.
The research methodological for determining the most appropriate model to estimate C s is systematically described in Figure 3. Appl. Sci. 2021, 11, x FOR PEER REVIEW 9 of 32

Database Compilation
In the current study, a database of 875 Oedometer and other geotechnical tests has been collected resulting into a database including a wide range of data, sufficient for a reliable study. Table 4 displays the descriptive statistics of collected samples, determined using SPSS such as the mean, median, mode, standard deviation, variance, skewness, error of skewness, kurtosis, kurtosis error, range, minimum, and maximum. The skewness values indicate that all variables are evenly distributed. Moreover, the results point out that the database includes a wide range of data. Therefore, the collected dataset can be used to enhance novel empirical models and assess the predictive capacity of existing formulae. North East Algerian soil can be characterized as a dense soil with an average wet density of 1.67 according to the European norms [54]. Moreover, the soil appears to be plastic with an average plastic index of 26.09 according to SONGLIRAT's classification [55]. In the classification based on swelling, according to the European norms, the soils could be classified as a swelling soil, with an average Cs equal to 0.044 [55].

Database Compilation
In the current study, a database of 875 Oedometer and other geotechnical tests has been collected resulting into a database including a wide range of data, sufficient for a reliable study. Table 4 displays the descriptive statistics of collected samples, determined using SPSS such as the mean, median, mode, standard deviation, variance, skewness, error of skewness, kurtosis, kurtosis error, range, minimum, and maximum. The skewness values indicate that all variables are evenly distributed. Moreover, the results point out that the database includes a wide range of data. Therefore, the collected dataset can be used to enhance novel empirical models and assess the predictive capacity of existing formulae. North East Algerian soil can be characterized as a dense soil with an average wet density of 1.67 according to the European norms [54]. Moreover, the soil appears to be plastic with an average plastic index of 26.09 according to SONGLIRAT's classification [55]. In the classification based on swelling, according to the European norms, the soils could be classified as a swelling soil, with an average C s equal to 0.044 [55].

Correlation between C s and Geotechnical Parameters
To statistically predict the correlation between C s and soil properties, the SPSS software has used. The cross-correlation between C s and soil parameters is presented in Figure 4, which could provide a descriptive overview of the data distribution. The results indicate a positive relationship between C s and others parameters, except for Y d and Y h , which seem to have a negative relationship (see Figure 4); this indicates that an increase in these parameters tends to proportionally increase C s . The Spearman correlation coefficient R and its significance between C s and other geotechnical parameters are presented in Table 5. The results show that the significance is less than 0.05 except Z, meaning that the majority of correlations are statistically significant. On the other hand, according to Smith' classification (1986), the C s is moderately correlated to the aforementioned soil parameters, except Sr and Z which are poorly correlated. The results indicate that these parameters could have a complex nonlinear correlation with C s . Moreover, in order to mathematically simulate the complex swellings phenomena, new advanced machine learning methods should be applied.

Optimal Input Selection Using Principal Component Analysis
According to our knowledge, several methods were utilized in the literature to characterize the proper factors that affect the modeling precision. Therefore, the current study used the approach of selecting eigenvalues equal to or greater than 1, as illustrated in Table 6. Holland (2008) stated according to [56] that the eigenvalues are applied to condense the variance where the highest eigenvalues (1 and above) are deemed for any analysis by eigenvectors ranking in any correlation matrix. Figure 5 presents eigenvalues of each factor, which show that nine input variables correspond to nine eigenvalues. Likewise, Table 6 displays the Eigenvalue and percentage of data in each factor. It is obvious from the table that the main first three variables explain more than 75% of the factors. Similarly, the results showed that five factors have a significant percentage contribution of more than 94. The findings indicate that the best model includes three PCs as input variables according to principal component analysis.

Optimal Input Selection Using Principal Component Analysis
According to our knowledge, several methods were utilized in the literature to characterize the proper factors that affect the modeling precision. Therefore, the current study used the approach of selecting eigenvalues equal to or greater than 1, as illustrated in Table  6. Holland (2008) stated according to [56] that the eigenvalues are applied to condense the variance where the highest eigenvalues (1 and above) are deemed for any analysis by eigenvectors ranking in any correlation matrix. Figure 5 presents eigenvalues of each factor, which show that nine input variables correspond to nine eigenvalues. Likewise, Table 6 displays the Eigenvalue and percentage of data in each factor. It is obvious from the table that the main first three variables explain more than 75% of the factors. Similarly, the results showed that five factors have a significant percentage contribution of more than 94. The findings indicate that the best model includes three PCs as input variables according to principal component analysis.    Table 7 that the first combination contains all nine inputs (dubbed an initial set). Similarly, the second one included eight input parameters (All-Sr),

Optimal Input Selection Using the Gamma Test
In this part, the influence of the separated inputs was evaluated by building the ten various combinations (Table 7) through diverse input parameters (Sr, Z, Y h , Y d , W, e 0 , FC, WL and PI). It was noticed from Table 7 that the first combination contains all nine inputs (dubbed an initial set). Similarly, the second one included eight input parameters (All-Sr), which means omitted Sr from the initial set; the fourth combination comprises all inputs except Y h , and so on for rest of the combinations as presented in Table 7. The results of GT analysis shown in Table 7 prove that the parameters W, FC, WL, and PI have an important effect on the target (C s ). The four input parameters are chosen according to the maximum value of gamma statistics (Γ), and V ratio . Based on this finding, four new combinations were tested in Table 8 for the sake of determining the optimal input set. In this case, the best set was designated based on the minimum of Γ and V ratio . The outcomes of GT on four diverse combinations are illustrated in Table 8. The findings indicate that the WL, PI, FC, and W set had the lowest value of gamma statistics (Γ = 1.3524 × 10 −4 , V ratio = 0.3714, and Mask = 1111), and used as input variables for modelling C s according to the Gamma Test method. In this part, the FS technique is utilized as a nonlinear input selection approach in order to choose the optimal input set between nine parameters. The ANN method with one layer is used to implement the nonlinear analysis of each set according to the empirical rule proposed by Kanellopoulos and Wilkinson (1997). The authors have demonstrated that the optimal node number in a simple network is twice the number of input parameters [57]. Firstly, correlation between target and other geotechnical parameters is estimated. Secondly, the parameter having the highest correlation coefficient, (i.e., WL with R = 0.55) is chosen as the first and the most significant parameter. Then, the other inputs are added into the model one by one and launch a new ANN modelling. The one providing the best modeling result (high determination coefficient (R 2 )) is selected as a new input and gathered into the formerly selected parameters. This work is repeated several times until that appending other parameter to the input set does not improve the modeling performance. Hence, if the determination coefficient increases more than 5%, the novel parameter is selected. Finally, input parameters having the most heavy influence on the target are selected and others ones are rejected. Table 9 shows the results of the forward selection procedure of different input models. The findings indicate that WL, Yd, W, and PI are selected as inputs for modeling C s according to the forward selection procedure, and the other parameters are eliminated. Table 9. Results of forward selection procedure of different input models.

Swelling Index Prediction through AI Models
The controlling parameters of the ELM, DNN, SVR, RF, LASSO, PLS, Ridge, KRidge, Stepwise, and GP algorithms used in this study are listed in Table 3. The performance of each models for selected optimal inputs (PSO, GT and FS) during training and validation phases is presented in Table 10. Six statistical performance indicators have been used to compare between proposed models in order to select the best one. We mention the Mean Absolute Error (MAE), Root Mean Square Error (RMSE), Index of Scattering (IOS), Nash-Sutcliffe Efficiency (NSE), Pearson correlation coefficient (R), and Index of Agreement (IOA). The data were divided into two parts, i.e., 80% for training and 20% for validation (700 samples for training and 175 for validation). It was indicated from Table 10

Evaluating the Best Fitted Model Using the K-fold Cross Validation Approach
The 10-fold cross validation approach was efficiently used to assess the predictive capacity of the proposed model. We must stress out the fact that previous studies interested in estimating C s [20,[22][23][24] have evaluated the predictive capacity of their proposed models depending on a single split. Subsequently, the capacity of their models in overcoming the over-fitting and under-fitting problems could not be ascertained. Figure 6 shows the performance measures of the best FS-RF models using 10-fold cross validation depending on training and validation data for each split. The findings prove the performance of the proposed model. The fact that R ranges between 0.92 and 0.94 for training data, and between 0.62 and 0.75 for validation data in the 10 splits, indicates the predictive capacity of the most appropriate FS-RF model for learning data, generating new validation data, and overcoming the over-fitting or under-fitting problems.

Evaluating the Best Fitted Model Using the K-fold Cross Validation Approach
The 10-fold cross validation approach was efficiently used to assess the predictive capacity of the proposed model. We must stress out the fact that previous studies interested in estimating Cs [20,[22][23][24] have evaluated the predictive capacity of their proposed models depending on a single split. Subsequently, the capacity of their models in overcoming the over-fitting and under-fitting problems could not be ascertained. Figure 6 shows the performance measures of the best FS-RF models using 10-fold cross validation depending on training and validation data for each split. The findings prove the performance of the proposed model. The fact that R ranges between 0.92 and 0.94 for training data, and between 0.62 and 0.75 for validation data in the 10 splits, indicates the predictive capacity of the most appropriate FS-RF model for learning data, generating new validation data, and overcoming the over-fitting or under-fitting problems.

Comparison between the Proposed Models and Empirical Formulae
In order to test the effectiveness of the most appropriate FS-RF model, a comparative study was performed with some empirical formulae suggested by the previous studies for predicting Cs. These formulae were presented in Table 1. Table 11 illustrates the statis-

Comparison between the Proposed Models and Empirical Formulae
In order to test the effectiveness of the most appropriate FS-RF model, a comparative study was performed with some empirical formulae suggested by the previous studies for predicting C s . These formulae were presented in Table 1. Table 11 illustrates the statistical results of C s,estimated /C s,measured for aforementioned empirical equations in comparison with the proposed FS-RF and FS-DNN models. The mean and standard deviation of the ratio C s,estimated /C s,measured could be useful evidence for evaluating the predictive capacity. The closer the mean value to one and the standard deviation to zero, the better is the model. The results show that the most appropriate FS-RF model is the one with the minimum standard deviation σ FS-RF = 0.25, in addition to being the closer mean value to 1 average(FS-RF) = 1.07. The other equations indicate a poor predictive capacity, yielding a mean value in the range of 0.45-1.7, and standard deviation value between 0.26-0.99. Equally, the box plot of C s,estimated /C s,measured of aforementioned formulae is displayed in Figure 7. This graphical representation provides an overview of the dispersion and skewness of every model. The scattering of the most appropriate FS-RF model appears only to be slightly regular and close to one, and is described by a shorter box than the others. The large box distant from 1, increasing to five in certain models, shows little variation in predicting C s . Data characterized by circles and stars in the figure denote the extreme and extra-extreme value.  Figure 7. This graphical representation provides an overview of the dispersion and skewness of every model. The scattering of the most appropriate FS-RF model appears only to be slightly regular and close to one, and is described by a shorter box than the others. The large box distant from 1, increasing to five in certain models, shows little variation in predicting Cs. Data characterized by circles and stars in the figure denote the extreme and extra-extreme value.

Sensitivity Analysis
To answer the question "Which input variables have the most or less influence on Cs in the proposed model?", a sensitivity analysis has been carried out using the step-by-step method [58]. In this approach, the normalized input neurons vary at a constant rate, one at a time, while the other variables are held constant. Different constant rates (0.3, 0.6 and

Sensitivity Analysis
To answer the question "Which input variables have the most or less influence on C s in the proposed model?", a sensitivity analysis has been carried out using the step-by-step method [58]. In this approach, the normalized input neurons vary at a constant rate, one at a time, while the other variables are held constant. Different constant rates (0.3, 0.6 and 0.9) are selected in the current study. For every input, the percentage of change in the output, as a result of the change in the input, is recorded. The sensitivity of each input is computed based on Equation (18): where K is the number of data sets used in the study (K = 875). The results of the sensitivity analysis of proposed FS-RF model are shown in Figure 8. It can be noticed that C s is significantly influenced by WL, and its sensibility ratio is between 32-38%. This parameter is closely followed by the PI, which gives a moderate sensitivity and is ranked second. In addition, W and Y d have little effect on C s .
Appl. Sci. 2021, 11, x FOR PEER REVIEW 19 of 32 0.9) are selected in the current study. For every input, the percentage of change in the output, as a result of the change in the input, is recorded. The sensitivity of each input is computed based on Equation (18): where K is the number of data sets used in the study (K = 875). The results of the sensitivity analysis of proposed FS-RF model are shown in Figure 8. It can be noticed that Cs is significantly influenced by WL, and its sensibility ratio is between 32-38%. This parameter is closely followed by the PI, which gives a moderate sensitivity and is ranked second. In addition, W and Yd have little effect on Cs.

Significance of the Findings and Cross-Validation of the Results
The main motivation of this study is to explore the capability of advanced machine learning methods to generate a reliable model aimed at easily predicting Cs. Needless to say, Cs is one of the most indispensable geotechnical parameters required to estimate the

Significance of the Findings and Cross-Validation of the Results
The main motivation of this study is to explore the capability of advanced machine learning methods to generate a reliable model aimed at easily predicting C s . Needless to say, C s is one of the most indispensable geotechnical parameters required to estimate the settlement and the swelling degree in the every site. Firstly, in order to identify the optimal input parameters, which have the ideal influence on C s three advanced methods have been used. PSO proposed three lower-dimensional parameters as an optimal input. GT indicated that WL, PI, FC, and W could formalize the optimal input set. However, FS showed that WL, Y d , W, and PI are the best ones. The reason behind the difference between the three approaches lies in the philosophy of each one in handling the data. Based on that, ten advanced machine learning methods (ELM, DNN, SVR, RF, LASSO, PLS, Ridge, KRidge, Stepwise, and GP) have applied for modeling the three selected optimal input set (PSO, GT, and FS). The findings clearly indicate that the optimal input is the one chosen by FS and trained by the RF method (FS-RF). The latter presents the most appropriate model, which gave the minimum values of error metrics (MEA, RMSE, and IOS) and higher values of NSE, R, and IOA compared to other models. Furthermore, the emerging model was evaluated by the K-fold cross validation approach and compared with other proposed formulae. The conclusion is that the FS-RF model could generate new data without over-fitting or under-fitting, and being more effective than the empirical formulae. The other most interesting aspect is the optimal input set related to the best FS-RF model (WL, Y d , W, PI). Interestingly, this also accords with several studies, which have showed that physical parameters indirectly affected the swelling phenomena [59]. It is known that the micro-scale features of swelling soils comprise the mineral composition of clay particles, their reaction with the water chemistry, and the cations attracted to the clay particle by electrical forces. These micro-scale factors influence macro-scale physical factors, such as density, plasticity, and water content to control the engineering comportment of soil [59]. Additionally, the last part consists of the sensitivity analysis, which gives an overview about the more influenced parameters on C s according to the proposed model. The findings indicate that WL and PI are respectively the most affected factors on C s , meaning that swelling phenomena are primarily influenced by plasticity parameters. In addition, water content and dry density have little effect on C s .

Scientific Importance of the Findings and Novelty of the Research
Our findings represent a crucial contribution to the geotechnical field. The elaborated model building in our study represents a reliable tool for estimating C s without doing an Oedomter test. The performance of the estimation has been highly developed compared with other models and formulae proposed in the literature, which are based just on simple regression or neural networks. According to these data, we can infer that the Random Forest method, which is applied in this study for the first time for modeling swelling index, could yield more effective and accurate results than the DeepANN and ANN method in modelling geotechnical phenomena. These results provide further support for the hypothesis that macroscale physical factors, such as density, plasticity, and water content, are the parameters that affected the swelling phenomena. Moreover, the sensitivity analysis of the proposed model revealed the most influenced parameters between them for better understanding the complex behavior of the swelling. This investigation enhances our understanding that the plasticity of soil consisting of the WL and PI are the most affected factors on swelling phenomena. In addition to these conceptual advantages, for enhancing the training phase, a large number of samples (875 tests) and multiple input parameters have been used in this study. The sophisticated k-fold cross-validation approach was utilized to test the capability of the best model to overcome under-fitting and over-fitting problems.

Limitations of the Study and Future Research Directions
Despite the impressive multiple results presented in this study, a number of important limitations need to be considered. The most important limitation lies in the fact that the proposed machine learning models suffered from the hard fitting used in the future study. Generally, to overcome this limitation, researchers have presented elaborated models in the form of programed Interface or simple script by a known programming language like Matlab and Python for generating the proposed model. Another limitation of using this kind of data is due to its inability to generalize new conditions or circumstances that are not used in training data. Investigators generally used big collected data by transferring knowledge between them. This is an important issue for future research to use more data gathered from multiple countries for better learning and more reliable results. A further study using meta-heuristic algorithms on estimating C s is therefore suggested. We note, for example, Particle Swarm Optimization (PSO) and Gravitational Search Algorithm (GSA), bee colony algorithm (ABC), Bio-geography-Based Optimization (BBO), Whale Optimization Algorithm (WOA), Ant Colony Optimization (ACO), and Grey Wolf Optimizer (GWO). These algorithms have proved high-performance results combined with machine learning techniques leading to improving their learning and rapidly converging to the best solution. The application of these meta-heuristic algorithms combined with machine learning methods have shown very impressive results in the abroad fields [60][61][62].

Conclusions
This study set out to optimize the swelling index parameter conducted by the expensive and time-consuming Oedometer test, contributing to elaborating on a new accurate model for predicting the swelling index (C s ) from easily obtained geotechnical physical parameters. To achieve our aim, several advanced machine learning methods were used for a practical analysis aimed at modeling the physical parameters including the wet density (Y h ), the dry density (Y d ), the degree of saturation (Sr), the plasticity index (PI), the water content (w), the void ratio (e), the liquid limit (WL), sample depth (Z), and the fine contents (FC). Firstly, principal component analysis (PCA), Gamma test (GT), and the forward selection (FS) approach are utilized to reduce the input variable numbers and choose the optimal ones. The results indicate the reduction of nine input variables to four (using FS and GT) and three (using PCA techniques). Afterward, the advanced machine learning techniques have applied for modeling the proposed optimal inputs and their accuracy models were evaluated through six statistical indicators (MAE, RMSE, IOS, NSE, R, and IOA). The comparison of results assessment between different proposed models revealed the superiority of the FS-RF model, which gives the highest accuracy in terms of MAE (5.6 × 10 −3 /10.6 × 10 −3 ), RMSE (0.007/0.013), IOS (0.165/0.298), NSE (0.75/0.13), R (0.94/0.71), and IOA (0.95/0.82) during the training/validation phase. For assessing the predictive capacity of the proposed FS-RF model, the K-fold cross validation approach with K = 10 has been carried out. The results show that this model has a high correlation coefficient, ranging between 0.92 and 0.94 for training data, and 0.62 to 0.75 for validation data in the 10 splits, meaning that any over-fitting or under-fitting have been found. Three criteria were used to compare the performances of the most appropriate FS-RF model with the proposed formulas in the literature: the mean, the standard deviation, and the Box plot of the ratio C s,estimated /C s,measured . The findings indicate that the aforementioned FS-RF model is more effective than the empirical formulae. Finally, a sensitivity analysis was carried out in order to assess the impacts of the soil parameter inputs on the model performance. The results proved that WL has the most important effect on the prediction of C s . PI has a moderate influence and ranked second. In addition, W and Y d have little effect on C s .
In the future studies, it is recommended that meta-heuristic algorithms be undertaken

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A
The scatter plots between target and output swelling index value by advanced machine learning models:

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A
The scatter plots between target and output swelling index value by advanced machine learning models: