1. Introduction
The compression index (
Cc) is a crucial parameter in geotechnical engineering for predicting the settlement behavior of clayey soils under consolidation. Its accurate estimation is fundamental for foundation design and soil stability analysis. Conventionally,
Cc is empirically estimated using correlations based on soil index properties (e.g., the liquid limit, the plasticity index) or determined directly through laboratory consolidation tests [
1,
2]. While these traditional methods provide useful estimations, they are often limited by regional soil characteristics and laboratory constraints [
3]. Recent advancements in machine learning (ML) have enabled more accurate, adaptable predictions of geotechnical parameters like
Cc [
4].
Many empirical models have been developed to correlate
Cc with soil properties such as the natural water content, the liquid limit, and the void ratio. Skempton [
2] proposed one of the earliest models, correlating
Cc with the liquid limit for normally consolidated clays.
In 1944, Skempton [
2] proposed an empirical formula to estimate the
Cc of remolded clay based on their liquid limit (LL):
This equation suggests that the compression index increases linearly with the liquid limit.
For normally consolidated clays, Terzaghi and Peck [
1] recommended a similar relationship with a slightly higher coefficient:
These correlations provide a straightforward method to estimate the compressibility of clays using their liquid limit.
Other studies have since refined these relationships and incorporated more variables for improved accuracy [
5,
6]. However, empirical models are often highly dependent on soil-specific characteristics and lead to less reliable predictions across diverse soil profiles [
7].
Table 1 shows the main empirical correlations for determining
Cc.
A recent study developed a novel gene expression programming (GEP) model to predict the compression index (
Cc) of fine-grained soils using the liquid limit (LL), the plastic limit (PL), and the initial void ratio (
e0) and provide a cost-effective and time-efficient alternative to conventional methods while demonstrating superior performance in terms of
R2, the RMSE, and the MAE [
12]. Another study utilized single and MLR analyses to predict the compression index (
Cc) of fine-grained remolded soils using basic soil properties, such as the liquid limit (LL), the plasticity index (PI), the optimum moisture content (OMC), the maximum dry density (MDD), and the DFS. The best-proposed equations showed an
R2 of 0.95 and an average variation of −13.67% to +9.62%, making it a reliable tool when combined with engineering judgment [
13].
Furthermore, a study evaluated the accuracy of an artificial neural network (ANN) model for predicting the compression index (
Cc) by comparing it to laboratory values and models proposed by Widodo and Singh [
14]. The proposed ANN model achieved a mean target value of 0.5409 and a correlation coefficient (
R2) of 0.939, and outperformed the models by Slamet Widodo (
R2 = 0.929) and Amardeep Singh (
R2 = 0.892). The predicted
Cc values from the ANN model demonstrated a better distribution around the trend line and highlighted its superior accuracy and strong agreement with laboratory results. Also, in another study [
15], artificial neural network (ANN) methodologies have been proposed as efficient alternatives to traditional 15-day consolidation tests for predicting the compression index (
Cc) in fine-grained soils. Another study trained an ANN using a dataset of 560 high- and low-plasticity soil samples from Turkey, with input parameters such as the natural water content, the LL, the PL, the PI, and the initial void ratio. Using Matlab 2023a’s regression learner program, the model achieved an
R2 of 0.81, demonstrating its ability to provide reliable
Cc predictions with fewer experiments and significantly shorter timeframes, making it a valuable tool for geotechnical engineering.
Genetic Programming (GP) is becoming increasingly popular in geotechnical engineering because of its flexibility and ability to model non-linear relationships. It has been successfully used to predict soil properties such as shear strength, permeability, and bearing capacity [
16,
17,
18]. Research by Pham et al. [
19] and Ahmadi et al. [
20] shows that GP can accurately model soil parameters, especially when traditional methods fall short. Its strength lies in its evolutionary approach where solutions adapt without needing predefined formulas. It makes this method ideal for handling complex, non-linear geotechnical data [
21,
22,
23].
XGBoost, as a powerful gradient-boosting technique, has gained recognition for its efficiency and accuracy in analyzing large, high-dimensional datasets. It has been widely used in civil and geotechnical engineering to predict properties like soil compaction and undrained shear strength [
24,
25]. Studies by Pal and Deswal [
26] and Ma et al. [
27] demonstrated that XGBoost often outperforms traditional machine learning models due to its robust regularization features and resistance to overfitting. Additionally, its interpretability, through feature importance metrics, provides valuable insights into how soil properties influence
Cc [
28,
29,
30].
Combining GP and XGBoost into a hybrid model addresses the limitations of each standalone approach. This hybridization leverages GP’s adaptability and XGBoost’s computational power, leading to improved accuracy and reliability in predicting complex parameters like soil strength and stability under varying conditions [
17,
31,
32]. Recent studies have shown that hybrid models significantly enhance performance and can accommodate the variability seen across different soil types and conditions, making them highly adaptable solutions for predicting
Cc [
33].
Comparative studies, such as those by Deng et al. [
34] and Shahin et al. [
35], highlight that hybrid machine learning models consistently outperform traditional methods. By combining multiple techniques, hybrid models handle non-linear relationships more effectively and reduce prediction errors for parameters like
Cc [
36]. Despite these advancements, challenges remain in generalizing these models across various soil profiles and conditions. Further research is needed to validate their applicability in diverse geotechnical settings [
29,
37]. Additionally, incorporating interpretability techniques, like SHAP (SHapley Additive exPlanations), could further clarify the relationship between soil properties and
Cc, improving their acceptance in practice [
38,
39,
40].
This paper explores the use of GP, XGBoost, and their hybrid model to predict Cc using a comprehensive geotechnical database, and demonstrates their potential in advancing soil behavior prediction.
The proposed method fills critical gaps in the prediction of the compression index (Cc) of fine-grained soils by addressing the limitations of traditional consolidation testing and empirical models. Conventional methods for determining Cc are time-consuming, requiring up to 15 days for test preparation, execution, and parameter calculation, which can significantly delay construction projects. Empirical formulas, while useful for initial estimates, often fail to generalize across diverse datasets due to their reliance on simplified assumptions and limited variables. These challenges underscore the need for advanced, efficient methodologies capable of delivering accurate and reliable Cc predictions while reducing time and resource demands.
To bridge these gaps, this study aims to develop and validate a novel hybrid machine learning model combining Genetic Programming (GP) and XGBoost. This hybrid approach uses the interpretability of GP and the computational efficiency of XGBoost to accurately predict Cc using easily measurable soil properties such as the liquid limit (LL), the plasticity index (PI), the initial void ratio (e0), and the water content (w). The research seeks to overcome the shortcomings of traditional methods by creating a model that not only delivers superior predictive accuracy but also generalizes effectively across diverse soil profiles. By validating the hybrid model against standalone GP and XGBoost models, as well as traditional empirical approaches, the study provides a robust and adaptable tool for geotechnical engineers, enabling faster, more cost-effective, and more reliable predictions of soil behavior.
2. Materials and Methods
2.1. Database
The database (including 352 sets of data) used in this study includes a detailed collection of geotechnical data focused on the properties of clay soils related to the prediction of the compression index (
Cc). The key parameters in the database are the initial void ratio (
e0), the liquid limit (LL), the plasticity index (PI), and the water content (
w). This database was collected from Alhaji et al. [
41]; Benbouras et al. [
42]; McCabe et al. [
43]; Mitachi and Ono [
44]; Widodo and Ibrahim [
45]; and Zaman et al. [
46].
These parameters were carefully chosen for their critical role in the understanding and prediction of soil compressibility. The liquid limit reflects the clay mineralogy and water-holding capacity and directly impacts compressibility during consolidation. The plasticity index indicates the range of moisture content where the soil remains plastic and correlates with its deformation potential under load. The initial void ratio measures soil structure and density. This parameter plays a fundamental role in determining how much soil will compress. Studies by experts like Skempton [
2], Terzaghi and Peck [
1], and others have confirmed the importance of these parameters in empirical models for estimating
Cc. Beyond theory, they are practical for field investigations as they can be measured using standard laboratory methods. Together, these parameters provide a great and strong foundation for integrating established soil mechanics principles with advanced predictive modeling techniques.
In this study, an 80/20 train-test split was used, selected through a random sampling process with a fixed seed to ensure reproducibility. This ratio was chosen based on common machine learning practices, balancing the need for sufficient training data with reliable testing performance. Although no formal optimization algorithm was applied, we experimented with different splits (e.g., 70/30 and 90/10) to assess their impact on model performance. The 80/20 split consistently provided stable and accurate results. Additionally, 5-fold cross-validation was employed to enhance model robustness and mitigate the effects of data splitting bias.
Table 2 shows the full database, which includes 352 observations with
Cc values ranging from 0.050 to 1.64 and a mean of 0.241.
Table 3 focuses on the training subset with 282 observations and shows a slightly higher mean
Cc of 0.244.
Table 4 highlights the testing subset of 70 observations with a slightly lower mean
Cc of 0.230 and narrower parameter ranges. These metrics reflect consistent patterns across the subsets, essential for building reliable predictive models. The complete database is presented in
Appendix A.
Figure 1 shows the Pearson correlation coefficients between key variables: the liquid limit (LL), the plasticity index (PI), the initial void ratio (
e0), the water content (
w), and
Cc. Pearson correlation measures the strength of the linear relationship between two variables, with values ranging from −1 to +1. A value of +1 means a perfect positive relationship, −1 indicates a perfect negative relationship, and 0 means no linear relationship.
In the heatmap, strong correlations are shown in red, while weaker ones appear in blue. The strongest connections with Cc are seen for e0 and w, with coefficients of 0.83 and 0.77, respectively. The LL and the PI also show moderate correlations at 0.65 and 0.56, which is expected due to their inherent mathematical relationship (PI = LL − PL). While no feature was removed to maintain the physical relevance of the geotechnical parameters, this dependency was acknowledged as a limitation in the study. These findings highlight that e0 and w are the most influential factors in predicting Cc.
To prepare the dataset for analysis, data cleaning steps such as outlier removal using Z-scores () and missing value imputation were conducted.
Normalization was applied to scale the data between 0 and 1, ensuring uniformity:
This ensures that all variables contribute equally to model training.
To improve the quality of the dataset and enhance the robustness of the predictive models, outlier detection and removal was conducted using the Boxplot method, a widely used statistical technique based on the interquartile range (IQR). This method helps to identify data points that deviate significantly from the central distribution, which could otherwise negatively impact model performance.
The Boxplot method identifies outliers using the following criteria:
where,
Q1 = the first quartile (25th percentile),
Q3 = the third quartile (75th percentile), and IQR = the interquartile range (
Q3 −
Q1)
Data points falling below the lower bound or above the upper bound were considered outliers. This method was applied to all continuous input features, including the liquid limit (LL), the plasticity index (PI), the initial void ratio (e0), the natural moisture content (w), and the fines content.
The outlier removal procedure involved four key steps to ensure data consistency and improve model performance. First, boxplots were generated for each feature to visualize data distribution and detect extreme values. Next, outliers were identified using the interquartile range (IQR) thresholds, which helped in pinpointing data points that deviated significantly from the norm. These identified outliers were then removed from the dataset to minimize the risk of skewed model predictions. Finally, the dataset was re-evaluated to confirm the absence of any remaining influential outliers that could bias the models. The impact of outlier removal was significant, as it enhanced model accuracy by reducing the influence of extreme values, improved generalization capability by ensuring the model was trained on data representative of typical geotechnical conditions, and stabilized the feature importance rankings, leading to more reliable interpretations of the factors affecting the compression index (Cc).
In this study, GP and XGBoost were selected based on their complementary strengths in handling complex, non-linear relationships inherent in geotechnical datasets. GP excels at generating interpretable symbolic regression models, providing insights into the mathematical relationships between soil properties and the compression index (Cc). XGBoost, on the other hand, is a robust ensemble learning algorithm known for its high predictive accuracy, scalability, and ability to handle non-linear feature interactions effectively.
2.2. Multiple Linear Regression (MLR)
Multiple linear regression models the relationship between input features and the target variable (
Cc) by assuming a linear relationship. The general formula for a multivariate linear regression model is as follows:
where
Cc: the compression index,
Xi: the input features (e.g.,
e0, the LL, the PI),
β0,
βi: coefficients determined via least squares, and
ϵ: the error term
The model’s coefficients were calculated by minimizing the residual sum of squares (RSS):
where
Cc,i: the actual compression index, and
: the predicted compression index.
Multiple linear regression (MLR) is often used as a baseline model to explore the simplest relationships in the data. Its performance is evaluated using metrics like the coefficient of determination (R2), the mean squared error (MSE), and the mean absolute error (MAE). R2 indicates how much of the variation in the dependent variable is explained by the independent variables, with values closer to 1 showing a better fit.
The MSE measures the average squared difference between the observed and predicted values, with lower values indicating greater accuracy. The MAE, on the other hand, calculates the average absolute difference between the observed and predicted values, providing a straightforward measure of prediction errors. The Mean Absolute Percentage Error (MAPE) measures the accuracy of a predictive model by calculating the average of the absolute percentage differences between the actual values and the predicted values. The Root Mean Square Error (RMSE) measures the square root of the average of the squared differences between the actual and predicted values. Together, these metrics offer a comprehensive understanding of the model’s accuracy and reliability.
where
n is the number of observations,
yi represents the actual value,
ŷi represents the predicted value, and
ȳ represents the mean of the actual values.
2.3. Genetic Programming (GP)
Genetic Programming (GP) is an evolutionary algorithm that generates symbolic models to predict
Cc. It begins with a population of random equations and iteratively refines them using crossover, mutation, and selection. The fitness function is defined as follows:
where MSE is the mean squared error between the predicted and actual
Cc. Operations such as crossover combine parts of two parent models:
Mutation introduces diversity by randomly altering parts of the equation. This process continues until convergence to an optimal model or a predefined termination criterion (e.g., maximum iterations).
GP excels in capturing nonlinear relationships and produces interpretable equations for Cc. However, it is computationally expensive compared to simpler methods.
Figure 2 illustrates the basic workflow of GP, which is inspired by the principles of natural selection and evolution. The process begins with the generation of an initial random population of potential solutions, often represented as symbolic mathematical expressions or trees. Each individual in this population is evaluated, and a fitness score is assigned based on its ability to accurately model the target outcome (e.g., predicting the compression index,
Cc). The selection function then identifies the most “fit” individuals, which are chosen to participate in the next generation. These selected individuals undergo genetic operations such as crossover (a recombination of parts from two parent solutions) and mutation (random alterations to introduce diversity), creating a new population with potentially improved solutions. The algorithm continues to iterate through these steps—selection, crossover, mutation, and fitness evaluation—until a specified termination condition is met, such as reaching a maximum number of generations or achieving an acceptable fitness level. Once the process concludes, the algorithm outputs the best-performing solution, representing the optimized predictive model.
2.4. XGBoost
XGBoost is a gradient-boosting framework that constructs a series of decision trees to minimize prediction errors. It uses a regularized objective function:
where
: the loss function (e.g., squared error loss).
At each iteration, a new tree is fitted to the residuals of the previous trees:
The final prediction in XGBoost is a weighted sum of all the decision trees. XGBoost is particularly strong because it can handle missing data and incorporates regularization to prevent overfitting. Also, it uses parallel processing to optimize performance. In our study, key hyperparameters, such as the learning rate and tree depth, were carefully adjusted through cross-validation to enhance accuracy. Metrics like R2 and RMSE were used to evaluate the model’s performance.
Figure 3 illustrates the workflow of XGBoost, an advanced machine learning algorithm based on the gradient boosting framework. The process begins with a dataset that is used to train an ensemble of decision trees
,
, …,
, each represented by its unique parameters
,
, …,
. The first tree
) generates initial predictions, and the residuals (the errors between the predicted and actual values) are calculated. These residuals are then passed to the next tree
), which aims to correct the previous errors. This iterative process continues, with each subsequent tree learning from the residuals of the previous model. The node splitting in each tree is optimized based on an objective function, which improves predictive accuracy. Finally, the outputs from all trees are combined through an additive process
and result in a high-performing predictive model. This ensemble approach enhances accuracy, reduces overfitting, and ensures efficient computation.
2.5. Hybrid (GP-XGBoost)
Combining GP and XGBoost creates a powerful hybrid methodology that integrates the interpretability of GP with the robust predictive capabilities of XGBoost. This combination is particularly valuable for solving complex problems where understanding the underlying relationships is as important as achieving high predictive accuracy. Below is a comprehensive outline of how these methods can be combined. In this research, method 2 was employed as it showed better performance in prediction.
Method 1. Sequential Hybrid Approach
In this approach, GP is used as a preprocessing step to create features or refine the input data for XGBoost, which then focuses on predictive modeling.
Step 1: Feature Engineering with GP
- 1.
Symbolic Regression: GP is employed to explore relationships between input variables (e.g., the LL, the PI, and
e0) and the target variable (
Cc). The output is symbolic equations that describe the following relationships:
- 2.
Feature Transformation: The symbolic equations are converted into new features, such as the following:
- 3.
Feature Selection: The symbolic features are evaluated based on their importance and correlation with the target variable, and only the most relevant features are retained for the next step.
Step 2: Predictive Modeling with XGBoost
- 1.
Dataset Augmentation: The refined features from GP are added to the original dataset, enriching the input space for XGBoost.
- 2.
Training XGBoost: XGBoost is trained on the augmented dataset. The combination of raw and GP-derived features allows XGBoost to capture complex patterns and residual errors that are not fully explained by GP.
- 3.
Evaluation: The performance of the model is assessed using metrics such as the MSE and R2.
By using GP’s symbolic equations, this approach improves XGBoost’s ability to generalize complex nonlinear relationships.
Method 2. Integrated Hybrid Approach
This approach iteratively combines GP and XGBoost, with feedback loops allowing both methods to influence each other.
Step 1: Initial Training with XGBoost
- 1.
XGBoost is trained in the raw features to establish a baseline model.
- 2.
The feature importance from XGBoost is extracted to identify which features contribute most to the predictions.
Step 2: Feedback to GP for Feature Discovery
- 1.
GP uses the ranked features from XGBoost to focus on the most influential variables.
- 2.
It generates symbolic relationships, such as Equations (11) and (12).
- 3.
GP-derived features are added to the original dataset.
Step 3: Iterative Refinement
- 1.
The new dataset, enriched by GP, is fed back into XGBoost for retraining.
- 2.
XGBoost’s predictions and feature importances are analyzed, and further refinement of features is performed by GP.
- 3.
This feedback loop continues until convergence or performance improvement stagnates.
Method 3. Parallel Hybrid Approach
In this approach, GP and XGBoost work independently, and their outputs are combined for the final prediction.
Step 1: Independent Training
- 1.
GP is trained independently to create symbolic models for predicting Cc, generating outputs (CcGP).
- 2.
XGBoost is trained independently on the same dataset to produce output (CcXGBoost).
Step 2: Weighted Combination of Outputs
- 1.
The predictions from both models are combined using a weighted averaging approach:
where
w1 and
w2 are the weights determined through cross-validation.
- 2.
Alternatively, a stacking ensemble can be used, where the outputs of GP and XGBoost serve as inputs to a meta-model that learns how to combine them optimally.
Step 3: Final Model Evaluation
The combined predictions are evaluated using standard performance metrics, ensuring that the complementary strengths of GP (interpretability) and XGBoost (accuracy) are fully utilized.
The hybrid GP-XGBoost approach combines the best features of two powerful methods and can offer distinct advantages. First, it improves model interpretability by using symbolic equations from GP which clearly show the relationships between input variables and the target. This is especially useful in geotechnical engineering where understanding soil behavior is key to making informed decisions. Second, the hybrid model boosts predictive accuracy by using XGBoost’s ability to handle complex, nonlinear patterns and residual errors. GP-derived features enhance the dataset and allow XGBoost to work with a richer, more meaningful input space. This often leads to better generalization and reduced overfitting. The method is also highly adaptable and makes it suitable for various datasets and applications where both precision and clarity are important.
The GP-XGBoost model is particularly effective for problems that involve complex nonlinear relationships and require both high accuracy and easy interpretation. GP’s symbolic equations help engineers understand the key factors driving these properties, while XGBoost ensures dependable predictions. Beyond geotechnics, the hybrid model is valuable in areas like environmental science where interactions between soil, water, and plants are critical. By combining symbolic reasoning with advanced machine learning, this hybrid approach bridges the gap between explainable and high-performance predictive models.
4. Discussion
4.1. Residual Analysis and Error Distribution for Model Performance Evaluation
Figure 8 compares the results of different models using several graphs. Residual plots show the differences between the actual and predicted values, which help assess accuracy and bias. Ideally, the residuals should be evenly spread around zero, indicating no bias in the model’s predictions. Models like GP and XGBoost perform well, with residuals mostly close to zero, showing good accuracy. On the other hand, scattered residuals highlighted areas where predictions were less accurate. These graphs also reveal the range of errors, with narrower distributions around zero indicating better performance, as seen in GP and XGBoost. Wider or skewed distributions suggest models with lower accuracy or potential biases.
The scatter matrix in
Figure 8 compares actual
Cc values with the predicted ones for each model. Points closer to the diagonal line indicate more accurate predictions. GP and XGBoost models show tight clustering along the diagonal direction, reflecting higher accuracy, while scattered points in other models reveal more errors. This visualization is useful for identifying how well each model generalizes to new data and whether there are any consistent issues. It offers a clear view of the models’ strengths and weaknesses, making it easier to understand their overall performance.
4.2. Feature Importance
Feature importance is a key concept in machine learning that helps explain which factors have the biggest impact on a model’s predictions. It shows how much each variable contributes to the model’s performance and helps researchers understand the relationships within the data. The method for calculating feature importance depends on the model. For example, in tree-based models like XGBoost, it can be measured by how much a feature reduces errors or how often it is used in decision splits. In GP models, importance is assessed by analyzing symbolic equations to see how strongly a feature influences the output. For hybrid models like GP-XGBoost, these methods are combined with techniques like permutation importance or SHAP values, which quantify how much each feature contributes to the predictions. These tools make it easier to rank features, interpret results, and improve the model.
The feature importance results in
Figure 9 reveal how different soil properties—such as the initial void ratio (
e0), the liquid limit (LL), the plasticity index (PI), and the water content (
w)—affect the prediction of the compression index (
Cc). The initial void ratio (
e0) stands out as the most important factor, especially in the GP-XGBoost hybrid model, where it has the highest importance score of 0.55. This highlights its critical role in soil compressibility, as it reflects the soil’s structure and potential for volume change. The LL and the PI also contribute significantly, particularly in GP-XGBoost and GP models, with importance scores of 0.240 and 0.220. These properties are linked to soil composition and plasticity, making them reliable indicators of compressibility. Interestingly, water content (
w) is moderately important in the hybrid model, with a score of 0.270, showing its effect on soil behavior. MLR assigns less importance to all features, especially the LL and the w, due to its limited ability to handle complex relationships. These results demonstrate how GP-XGBoost combines the strengths of both GP and XGBoost to give a detailed understanding of soil behavior through feature analysis.
It is important to note that the dependency between the liquid limit (LL) and the plasticity index (PI) may have influenced the feature importance rankings. Since the PI is derived from the LL and the PL, this relationship introduces collinearity, potentially inflating the importance of these features in the predictive models. Although the GP-XGBoost model is robust against multicollinearity, this dependency could affect the interpretability of the results, and it is acknowledged as a limitation of this study.
Feature importance ranking aligns with fundamental geotechnical principles and reflects the critical role of soil composition and structure in compressibility behavior.
The LL and the PI are indicators of a soil’s clay mineral content and plasticity, which directly affect compressibility. Higher LL values are typically associated with fine-grained soils rich in montmorillonite or kaolinite, which have a greater capacity to absorb water and undergo volume changes under load. The PI reflects the soil’s ability to deform plastically, and a higher PI usually correlates with the greater rearrangement of particles during consolidation, resulting in higher Cc values. The strong feature importance of LL and PI in the model suggests that mineralogical composition and interparticle bonding are key mechanisms driving compressibility.
The initial void ratio is a fundamental measure of the porosity and packing density of soil particles. Soils with a high e0 have more interparticle voids that make them more susceptible to settlement when subjected to load. The model’s emphasis on e0 highlights the significance of soil structure, particularly the arrangement of particles and pore spaces, in controlling the magnitude of primary consolidation.
The natural moisture content influences the pore water pressure, effective stress, and the ease with which particles can rearrange under loading conditions. Soils with higher moisture content tend to have weaker interparticle forces and facilitate greater compression.
The feature importance analysis suggests that the compression index (Cc) is governed by a combination of the following features:
- -
Mineralogical properties (the LL, the PI, and specific gravity) affect plasticity and particle interaction.
- -
Structural characteristics (the initial void ratio) influence particle arrangement and porosity.
- -
Hydrological conditions (the natural moisture content) impact pore water dynamics and effective stress.
These findings align with the classical consolidation theory, where Cc is a function of both soil composition and initial structural conditions. Machine learning models not only confirm these geotechnical principles but also provide quantitative evidence of the relative importance of each factor.
4.3. Comparison with the Literature
Figure 10 compares the
Cc values predicted by different models with the actual observed values. The models include GP-XGBoost, Terzaghi and Peck [
1], Azzouz et al. [
8], Koppula [
9], and Yoshinaka and Osada [
11]. The GP-XGBoost model closely matches actual values, especially for lower
Cc ranges. In contrast, the empirical models show larger deviations. Terzaghi and Peck [
1] often overestimate
Cc, while Azzouz et al. [
8] significantly underestimates it. Koppula [
9] and Yoshinaka and Osada [
11] provide more consistent results but lack the precision of GP-XGBoost.
These results highlight the strength of machine learning models like GP-XGBoost in capturing complex relationships in the data. They outperform empirical models, which are limited by their assumptions and simplified formulas. The graph also shows the variability in soil behavior and the difficulty that empirical methods have in generalizing across different conditions. While traditional models are useful for initial estimates, GP-XGBoost offers higher reliability and accuracy. This makes it a valuable tool for geotechnical engineering and emphasizes the importance of modern, data-driven techniques.
The results in
Table 14 show that the GP-XGBoost hybrid model performs better than empirical models in predicting
Cc. The GP-XGBoost model achieved a high R
2 of 0.927 and a low MSE of 0.027. This shows it has strong accuracy and minimal error. In comparison, empirical models had much lower R
2 values. Terzaghi and Peck [
1] had the highest R
2 at 0.149, while others had R
2 < 0.1 values that show a poor fit or overestimation.
MAE values for the empirical models were also much higher. They ranged from 0.090 for Terzaghi and Peck [
1] to 0.189 for Azzouz et al. [
8]. These results show that GP-XGBoost is effective at capturing complex patterns in the data. Traditional empirical models fail to represent these relationships accurately.
4.4. Limitations and Future Work
The hybrid GP-XGBoost approach is effective but faces challenges in computation and practical use. GP’s symbolic regression can create complex equations that are hard to use in real-time. XGBoost also needs significant computing power for its iterative training on enhanced datasets. This may not be feasible for small engineering firms or in developing areas. To address this, future studies can focus on simplifying GP-generated equations or applying dimensionality reduction techniques. Cloud-based or easy-to-use software can make the models more accessible.
Overfitting is another issue in machine learning. It can produce good results on training data but poor results on new data. This study used normalization and cross-validation, but more work is needed. Testing with data from different regions can help. Techniques like dropout regularization and ensemble methods can reduce overfitting. Tools like SHAP values can show how features affect predictions. This will build trust and support the wider use of the models.
One limitation of this study is the inherent dependency between some input parameters, particularly the LL and the PI. Since the PI is calculated directly from the LL and the PL, this introduces collinearity, which may affect the stability and interpretability of feature importance in the models. Although the hybrid GP-XGBoost model is capable of handling such dependencies, future research could benefit from employing dimensionality reduction techniques or regularization methods to mitigate these effects and enhance model robustness.
Building upon the findings and acknowledging the limitations of this study, several directions for future research are proposed to enhance the robustness, generalizability, and interpretability of the predictive models for the compression index (Cc):
- -
Future studies should explore dimensionality reduction techniques, such as Principal Component Analysis (PCA), to transform correlated features into uncorrelated components. Alternatively, regularization methods like LASSO regression can be applied to reduce the influence of redundant variables.
- -
The current dataset, while comprehensive, lacks geographical and geological diversity. Future work should incorporate larger, multi-regional datasets that cover a wider range of soil types, climatic conditions, and geotechnical properties.
- -
To ensure the generalizability of the proposed model, future research should involve external validation using independent datasets from different projects or geographical locations.
- -
Although the GP-XGBoost hybrid model offers high accuracy, its complex symbolic equations can hinder interpretability. Future studies could focus on developing simplified symbolic models by incorporating genetic simplification algorithms or rule-based pruning techniques to generate more concise, physically interpretable expressions that align with geotechnical principles.
- -
While the hybrid GP-XGBoost model demonstrated strong performance, future research could explore advanced ensemble techniques such as stacking, blending, or meta-learning frameworks to further enhance predictive accuracy. Additionally, integrating deep learning models like Recurrent Neural Networks (RNNs) or Attention Mechanisms could improve the capture of complex temporal and spatial patterns in geotechnical data.
4.5. Statistical Significance Analysis and Model Comparison
To ensure the reliability of the predictive models, 30 independent runs were conducted for each model (MLR, GP, XGBoost, and GP-XGBoost) with different random seeds to account for variability in data splitting and model initialization. The performance metrics—
R2, MSE, and MAE—were calculated for each run. This approach allows for a comprehensive assessment of each model’s performance. The mean and standard deviation of these metrics were computed to quantify the models’ central tendency and variability, as shown in
Table 15.
These results indicate that the hybrid GP-XGBoost model consistently outperformed other models with the highest mean R2 and the lowest MSE and MAE.
4.5.1. One-Way ANOVA Test
To determine if the observed differences in model performance were statistically significant, a one-way Analysis of Variance (ANOVA) test was applied. ANOVA is a strong statistical method that is used to compare the means of multiple groups (in this case, the performance metrics of different models) and assess whether at least one model performs differently from the others. The null hypothesis (
H0) assumes no significant difference among the models, while the alternative hypothesis (
H1) suggests that at least one model shows superior performance. The F-statistic and corresponding
p-values were calculated for each performance metric (
R2, MSE, and MAE). A
p-value less than 0.05 indicates that the null hypothesis can be confidently rejected and confirms that significant performance differences exist between models (refer to
Table 16).
4.5.2. Post-Hoc Tukey’s HSD Test
Following the ANOVA test, which identified significant differences among models, we conducted a Tukey’s Honestly Significant Difference (HSD) test as a post-hoc analysis. Tukey’s HSD test is designed to determine which specific model pairs exhibit statistically significant differences. This method compares all possible pairs of models and adjusts for multiple comparisons to control the family-wise error rate. The test provides
p-values that indicate the likelihood that performance differences between two models occurred by chance.
p-values less than 0.05 suggest a statistically significant difference between model performances. This analysis offers a detailed understanding of how each model compares against others, as summarized in
Table 17.
4.5.3. Model Validation Techniques
The model validation techniques employed in this study were designed to rigorously assess the predictive performance and reliability of the proposed models. Initially, the dataset was randomly split into 80% training and 20% testing subsets to evaluate model generalization, with random seeds used to ensure reproducibility across multiple runs. To further reduce the risk of overfitting and obtain robust performance metrics, a 5-fold cross-validation approach was implemented, where the dataset was divided into five subsets, using four for training and one for validation in a rotating manner. To enhance the reliability of the results, the entire modelling process was repeated 30 times with different random splits, and the mean and standard deviation of key performance metrics (R2, MSE, and MAE) were calculated to assess model stability. Finally, statistical significance tests, including ANOVA and Tukey’s HSD, were conducted to verify that performance differences between models were statistically significant, adding an extra layer of validation to this study’s findings.
5. Conclusions
This study proposed a hybrid GP-XGBoost model for predicting the compression index (Cc) of clayey soils, integrating the symbolic regression capabilities of Genetic Programming (GP) with the robust predictive power of Extreme Gradient Boosting (XGBoost). The model demonstrated superior performance compared to traditional empirical methods and standalone machine learning models, achieving an R2 of 0.927 with significantly reduced prediction errors. The feature importance analysis highlighted key geotechnical parameters—such as the liquid limit (LL), the plasticity index (PI), the initial void ratio (e0), and the natural moisture content (wn)—as critical factors influencing soil compressibility, aligning with established soil mechanics principles.
Despite the promising results, this study acknowledges several limitations, including parameter dependencies (e.g., the LL and the PI), limited dataset diversity, and potential overfitting risks. To address these challenges, future research should incorporate dimensionality reduction techniques, external validation with independent datasets, and simplified symbolic models for enhanced interpretability. Additionally, expanding the dataset to include diverse soil types and environmental conditions will improve the model’s generalizability in real-world applications.
The findings of this research offer valuable insights for geotechnical engineers involved in foundation design, settlement analysis, and infrastructure planning, providing a data-driven approach to complement traditional soil mechanics theories. By advancing predictive modeling techniques and addressing the identified limitations, future studies can further improve the accuracy, reliability, and practical applicability of machine learning models in geotechnical engineering.