Next Article in Journal
Depositional History Reconstruction of the Miocene Formations in the Carpathian Foredeep Area Based on the Integration of Seismostratigraphic and Chemostratigraphic Interpretation
Next Article in Special Issue
A Novel Approach to Incremental Diffusion for Continuous Dataset Updates in Image Retrieval
Previous Article in Journal
Therapeutic Potential and Challenges of Pioglitazone in Cancer Treatment
Previous Article in Special Issue
Identifying NSFW Groups on Reddit Social Network by Identifying Highly Interconnected Subreddits Through Analysis of Implicit Communication Patterns
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Predicting the Compression Index of Clayey Soils Using a Hybrid Genetic Programming and XGBoost Model

1
School of Engineering, Deakin University, Geelong, VIC 3216, Australia
2
Department of Civil Engineering, La Trobe University, Bundoora, VIC 3086, Australia
3
Melbourne School of Design, The University of Melbourne, Parkville, VIC 3052, Australia
4
School of Civil Engineering, Guangzhou University, Guangzhou 510006, China
*
Authors to whom correspondence should be addressed.
Appl. Sci. 2025, 15(4), 1926; https://doi.org/10.3390/app15041926
Submission received: 1 January 2025 / Revised: 2 February 2025 / Accepted: 6 February 2025 / Published: 13 February 2025
(This article belongs to the Special Issue AI-Based Data Science and Database Systems)

Abstract

:

Featured Application

This study introduces a novel hybrid method combining Genetic Programming (GP) and XGBoost to accurately predict the compression index (Cc) of clayey soils. The proposed model offers a powerful tool for geotechnical engineers to assess soil compressibility with higher precision and aids in the design and analysis of foundations, earth structures, and settlement calculations. The method’s ability to handle complex, nonlinear relationships makes it particularly valuable for projects involving diverse soil types and challenging site conditions. Its application can significantly enhance the reliability of geotechnical assessments and streamline the design process for critical infrastructure projects.

Abstract

The accurate prediction of the compression index (Cc) is crucial for understanding the settlement behavior of clayey soils, which is a key factor in geotechnical design. Traditional empirical models, while widely used, often fail to generalize across diverse soil conditions due to their reliance on simplified assumptions and regional dependencies. This study proposed a novel hybrid method combining Genetic Programming (GP) and XGBoost methods. A large database (including 385 datasets) of geotechnical properties, including the liquid limit (LL), the plasticity index (PI), the initial void ratio (e0), and the water content (w), was used. The hybrid GP-XGBoost model achieved remarkable predictive performance, with an R2 of 0.966 and 0.927 and mean squared error (MSE) values of 0.001 and 0.001 for training and testing datasets, respectively. The mean absolute error (MAE) was also exceptionally low at 0.030 for training and 0.028 for testing datasets. Comparative analysis showed that the hybrid model outperformed the standalone GP (R2 = 0.934, MSE = 0.003) and XGBoost (R2 = 0.939, MSE = 0.002) models, as well as traditional empirical methods such as Terzaghi and Peck (R2 = 0.149, MSE = 0.090). Key findings highlighted that the initial void ratio and water content are the most influential predictors of Cc, with feature importance scores of 0.55 and 0.27, respectively. The novelty of the proposed method lies in its ability to combine the interpretability of GP with the computational efficiency of XGBoost and results in a robust and adaptable predictive tool. This hybrid approach has the potential to advance geotechnical engineering practices by providing accurate and interpretable models for diverse soil profiles and complex site conditions.

1. Introduction

The compression index (Cc) is a crucial parameter in geotechnical engineering for predicting the settlement behavior of clayey soils under consolidation. Its accurate estimation is fundamental for foundation design and soil stability analysis. Conventionally, Cc is empirically estimated using correlations based on soil index properties (e.g., the liquid limit, the plasticity index) or determined directly through laboratory consolidation tests [1,2]. While these traditional methods provide useful estimations, they are often limited by regional soil characteristics and laboratory constraints [3]. Recent advancements in machine learning (ML) have enabled more accurate, adaptable predictions of geotechnical parameters like Cc [4].
Many empirical models have been developed to correlate Cc with soil properties such as the natural water content, the liquid limit, and the void ratio. Skempton [2] proposed one of the earliest models, correlating Cc with the liquid limit for normally consolidated clays.
In 1944, Skempton [2] proposed an empirical formula to estimate the Cc of remolded clay based on their liquid limit (LL):
Cc = 0.007 × (LL − 10)
This equation suggests that the compression index increases linearly with the liquid limit.
For normally consolidated clays, Terzaghi and Peck [1] recommended a similar relationship with a slightly higher coefficient:
Cc = 0.009 × (LL − 10)
These correlations provide a straightforward method to estimate the compressibility of clays using their liquid limit.
Other studies have since refined these relationships and incorporated more variables for improved accuracy [5,6]. However, empirical models are often highly dependent on soil-specific characteristics and lead to less reliable predictions across diverse soil profiles [7]. Table 1 shows the main empirical correlations for determining Cc.
A recent study developed a novel gene expression programming (GEP) model to predict the compression index (Cc) of fine-grained soils using the liquid limit (LL), the plastic limit (PL), and the initial void ratio (e0) and provide a cost-effective and time-efficient alternative to conventional methods while demonstrating superior performance in terms of R2, the RMSE, and the MAE [12]. Another study utilized single and MLR analyses to predict the compression index (Cc) of fine-grained remolded soils using basic soil properties, such as the liquid limit (LL), the plasticity index (PI), the optimum moisture content (OMC), the maximum dry density (MDD), and the DFS. The best-proposed equations showed an R2 of 0.95 and an average variation of −13.67% to +9.62%, making it a reliable tool when combined with engineering judgment [13].
Furthermore, a study evaluated the accuracy of an artificial neural network (ANN) model for predicting the compression index (Cc) by comparing it to laboratory values and models proposed by Widodo and Singh [14]. The proposed ANN model achieved a mean target value of 0.5409 and a correlation coefficient (R2) of 0.939, and outperformed the models by Slamet Widodo (R2 = 0.929) and Amardeep Singh (R2 = 0.892). The predicted Cc values from the ANN model demonstrated a better distribution around the trend line and highlighted its superior accuracy and strong agreement with laboratory results. Also, in another study [15], artificial neural network (ANN) methodologies have been proposed as efficient alternatives to traditional 15-day consolidation tests for predicting the compression index (Cc) in fine-grained soils. Another study trained an ANN using a dataset of 560 high- and low-plasticity soil samples from Turkey, with input parameters such as the natural water content, the LL, the PL, the PI, and the initial void ratio. Using Matlab 2023a’s regression learner program, the model achieved an R2 of 0.81, demonstrating its ability to provide reliable Cc predictions with fewer experiments and significantly shorter timeframes, making it a valuable tool for geotechnical engineering.
Genetic Programming (GP) is becoming increasingly popular in geotechnical engineering because of its flexibility and ability to model non-linear relationships. It has been successfully used to predict soil properties such as shear strength, permeability, and bearing capacity [16,17,18]. Research by Pham et al. [19] and Ahmadi et al. [20] shows that GP can accurately model soil parameters, especially when traditional methods fall short. Its strength lies in its evolutionary approach where solutions adapt without needing predefined formulas. It makes this method ideal for handling complex, non-linear geotechnical data [21,22,23].
XGBoost, as a powerful gradient-boosting technique, has gained recognition for its efficiency and accuracy in analyzing large, high-dimensional datasets. It has been widely used in civil and geotechnical engineering to predict properties like soil compaction and undrained shear strength [24,25]. Studies by Pal and Deswal [26] and Ma et al. [27] demonstrated that XGBoost often outperforms traditional machine learning models due to its robust regularization features and resistance to overfitting. Additionally, its interpretability, through feature importance metrics, provides valuable insights into how soil properties influence Cc [28,29,30].
Combining GP and XGBoost into a hybrid model addresses the limitations of each standalone approach. This hybridization leverages GP’s adaptability and XGBoost’s computational power, leading to improved accuracy and reliability in predicting complex parameters like soil strength and stability under varying conditions [17,31,32]. Recent studies have shown that hybrid models significantly enhance performance and can accommodate the variability seen across different soil types and conditions, making them highly adaptable solutions for predicting Cc [33].
Comparative studies, such as those by Deng et al. [34] and Shahin et al. [35], highlight that hybrid machine learning models consistently outperform traditional methods. By combining multiple techniques, hybrid models handle non-linear relationships more effectively and reduce prediction errors for parameters like Cc [36]. Despite these advancements, challenges remain in generalizing these models across various soil profiles and conditions. Further research is needed to validate their applicability in diverse geotechnical settings [29,37]. Additionally, incorporating interpretability techniques, like SHAP (SHapley Additive exPlanations), could further clarify the relationship between soil properties and Cc, improving their acceptance in practice [38,39,40].
This paper explores the use of GP, XGBoost, and their hybrid model to predict Cc using a comprehensive geotechnical database, and demonstrates their potential in advancing soil behavior prediction.
The proposed method fills critical gaps in the prediction of the compression index (Cc) of fine-grained soils by addressing the limitations of traditional consolidation testing and empirical models. Conventional methods for determining Cc are time-consuming, requiring up to 15 days for test preparation, execution, and parameter calculation, which can significantly delay construction projects. Empirical formulas, while useful for initial estimates, often fail to generalize across diverse datasets due to their reliance on simplified assumptions and limited variables. These challenges underscore the need for advanced, efficient methodologies capable of delivering accurate and reliable Cc predictions while reducing time and resource demands.
To bridge these gaps, this study aims to develop and validate a novel hybrid machine learning model combining Genetic Programming (GP) and XGBoost. This hybrid approach uses the interpretability of GP and the computational efficiency of XGBoost to accurately predict Cc using easily measurable soil properties such as the liquid limit (LL), the plasticity index (PI), the initial void ratio (e0), and the water content (w). The research seeks to overcome the shortcomings of traditional methods by creating a model that not only delivers superior predictive accuracy but also generalizes effectively across diverse soil profiles. By validating the hybrid model against standalone GP and XGBoost models, as well as traditional empirical approaches, the study provides a robust and adaptable tool for geotechnical engineers, enabling faster, more cost-effective, and more reliable predictions of soil behavior.

2. Materials and Methods

2.1. Database

The database (including 352 sets of data) used in this study includes a detailed collection of geotechnical data focused on the properties of clay soils related to the prediction of the compression index (Cc). The key parameters in the database are the initial void ratio (e0), the liquid limit (LL), the plasticity index (PI), and the water content (w). This database was collected from Alhaji et al. [41]; Benbouras et al. [42]; McCabe et al. [43]; Mitachi and Ono [44]; Widodo and Ibrahim [45]; and Zaman et al. [46].
These parameters were carefully chosen for their critical role in the understanding and prediction of soil compressibility. The liquid limit reflects the clay mineralogy and water-holding capacity and directly impacts compressibility during consolidation. The plasticity index indicates the range of moisture content where the soil remains plastic and correlates with its deformation potential under load. The initial void ratio measures soil structure and density. This parameter plays a fundamental role in determining how much soil will compress. Studies by experts like Skempton [2], Terzaghi and Peck [1], and others have confirmed the importance of these parameters in empirical models for estimating Cc. Beyond theory, they are practical for field investigations as they can be measured using standard laboratory methods. Together, these parameters provide a great and strong foundation for integrating established soil mechanics principles with advanced predictive modeling techniques.
In this study, an 80/20 train-test split was used, selected through a random sampling process with a fixed seed to ensure reproducibility. This ratio was chosen based on common machine learning practices, balancing the need for sufficient training data with reliable testing performance. Although no formal optimization algorithm was applied, we experimented with different splits (e.g., 70/30 and 90/10) to assess their impact on model performance. The 80/20 split consistently provided stable and accurate results. Additionally, 5-fold cross-validation was employed to enhance model robustness and mitigate the effects of data splitting bias.
Table 2 shows the full database, which includes 352 observations with Cc values ranging from 0.050 to 1.64 and a mean of 0.241. Table 3 focuses on the training subset with 282 observations and shows a slightly higher mean Cc of 0.244. Table 4 highlights the testing subset of 70 observations with a slightly lower mean Cc of 0.230 and narrower parameter ranges. These metrics reflect consistent patterns across the subsets, essential for building reliable predictive models. The complete database is presented in Appendix A.
Figure 1 shows the Pearson correlation coefficients between key variables: the liquid limit (LL), the plasticity index (PI), the initial void ratio (e0), the water content (w), and Cc. Pearson correlation measures the strength of the linear relationship between two variables, with values ranging from −1 to +1. A value of +1 means a perfect positive relationship, −1 indicates a perfect negative relationship, and 0 means no linear relationship.
In the heatmap, strong correlations are shown in red, while weaker ones appear in blue. The strongest connections with Cc are seen for e0 and w, with coefficients of 0.83 and 0.77, respectively. The LL and the PI also show moderate correlations at 0.65 and 0.56, which is expected due to their inherent mathematical relationship (PI = LL − PL). While no feature was removed to maintain the physical relevance of the geotechnical parameters, this dependency was acknowledged as a limitation in the study. These findings highlight that e0 and w are the most influential factors in predicting Cc.
To prepare the dataset for analysis, data cleaning steps such as outlier removal using Z-scores ( Z = X μ σ ) and missing value imputation were conducted.
Normalization was applied to scale the data between 0 and 1, ensuring uniformity:
X N o r m a l i s e d = X min ( X ) max ( X ) min ( X )
This ensures that all variables contribute equally to model training.
To improve the quality of the dataset and enhance the robustness of the predictive models, outlier detection and removal was conducted using the Boxplot method, a widely used statistical technique based on the interquartile range (IQR). This method helps to identify data points that deviate significantly from the central distribution, which could otherwise negatively impact model performance.
The Boxplot method identifies outliers using the following criteria:
Lower Bound = Q1 − 1.5 × IQR
Upper Bound = Q3 + 1.5 × IQR
where, Q1 = the first quartile (25th percentile), Q3 = the third quartile (75th percentile), and IQR = the interquartile range (Q3Q1)
Data points falling below the lower bound or above the upper bound were considered outliers. This method was applied to all continuous input features, including the liquid limit (LL), the plasticity index (PI), the initial void ratio (e0), the natural moisture content (w), and the fines content.
The outlier removal procedure involved four key steps to ensure data consistency and improve model performance. First, boxplots were generated for each feature to visualize data distribution and detect extreme values. Next, outliers were identified using the interquartile range (IQR) thresholds, which helped in pinpointing data points that deviated significantly from the norm. These identified outliers were then removed from the dataset to minimize the risk of skewed model predictions. Finally, the dataset was re-evaluated to confirm the absence of any remaining influential outliers that could bias the models. The impact of outlier removal was significant, as it enhanced model accuracy by reducing the influence of extreme values, improved generalization capability by ensuring the model was trained on data representative of typical geotechnical conditions, and stabilized the feature importance rankings, leading to more reliable interpretations of the factors affecting the compression index (Cc).
In this study, GP and XGBoost were selected based on their complementary strengths in handling complex, non-linear relationships inherent in geotechnical datasets. GP excels at generating interpretable symbolic regression models, providing insights into the mathematical relationships between soil properties and the compression index (Cc). XGBoost, on the other hand, is a robust ensemble learning algorithm known for its high predictive accuracy, scalability, and ability to handle non-linear feature interactions effectively.

2.2. Multiple Linear Regression (MLR)

Multiple linear regression models the relationship between input features and the target variable (Cc) by assuming a linear relationship. The general formula for a multivariate linear regression model is as follows:
C c = β 0 + ( i = 1 n β i X i + ϵ )
where Cc: the compression index, Xi: the input features (e.g., e0, the LL, the PI), β0, βi: coefficients determined via least squares, and ϵ: the error term
The model’s coefficients were calculated by minimizing the residual sum of squares (RSS):
R S S = i = 1 N ( C c , i C ^ c ,   i ) 2
where Cc,i: the actual compression index, and C ^ c ,   i : the predicted compression index.
Multiple linear regression (MLR) is often used as a baseline model to explore the simplest relationships in the data. Its performance is evaluated using metrics like the coefficient of determination (R2), the mean squared error (MSE), and the mean absolute error (MAE). R2 indicates how much of the variation in the dependent variable is explained by the independent variables, with values closer to 1 showing a better fit.
The MSE measures the average squared difference between the observed and predicted values, with lower values indicating greater accuracy. The MAE, on the other hand, calculates the average absolute difference between the observed and predicted values, providing a straightforward measure of prediction errors. The Mean Absolute Percentage Error (MAPE) measures the accuracy of a predictive model by calculating the average of the absolute percentage differences between the actual values and the predicted values. The Root Mean Square Error (RMSE) measures the square root of the average of the squared differences between the actual and predicted values. Together, these metrics offer a comprehensive understanding of the model’s accuracy and reliability.
R 2 = 1 ( y i y i ^ ) 2 ( y i y ¯ ) 2
M S E = 1 n i = 1 n ( y i y i ^ ) 2
R M S E = 1 n i = 1 n ( y i y i ^ ) 2
M A E = 1 n i = 1 n y i y i ^
M A P E = 100 n i = 1 n y i y i ^ y i
where n is the number of observations, yi represents the actual value, ŷi represents the predicted value, and ȳ represents the mean of the actual values.

2.3. Genetic Programming (GP)

Genetic Programming (GP) is an evolutionary algorithm that generates symbolic models to predict Cc. It begins with a population of random equations and iteratively refines them using crossover, mutation, and selection. The fitness function is defined as follows:
F i t n e s s = 1 1 + M S E
where MSE is the mean squared error between the predicted and actual Cc. Operations such as crossover combine parts of two parent models:
foffspring = αfparent1 + (1 − α)fparent2
Mutation introduces diversity by randomly altering parts of the equation. This process continues until convergence to an optimal model or a predefined termination criterion (e.g., maximum iterations).
GP excels in capturing nonlinear relationships and produces interpretable equations for Cc. However, it is computationally expensive compared to simpler methods.
Figure 2 illustrates the basic workflow of GP, which is inspired by the principles of natural selection and evolution. The process begins with the generation of an initial random population of potential solutions, often represented as symbolic mathematical expressions or trees. Each individual in this population is evaluated, and a fitness score is assigned based on its ability to accurately model the target outcome (e.g., predicting the compression index, Cc). The selection function then identifies the most “fit” individuals, which are chosen to participate in the next generation. These selected individuals undergo genetic operations such as crossover (a recombination of parts from two parent solutions) and mutation (random alterations to introduce diversity), creating a new population with potentially improved solutions. The algorithm continues to iterate through these steps—selection, crossover, mutation, and fitness evaluation—until a specified termination condition is met, such as reaching a maximum number of generations or achieving an acceptable fitness level. Once the process concludes, the algorithm outputs the best-performing solution, representing the optimized predictive model.

2.4. XGBoost

XGBoost is a gradient-boosting framework that constructs a series of decision trees to minimize prediction errors. It uses a regularized objective function:
O b j e c t i v e s = i = 1 n l ( C c , i , C ^ c , i ) + k = 1 K ( f k )
where l ( C c , i , C ^ c , i ) : the loss function (e.g., squared error loss).
f k = γ T + λ 2 ω k 2
At each iteration, a new tree is fitted to the residuals of the previous trees:
r i = C c , i C ^ c , i
The final prediction in XGBoost is a weighted sum of all the decision trees. XGBoost is particularly strong because it can handle missing data and incorporates regularization to prevent overfitting. Also, it uses parallel processing to optimize performance. In our study, key hyperparameters, such as the learning rate and tree depth, were carefully adjusted through cross-validation to enhance accuracy. Metrics like R2 and RMSE were used to evaluate the model’s performance.
Figure 3 illustrates the workflow of XGBoost, an advanced machine learning algorithm based on the gradient boosting framework. The process begins with a dataset that is used to train an ensemble of decision trees T r e e 1 , T r e e 2 , …, T r e e k , each represented by its unique parameters φ 1 , φ 2 , …, φ k . The first tree f 1 ( X , φ 1 ) generates initial predictions, and the residuals (the errors between the predicted and actual values) are calculated. These residuals are then passed to the next tree f 2 ( X , φ 2 ), which aims to correct the previous errors. This iterative process continues, with each subsequent tree learning from the residuals of the previous model. The node splitting in each tree is optimized based on an objective function, which improves predictive accuracy. Finally, the outputs from all trees are combined through an additive process f k ( X , θ k ) and result in a high-performing predictive model. This ensemble approach enhances accuracy, reduces overfitting, and ensures efficient computation.

2.5. Hybrid (GP-XGBoost)

Combining GP and XGBoost creates a powerful hybrid methodology that integrates the interpretability of GP with the robust predictive capabilities of XGBoost. This combination is particularly valuable for solving complex problems where understanding the underlying relationships is as important as achieving high predictive accuracy. Below is a comprehensive outline of how these methods can be combined. In this research, method 2 was employed as it showed better performance in prediction.
Method 1. Sequential Hybrid Approach
In this approach, GP is used as a preprocessing step to create features or refine the input data for XGBoost, which then focuses on predictive modeling.
Step 1: Feature Engineering with GP
1.
Symbolic Regression: GP is employed to explore relationships between input variables (e.g., the LL, the PI, and e0) and the target variable (Cc). The output is symbolic equations that describe the following relationships:
C c = L L P I + e 0
C c L L . l o g ( P I )
2.
Feature Transformation: The symbolic equations are converted into new features, such as the following:
f 1 = L L P I + e 0
f 2 = L L . l o g ( P I )
3.
Feature Selection: The symbolic features are evaluated based on their importance and correlation with the target variable, and only the most relevant features are retained for the next step.
Step 2: Predictive Modeling with XGBoost
1.
Dataset Augmentation: The refined features from GP are added to the original dataset, enriching the input space for XGBoost.
2.
Training XGBoost: XGBoost is trained on the augmented dataset. The combination of raw and GP-derived features allows XGBoost to capture complex patterns and residual errors that are not fully explained by GP.
3.
Evaluation: The performance of the model is assessed using metrics such as the MSE and R2.
By using GP’s symbolic equations, this approach improves XGBoost’s ability to generalize complex nonlinear relationships.
Method 2. Integrated Hybrid Approach
This approach iteratively combines GP and XGBoost, with feedback loops allowing both methods to influence each other.
Step 1: Initial Training with XGBoost
1.
XGBoost is trained in the raw features to establish a baseline model.
2.
The feature importance from XGBoost is extracted to identify which features contribute most to the predictions.
Step 2: Feedback to GP for Feature Discovery
1.
GP uses the ranked features from XGBoost to focus on the most influential variables.
2.
It generates symbolic relationships, such as Equations (11) and (12).
3.
GP-derived features are added to the original dataset.
Step 3: Iterative Refinement
1.
The new dataset, enriched by GP, is fed back into XGBoost for retraining.
2.
XGBoost’s predictions and feature importances are analyzed, and further refinement of features is performed by GP.
3.
This feedback loop continues until convergence or performance improvement stagnates.
Method 3. Parallel Hybrid Approach
In this approach, GP and XGBoost work independently, and their outputs are combined for the final prediction.
Step 1: Independent Training
1.
GP is trained independently to create symbolic models for predicting Cc, generating outputs (CcGP).
2.
XGBoost is trained independently on the same dataset to produce output (CcXGBoost).
Step 2: Weighted Combination of Outputs
1.
The predictions from both models are combined using a weighted averaging approach:
C c = ω 1 · C c G P + ω 1 · C c X G B o o s t
where w1 and w2 are the weights determined through cross-validation.
2.
Alternatively, a stacking ensemble can be used, where the outputs of GP and XGBoost serve as inputs to a meta-model that learns how to combine them optimally.
Step 3: Final Model Evaluation
The combined predictions are evaluated using standard performance metrics, ensuring that the complementary strengths of GP (interpretability) and XGBoost (accuracy) are fully utilized.
The hybrid GP-XGBoost approach combines the best features of two powerful methods and can offer distinct advantages. First, it improves model interpretability by using symbolic equations from GP which clearly show the relationships between input variables and the target. This is especially useful in geotechnical engineering where understanding soil behavior is key to making informed decisions. Second, the hybrid model boosts predictive accuracy by using XGBoost’s ability to handle complex, nonlinear patterns and residual errors. GP-derived features enhance the dataset and allow XGBoost to work with a richer, more meaningful input space. This often leads to better generalization and reduced overfitting. The method is also highly adaptable and makes it suitable for various datasets and applications where both precision and clarity are important.
The GP-XGBoost model is particularly effective for problems that involve complex nonlinear relationships and require both high accuracy and easy interpretation. GP’s symbolic equations help engineers understand the key factors driving these properties, while XGBoost ensures dependable predictions. Beyond geotechnics, the hybrid model is valuable in areas like environmental science where interactions between soil, water, and plants are critical. By combining symbolic reasoning with advanced machine learning, this hybrid approach bridges the gap between explainable and high-performance predictive models.

3. Results

3.1. Multiple Linear Regression (MLR) Predictions

The results from the MLR model show its limitations in predicting Cc, especially for the testing database. Figure 4 shows that the training data aligns reasonably well with the 1:1 reference line and indicates a decent fit for the known data. Many points fall outside of the ±10% boundaries and show that MLR struggles to generalize new data. This is because MLR relies on linear relationships. This linear approach is not enough to capture the complex, nonlinear interactions often seen in soil compressibility behavior.
The boxplots and violin plots in Figure 4 further illustrate MLR’s shortcomings. The boxplots show wider interquartile ranges and more outliers in both the training and testing predictions compared to the actual values, and thus indicate less precision. The violin plots reveal broader and misaligned distributions for predicted values, especially in the testing dataset. This reflects MLR’s inability to handle the variability and complexity of real data. These results show the need for more advanced methods, such as machine learning models like XGBoost or GP, which are better suited to capture nonlinear relationships and improve prediction accuracy in complex geotechnical datasets.
Table 5 shows the performance of the MLR method in predicting Cc values for both training and testing databases. The model performs well and achieved an R2 of 0.879 for training and 0.843 for testing and indicated strong predictive accuracy and good generalization to new data.
The error metrics further support these results. The MAE values are 0.054 for training and 0.059 for testing, while the MSE values are 0.006 and 0.008, respectively. These low error rates highlight the MLR method’s ability to provide reliable predictions.

3.2. Genetic Programming (GP) Predictions

Figure 5 highlights how well the GP model predicts Cc values and shows a strong match between the predicted and actual results. In the scatter plot, most data points from both the training and testing datasets align closely with the 1:1 reference line. This approach reflects excellent accuracy. Many points also fall within the ±10% deviation bands and show that the model can generalize well with new data. However, the testing database shows slightly more scatter than the training data. This point shows that the model could be improved for more complex or extreme cases.
The histogram, boxplot, and violin plot in Figure 5 provide more evidence of the GP model’s reliability. The histogram shows that the predicted values closely match the actual data distribution, especially for lower Cc values. The boxplots reveal very little difference between the median and interquartile ranges of the actual and predicted values, with only a few outliers. The violin plots show that the predicted data have a similar shape and spread to the actual values, further confirming the model’s consistency. These results show that the GP model captures nonlinear relationships well, and its symbolic regression makes it easy to interpret. This combination of accuracy and interpretability makes the GP model a powerful tool for predicting soil compressibility.
The following equation shows the GP proposed equation. Also, Table 6 represents parameters and constants:
Y = (r12 + (x1 − x2)2 × 4 × r1)2 × (x3 − r32 × x1 + (r2 + x3) × (x3 + x1) − x44 + (x22 − x32) × (2 × x1 − r3 − x4) × ((x1 − r1) × (2 × x2) − (3 × x3 + r1)))
Table 7 presents the performance metrics of the GP method in predicting the Cc for the training and testing datasets. The GP method demonstrates strong predictive performance, with a high R2 value of 0.934 for the training database, indicating an excellent fit to the data. For the testing database, the R2 value is 0.827, reflecting good generalization capabilities. Additionally, the low mean absolute error (MAE) values of 0.039 and 0.040, along with mean squared error (MSE) values of 0.003 for both datasets, highlight the GP method’s ability to capture complex relationships and provide accurate predictions for both training and testing datasets.
The GP model was optimized using an evolutionary strategy that balances accuracy and interpretability. The best configuration was obtained by tournament selection and elitism, ensuring diversity while avoiding premature convergence. Table 8 shows optimized hyperparameters for GP.

3.3. XGBoost Predictions

Figure 6 illustrates the strong performance of the XGBoost model in predicting Cc values. Most data points for both the training and testing sets are closely aligned with the 1:1 line, showing a high correlation between the predicted and actual values. The minimal scatter of points indicates excellent accuracy and consistency in the model’s predictions. The addition of ±10% deviation lines further highlights that the majority of predictions fall within an acceptable error range, demonstrating XGBoost’s ability to capture complex and nonlinear patterns in the data. A few points outside these lines suggest areas where fine-tuning hyperparameters or adding features could enhance performance.
The histogram, boxplot, and violin plot in Figure 6 add more evidence of XGBoost’s reliability. The histogram shows a close match between the distributions of predicted and actual Cc values for both the training and testing datasets, suggesting strong generalization to unseen data. The boxplots reveal similar median values and interquartile ranges for the predicted and actual results, with only a few outliers. The violin plots provide a deeper look at data density, showing consistent shapes and patterns between actual and predicted values. These results confirm that XGBoost not only delivers accurate predictions but also maintains robust and consistent performance, making it a dependable choice for geotechnical applications.
Table 9 highlights how well the XGBoost model predicts Cc values for both training and testing datasets. The model performs great, with an R2 of 0.939 for the training set and an impressive 0.916 for the testing set. This shows that it not only makes accurate predictions but also generalizes well to new data.
The error values add to this confidence. The MAE is 0.038 for the training set and 0.028 for the testing set, while the MSE is 0.002 and 0.001, respectively. These low error rates demonstrate the model’s precision and reliability. Overall, XGBoost stands out as a highly effective and consistent tool for geotechnical predictions.
To achieve the best predictive performance, Bayesian Optimization (TPE) was used for hyperparameter tuning. A 5-fold cross-validation approach was applied, and the best parameters were selected based on the lowest MSE. The optimized parameters for the XGBoost model are presented in Table 10.

3.4. Hybrid (GP-XGBoost) Predictions

Figure 7 shows the scatter plot that compares actual and predicted Cc values for the GP-XGBoost model. This figure highlights the strong predictive performance of this hybrid method. Most points from both the training and testing datasets align closely with the 1:1 reference line. This performance shows a high level of agreement between predictions and actual values. The points are tightly distributed within ±10% deviation lines and confirm the model’s accuracy and consistency. This demonstrates the hybrid model’s ability to effectively handle nonlinear relationships in the data and make it a reliable choice for predicting Cc. A few slight deviations suggest areas for potential improvement in feature representation or model assumptions.
The histogram, boxplot, and violin plot in Figure 7 provide additional insights into the model’s performance. The histogram shows that predicted values closely follow the actual data distribution and peak around the mean Cc. The boxplot reveals minimal differences between median values and tightly grouped interquartile ranges and it further confirms the model’s accuracy. The violin plots give a more detailed view of data distribution. They show consistent density and effectively address outliers. Together, these visualizations emphasize the GP-XGBoost model’s robustness, precision, and reliability and make it a strong tool for geotechnical predictions that require high confidence.
Table 11 shows how well the Hybrid GP-XGBoost model predicts Cc values for both training and testing datasets. The model delivers excellent results, with an R2 of 0.966 for the training set and 0.927 for the testing set. The results represent high accuracy and strong generalization. The low mean absolute error (MAE) values of 0.030 for the training set and 0.028 for the testing set, along with a minimal mean squared error (MSE) of 0.001 for both datasets, demonstrate its precision in handling complex relationships.
This performance highlights the model’s ability to outperform individual methods, proving its reliability and robustness. By effectively capturing the complexities of the data, the Hybrid GP-XGBoost model stands out as a highly dependable approach for accurate predictions.
The following equation shows the GP-XGBoost proposed equation. Also, Table 12 represents parameters and constants:
Y = ((R1 + R2) × X3 × R1 + (((4 × X3 − R3) × (X3X2 + X1X4))2)) + (((R1 + R2 × X4) × (R1 + R2) × (X4 × R1) × X2 × R1) − ((X3X4) × ((X3X2 + R3X4) − ((X2X3) × (2 × X3X1)))))
For the GP-XGBoost hybrid model, the best approach involved iterative refinement:
Feature Selection: Only the top four most important GP-derived features were used in XGBoost.
A stacking ensemble approach was used, where a meta-model determined the optimal contribution of GP and XGBoost outputs. A grid search was performed to find the best weight combination. Table 13 shows optimized hyperparameters for GP-XGBoost.

4. Discussion

4.1. Residual Analysis and Error Distribution for Model Performance Evaluation

Figure 8 compares the results of different models using several graphs. Residual plots show the differences between the actual and predicted values, which help assess accuracy and bias. Ideally, the residuals should be evenly spread around zero, indicating no bias in the model’s predictions. Models like GP and XGBoost perform well, with residuals mostly close to zero, showing good accuracy. On the other hand, scattered residuals highlighted areas where predictions were less accurate. These graphs also reveal the range of errors, with narrower distributions around zero indicating better performance, as seen in GP and XGBoost. Wider or skewed distributions suggest models with lower accuracy or potential biases.
The scatter matrix in Figure 8 compares actual Cc values with the predicted ones for each model. Points closer to the diagonal line indicate more accurate predictions. GP and XGBoost models show tight clustering along the diagonal direction, reflecting higher accuracy, while scattered points in other models reveal more errors. This visualization is useful for identifying how well each model generalizes to new data and whether there are any consistent issues. It offers a clear view of the models’ strengths and weaknesses, making it easier to understand their overall performance.

4.2. Feature Importance

Feature importance is a key concept in machine learning that helps explain which factors have the biggest impact on a model’s predictions. It shows how much each variable contributes to the model’s performance and helps researchers understand the relationships within the data. The method for calculating feature importance depends on the model. For example, in tree-based models like XGBoost, it can be measured by how much a feature reduces errors or how often it is used in decision splits. In GP models, importance is assessed by analyzing symbolic equations to see how strongly a feature influences the output. For hybrid models like GP-XGBoost, these methods are combined with techniques like permutation importance or SHAP values, which quantify how much each feature contributes to the predictions. These tools make it easier to rank features, interpret results, and improve the model.
The feature importance results in Figure 9 reveal how different soil properties—such as the initial void ratio (e0), the liquid limit (LL), the plasticity index (PI), and the water content (w)—affect the prediction of the compression index (Cc). The initial void ratio (e0) stands out as the most important factor, especially in the GP-XGBoost hybrid model, where it has the highest importance score of 0.55. This highlights its critical role in soil compressibility, as it reflects the soil’s structure and potential for volume change. The LL and the PI also contribute significantly, particularly in GP-XGBoost and GP models, with importance scores of 0.240 and 0.220. These properties are linked to soil composition and plasticity, making them reliable indicators of compressibility. Interestingly, water content (w) is moderately important in the hybrid model, with a score of 0.270, showing its effect on soil behavior. MLR assigns less importance to all features, especially the LL and the w, due to its limited ability to handle complex relationships. These results demonstrate how GP-XGBoost combines the strengths of both GP and XGBoost to give a detailed understanding of soil behavior through feature analysis.
It is important to note that the dependency between the liquid limit (LL) and the plasticity index (PI) may have influenced the feature importance rankings. Since the PI is derived from the LL and the PL, this relationship introduces collinearity, potentially inflating the importance of these features in the predictive models. Although the GP-XGBoost model is robust against multicollinearity, this dependency could affect the interpretability of the results, and it is acknowledged as a limitation of this study.
Feature importance ranking aligns with fundamental geotechnical principles and reflects the critical role of soil composition and structure in compressibility behavior.
The LL and the PI are indicators of a soil’s clay mineral content and plasticity, which directly affect compressibility. Higher LL values are typically associated with fine-grained soils rich in montmorillonite or kaolinite, which have a greater capacity to absorb water and undergo volume changes under load. The PI reflects the soil’s ability to deform plastically, and a higher PI usually correlates with the greater rearrangement of particles during consolidation, resulting in higher Cc values. The strong feature importance of LL and PI in the model suggests that mineralogical composition and interparticle bonding are key mechanisms driving compressibility.
The initial void ratio is a fundamental measure of the porosity and packing density of soil particles. Soils with a high e0 have more interparticle voids that make them more susceptible to settlement when subjected to load. The model’s emphasis on e0 highlights the significance of soil structure, particularly the arrangement of particles and pore spaces, in controlling the magnitude of primary consolidation.
The natural moisture content influences the pore water pressure, effective stress, and the ease with which particles can rearrange under loading conditions. Soils with higher moisture content tend to have weaker interparticle forces and facilitate greater compression.
The feature importance analysis suggests that the compression index (Cc) is governed by a combination of the following features:
-
Mineralogical properties (the LL, the PI, and specific gravity) affect plasticity and particle interaction.
-
Structural characteristics (the initial void ratio) influence particle arrangement and porosity.
-
Hydrological conditions (the natural moisture content) impact pore water dynamics and effective stress.
These findings align with the classical consolidation theory, where Cc is a function of both soil composition and initial structural conditions. Machine learning models not only confirm these geotechnical principles but also provide quantitative evidence of the relative importance of each factor.

4.3. Comparison with the Literature

Figure 10 compares the Cc values predicted by different models with the actual observed values. The models include GP-XGBoost, Terzaghi and Peck [1], Azzouz et al. [8], Koppula [9], and Yoshinaka and Osada [11]. The GP-XGBoost model closely matches actual values, especially for lower Cc ranges. In contrast, the empirical models show larger deviations. Terzaghi and Peck [1] often overestimate Cc, while Azzouz et al. [8] significantly underestimates it. Koppula [9] and Yoshinaka and Osada [11] provide more consistent results but lack the precision of GP-XGBoost.
These results highlight the strength of machine learning models like GP-XGBoost in capturing complex relationships in the data. They outperform empirical models, which are limited by their assumptions and simplified formulas. The graph also shows the variability in soil behavior and the difficulty that empirical methods have in generalizing across different conditions. While traditional models are useful for initial estimates, GP-XGBoost offers higher reliability and accuracy. This makes it a valuable tool for geotechnical engineering and emphasizes the importance of modern, data-driven techniques.
The results in Table 14 show that the GP-XGBoost hybrid model performs better than empirical models in predicting Cc. The GP-XGBoost model achieved a high R2 of 0.927 and a low MSE of 0.027. This shows it has strong accuracy and minimal error. In comparison, empirical models had much lower R2 values. Terzaghi and Peck [1] had the highest R2 at 0.149, while others had R2 < 0.1 values that show a poor fit or overestimation.
MAE values for the empirical models were also much higher. They ranged from 0.090 for Terzaghi and Peck [1] to 0.189 for Azzouz et al. [8]. These results show that GP-XGBoost is effective at capturing complex patterns in the data. Traditional empirical models fail to represent these relationships accurately.

4.4. Limitations and Future Work

The hybrid GP-XGBoost approach is effective but faces challenges in computation and practical use. GP’s symbolic regression can create complex equations that are hard to use in real-time. XGBoost also needs significant computing power for its iterative training on enhanced datasets. This may not be feasible for small engineering firms or in developing areas. To address this, future studies can focus on simplifying GP-generated equations or applying dimensionality reduction techniques. Cloud-based or easy-to-use software can make the models more accessible.
Overfitting is another issue in machine learning. It can produce good results on training data but poor results on new data. This study used normalization and cross-validation, but more work is needed. Testing with data from different regions can help. Techniques like dropout regularization and ensemble methods can reduce overfitting. Tools like SHAP values can show how features affect predictions. This will build trust and support the wider use of the models.
One limitation of this study is the inherent dependency between some input parameters, particularly the LL and the PI. Since the PI is calculated directly from the LL and the PL, this introduces collinearity, which may affect the stability and interpretability of feature importance in the models. Although the hybrid GP-XGBoost model is capable of handling such dependencies, future research could benefit from employing dimensionality reduction techniques or regularization methods to mitigate these effects and enhance model robustness.
Building upon the findings and acknowledging the limitations of this study, several directions for future research are proposed to enhance the robustness, generalizability, and interpretability of the predictive models for the compression index (Cc):
-
Future studies should explore dimensionality reduction techniques, such as Principal Component Analysis (PCA), to transform correlated features into uncorrelated components. Alternatively, regularization methods like LASSO regression can be applied to reduce the influence of redundant variables.
-
The current dataset, while comprehensive, lacks geographical and geological diversity. Future work should incorporate larger, multi-regional datasets that cover a wider range of soil types, climatic conditions, and geotechnical properties.
-
To ensure the generalizability of the proposed model, future research should involve external validation using independent datasets from different projects or geographical locations.
-
Although the GP-XGBoost hybrid model offers high accuracy, its complex symbolic equations can hinder interpretability. Future studies could focus on developing simplified symbolic models by incorporating genetic simplification algorithms or rule-based pruning techniques to generate more concise, physically interpretable expressions that align with geotechnical principles.
-
While the hybrid GP-XGBoost model demonstrated strong performance, future research could explore advanced ensemble techniques such as stacking, blending, or meta-learning frameworks to further enhance predictive accuracy. Additionally, integrating deep learning models like Recurrent Neural Networks (RNNs) or Attention Mechanisms could improve the capture of complex temporal and spatial patterns in geotechnical data.

4.5. Statistical Significance Analysis and Model Comparison

To ensure the reliability of the predictive models, 30 independent runs were conducted for each model (MLR, GP, XGBoost, and GP-XGBoost) with different random seeds to account for variability in data splitting and model initialization. The performance metrics—R2, MSE, and MAE—were calculated for each run. This approach allows for a comprehensive assessment of each model’s performance. The mean and standard deviation of these metrics were computed to quantify the models’ central tendency and variability, as shown in Table 15.
These results indicate that the hybrid GP-XGBoost model consistently outperformed other models with the highest mean R2 and the lowest MSE and MAE.

4.5.1. One-Way ANOVA Test

To determine if the observed differences in model performance were statistically significant, a one-way Analysis of Variance (ANOVA) test was applied. ANOVA is a strong statistical method that is used to compare the means of multiple groups (in this case, the performance metrics of different models) and assess whether at least one model performs differently from the others. The null hypothesis (H0) assumes no significant difference among the models, while the alternative hypothesis (H1) suggests that at least one model shows superior performance. The F-statistic and corresponding p-values were calculated for each performance metric (R2, MSE, and MAE). A p-value less than 0.05 indicates that the null hypothesis can be confidently rejected and confirms that significant performance differences exist between models (refer to Table 16).

4.5.2. Post-Hoc Tukey’s HSD Test

Following the ANOVA test, which identified significant differences among models, we conducted a Tukey’s Honestly Significant Difference (HSD) test as a post-hoc analysis. Tukey’s HSD test is designed to determine which specific model pairs exhibit statistically significant differences. This method compares all possible pairs of models and adjusts for multiple comparisons to control the family-wise error rate. The test provides p-values that indicate the likelihood that performance differences between two models occurred by chance. p-values less than 0.05 suggest a statistically significant difference between model performances. This analysis offers a detailed understanding of how each model compares against others, as summarized in Table 17.

4.5.3. Model Validation Techniques

The model validation techniques employed in this study were designed to rigorously assess the predictive performance and reliability of the proposed models. Initially, the dataset was randomly split into 80% training and 20% testing subsets to evaluate model generalization, with random seeds used to ensure reproducibility across multiple runs. To further reduce the risk of overfitting and obtain robust performance metrics, a 5-fold cross-validation approach was implemented, where the dataset was divided into five subsets, using four for training and one for validation in a rotating manner. To enhance the reliability of the results, the entire modelling process was repeated 30 times with different random splits, and the mean and standard deviation of key performance metrics (R2, MSE, and MAE) were calculated to assess model stability. Finally, statistical significance tests, including ANOVA and Tukey’s HSD, were conducted to verify that performance differences between models were statistically significant, adding an extra layer of validation to this study’s findings.

5. Conclusions

This study proposed a hybrid GP-XGBoost model for predicting the compression index (Cc) of clayey soils, integrating the symbolic regression capabilities of Genetic Programming (GP) with the robust predictive power of Extreme Gradient Boosting (XGBoost). The model demonstrated superior performance compared to traditional empirical methods and standalone machine learning models, achieving an R2 of 0.927 with significantly reduced prediction errors. The feature importance analysis highlighted key geotechnical parameters—such as the liquid limit (LL), the plasticity index (PI), the initial void ratio (e0), and the natural moisture content (wn)—as critical factors influencing soil compressibility, aligning with established soil mechanics principles.
Despite the promising results, this study acknowledges several limitations, including parameter dependencies (e.g., the LL and the PI), limited dataset diversity, and potential overfitting risks. To address these challenges, future research should incorporate dimensionality reduction techniques, external validation with independent datasets, and simplified symbolic models for enhanced interpretability. Additionally, expanding the dataset to include diverse soil types and environmental conditions will improve the model’s generalizability in real-world applications.
The findings of this research offer valuable insights for geotechnical engineers involved in foundation design, settlement analysis, and infrastructure planning, providing a data-driven approach to complement traditional soil mechanics theories. By advancing predictive modeling techniques and addressing the identified limitations, future studies can further improve the accuracy, reliability, and practical applicability of machine learning models in geotechnical engineering.

Author Contributions

Conceptualization, A.B. and K.K.; methodology, A.B.; software, K.K.; validation, A.B., H.A.-N. and Y.L.; formal analysis, A.B., H.A.-N. and Y.L.; investigation, K.K.; resources, K.K.; data curation, K.K.; writing—original draft preparation, A.B.; writing—review and editing, H.A.-N. and Y.L.; visualization, K.K.; supervision, H.A.-N.; project administration, A.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding authors.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
CcCompression Index
MLRMultiple Linear Regression
GPGenetic Programming
XGBoosteXtreme Gradient Boosting
MAEMean Absolute Error
MSEMean Squared Error
R2Coefficient of Determination
MLMachine Learning
LLLiquid Limit
PIPlasticity Index
e0Void Ratio
wWater Content

Appendix A

Table A1. Used database in this study.
Table A1. Used database in this study.
No.LL (%)PI (%)e0w (%)Cc
156360.49820.20.169
230100.71829.70.11
335150.88336.60.246
42970.75332.10.213
540220.69524.40.123
639190.77290.279
748250.72425.60.163
833100.63222.60.116
93080.54620.30.149
1040210.58523.70.22
1152280.73129.40.22
1254310.75529.80.149
1358360.867340.196
1450290.7431.20.259
1549280.9736.50.29
1636170.67626.50.159
1747290.82631.40.179
1838170.52919.40.11
1945220.80927.10.22
2046250.77530.30.173
2147260.71728.90.246
2229110.7124.70.19
2347240.91537.50.29
242980.71720.50.146
2534120.66824.10.106
2636170.67626.50.159
2735130.82530.80.2
2831120.62123.90.156
2943200.73226.90.163
303080.60523.90.133
3154310.75529.80.149
322550.86930.80.2
3353280.58322.40.169
3452280.51719.50.14
3545220.65227.10.18
3652280.80628.90.28
372551.22248.70.41
3834120.67525.20.229
3933100.76831.40.252
4036140.75329.70.2
4153290.98390.26
4234130.73425.30.2
432680.82428.90.199
442790.63200.193
4539190.66726.40.279
4633130.78933.60.279
4735160.66325.20.14
482780.67524.10.126
4931110.73625.90.21
5035140.67525.30.13
5147260.78529.80.173
522750.51911.10.126
5340180.58825.20.16
5454300.77631.20.183
552980.66326.70.12
562970.63722.70.183
5744240.629270.166
582780.66125.30.156
5956340.61222.60.15
6036150.697250.183
6131110.83129.80.176
6243220.798290.209
6333140.6924.90.13
6453350.64223.10.169
6546220.80128.60.153
6653270.9739.80.252
6734150.76123.70.189
6831110.72325.30.169
6958350.80727.30.229
7037180.80227.30.179
7136140.77727.40.229
7242220.65821.70.22
7331110.76927.20.176
7437160.77631.80.166
7534120.82428.10.269
76 34 140.57322.10.123
772760.64322.40.189
7843210.64322.10.203
7932100.644270.196
8035130.65223.20.206
8139200.65328.50.186
822680.82428.90.199
833080.60523.90.133
842990.77728.40.14
8543200.73226.90.163
8639170.7325.80.196
8729110.65825.10.149
8837160.73825.10.203
8931140.61926.40.14
9030150.74626.20.183
9160300.73326.30.196
9230100.60222.60.13
933170.73827.50.173
9430100.87431.10.209
9533120.703280.103
9643190.82829.80.186
9732100.71825.70.166
9831110.63523.10.146
9956280.90936.90.27
10034150.73927.50.146
10137150.76129.60.173
10242200.71626.90.216
10337170.723260.21
10444220.66726.70.123
1052660.70424.80.226
1062440.558220.103
10734130.74524.70.18
10837150.87327.60.329
10939170.99341.10.259
11060361.00849.20.249
11151270.85434.80.249
11236190.67827.20.153
11333140.67727.20.17
11439170.83731.90.2
11544240.96537.80.229
11662441.01438.40.326
11741200.90935.30.226
11837170.721280.163
11937170.73428.30.159
12039190.76124.50.183
12138170.56321.80.103
122 33 160.894340.329
12347270.73629.80.25
12445220.53711.50.13
12552280.61517.60.21
12646230.61118.50.173
12751250.58619.20.186
12830100.78232.20.159
12935130.84135.30.256
13043220.96431.10.365
13149290.80532.10.233
13279451.58757.40.628
13335150.50723.50.11
13437100.92840.90.31
13549280.79732.80.309
13651320.829320.31
13738170.7930.50.249
13842200.75530.30.193
13941220.81631.40.266
14039190.72528.50.183
1416334.151.247.10.34
14246240.84733.60.296
14334130.675220.216
14457.533.20.806250.216
1455633.40.70422.80.141
14658.229.40.57120.80.143
14757.228.50.82122.60.246
14851.921.70.65621.80.179
14945.622.60.74712.40.188
15040.520.90.66316.70.188
15140.923.20.87211.20.206
15250.223.50.671190.194
15347.625.60.61217.20.158
15450.4260.78224.30.176
15548.724.10.84424.90.221
15640.2180.64417.70.15
15744.615.50.71418.60.198
15834.814.70.5221.90.135
1594424.30.5617.70.176
16049.1300.51116.20.123
16141.122.10.7813.30.228
16249.818.70.96833.30.241
16352.311.70.786370.211
16442.79.20.74527.60.158
16561.419.80.93831.20.199
16650.9171.0629.20.236
16757.832.90.66922.40.163
168 42.9 8.40.79731.30.176
16931101.23246.40.465
17029100.7727.20.183
17148250.75629.90.163
17268420.88438.50.322
17342240.657280.209
17440170.93139.30.209
1753280.97939.40.266
1762760.81728.90.176
17757351.03128.90.306
1782680.67620.10.113
17946250.64723.80.123
18033120.703280.103
18145260.80834.50.319
18230110.68125.20.163
18344250.81634.20.239
18432110.70328.80.206
1852990.64325.50.126
18652311.13245.70.379
18737140.928350.233
18834140.67624.60.226
18964311.64764.10.53
19033150.786310.193
19156350.56925.20.146
19244210.47614.50.126
1932870.59822.60.173
1942870.61922.80.153
19550270.77133.10.176
19646.518.71.24310.90.317
19750.5140.79617.80.183
19848240.84500.3
199257.50.5200.2
2004222.60.604220.136
20163500.817300.133
20245.921.10.577210.152
20344.122.90.706290.173
20461320.75426.50.17
2053919.10.537250.121
20626100.525250.117
20751.928.10.612230.162
2082250.671240.121
20943.819.140.703290.316
21036.224.150.72230.30.325
21138.217.250.66928.40.248
21267.139.951.14942.40.531
21341.212.080.875290.279
214 38.5 14.430.7629.80.26
21545.927.480.895300.334
21647.523.180.74329.10.306
21744.720.920.817350.307
21833.7220.64125.20.33
21948.617.660.94633.90.347
22066.644.291.19645.90.534
221106672.861201.5
22248.720.20.886130.231
22380301.86571.5
22446221.1380.35
22550.214.10.85923.20.248
22646.910.50.81316.60.199
22746.511.50.917.90.241
22847.9110.96622.50.287
22946.59.80.9419.80.253
23052301.244.20.511
23146220.9635.60.289
23251311.244.40.535
23347241.2647.20.465
23466401.2245.10.382
23545250.9534.60.368
236125822.91081.64
23770311.8640.7
23887471.7730.85
23976381.8670.66
24078421.8680.69
24162311.4550.54
242100441.4530.71
2432580.725.40.13
24463341.4480.59
24563341.4480.62
246116692.91311.4
2471358131241.6
24850261350.2
24940181.04540.70.259
25044260.76132.10.199
25132120.66521.60.166
25230110.75226.10.183
2532890.7319.50.113
25440190.715190.219
25534150.61318.40.153
25655281.32249.60.409
25758351.059390.385
25834140.87131.20.176
25956340.98336.10.306
260 37 150.88330.209
26162361.05432.70.355
26262360.80634.40.312
26340210.92630.90.379
26455300.92137.40.246
26541220.80431.40.229
26639200.748290.233
26748250.72425.60.163
26839210.64725.30.259
2693090.70225.30.189
27039210.64725.30.259
27144230.81535.30.183
27234140.67624.60.225
27340190.8290.279
27434120.69225.80.173
27532110.69920.40.173
27633100.70727.80.176
2772440.69526.30.169
2782980.63726.50.166
27938190.85935.40.249
28039170.88136.20.252
28133120.70531.90.12
28240180.66627.60.156
28336150.71128.90.216
28443200.71928.10.156
28534140.75329.80.183
28658350.69228.30.159
28757340.88350.256
28839160.82823.80.159
28936120.50221.30.11
2902690.55120.60.159
29136150.56718.50.106
2922430.36810.60.09
29331120.77826.10.189
29445220.66119.70.149
29544220.7528.80.209
29636140.84433.70.203
29740200.75528.70.249
29840180.75728.20.169
29929100.7530.10.223
30034160.73827.60.189
30137150.78430.80.199
30241200.69921.30.14
30338180.68426.70.179
30430100.94354.30.282
30534130.77828.30.133
306 36 170.75928.80.206
30739190.64424.50.176
30842210.56822.20.153
30936160.83131.80.206
3102560.83429.60.229
31139200.86833.30.252
31234131.16142.20.219
3132990.78626.10.209
31457330.69722.40.143
31542190.53422.20.136
31641240.60516.40.173
31729110.49519.40.123
3182990.691250.149
31929110.77829.70.196
32039200.82332.10.236
32140220.78730.20.219
32238190.59223.60.13
32332120.7417.70.159
32436171.23747.60.299
32543221.12746.60.269
32650260.82733.50.296
32737160.59623.50.05
32836150.83533.70.166
32939210.55221.60.11
33050270.61422.20.166
33143240.78929.80.22
33234130.56520.50.076
33335150.7327.70.159
33442230.66421.60.156
33544240.68621.90.166
33647220.74220.50.196
33755270.71925.30.209
33846250.671200.176
33952291.882700.54
34081500.96637.80.27
3412760.95137.10.25
34244230.35710.20.08
34346230.99640.10.33
34437210.50716.60.226
34553311.062430.4
34662340.93737.20.345
34735150.74527.90.153
34832140.82424.40.276
34940190.923370.309
35047240.81428.20.2
35149250.89139.60.26
352 41 180.66627.10.116

References

  1. Terzaghi, K.; Peck, R.B. Soil Mechanics in Engineering Practice, 2nd ed.; John Wiley & Sons: New York, NY, USA, 1967. [Google Scholar]
  2. Skempton, A.W. Notes on the compressibility of clays. Q. J. Geol. Soc. 1944, 100, 119–135. [Google Scholar] [CrossRef]
  3. Mesri, G.; Castro, A. The Coefficient of Secondary Compression. J. Geotech. Eng. 1987, 113, 1001–1016. [Google Scholar] [CrossRef]
  4. Zhang, L.; Tan, Z.; Li, J. ML Applications in Predicting Soil Consolidation. Geotech. Res. 2021, 34, 1123–1132. [Google Scholar]
  5. Carrier, W.D. Geotechnical Properties of Soils. J. Geotech. Geoenviron. Eng. 2003, 129, 307–320. [Google Scholar]
  6. Bowles, J.E. Foundation Analysis and Design, 5th ed.; McGraw-Hill: New York, NY, USA, 1996. [Google Scholar]
  7. Mesri, G. New Trends in Soil Compressibility. Geotech. Geol. Eng. 2001, 19, 285–305. [Google Scholar]
  8. Azzouz, A.S.; Krizek, R.J.; Corotis, R.B. Regression Analysis of Soil Compressibility. Soils Found. 1976, 16, 19–29. [Google Scholar] [CrossRef]
  9. Koppula, S.D. Compression Index of Soils and Its Relationship with Soil Properties. Indian Geotech. J. 1984, 14, 327–342. [Google Scholar]
  10. Sridharan, A.; Prakash, K. Mechanisms Controlling the Undrained Shear Strength Behavior of Clays. Can. Geotech. J. 1999, 36, 1030–1038. [Google Scholar] [CrossRef]
  11. Yoshinaka, R.; Osada, M. Empirical equations for predicting soil compressibility. Soils Found. 2005, 45, 111–120. [Google Scholar]
  12. Mohammadzadeh, S.D.; Kazemi, S.F.; Mosavi, A.; Nasseralshariati, E.; Tah, J.H. Prediction of compression index of fine-grained soils using a gene expression programming model. Infrastructures 2019, 4, 26. [Google Scholar] [CrossRef]
  13. Kumar, R.; Jain, P.K.; Dwivedi, P. Prediction of compression index (Cc) of fine grained remolded soils from basic soil properties. Int. J. Appl. Eng. Res. 2016, 11, 592–598. [Google Scholar]
  14. Nesamatha, R.; Arumairaj, P.D. Numerical modeling for prediction of compression index from soil index properties. Electron. J. Geotech. Eng 2015, 20, 4369–4378. [Google Scholar]
  15. Uzer, A.U. Accurate Prediction of Compression Index of Normally Consolidated Soils Using Artificial Neural Networks. Buildings 2024, 14, 2688. [Google Scholar] [CrossRef]
  16. Koza, J.R. Genetic Programming: On the Programming of Computers by Means of Natural Selection; MIT Press: Cambridge, MA, USA, 1992. [Google Scholar]
  17. Gandomi, A.H.; Alavi, A.H. Genetic Programming and its Applications in Engineering. Comput. Struct. 2012, 89, 2513–2525. [Google Scholar]
  18. Baghbani, A.; Costa, S.; Lu, Y.; Soltani, A.; Abuel-Naga, H.; Samui, P. Effects of particle shape on shear modulus of sand using dynamic simple shear testing. Arab. J. Geosci. 2023, 16, 422. [Google Scholar] [CrossRef]
  19. Pham, T.A.; Ly, H.B.; Tran, V.Q.; Giap, L.V.; Vu, H.L.T.; Duong, H.A.T. Prediction of pile axial bearing capacity using artificial neural network and random forest. Appl. Sci. 2020, 10, 1871. [Google Scholar] [CrossRef]
  20. Ahmadi, H.; Behbahani, H.; Zeynali, M. Using Genetic Programming to Predict Soil Shear Strength. J. Geotech. Eng. 2014, 140, 04014032. [Google Scholar]
  21. Nashaat, I.; Mohammed, I.; Dessouky, M.; Said, H. Mismatch-aware Placement of Device Arrays using Genetic Optimization. In Proceedings of the 2018 15th International Conference on Synthesis, Modeling, Analysis and Simulation Methods and Applications to Circuit Design (SMACD), Prague, Czech Republic, 2–5 July 2018; pp. 177–180. [Google Scholar]
  22. Kiany, K.; Baghbani, A.; Abuel-Naga, H.; Baghbani, H.; Arabani, M.; Shalchian, M.M. Enhancing Ultimate Bearing Capacity Prediction of Cohesionless Soils Beneath Shallow Foundations with Grey Box and Hybrid AI Models. Algorithms 2023, 16, 456. [Google Scholar] [CrossRef]
  23. Nong, X.; Bai, W.; Yi, S.; Baghbani, A.; Lu, Y. Vibration mitigation performance of a novel grouting material in the tunnel environment. Constr. Build. Mater. 2024, 452, 138995. [Google Scholar] [CrossRef]
  24. Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
  25. Tiwari, A.; Kumar, A. Modeling of Soil Properties Using XGBoost. J. Soil Sci. 2021, 26, 511–528. [Google Scholar]
  26. Pal, S.; Deswal, S. Prediction of Soil Compaction Characteristics Using ML Techniques. Constr. Build. Mater. 2020, 230, 117032. [Google Scholar]
  27. Ma, J.; Yu, Z.; Qu, Y.; Xu, J.; Cao, Y. Application of the XGBoost machine learning method in PM2.5 prediction: A case study of Shanghai. Aerosol Air Qual. Res. 2020, 20, 128–138. [Google Scholar] [CrossRef]
  28. Zhao, H.; Chen, X.; Jiang, L. Feature Engineering for Soil Property Prediction Using XGBoost. J. Soil Mech. 2021, 17, 234–241. [Google Scholar]
  29. Alavi, A.H.; Gandomi, A.H. Applications of Artificial Intelligence in Geotechnical Engineering; Springer: Berlin/Heidelberg, Germany, 2014. [Google Scholar]
  30. Baghbani, A.; Faradonbeh, R.S.; Lu, Y.; Soltani, A.; Kiany, K.; Baghbani, H.; Abuel-Naga, H.; Samui, P. Enhancing earth dam slope stability prediction with integrated AI and statistical models. Appl. Soft Comput. 2024, 164, 111999. [Google Scholar] [CrossRef]
  31. Safa, M.; Taha, M.R.; Najafi, M. Hybrid ML Approaches for Soil Behavior Prediction. Environ. Earth Sci. 2020, 79, 602. [Google Scholar]
  32. Kafle, B.; Baghbani, A.; Pempeit, R.; Shrestha, K. Investigating the Mechanical Behavior of Unbound Granular Material (UGM) for Road Pavement Construction Applications: A Western Victoria Case Study. Int. J. Geosynth. Ground Eng. 2024, 10, 29. [Google Scholar] [CrossRef]
  33. Baghbani, A.; Soltani, A.; Kiany, K.; Daghistani, F. Predicting the Strength Performance of Hydrated-Lime Activated Rice Husk Ash-Treated Soil Using Two Grey-Box Machine Learning Models. Geotechnics 2023, 3, 894–920. [Google Scholar] [CrossRef]
  34. Deng, X.; Li, M.; Deng, S.; Wang, L. Hybrid gene selection approach using XGBoost and multi-objective genetic algorithm for cancer classification. Med. Biol. Eng. Comput. 2022, 60, 663–681. [Google Scholar] [CrossRef]
  35. Shahin, M.A.; Jaksa, M.B. Machine Learning in Geotechnics: Comparative Study. J. Comput. Civ. Eng. 2019, 33, 04019022. [Google Scholar]
  36. Wang, X.; Tran, Q. ML Algorithms for Predicting Geotechnical Parameters. Geosci. Front. 2020, 12, 79–89. [Google Scholar]
  37. Li, Y.; Zhang, D. Empirical Correlations for Compression Index. Geotech. Eng. J. 2020, 32, 219–226. [Google Scholar]
  38. Lundberg, S.M.; Lee, S.I. A Unified Approach to Interpreting Model Predictions. Adv. Neural Inf. Process. Syst. 2017, 30, 4765–4774. [Google Scholar]
  39. Baghbani, A.; Abuel-Naga, H.; Shirkavand, D. Accurately Predicting Quartz Sand Thermal Conductivity Using Machine Learning and Grey-Box AI Models. Geotechnics 2023, 3, 638–660. [Google Scholar] [CrossRef]
  40. Soltani, A.; Azimi, M.; O’Kelly, B.C.; Baghbani, A.; Taheri, A. Statistical Reappraisal of the Wax and Mercury Methods for Shrinkage Limit Determinations of Fine-Grained Soils. Geotech. Geol. Eng. 2024, 42, 5317–5333. [Google Scholar] [CrossRef]
  41. Alhaji, M.M.; Alhassan, M.; Tsado, T.Y.; Mohammed, Y.A. Compression Index Prediction Models for Fine-grained Soil Deposits in Nigeria. In Proceeding of the 2nd International Engineering Conference, Minna, Nigeria; Federal University of Technology: Minna, Nigeria, 2017. [Google Scholar]
  42. Benbouras, M.A.; Kettab Mitiche, R.; Zedira, H.; Petrisor, A.I.; Mezouar, N.; Debiche, F. A new approach to predict the compression index using artificial intelligence methods. Mar. Georesour. Geotechnol. 2019, 37, 704–720. [Google Scholar] [CrossRef]
  43. McCabe, B.A.; Sheil, B.B.; Long, M.M.; Buggy, F.J.; Farrell, E.R. Empirical correlations for the compression index of Irish soft soils. Proc. Inst. Civ. Eng. Geotech. Eng. 2014, 167, 510–517. [Google Scholar] [CrossRef]
  44. Mitachi, T.; Ono, T. Prediction of undrained shear strength of overconsolidated clay. Tsuchi Kiso JSSMFE 1985, 33, 21–26. [Google Scholar]
  45. Widodo, S.; Ibrahim, A. Estimation of primary compression index (Cc) using physical properties of Pontianak soft clay. Int. J. Eng. Res. Appl. 2012, 2, 2231–2235. [Google Scholar]
  46. Zaman, M.W.; Hossain, M.R.; Shahin, H.; Alam, A.A. A study on correlation between consolidation properties of soil with liquid limit, in situ water content, void ratio and plasticity index. Geotech. Sustain. Infrastruct. Dev. 2016, 5, 899–902. [Google Scholar]
Figure 1. Pearson correlation heatmap of variables influencing Cc.
Figure 1. Pearson correlation heatmap of variables influencing Cc.
Applsci 15 01926 g001
Figure 2. Workflow of GP for model optimization.
Figure 2. Workflow of GP for model optimization.
Applsci 15 01926 g002
Figure 3. Workflow of XGBoost algorithm for model development.
Figure 3. Workflow of XGBoost algorithm for model development.
Applsci 15 01926 g003
Figure 4. Evaluation of actual vs. predicted Cc using MLR through various visualizations.
Figure 4. Evaluation of actual vs. predicted Cc using MLR through various visualizations.
Applsci 15 01926 g004
Figure 5. Evaluation of actual vs. predicted Cc using GP through various visualizations.
Figure 5. Evaluation of actual vs. predicted Cc using GP through various visualizations.
Applsci 15 01926 g005
Figure 6. Evaluation of actual vs. predicted Cc using XGBoost through various visualizations.
Figure 6. Evaluation of actual vs. predicted Cc using XGBoost through various visualizations.
Applsci 15 01926 g006
Figure 7. Evaluation of actual vs. predicted Cc using hybrid GP-XGBoost model through various visualizations.
Figure 7. Evaluation of actual vs. predicted Cc using hybrid GP-XGBoost model through various visualizations.
Applsci 15 01926 g007
Figure 8. Residual and error distribution analysis of Cc predictions across GP, XGBoost, GP-XGBoost, and MLR models.
Figure 8. Residual and error distribution analysis of Cc predictions across GP, XGBoost, GP-XGBoost, and MLR models.
Applsci 15 01926 g008
Figure 9. Feature importance comparison for predicting Cc across MLR, GP, XGBoost, and GP-XGBoost models.
Figure 9. Feature importance comparison for predicting Cc across MLR, GP, XGBoost, and GP-XGBoost models.
Applsci 15 01926 g009
Figure 10. Scatter matrix plot of actual and predicted Cc for testing data using different models [1,8,9,11].
Figure 10. Scatter matrix plot of actual and predicted Cc for testing data using different models [1,8,9,11].
Applsci 15 01926 g010
Table 1. Empirical correlations to determine Cc.
Table 1. Empirical correlations to determine Cc.
ReferenceInputsType of SoilFormula
Skempton [2]Liquid Limit (LL)Normally Consolidated ClaysCc = 0.007 × (LL − 10)
Terzaghi and Peck [1]Liquid Limit (LL)Normally Consolidated ClaysCc = 0.009 × (LL − 10)
Bowles [6]Liquid Limit (LL)Normally Consolidated ClaysCc = 0.0046 × (LL − 9)
Azzouz et al. [8]Natural Water Content (w)Clayey SoilsCc = 0.3 × w − 0.05
Koppula [9]Initial Void Ratio (e0)Normally Consolidated Silty ClaysCc = 0.15 × e0
Sridharan and Prakash [10]Liquid Limit (LL)GeneralCc = 0.012 × LL
Yoshinaka and Osada [11]Liquid Limit (LL), Plastic Limit (PL)GeneralCc = 0.0042 × (LL − PL)
Table 2. Statistical metrics for the complete dataset used in predicting Cc.
Table 2. Statistical metrics for the complete dataset used in predicting Cc.
VariableObservationsMinimumMaximumMeanStd. Deviation
Cc3520.0501.640.2410.189
LL (%)35222.000135.0042.96314.247
PI (%)3523.00082.0020.30410.764
e03520.3573.000.8190.318
w (%)35210.200131.0030.17913.582
Table 3. Statistical metrics for the training dataset used in predicting Cc.
Table 3. Statistical metrics for the training dataset used in predicting Cc.
VariableObservationsMinimumMaximumMeanStd. Deviation
Cc2820.0501.640.2440.201
LL (%)28222.000135.0042.88914.640
PI (%)2823.00082.0020.27810.943
e02820.3683.000.8220.335
w (%)28210.600131.0030.35214.293
Table 4. Statistical metrics for the testing dataset used in predicting Cc.
Table 4. Statistical metrics for the testing dataset used in predicting Cc.
VariableObservationsMinimumMaximumMeanStd. Deviation
Cc700.0800.850.2300.128
LL (%)7024.00087.0043.26112.636
PI (%)704.00047.0020.41110.088
e0700.3571.800.8090.236
w (%)7010.20073.0029.47910.288
Table 5. The performance of the MLR method to predict Cc for training and testing databases.
Table 5. The performance of the MLR method to predict Cc for training and testing databases.
Performance MetricsTraining DatabaseTesting Database
MAE0.0540.059
MSE0.0060.008
R20.8790.843
RMSE0.0650.070
MAPE25.90128.184
Table 6. Variables and parameters in GP proposed equation.
Table 6. Variables and parameters in GP proposed equation.
Variables in Equation (18)Parameters/Values
YCc
X1LL
X2PI
X3e0
X4w
R1Constant = 0.782
R2Constant = 0.221
R3Constant = 0.320
Table 7. The performance of the GP method to predict Cc for training and testing databases.
Table 7. The performance of the GP method to predict Cc for training and testing databases.
Performance MetricsTraining DatabaseTesting Database
MAE0.0390.040
MSE0.0030.003
R20.9340.827
RMSE0.0520.053
MAE17.01217.541
Table 8. Optimized hyperparameters for GP.
Table 8. Optimized hyperparameters for GP.
ParameterOptimized Value
Population Size1000
Generations60
Mutation Rate0.08
Crossover Rate0.75
Selection MethodTournament (Size = 7)
Fitness FunctionMSE
Table 9. The performance of XGBoost method to predict Cc for training and testing databases.
Table 9. The performance of XGBoost method to predict Cc for training and testing databases.
Performance MetricsTraining DatabaseTesting Database
MAE0.0380.028
MSE0.0020.001
R20.9390.916
RMSE0.0490.037
MAE16.01214.541
Table 10. Optimized hyperparameters for XGBoost.
Table 10. Optimized hyperparameters for XGBoost.
ParameterOptimized Value
Learning Rate (η)0.05
Max Depth (d)7
Number of Trees (Nestimators)400
Subsample Ratio0.85
Colsamplebytree0.8
Regulazation (λ)1.2
Min Child Weight5
Gamma0.1
Table 11. The performance of the hybrid GP-XGBoost method to predict Cc for training and testing databases.
Table 11. The performance of the hybrid GP-XGBoost method to predict Cc for training and testing databases.
Performance MetricsTraining DatabaseTesting Database
MAE0.0300.028
MSE0.0010.001
R20.9660.927
RMSE0.0370.034
MAPE12.90111.184
Table 12. Variables and parameters in GP-XGBoost proposed equation.
Table 12. Variables and parameters in GP-XGBoost proposed equation.
Variables in Equation (19)Parameters/Values
YCc
X1LL
X2PI
X3e0
X4w
R10.648
R20.264
R30.213
Table 13. Optimized hyperparameters for GP-XGBoost.
Table 13. Optimized hyperparameters for GP-XGBoost.
ParameterOptimized Value
GP Feature SelectionTop 4 Features
Ensemble Weight (w1)0.35
Ensemble Weight (w2)0.65
Number of XGBoost Trees400
Feature EngineeringYes
Table 14. Performance comparison of GP and XGBoost with empirical models for predicting Cc.
Table 14. Performance comparison of GP and XGBoost with empirical models for predicting Cc.
SourceR2MSE
GP and XGBoost0.9273830.027343
Terzaghi and Peck [1]0.1491810.09012
Azzouz et al. [8]0.0911340.189196
Koppula [9]0.0297360.106369
Yoshinaka and Osada [11]0.0939190.131074
Table 15. Statistical summary of model performance (30 runs).
Table 15. Statistical summary of model performance (30 runs).
ModelR2 (Mean ± SD)MSE (Mean ± SD)MAE (Mean ± SD)
MLR0.843 ± 0.0190.0084 ± 0.00120.059 ± 0.004
GP0.827 ± 0.0250.0033 ± 0.00080.040 ± 0.003
XGBoost0.916 ± 0.0150.0012 ± 0.00050.028 ± 0.0002
GP-XGBoost0.927 ± 0.0110.0010 ± 0.00040.027 ± 0.001
Table 16. ANOVA results.
Table 16. ANOVA results.
MetricF-Statisticp-Value
R287.42<0.001
MSE132.67<0.001
MAE109.83<0.001
Table 17. Tukey’s HSD test results (p-values).
Table 17. Tukey’s HSD test results (p-values).
Model ComparisonR2MSE (p-Value)MAE (p-Value)
GP vs. MLR0.0270.0020.001
GP-XGBoost vs. MLR0.1491810.09012<0.001
XGBoost vs. GP0.0911340.189196<0.001
GP-XGBoost vs. GP0.0297360.1063690.001
GP-XGBoost vs. XGBoost0.4320.2890.315
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Baghbani, A.; Kiany, K.; Abuel-Naga, H.; Lu, Y. Predicting the Compression Index of Clayey Soils Using a Hybrid Genetic Programming and XGBoost Model. Appl. Sci. 2025, 15, 1926. https://doi.org/10.3390/app15041926

AMA Style

Baghbani A, Kiany K, Abuel-Naga H, Lu Y. Predicting the Compression Index of Clayey Soils Using a Hybrid Genetic Programming and XGBoost Model. Applied Sciences. 2025; 15(4):1926. https://doi.org/10.3390/app15041926

Chicago/Turabian Style

Baghbani, Abolfazl, Katayoon Kiany, Hossam Abuel-Naga, and Yi Lu. 2025. "Predicting the Compression Index of Clayey Soils Using a Hybrid Genetic Programming and XGBoost Model" Applied Sciences 15, no. 4: 1926. https://doi.org/10.3390/app15041926

APA Style

Baghbani, A., Kiany, K., Abuel-Naga, H., & Lu, Y. (2025). Predicting the Compression Index of Clayey Soils Using a Hybrid Genetic Programming and XGBoost Model. Applied Sciences, 15(4), 1926. https://doi.org/10.3390/app15041926

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop