Comparative Study of Machine Learning Techniques for Predicting UCS Values Using Basic Soil Index Parameters in Pavement Construction

Alqudah, Mudhaffer; Saleh, Haitham; Yasarer, Hakan; Al-Ostaz, Ahmed; Najjar, Yacoub

doi:10.3390/infrastructures10070153

Open AccessArticle

Comparative Study of Machine Learning Techniques for Predicting UCS Values Using Basic Soil Index Parameters in Pavement Construction

by

Mudhaffer Alqudah

^*,

Haitham Saleh

,

Hakan Yasarer

,

Ahmed Al-Ostaz

and

Yacoub Najjar

Civil Engineering (CE) Department, The University of Mississippi (UM), University, MS 38677, USA

^*

Author to whom correspondence should be addressed.

Infrastructures 2025, 10(7), 153; https://doi.org/10.3390/infrastructures10070153

Submission received: 20 April 2025 / Revised: 11 June 2025 / Accepted: 20 June 2025 / Published: 24 June 2025

(This article belongs to the Special Issue Advances in Materials and Technology for Sustainable and Smart Pavements)

Download

Browse Figures

Versions Notes

Abstract

This study investigated the prediction of unconfined compressive strength (UCS), a common measure of soil’s undrained shear strength, using fundamental soil characteristics. While traditional pavement subgrade design often relies on parameters like the resilient modulus and California bearing ratio (CBR), researchers are exploring the potential of incorporating more easily obtainable strength indicators, such as UCS. To evaluate the potential effectiveness of UCS for pavement engineering applications, a dataset of 152 laboratory-tested soil samples was compiled to develop predictive models. For each sample, geotechnical properties including the Atterberg limits, liquid limit (LL), plastic limit (PL), water content (WC), and bulk density (determined using the Harvard miniature compaction apparatus), alongside the UCS, were measured. This dataset served to train various models to estimate the UCS from basic soil parameters. The methods employed included multi-linear regression (MLR), multi-nonlinear regression (MNLR), and several machine learning techniques: backpropagation artificial neural networks (ANNs), gradient boosting (GB), random forest (RF), support vector machine (SVM), and K-nearest neighbor (KNN). The aim was to establish a relationship between the dependent variable (UCS) and the independent basic geotechnical properties and to test the effectiveness of each ML algorithm in predicting UCS. The results indicate that the ANN-based model provided the most accurate predictions for UCS, achieving an R² of 0.83, a root-mean-squared error (RMSE) of 1.11, and a mean absolute relative error (MARE) of 0.42. The performance ranking of the other models, from best to worst, was RF, GB, SV, KNN, MLR, and MNLR.

Keywords:

soil strength; machine learning (ML); artificial neural networks (ANNs); gradient boosting (GB); random forest (RF); support vector machine (SVM); K-nearest neighbor (KNN); multi-linear regression (MLR); multi-nonlinear regression (MNLR)

1. Introduction

The UCS test, a fundamental component of geotechnical engineering, measures the soil’s undrained shear strength, a crucial factor in the planning and evaluation of several engineering applications. In comparison with triaxial strength, UCS, a triaxial test with zero confinement, provides a more straightforward and conservative strength value. Although it has been used in railway engineering, slope stability analysis, and foundation design, its function in pavement construction is especially noteworthy. Subgrade strength in undrained scenarios is represented by UCS in pavement applications. Ref. [1] indicates that UCS values are correlated with the CBR for treated swelling soils. Therefore, it is essential to determine the UCS accurately to guarantee the long-term performance and structural integrity of pavement systems.

To support mechanistic pavement design techniques, the elastic modulus for each layer of the pavement structure must be determined or estimated based on elastic layer theory [2]. The soil subgrade is often characterized by its resilient modulus, which is determined through repeated stress testing. However, due to the complexities involved in cyclic testing, the resilient modulus is frequently approximated using shear strength measures, which may not fully capture the stress-dependent behavior of soils. Ref. [3] developed several models to predict the resilient modulus of silty and clayey subgrades from UCS values. The model that incorporated UCS values, along with basic soil parameters, Atterberg limits, and soil texture, achieved a higher correlation accuracy, with an R² of 73%.

Recent developments in ML have opened new opportunities in pavement design and performance monitoring. For instance, ref. [4] trained a random forest machine learning model to simplify the pavement design procedure. The training dataset was produced based on the AASHTOWare Pavement ME Design tool, with 79,600 design cases for both rigid and flexible pavements. The prediction model helps predict pavement thickness and distress with excellent accuracy. Similarly, refs. [5,6] employed machine learning algorithms such as SVM, K-NN, RF, and ANN to monitor pavement performance. With good accuracy, the results confirmed the potential value of ML models for pavement engineers in the design and lifecycle monitoring of pavements.

In parallel, many studies have explored how basic geotechnical parameters such as bulk density, water content, particle size distribution, and Atterberg limits can be used to estimate soil shear strength, thereby reducing the need for time-consuming laboratory testing. The impact of several parameters on soil shear strength has been investigated in earlier studies. For example, ref. [7] studied the influence of soil texture, moisture level, and density on the strength parameters, cohesion, and friction angle of paddy soils. Ref. [8] implemented machine learning methods to measure rock samples’ uniaxial strength in real time while drilling. In addition to ANN models, they trained an SV and the adaptive neuro-fuzzy inference system (ANFIS) using data from prior field rock sampling; the best model had an R² of 99%.

Among ML techniques, GB has recently been exploited in civil engineering applications because it can capture complex relations between dependent and independent variables. It consists of a number of base learners like decision trees, smooth models, and linear models that work as building elements for the predictive models [9]. The approach is based on combining base learners at different stages to gradually utilize the best prediction, which is where its name gradient comes from. GB is highly flexible and can adapt to different tasks, including classifications and relations. This is due to the wide range of loss functions it utilizes during the training stage to minimize errors efficiently. The author used a Gaussian loss function based on the square error of the residual, as shown in Equation (1), where

(y - f)

is the residual. It is widely used when y is a continuous variable.

ψ (y, f) = \frac{1}{2} {(y - f)}^{2}

(1)

Another widely used ML method is RF, which consists of multiple decision trees trained independently on random data subsets through a process called bagging. This technique enhances randomization, reduces overfitting, and improves generalization [10]. Each tree in the forest is trained to predict using a random subdivision of data features, making them less likely to make the same error and behave differently. Each tree votes for a classification or prediction, and the forest determines the expected outcome based on voting across the trees [11]. In geotechnical applications, RF is used for prediction and classification. For example, ref. [12] used RF to classify soil based on the USCS (Unified Soil Classification System), using input variables like clay, sand, silt, and water content with less than 2% error.

An SVM is another ML technique that is constructed based on the principle of separating data into subcategories according to their similarities. The separation line, or hyperplane, is optimized by raising the margin (distance) between the closest data points and the separation line, reducing the potential structural risk (error) of misclassifying new data points [13]. The nearest data points to the hyperplane are called support vectors [14]. In cases of nonlinear separation, the SVM uses a kernel function to represent the data in a dimensional space of a higher order, where it becomes linearly separable by identifying the optimal hyperplane. This process, called mapping, helps transform nonlinear, non-separable data into linear–separable data in higher dimensions [15]. SVMs are also widely used in geotechnical engineering, particularly for assessing the landslide and liquefaction potential in various soil types. Ref. [16] found that SVM outperformed other machine learning algorithms, such as ANNs, in predicting the probability of soil liquefaction under dynamic excitation.

The K-nearest neighbor (KNN) algorithm operates by examining surrounding data points and averaging them to make predictions. The core assumption of KNN is that neighboring points share similar characteristics. However, this assumption can be a disadvantage when the problem is complex and varies significantly from one point to another, such as with soil behavior. KNN is considered a lazy algorithm because it does not generalize during the training stage, meaning it requires the entire training set during the testing stage. This can lead to memory issues when dealing with large datasets [17]. However, KNN performs well in mid-sized datasets, which is applicable to our case. For instance, ref. [18] trained KNN on 110 soil samples to predict subgrade CBR values using basic soil indices such as the OMC (optimum moisture content), density, soil gradation, and Atterberg limits. Among other ML algorithms, KNN demonstrated strong prediction capabilities, achieving an R² of 90.74% during training and 90.23% for testing data.

Taking advantage of the correlations between shear strength parameters and basic geotechnical properties, previous studies have minimized the cost and effort required for direct shear strength tests. This study focused on relating basic geotechnical parameters to the UCS of soil using statistically based and machine learning approaches. These efforts will set the framework for building a prediction tool for pavement subgrade strength by incorporating the UCS values for a wide range of soils. To achieve this, both MLR and MNLR models, along with various ML algorithms, were employed. The primary objective of this study was to utilize a database of 152 independent cases to develop an accurate prediction model for UCS based on basic soil parameters. Additional goals included testing the capabilities of MLR, MNLR, and ML to predict the UCS; analyzing the relative importance of the input variables; and identifying the most effective model based on accuracy metrics. By addressing these objectives, this study sought to bridge the gap between geotechnical engineering research and practical field applications.

2. Methodology

2.1. Material

Two native soils obtained from the National Sedimentation Laboratory (NSL) in Oxford, Mississippi, were used in this study. The soil was classified as silty clay (CL-ML) with low plasticity and silty clay (CH-MH) with high plasticity. To capture a broad range of soil behaviors, different soil mixtures were prepared by varying the proportions of coarse and fine materials. These mixtures ranged from coarse-dominant (95% coarse content–5% fine content) to fine-dominant (95% fine content–5% coarse content). All soil blends were oven-dried and stored in sealed plastic containers. The optimum moisture content of the parent soil was determined to be 15%, as shown in the compaction curve in Figure 1. Accordingly, water contents of 10% and 15% were used during mixing to represent conditions on the dry and optimum sides of the compaction curve, reflecting the typical construction moisture content of subgrade soils.

2.2. Laboratory Experiments

Soil mixtures were used to conduct a laboratory testing program in accordance with ASTM standards. Various soil properties were determined, including bulk density, following [19], UCS, following [20], and Atterberg limits, following [21]. Table 1 shows the sample results from the lab experiment. The UCS value is the maximum strength the soil achieved in the stress–strain curve. Figure 2 shows a UCS curve with a maximum uniaxial stress of 10.8 ksf at 1% axial strain.

Figure 3 presents a sample of particle size distribution (PSD) curves for the different soil mixtures prepared in this study. The curves illustrate a range of gradation patterns, from well-graded to gap-graded mixtures, which suggests a wide spectrum of soil gradation included in this study. For the model input, soil texture was simplified by using the percentage passing and retained on a sieve #200, representing fine and coarse materials, respectively. This approach enhanced the practicality of the developed models, as a no. 200 sieve (0.075 mm) is commonly used in engineering practice to distinguish between fine-grained and coarse-grained soils. This classification threshold is integral to standard soil classification systems, such as the USCS and the AASHTO Soil Classification System, and is widely adopted in geotechnical engineering applications [22].

2.3. Data Split

The database consisted of 152 experimentally based datasets, separated into three classes—training, testing, and validation—using the cross-validation concept [23]. The training dataset is crucial for each regression model as it includes all input variables and the target output, allowing the models to learn the data trends. Testing and validation datasets are equally important, as they assess the model for potential errors and validate its performance on unseen data, ensuring generalization and robustness.

Approximately 50%, 25%, and 25% of the total dataset were allocated to each category, respectively.

2.4. Multiple Linear Regression (MLR)

MLR is commonly used to correlate a dependent variable with multiple independent variables. It helps establish a benchmark for accuracy and provides insight into the relationships between variables. In this study, the UCS was the dependent variable, correlated with basic soil parameters. Excel’s data analysis tool was used to calculate the correlation and regression between UCS and the independent variables. Table 2 presents the Pearson correlation values for UCS and each independent variable across 152 datasets. The Atterberg limits exhibited a strong negative correlation with UCS, which is consistent with geotechnical principles [24]. Higher PL and LL values are typically associated with expansive and compressible soils, which tend to exhibit a lower shear strength and, consequently, lower UCS values. Water content showed a moderate negative correlation with UCS, as expected. An increase in the moisture content generally leads to a reduction in the effective stress, thereby weakening the soil structure and reducing its undrained shear strength.

Bulk density also displayed a negative correlation with UCS, which may initially appear counterintuitive, as a higher density is often linked to increased strength. However, in this study, a higher bulk density was associated with a greater proportion of coarse material, which indicated lower soil cohesion. Since UCS reflects the undrained shear strength of soil, a property more influenced by cohesion and plasticity, fine-grained mixtures with higher clay contents and lower bulk densities tend to exhibit greater UCS values.

Soil texture exhibits a weaker relation with UCS. However, for practicality and because soil texture is widely available, all independent variables were included in this model. Since the fine soil fraction, passing through sieve #200, was perfectly complementary to the coarse soil fraction, retained on sieve #200, these parameters exhibited multicollinearity. Multicollinearity made it difficult for the regression model to accurately capture the individual effects of each independent variable on the UCS [18]. As a result, only the percentage of retained soil was considered in the prediction model.

2.5. Multiple Nonlinear Regression (MNLR)

The (minpack.lm) package in R was used to create a nonlinear regression using its nls.lm() function. This package implements the Levenberg–Marquardt algorithm, which iteratively adjusts the predicted values to lower the least-squares error between the observed and predicted data [25]. The algorithm achieved a convergence tolerance of 1.00 × 10⁻⁸ after 50 iterations, at which point the iterations stopped. The structure of the nonlinear regression model was determined through a trial-and-error approach guided by model performance. Various combinations of functional forms were tested to evaluate how different soil parameters influenced UCS predictions. Exponential, higher polynomials and a mix of linear and nonlinear functions were used, and the mixed form (linear and exponential) achieved the best accuracy.

2.6. Random Forest Model

R was used to train a forest of decision trees via the (randomForest) package. The default setting in R uses 500 trees, and the optimal number of decision trees was manually determined by testing various tree counts: 100, 300, 500, 700, 1000, and 2000. Increasing the number of trees beyond 500 did not improve accuracy, so the best-trained model with the minimum number of trees was 500. The forest was trained on 50% of the data, with 25% reserved for testing and 25% for validation. The parameters used to predict UCS were retained sieve #200, LL, PL, WC, and density, consistent across all models.

2.7. ANN

Given the inherent complexity and uncertainty of soil properties, a more advanced modeling technique was employed in this study to improve prediction accuracy. The backpropagation feed-forward ANN was chosen as a widely recognized method for exploring complex relationships between inputs and outputs through hidden layers of computational nodes. The training of the ANN models was conducted using TR-SEQ1, a C++-based program [26,27], which has shown that in many scenarios, a single hidden layer of nodes can perform as well as or better than multiple layers. Consequently, the ANN models utilized in this research incorporated only one hidden layer with numerous nodes.

The optimal number of hidden nodes and training iterations for the ANN model was adaptively determined based on statistical indicators of prediction accuracy, such as R² and the mean absolute relative error (MARE), evaluated on the training and testing datasets. After establishing the optimal model, it was validated using the validation dataset. Following the 4-step procedure described by [28], the developed ANN model was then re-trained on the entire dataset (once the optimal hidden node count and iterations were fixed) to better capture patterns present across the testing and validation data. This 4-step method often yields ANN models with enhanced performance compared with standard training–testing–validation approaches [28].

2.8. Gradient Boosting Regression

The gradient boosting package (gbm) in R was employed to introduce another regression model. The algorithm reduced the loss function to a minimum value, with the mean squared error (MSE) used as the loss function due to its improved accuracy in regression applications [29]. According to [30], error in regression models is generally preferred to follow a Gaussian (normal) distribution, while a Bernoulli distribution is more suitable for classification. Therefore, this study assumed a Gaussian distribution for the error function.

2.9. KNN

The classification and regression training (Caret) package in R was applied to train the K-nearest neighbors (KNN) model. The optimal value for K was determined by testing various values, with K = 5 being selected as the best fit. A cross-validation method, with 10 folds, was applied to the training data to reduce overfitting and improve the model’s generalization capability. The algorithm computes the Euclidean distance between points to identify the nearest neighbors, and in this study, the predicted value was based on the average of the closest 5 neighbors.

2.10. SVM

SVM was implemented using the e1071 package in R, employing the svm function to train the model. Given the nonlinear nature of the problem, the algorithm utilized a Radial Basis Function (RBF) to map the training set into a high-dimensional space, facilitating hyperplane separation among the training datasets [31]. The similarity distance between points was calculated using an exponential function, as represented in Equation (2). Support vectors (SVs) were identified as the nearest points to the hyperplane, with a margin of error (∈) set at 0.1. The model was subsequently tested for accuracy and validated.

S (x i, x j) = e^{- γ {‖x i, x j‖}^{2}}

(2)

where S represents the similarity between points xi and xj, with γ set at 0.167, defining the influence of each training data point in establishing the separation boundary. A lower γ means a higher influence on the separation boundary and a higher value means a lower impact.

3. Results and Discussion

3.1. Experimental Results

A statistical summary derived from the laboratory test outcomes is presented in Table 3. The UCS values recorded ranged from 0.1306 to 18.37 Ksf, with an average value of 5.014 Ksf. According to the classification criteria provided by [32], the soil was deemed very soft if its UCS was below 0.25 Ksf and classified as hard when its UCS surpassed 8 Ksf. This observed range suggested that the soil types incorporated within our database represented a wide spectrum of strengths. Furthermore, the Atterberg limits displayed significant variance, which is likely attributable to the considerable differences in the fine material content across the samples, varying from 95% fine in one mixture down to as low as 5% fine in another.

As per [33], the typical density of stiff clay is around 0.125 kcf, and for soft clay, it is about 0.11 kcf. These values fall within the range of our database, further illustrating the diverse nature of the soil types studied. Moreover, the observed failure modes of the UCS samples aligned well with the findings reported in the literature.

3.2. Multi-Linear Regression Model

A common technique employed to relate a dependent variable (Y) to one or multiple independent variables (X) is the multi-linear regression (MLR) model. The preliminary processing of the laboratory results involved conducting linear regression using Excel’s Data Analysis ToolPak. For this study, the MLR model was utilized as a benchmark to evaluate the predictive accuracy of the other models tested. Equation (3) shows the empirical formula for the MLR model.

U C S = 1.43 + - 0.80 C + 7.39 * ρ_{b u l l k} - 0.16 * W . C - 0.27 * L . L i m i t + 0.62 * P . L i m i t

(3)

Figure 4 illustrates the comparison between the experimental UCS values and those estimated by the model. The linear model achieved an R² value of 37%. Given the linear model’s low level of precision, it failed to capture the nonlinear behavior inherited in geotechnical parameters such as Atterberg’s limits and bulk density. This aligned with the findings of [34] that highlighted MLR’s limited predictive accuracy for soil compaction parameters. This suggests that more-sophisticated methodologies may be better equipped to characterize the underlying relationships within the dataset. Therefore, this study involved comparing the accuracy metrics of all implemented models to identify the most effective one.

3.3. Multi-Nonlinear Regression Model

A nonlinear model was generated using R, as shown in Equation (4):

U C S = 10.14 * e^{- 44 * C} + e^{- 35.6 * ρ_{b u l l k}} + 0.25 * W . C - 0.25 * L . L i m i t + 0.41 * P . L i m i t

(4)

Demonstrating better accuracy than the MLR model, the MNLR model attained an R² value of 47%, showing a 10% enhancement over the linear model’s performance on the training set. This improvement can be linked to its ability to model the nonlinear relationships that the MLR method oversimplified. The MNLR showed good predictive accuracy for UCS values between 3 ksf and 10 ksf, represented by data clustering in Figure 5. However, there was a scatter in data when trying to predict a UCS higher than 10 ksf, this indicates a lack of higher-order terms to fully capture UCS values. Figure 5 visually depicts the accuracy by plotting the actual UCS values versus the predicted values from the MNLR model.

3.4. Random Forest Model

Out of several models that were trained, the 500-tree model demonstrated the highest accuracy with the least number of trees. This model predicted the UCS with an R² of 87% for the training set, 67% for the testing set, and 74% for the validation set. This represented a 20% improvement in the R² over the nonlinear model. This was attributed to the RF model’s ability to mitigate the effects of multicollinearity among input parameters, making it less sensitive to highly correlated predictors, as stated by [35]. The model maintained strong predictive capability on the validation (unseen) dataset. Figure 6 illustrates the model’s predictions plotted against the actual values at each stage.

3.5. ANN Model

The optimum ANN model was found in 9 internal hidden nodes and 5000 iterations: 9-9-1. Figure 7 represents the predicted UCS compared with the actual values. The internal connection weights from the input nodes to the hidden nodes and from the hidden nodes to the output were implemented into an Excell GUI to calculate the UCS for any additional soil, as shown in Figure 8. This tool allowed the practitioner to estimate the UCS values for new soil samples within the recommended input ranges to ensure higher accuracy and interpolation. The ANN optimum model showed the best prediction capabilities among all the trained ML algorithms, with an R² of 77% for the training set, 75% for the testing set, and 76% for the validation set. These results are consistent with findings reported in the literature, such as in [36], where an R² of 99% was achieved for training and 98% for testing. While their reported accuracy was higher, it is important to note that their dataset involved geopolymer-stabilized soils, which typically exhibit more predictable strength behavior due to chemically bonded stabilization.

Moreover, the GUI was based on Equation (5), which represents a sigmoidal value of the connection weight in addition to the bias at each node.

{U C S}_{p r e d i c t e d}^{'} = \sum_{k = 1}^{m} ({s i g m o i d a l}_{k} \{[\sum_{j = 1}^{N} {s i g m o i d a l}_{j} (\sum_{i = 1}^{s} (w_{i j} * X_{i}) + θ_{j}) * w_{j k}] + θ_{k}\})

(5)

where the predicted

{U C S}_{p r e d i c t e d}^{'}

is the normalized value of the predicted UCS, and N is the number of the internal hidden nodes.

w_{i j}

and

w_{j k}

are the connection weights from the input node to the internal nodes and from the internal to the output nodes, respectively.

X_{i}

is the normalized value for each input,

θ_{j}

is the internal node bias, and

θ_{k}

is the bias for the output node k. m is the number of outputs, where, in our case, m = 1. The final step was to denormalize the output to obtain the actual predicted UCS. Figure 9 shows a generic layout of the ANN structure.

3.6. Gradient Boosting Regression

The optimal gradient boosting (GB) model was identified with 500 base learners, an interaction depth of 0.1, and a learning rate of 0.01. This model demonstrated high predictive capabilities, ranking second after the ANN model. The GB model achieved an R² of 87% for the training set, 70% for the testing set, and 71% for the validation set. Figure 10 illustrates the performance of the GB model at each stage, showcasing its goodness of fit. Although limited literature exists specifically on UCS prediction using GB, a study by [37] reported comparable accuracy in predicting subgrade CBR, with an R² of 79% for training and 73% for testing. These findings support the suitability of GB for geotechnical applications, particularly given its effectiveness in modeling complex feature interactions and handling nonlinearity and multicollinearity within soil behavior data.

3.7. KNN

The final K-nearest neighbor (KNN) model predicted UCS with an R² of 55% for the training set, 59% for the testing set, and 44% for the validation set. Although KNN showed some improvement over nonlinear models, it dropped behind other machine learning approaches in this study. Increasing the dataset size could enhance the model’s accuracy, but this would likely require more memory due to KNN’s reliance on using the training data during both the testing and validation stages. The model was tested with K values of 5, 7, 9, 11, and 13, with the best performance observed at K = 5. Figure 11 depicts the KNN model’s accuracy at each stage. The authors of [38] used KNN to predict soil compaction parameters and reported an R² of 59% on the training set and 48% on the testing set. The close agreement between studies supported the conclusion that, while KNN can provide moderate predictive capability, it is often outperformed by more advanced algorithms in complex geotechnical prediction tasks.

3.8. SVM

The SVM model utilized a radial basis function (RBF) kernel to predict the UCS with an R² of 61% for the training set, 50% for the testing set, and 59% for the validation set. While the SVM demonstrated better accuracy compared with the KNN and nonlinear models, it still demonstrated the same accuracy compared with other machine learning models like ANNs and GB. The model’s performance was enhanced by tuning the kernel parameter (γ = 0.17) and the error margin (∈ = 0.1), which helped capture the complex relationships within the data. Figure 12 shows the SVM model’s prediction accuracy across the training, testing, and validation stages. The research by [38] applied an SVM to predict soil compaction parameters such as the OMC and maximum dry density (MDD). Unlike our approach, the RBF-based function achieved an R² of only 43% on training and 34% on testing. This underperformance can be attributed to the absence of hyperparameter tuning, as constant γ and ∈ values of 1 were used across all kernel types. In contrast, our model optimized the kernel parameter (γ = 0.17), which contributed to capturing complex nonlinear relationships inherent in UCS prediction.

3.9. Model Accuracy

Each model’s performance was measured based on statistical analysis parameters such as R², root-mean-square error (RMSE), and mean absolute relative error (MARE), following Equations (6)–(8) for each parameter. Table 4 summarizes the statistical analysis for all models.

R^{2} = 1 - \frac{\sum_{i = 1}^{n} {({U C S}_{a c t u a l} - {U C S}_{p r e d i c t e d})}^{2}}{\sum_{i = 1}^{n} {({U C S}_{a c t u a l} - m e a n ({U C S}_{a c t u a l}))}^{2}}

(6)

R M S E = \sqrt{\frac{\sum_{i = 1}^{n} {({U C S}_{a c t u a l} - {U C S}_{p r e d i c t e d})}^{2}}{n}}

(7)

M A R E = \frac{1}{n} \sum_{i = 1}^{n} |\frac{{U C S}_{a c t u a l} - {U C S}_{p r e d i c t e d}}{{U C S}_{a c t u a l}}|

(8)

To determine the best model of prediction, a ranking order was used in this study by giving the best model rank 7 and 1 for the lowest at the training, testing, and validation stages. The models were ranked based on the highest total rank.

Table 5 summarizes the ranking of each model based on the predictive accuracy for UCS. Among the tested models, the ANN model emerged as the top performer, consistent with other studies [39], followed closely by RF and GB, with overall ranks of 72 and 61, respectively, compared with 75 for ANN. SVM demonstrated notable improvement, enhancing the prediction accuracy by 9% over MNLR and 18% over MLR in terms of the overall R². KNN, however, did not achieve significant improvement compared with MNLR and MLR, likely due to its reliance on averaging nearest neighbors from the trained data, resulting in lower generalization and limited learning capacity.

4. Conclusions

This study evaluated several predictive models for estimating UCS using fundamental soil properties. A dataset of 152 lab results was utilized for training and evaluating these models. Among the models tested, the ANN model achieved the highest predictive accuracy with an overall R² of 83%, followed by the RF and GB models, each with R² values of 77%. Traditional models such as MLR and MNLR yielded lower accuracy levels, with overall R² values of 37% and 47%, respectively, while the SVM and KNN models showed moderate performance, with overall R² values of 56% and 53%, respectively.

The ANN model was further implemented into a practical Excel-based GUI tool, allowing users to estimate UCS from readily available soil parameters, offering a simplified and time-efficient alternative to laboratory testing. These findings demonstrate the potential for machine learning models to enhance pavement application, particularly in preliminary design phases or field applications where lab testing is limited.

Future work will focus on incorporating additional variables, such as deviatoric stress and saturation levels, to expand the model’s utility in predicting other shear strength parameters under various conditions.

Author Contributions

Conceptualization, M.A., Y.N. and H.Y.; methodology, M.A., Y.N. and H.S.; software, M.A. and Y.N.; validation, M.A., Y.N. and H.Y.; formal analysis, Y.N. and M.A; investigation, M.A., Y.N. and H.S.; resources, M.A., H.S., H.Y. and A.A.-O.; data curation, M.A. and Y.N.; writing—original draft preparation, M.A.; writing—review and editing, M.A. and Y.N.; visualization, M.A. and Y.N.; supervision, Y.N. and A.A.-O.; project administration, A.A.-O. and Y.N.; funding acquisition, A.A.-O. and Y.N. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

This is ongoing research; the data will be available at the end of the project and with the stakeholder’s permission.

Acknowledgments

The authors acknowledge the support provided by the U.S. Army Engineer Research and Development Center (ERDC). Permission to publish was granted by the ERDC Geotechnical and Structures Laboratory (GSL).

Conflicts of Interest

The authors declare no conflict of interest.

References

Rabab’ah, S.R.; Sharo, A.A.; Alqudah, M.M.; Ashteyat, A.M.; Saleh, H.O. Effect of using Oil Shale Ash on geotechnical properties of cement-stabilized expansive soil for pavement applications. Case Stud. Constr. Mater. 2023, 19, e02508. [Google Scholar] [CrossRef]
Vásquez, V.; Luis, R.; Francisco, J.; García, O. An overview of asphalt pavement design for streets and roads. Rev. Fac. Ing. Univ. Antioquia 2021, 10–26. [Google Scholar] [CrossRef]
Hossain, M.S.; Kim, W.S. Estimation of Subgrade Resilient Modulus Using the Unconfined Compression Test (No. FHWA/VCTIR 15-R12). Virginia Center for Transportation Innovation and Research. 2014. Available online: https://scholar.google.com/scholar?q=3.+Hossain,+M.S.+Estimation+of+Subgrade+Resilient+Modulus+Using+the+Unconfined+Compression+Test&hl=en&as_sdt=0&as_vis=1&oi=scholart (accessed on 19 April 2025).
Yang, G.; Mahboub, K.C.; Renfro, R.L.; Graves, C.; Wang, K.C.P. A Machine Learning Tool for Pavement Design and Analysis. KSCE J. Civ. Eng. 2023, 27, 207–217. [Google Scholar] [CrossRef]
Zeiada, W.; Dabous, S.A.; Hamad, K.; Al-Ruzouq, R.; Khalil, M.A. Machine Learning for Pavement Performance Modelling in Warm Climate Regions. Arab. J. Sci. Eng. 2020, 45, 4091–4109. [Google Scholar] [CrossRef]
Cano-Ortiz, S.; Pascual-Muñoz, P.; Castro-Fresno, D. Machine learning algorithms for monitoring pavement performance. Autom. Constr. 2022, 139, 104309. [Google Scholar] [CrossRef]
Jiang, Q.; Cao, M.; Wang, Y.; Wang, J.; He, Z. Estimation of Soil Shear Strength Indicators Using Soil Physical Properties of Paddy Soils in the Plastic State. Appl. Sci. 2021, 11, 5609. [Google Scholar] [CrossRef]
Gowadi, A.; Elkatatny, S.; Gamal, H. Unconfined Compressive Strength (UCS) Prediction in Real-Time While Drilling Using Artificial Intelligence Tools|Neural Computing and Applications. Available online: https://link.springer.com/article/10.1007/s00521-020-05546-7 (accessed on 6 February 2025).
Natekin, A.; Knoll, A. Frontiers|Gradient Boosting Machines, a Tutorial. Available online: https://www.frontiersin.org/journals/neurorobotics/articles/10.3389/fnbot.2013.00021/full (accessed on 6 February 2025).
Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Liaw, A.; Wiener, M. Classification and Regression by randomForest. R News 2022, 2, 18–22. [Google Scholar]
Gambill, D.R.; Wall, W.A.; Fulton, A.J.; Howard, H.R. Predicting USCS soil classification from soil property variables using Random Forest. J. Terramechanics 2016, 65, 85–92. [Google Scholar] [CrossRef]
Park, J.H.; Kim, Y.S.; Eom, I.K.; Lee, K.Y. Economic load dispatch for piecewise quadratic cost function using Hopfield neural network. IEEE Trans. Power Syst. 1993, 8, 1030–1038. [Google Scholar] [CrossRef]
Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
Ma, G.; Chao, Z.; Zhang, Y.; Zhu, Y.; Hu, H. The application of support vector machine in geotechnical engineering. IOP Conf. Ser. Earth Environ. Sci. 2018, 189, 022055. [Google Scholar] [CrossRef]
Samui, P.; Jagan, J.; Hariharan, R. An Alternative Method for Determination of Liquefaction Susceptibility of Soil. Geotech. Geol. Eng. 2016, 34, 735–738. [Google Scholar] [CrossRef]
Song, Y.; Liang, J.; Lu, J.; Zhao, X. An efficient instance selection algorithm for k nearest neighbor regression. Neurocomputing 2017, 251, 26–34. [Google Scholar] [CrossRef]
Khasawneh, M.; Al-Akhrass, H.; Rabab’ah, S.; Al-sugaier, A. Prediction of California Bearing Ratio Using Soil Index Properties by Regression and Machine-Learning Techniques. Int. J. Pavement Res. Technol. 2022, 1–19. [Google Scholar] [CrossRef]
ASTM D698-12; D18 Committee Test Methods for Laboratory Compaction Characteristics of Soil Using Standard Effort (12 400 ft-lbf/ft3 (600 kN-m/m3)). ASTM International: West Conshohocken, PA, USA, 2009. [CrossRef]
ASTM D2166; Test Method for Unconfined Compressive Strength of Cohesive Soil. ASTM International: West Conshohocken, PA, USA, 2013. [CrossRef]
ASTM D4318 D4318; Standard Test Methods for Liquid Limit, Plastic Limit, and Plasticity Index of Soils. ASTM International: West Conshohocken, PA, USA, 2018. Available online: https://www.astm.org/d4318-17e01.html (accessed on 6 February 2025).
ASTM D422-63; Test Method for Particle-Size Analysis of Soils. ASTM International: West Conshohocken, PA, USA, 2007. [CrossRef]
Stone, M. Cross-Validatory Choice and Assessment of Statistical Predictions. J. R. Stat. Soc. Ser. B Stat. Methodol. 1974, 36, 111–133. [Google Scholar] [CrossRef]
Widodo, S.; Ibrahim, A.; Hong, S. Analysis of Different Equations of Undrained Shear Strength Estimations Using Atterberg Limits on Pontianak Soft Clay. Available online: https://www.researchgate.net/publication/262025745_Analysis_of_different_equations_of_undrained_shear_strength_estimations_using_Atterberg_Limits_on_Pontianak_Soft_Clay (accessed on 19 April 2025).
Lourakis, M.I.A. A Brief Description of the Levenberg-Marquardt Algorithm Implemened by Levmar. 2005. Available online: https://www.researchgate.net/publication/239328019_A_Brief_Description_of_the_Levenberg-Marquardt_Algorithm_Implemened_by_levmar (accessed on 19 April 2025).
Najjar, Y.M.; Ali, H.E.; Basheer, I.A. On the use of neuronets for simulating the stress-strain behavior of soils. In Numerical Models in Geomechanics. Proceedings of the 7th International Symposium, Graz, September 1999; CRC PRESS: Boca Raton, FL, USA, 1999; pp. 657–662. [Google Scholar]
Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning (Adaptive Computation and Machine Learning Series): Goodfellow, Ian, Bengio, Yoshua, Courville, Aaron: 9780262035613. Available online: https://www.amazon.com/Deep-Learning-Adaptive-Computation-Machine/dp/0262035618 (accessed on 6 February 2025).
Najjar, Y.M.; Huang, C. Simulating the stress–strain behavior of Georgia kaolin via recurrent neuronet approach. Comput. Geotech. 2007, 34, 346–361. [Google Scholar] [CrossRef]
Nasiboglu, R.; Nasibov, E. WABL method as a universal defuzzifier in the fuzzy gradient boosting regression model. Expert Syst. Appl. 2023, 212, 118771. [Google Scholar] [CrossRef]
Friedman, J.H. Greedy function approximation: A gradient boosting machine. Ann. Stat. 2001, 29, 1189–1232. [Google Scholar] [CrossRef]
Ding, X.; Liu, J.; Yang, F.; Cao, J. Random radial basis function kernel-based support vector machine. J. Frankl. Inst. 2021, 358, 10121–10140. [Google Scholar] [CrossRef]
Hastuty, I.P. Comparison of the Use of Cement, Gypsum, and Limestone on the Improvement of Clay through Unconfined Compression Test. J. Civ. Eng. Forum 2019, 5, 131. [Google Scholar] [CrossRef]
Lindeburg, K. Silty Mantles and Fragipans in Pennsylvania Soils. 2011. Available online: https://www.researchgate.net/publication/263350712_Silty_Mantles_and_Fragipans_in_Pennsylvania_Soils (accessed on 19 April 2025).
Gour, S.S.; Muthekar, D.V.V.; Saner, A.B. A Comparative Study of ANN Models developed for predicting Soil Compaction Parameters using MS Excel and MATLAB. GIS Sci. J. 2022, 9, 1975. [Google Scholar]
Genuer, R.; Poggi, J.-M.; Tuleau-Malot, C. Variable selection using random forests. Pattern Recognit. Lett. 2010, 31, 2225–2236. [Google Scholar] [CrossRef]
Mozumder, R.A.; Laskar, A.I. Prediction of unconfined compressive strength of geopolymer stabilized clayey soil using Artificial Neural Network. Comput. Geotech. 2015, 69, 291–300. [Google Scholar] [CrossRef]
Ikeagwuani, C.C. Estimation of modified expansive soil CBR with multivariate adaptive regression splines, random forest and gradient boosting machine. Innov. Infrastruct. Solut. 2021, 6, 199. [Google Scholar] [CrossRef]
Akshaya, R.; Premalatha, K. Prediction of Soil Compression Index using SVM and kNN. IOP Conf. Ser. Earth Environ. Sci. 2024, 1326, 012121. [Google Scholar] [CrossRef]
Mohamad, E.T.; Jahed Armaghani, D.; Momeni, E.; Alavi Nezhad Khalil Abad, S.V. Prediction of the unconfined compressive strength of soft rocks: A PSO-based ANN approach. Bull. Eng. Geol. Environ. 2015, 74, 745–757. [Google Scholar] [CrossRef]

Figure 1. Compaction curve for native soil.

Figure 2. UCS curve for mixture: 95% fine–5% coarse.

Figure 3. Particle size distributions for sample mixtures.

Figure 4. MLR prediction model.

Figure 5. MNLR prediction model.

Figure 6. RF prediction model of UCS: (a) training data, (b) testing data, (c) validation data, and (d) entire dataset.

Figure 7. ANN prediction model of UCS: (a) training data, (b) testing data, (c) validation data, and (d) entire dataset.

Figure 8. ANN GUI: available upon request.

Figure 9. Generic structure of ANN.

Figure 10. GB prediction model of UCS: (a) training data, (b) testing data, (c) validation data, and (d) entire dataset.

Figure 11. KNN prediction model of UCS: (a) training data, (b) testing data, (c) validation data, and (d) entire dataset.

Figure 12. SVM prediction model of UCS: (a) training data, (b) testing data, (c) validation data, and (d) entire dataset.

Table 1. Sample lab results.

#	Coarse Content	Fine Content	Density (Kip/ft³)	WC (%)	LL (%)	PL (%)	UCS (Ksf)
1	0.95	0.05	0.1223	10	29.25	23.8	6.65
2	0.95	0.05	0.1351	15	29.25	23.8	6.9
3	0.9	0.1	0.1222	10	27.75	24.14	7.6
4	0.9	0.1	0.1344	15	27.75	24.14	4.3
5	0.1	0.9	0.1242	10	71.55	37.56	4.88
6	0.1	0.9	0.1386	15	71.55	37.56	2.48
7	0.05	0.95	0.1261	10	73.00	38.00	2.86
8	0.05	0.95	0.1355	15	73.00	38.00	2.39

Table 2. Pearson correlation value of soil independent variables and UCS.

Independent Variables	Pearson Correlation
Plastic limit	−0.312
Liquid limit	−0.447
Water content	−0.248
Bulk density	−0.391
Retained sieve #200	−0.056
Passed sieve #200	0.056

Table 3. Statistical summary for the geotechnical properties of the tested mixtures.

Parameter	Range	Mean	ST. DV	CV%
UCS (ksf)	1036–18.37	5.014	2.7725	55.30
PL (%)	18.01–38	25.550	5.3594	20.98
LL (%)	27–73	40.362	14.900	36.92
WC (%)	10–15	12.5	2.5083	20.07
Density (kip/ft³)	0.1173–0.1398	0.129	0.0060	4.65
Retained sieve #200 (%)	95–5	50	0.2748	0.55

Table 4. Models’ performance for regressions and ML techniques.

Model	R² (%)				RMSE				MARE
	Training	Testing	Validation	All	Training	Testing	Validation	All	Training	Testing	Validation	All
ANN	76	75	76	80	1.32	1.60	1.33	1.25	0.65	0.25	0.14	0.33
GB	87	70	71	77	0.96	1.73	1.61	1.36	0.56	0.24	0.75	0.53
RF	87	67	74	77	1.10	1.83	1.19	1.34	0.43	0.23	0.52	0.41
SV	61	50	59	56	1.84	2.39	1.45	1.91	0.77	0.31	0.82	0.67
KNN	55	59	44	53	2.19	2.01	1.99	1.93	1.24	0.33	0.90	0.71
MNLR	48	44	50	47	1.96	2.36	1.67	2.00	0.87	0.33	0.98	0.76
MLR	36	31	27	37	1.24	2.76	3.08	2.25	0.76	0.86	0.29	0.67

Table 5. Models ranking for regressions and ML techniques.

Model	R²				RMSE				MARE				Total Ranking
	Training	Testing	Validation	All	Training	Testing	Validation	All	Training	Testing	Validation	All	Rank
ANN	5	7	7	7	5	7	6	7	5	5	7	7	75
GB	7	6	5	6	7	6	4	5	6	6	4	5	61
RF	6	5	6	6	6	5	7	6	7	7	5	6	72
SV	4	4	4	4	3	3	5	4	3	4	3	4	41
KNN	3	3	3	3	1	4	2	3	1	3	2	2	30
MNLR	2	2	2	2	2	2	3	2	2	3	1	1	24
MLR	1	1	1	1	4	1	1	1	4	1	6	3	26

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Alqudah, M.; Saleh, H.; Yasarer, H.; Al-Ostaz, A.; Najjar, Y. Comparative Study of Machine Learning Techniques for Predicting UCS Values Using Basic Soil Index Parameters in Pavement Construction. Infrastructures 2025, 10, 153. https://doi.org/10.3390/infrastructures10070153

AMA Style

Alqudah M, Saleh H, Yasarer H, Al-Ostaz A, Najjar Y. Comparative Study of Machine Learning Techniques for Predicting UCS Values Using Basic Soil Index Parameters in Pavement Construction. Infrastructures. 2025; 10(7):153. https://doi.org/10.3390/infrastructures10070153

Chicago/Turabian Style

Alqudah, Mudhaffer, Haitham Saleh, Hakan Yasarer, Ahmed Al-Ostaz, and Yacoub Najjar. 2025. "Comparative Study of Machine Learning Techniques for Predicting UCS Values Using Basic Soil Index Parameters in Pavement Construction" Infrastructures 10, no. 7: 153. https://doi.org/10.3390/infrastructures10070153

APA Style

Alqudah, M., Saleh, H., Yasarer, H., Al-Ostaz, A., & Najjar, Y. (2025). Comparative Study of Machine Learning Techniques for Predicting UCS Values Using Basic Soil Index Parameters in Pavement Construction. Infrastructures, 10(7), 153. https://doi.org/10.3390/infrastructures10070153

Article Menu

Comparative Study of Machine Learning Techniques for Predicting UCS Values Using Basic Soil Index Parameters in Pavement Construction

Abstract

1. Introduction

2. Methodology

2.1. Material

2.2. Laboratory Experiments

2.3. Data Split

2.4. Multiple Linear Regression (MLR)

2.5. Multiple Nonlinear Regression (MNLR)

2.6. Random Forest Model

2.7. ANN

2.8. Gradient Boosting Regression

2.9. KNN

2.10. SVM

3. Results and Discussion

3.1. Experimental Results

3.2. Multi-Linear Regression Model

3.3. Multi-Nonlinear Regression Model

3.4. Random Forest Model

3.5. ANN Model

3.6. Gradient Boosting Regression

3.7. KNN

3.8. SVM

3.9. Model Accuracy

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI