Next Article in Journal
A Credibility-Based Self-Evolution Algorithm for Equipment Digital Twins Based on Multi-Layer Deep Koopman Operator
Previous Article in Journal
Multi-Scale Feature Analysis Method for Soil Heavy Metal Based on Two-Dimensional Empirical Mode Decomposition: An Example of Arsenic
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Machine Learning Estimation of the Unit Weight of Organic Soils

by
Artur Borowiec
1,*,
Grzegorz Straż
2 and
Maria Jolanta Sulewska
3
1
Department of Structural Mechanics, Faculty of Civil and Environmental Engineering and Architecture, Rzeszow University of Technology, Powstanców Warszawy 12 Av., 35-959 Rzeszow, Poland
2
Department of Geodesy and Geotechnics, Faculty of Civil and Environmental Engineering and Architecture, Rzeszow University of Technology, Powstanców Warszawy 12 Av., 35-959 Rzeszow, Poland
3
Department of Geotechnics, Roads and Geodesy, Faculty of Civil Engineering and Environmental Sciences, Bialystok University of Technology, Wiejska 45E Str., 15-351 Bialystok, Poland
*
Author to whom correspondence should be addressed.
Appl. Sci. 2025, 15(16), 9079; https://doi.org/10.3390/app15169079
Submission received: 30 June 2025 / Revised: 27 July 2025 / Accepted: 7 August 2025 / Published: 18 August 2025

Abstract

The aim of this study is to search for and verify regression models of selected geotechnical parameters of organic soils that are useful in engineering practices. Various machine learning methodologies were employed, including decision tree, ensembles of trees, support vector regression, Gaussian process, and neural networks. The work was based on two qualitatively different examples of estimating the unit weight of soil (γt). In the first example, the results of cone penetration test (CPT) probing (cone resistance qc and friction resistance fs) were used. In the second example, the results of laboratory tests of other physical properties of these soils (content of organic parts LOIT and moisture content w) were used. The task was completed for 135 sets of test results, which were carried out at the Rzeszów training ground in Poland with in situ tests using the CPT probe and laboratory tests. A statistical analysis was carried out to initially determine the relationships between the variables. This work presents the results of a comparison of multiple linear regression models with regression models obtained using the machine learning (ML) method. The studies obtained ML models with mean absolute percentage errors (MAPE) that were smaller than those of statistical models. Consequently, for the CPT sounding data, the MAPE changed from 13.57% to 7.37%, and, for the second data set, from 7.87% to 1.25%. Software STATISTICA version 13.3 and the Regression Learner TM library from MATLAB R2024b were used to analyze the soil data.

Graphical Abstract

1. Introduction

The main objective of soil tests is to determine the geotechnical characteristics of an area for various types of investments. Soil parameters are used in the analysis of soil and water conditions, to check the stability of slopes, and in the design of geotechnical structures and foundations of objects, including foundations, piles, retaining walls, and embankments. One of the basic soil parameters used in design is the unit weight of the soil (γt). On the basis of this, the vertical component of stress in the soil medium or the bearing capacity of the soil can be calculated. The best method of estimating it is the direct testing of soil samples performed under laboratory conditions. Such tests are very laborious and time-consuming. An appropriate number of samples with intact structures from control boreholes is required. The significant difficulty is in collecting samples, which can be located at significant depths. In the case of organic soils, due to their characteristics, local properties, and locations, which are mainly below the groundwater table, this process can be exceptionally burdensome and is often ineffective. Disturbances of the structures of laboratory test samples (γt) may occur during their collection and transport [1]. Samples for laboratory testing (LOIT and w) may show disturbed structures without a change in moisture content.
As an alternative to laboratory tests, field tests using the cone penetration test (CPT) are used to estimate unit weight [2]. The depth of the soil in most probings does not affect the results when estimating their parameters. In situ CPT probe tests ensure that the soil structure remains undamaged before testing, allowing for results that are more reliable than those obtained from laboratory research. The relative simplicity and speed of conducting penetration tests are the advantages of these tests. The interpretation of probing results should be based on empirical formulas obtained from laboratory tests of the properties of a given soil. In the case of organic soils, the guidelines for the interpretation of test results are very limited due to their small number, at present. Therefore, new, reliable empirical correlations are expected with the results of CPT tests verified on the basis of laboratory tests. In engineering practice, due to the costs, verification tests are performed very rarely.
Based on research over the past four decades, several empirical formulas have been developed to calculate the unit weight of soils γt based on parameters recorded during CPT soundings. The most commonly used formulas in basic works [2,3] using CPT probing do not take into account organic soils. Further formulas can be found in the work of other authors [4,5,6]. Only a few studies [7,8] take into account the use of CPT probes in organic soils. A query was performed in the Web of Science (WoS) database to explore the subject of soil testing using CPT or CPTU soundings. The search yielded 1138 publications (Topic Search = (“SOILS”) and (“CPT” or “CPTU”)), among which a mere 24 addressed organic soils. This provides impetus for further research on organic soils, expanding the existing knowledge base, and meeting practical requirements in the construction field.
This study presents various regression models for estimating the unit weight of organic soils using machine learning. Machine learning is a subdiscipline of artificial intelligence that focuses on developing algorithms and models that enable systems to acquire knowledge from data that may come from scientific experiments, sensors, or computer simulations. Key techniques include supervised learning, where the model is trained on labeled data (regression and classification). In recent years, unsupervised learning has been intensively developed, in which the algorithm tries to find hidden relationships in a data set of unlabeled data, and reinforcement learning [9].
The dynamic development of data-processing methods, including machine learning, has led to wide interest among researchers for use in construction [10,11,12]. In the last 40 years, there has been an increase in the number of publications using ML in geotechnics, especially after 2017 [13], mainly in developed countries [14]. The most commonly used ML algorithms in geotechnics in the last decade are artificial neural networks (ANN), random forest (RF), support vector machines (SVM), Bayesian network (BN), which account for 35%, 19%, 17%, and 7% of applications, respectively [15].
The WoS database contains 400 articles in the field of geotechnical engineering using ML, including 200 articles related to determining unit weight and 30 articles related to organic soils. At the same time, the total number of articles in the field of geotechnical engineering in the last 30 years on this topic indexed in the WoS database was over 6500 (Topic Search = “UNIT WEIGHT” or “ORGANIC SOILS”). Among the works using ML, for example, those based on data obtained from CPT probings [16,17], including the prediction of unit weight [6,18,19] and works related to organic soils [20,21,22]. In each of the mentioned works, after using ML methods, an improvement in prediction was obtained in relation to linear regression or nonlinear regression models. Machine learning can accelerate the search for existing relationships by automating data analysis, reducing errors, and enabling the verification of new models. The examples of the machine learning regression models presented in this study can be used as a guide to expanding the analytical workshop of geotechnical engineers.

2. Description of Geotechnical Research

2.1. The Local Subsoil

Data for the analysis were obtained from a survey of the investment site of the Theological and Pastoral Institute on Witolda Street in Rzeszów, located in the Rzeszów Foothills, within the macro-region of the Sandomierz Basin and so-called Northern Subcarpathia. Morphologically, the studied area is located within the terrace of the Młynówka River, which rises in this region to a height of approximately 206–206.5 m above sea level. The terrain surface is generally flat. The subsoil was identified using a static probing method with a CPT probe; the location of the research points is presented in Figure 1.
In geological terms, the site is located in the southern part of the Pre-Carpathian Foredeep, filled with marine Miocene sediments mainly in the form of clayey formations, the top of which was drilled at a depth of 9.1–9.9 m below ground level. Above these is a series of Quaternary sediments (Pleistocene and Holocene) of fluvial and stagnant origin, which are layers of medium-compacted gravelly-sandy formations from 0.5 to 2.2 m thick. A layer of organic sediments (muds) lies directly on top of them, with a layer of humus fen soils (alluvium, mada). The alluvial deposits are clayey with numerous peats inserts of variable thickness and irregular occurrence. Locally, at a depth of approximately 2.3–5.2 m below ground level, a 3.1 m thick peat layer was identified in the form of young Holocene formations, which are covered with a layer of river muds and alluvial soils, lithologically developed as compact silty clays and silts. These soils are soft and have a very soft consistency. The whole area is covered with a layer of soil approximately 0.3 m thick (Figure 2). The principal aquifer is located at a depth of approximately 1.0 m below ground level and is associated with sand–gravel deposits. The second aquifer, at a depth of 1.0 to 2.5 m below sea level, occurs in the form of intense intra-clay seepage originating from the infiltration of rainwater within the upper and middle layers of madas, muds, and peats [23,24].

2.2. The Geotechnical Research Methods

2.2.1. In Situ Testing with CPT Static Probe (fs, qc)

Probing with a CPT-type static probe was carried out using an electric cone, measuring the resistance of the cone qc [MPa] and the friction resistance of the soil against the side surface of the friction sleeve fs [MPa]. The dimensions of the cone, as well as the test procedure, were in accordance with international standards [25,26,27,28,29]. Fifteen static soundings (CPT1-CPT15) were made, reaching a depth of 8.6 to 15.4 m below ground level and a total length of approximately 180 running meters. Six research boreholes were drilled to a depth of 12.0–15.0 m below ground level with a total length of 78 m. For the purposes of this smaller study, the interpretation of the sounding results was based on soil profiles determined from research boreholes located in the immediate vicinity of three selected soundings designated as: CPT(S1); CPT(S7); and CPT(S15). Samples of organic soils with a natural structure obtained from the test holes were taken for laboratory tests.
Geotechnical profiles corresponding to the points of selected static soundings are show in Figure 2. The designations of organic soils as T and Nmg were adopted in accordance with the standards PN-B-02480:1986 [30] and PN-B-04481:1988 [31], which, in accordance with the European classification PN-EN ISO 14688-1:2018 [32] and PN-EN ISO 14688-2:2018 [33], correspond to the designations Pt (peat), orCl, or orSi (muds).
Figure 2. Geological–engineering cross-sections for the locations of selected soundings: CPT(S1), CPT(S7), CPT(S15) [23]. The figure shows the names according to the Polish classification [30,31], where organic soils are marked with the symbols T and Nmg, which correspond to the European classification according to [32,33], respectively, as Pt, orCl or orSi.
Figure 2. Geological–engineering cross-sections for the locations of selected soundings: CPT(S1), CPT(S7), CPT(S15) [23]. The figure shows the names according to the Polish classification [30,31], where organic soils are marked with the symbols T and Nmg, which correspond to the European classification according to [32,33], respectively, as Pt, orCl or orSi.
Applsci 15 09079 g002
The summary and range of values of the results of organic soil tests using the CPT probe are presented in Table 2.

2.2.2. The Laboratory Test (w, LOIT, γt)

The basic parameters characterizing organic soils are natural moisture, bulk density, and organic content. Natural moisture (w) is defined by the ratio of the mass of free water (mw) in the sample to the mass of its soil skeleton (md). It was calculated according to the following Formula (1) [34]:
w = m w m d 100 %   =   m 1 m 2 m 2 m c 100 %
where m1 is the mass of the sample with natural moisture content including the evaporator, m2 is the mass of the sample dried to constant mass including the evaporator, and mc is the mass of the evaporator.
The organic content (LOIT) was determined by roasting the sample at 800 °C for 3 h. The amount of soil mass loss resulting from this process was then determined [30,35]. The organic content was calculated according to the following Formula (2) [36]:
L O I T = D W 105 C W T D W 105 100 %
where DW105 is the weight of the sample after drying to a constant mass at 105–110 °C and DWT is the weight of the sample roasted to a constant mass at approximately 800 °C.
Based on the results of the organic content tests, local organic soils were identified as low (2–6%), medium (6–20%) and high organic (>20%), i.e., the tested samples represented the full classification spectrum in terms of organic soils [32,33].
The unit weight (γt) of the soil was calculated based on its bulk density (ρ), which is the ratio of mass (m) to the total volume of the soil sample (V) [37], assuming that the acceleration of gravity is 9.81 m/s2 (3), as follows:
γ t = ρ g = m V g
A summary of the geotechnical parameters of the organic soils determined in laboratory tests is given in Table 2. The analyzed data set was also used in the co-author’s work to estimate the unit weight of the organic soils using classical methods and ANN [38,39].

3. Description of Statistical and Numerical Analysis Tools

Figure 3 shows a diagram explaining the adopted methodology for analyzing the research results. The first stage was to prepare a database of laboratory and field research results. In the second stage, statistical data analysis was performed, allowing for a preliminary assessment of possible relationships between the data, including the linear correlation matrix. The next stage was to divide the data into training and testing patterns. Then, the Regression LearnerTM tool in MATLAB was used to build and compare 28 regression models. At this stage of the research, the cross-validation procedure was used, where the data were randomly divided into equal parts. Then, one part was used as a validation set and the rest were used as a training set. The process was repeated until each part was used as a validation set. The best ML regression model was selected based on a comparison of validation error measures (VeML). As part of the verification analyses, a comparison of testing results (TeML) was performed with the linear regression model available in the software (TeL). Each of the ML models could then be further improved (hyperparameter optimization) using the MSE validation error.

3.1. Statistical Analysis Tools

Correlation and regression analyses are basic statistical procedures that aim to model the relationship between variables. In scientific research, regression equations are often used to predict the value of one variable (dependent variable) based on one or more independent variables. In regression analysis, there are many assumptions about the data such as linearity, independence, and the normality of residuals. Failure to meet these assumptions can lead to incorrect interpretation of the analysis results and to erroneous conclusions [40]. The most popular model is simple linear regression, where the relationship is presented graphically as a straight line, expressed by Formula (4) or linear multiple regression, expressed by Formula (5):
Y = a + bX + ε
Y = a0 + a1X1 + … + anXn + ε
where Y is the dependent (explained) variable, X and Xi (i = 1, …, n) are independent (explanatory) variables, aj (j = 0, …, n) are regression coefficients, and ε is the standard error of estimation. In more advanced tasks, nonlinear, polynomial, exponential, logarithmic regression, or other complex models can be used [41]. One of the most commonly used approaches for estimating regression model parameters is the least squares method, which minimizes the error function in the form of the sum of squared differences between observed values (dp) and values predicted by the model (yp). An important element in regression analysis is the assessment of the model’s quality. The most commonly used measures of prediction error are mean squared error (6), mean absolute percentage error (7), and the coefficient of determination (8):
M S E = 1 n p = 0 n y p d p 2
M A P E = 100 % n p 1 n d p y p d p
R 2 = 1 p = 1 n d p y p 2 p = 1 n d p d ¯ p 2 = p = 1 n y p d ¯ p 2 p = 1 n d p d ¯ p 2
where n is the number of cases, dp is the measured values, yp is the predicted values and d ¯ p is the mean of the measured values. For the research on statistical analyses, the STATISTICA program version 13.3 [42] was used. The statistical analyses included:
  • Calculation of basic statistical parameters;
  • Checking of the normality of variable distributions;
  • Calculation of coefficients of linear correlation matrices;
  • Development of linear multiple regression models.

3.2. Numerical Analysis

MATLAB (Matrix Laboratory) is a popular software program used by scientists, engineers, and analysts for numerical calculations, data analysis, modeling, and visualization [43]. Since version R2017a, MATLAB has a built-in Regression Learner application. This tool enables users to verify various regression models using supervised machine learning. Most often, the analyzed data set is divided into three subsets (training, validation, testing). During the analyses, 15% of the data sets (20 out of 135 patterns) were used for testing. For the remaining 115 patterns, division into 9 subsets for cross-validation was considered. Each of the analyzed ML regression models was verified for identical divisions. Cross-validation errors presented in the application are average values depending on the number of subsets. Testing errors are values for a single testing set. In the application itself, learning algorithms can be selected for preparing data sets, defining validation schemes, training models, and evaluating testing results. Based on the MATLAB documentation, Table 1 presents 28 available models in the Regression Learner. The second column of the table groups the models according to the applied ML method. The third column contains individual types of regression models associated with predefined (preset) numbers and hyperparameter values. Depending on the type of ML regression model, these are parameters of the applied algorithms or are related to their structure (structure, architecture, functions). Hyperparameter optimization is performed by analyzing their full combination, random selection of elements, or Bayesian inference [44]. In order to present the specificity of individual models, their explanations are provided in the subsequent columns of the table. The interpretability column refers to the human ability to understand and explain how the model makes decisions and how input variables affect the model’s predictions. The flexibility column concerns the ability of a given model to capture complex relationships and patterns in the data.
In practice, all models available in the tool (M1–M28) or selected groups of models can be used first. Then, automatic hyperparameter optimization can be performed for some or all models. The software allows for the manual selection of the hyperparameter change ranges. The obtained results can be sorted to create rankings using calculated validation or testing error measures. It should be noted that the usefulness of regression models depends on the amount of data used. Some models will achieve better generalization results for a number of hyperparameters adjusted to the number of learning patterns. The MSE error in the validation set is used to evaluate optimized models. The Regression Learner offers the classic linear regression method (M1), which fits a straight line to the data using the least squares method. It was used as a reference model for comparative analyses. Extended linear regression models are available, which take into account interactions between variables (M2), are resistant to outliers (M3), or regulate the variable selection process (M4). The M14 and M15 models are linearized versions of more complex models.
Decision tree models (M5–M7) are available and are popular in many fields, including statistics, machine learning, and data analysis, due to their intuitive nature and effectiveness in solving classification and regression problems. Decision trees [45] are built from decision branch nodes that create a structure for dividing data into smaller subgroups. The division in subsequent branch nodes is based on established criteria for independent variables. At the end of the tree structure are leaf nodes containing numerical values of the dependent variable. The results of the regression tree model are easy to interpret, can be fitted and predicted quickly, and use little memory. During model building, the algorithm adjusts the tree structure. The hyperparameters of the decision tree model are the number of branches and leaves. Trees with nodes with a large number of leaves tend to overfit.
The Regression Learner module enables the user to train ensembles of trees for regression (M16–M17). Combining the results of many individual decision trees enables the user to obtain better quality models. In practice, two basic methods are used to create ensembles of trees: bagging [46] and boosting [47]. With the bagging (random forest) method, multiple decision trees are generated that can be trained on different subsets of data and feature sets. Then, the results are combined, which leads to better stability and lower variance. The boosting method enables the user to create a model by adding successive decision trees that are trained based on the errors of previous trees. New models are based on the errors of the predecessors, which means that each successive classifier tries to correct the errors of the previous classifiers. The hyperparameters of the ensemble models of trees include the numbers of trees, branches and leaf nodes, and the pattern-splitting coefficient.
Support vector machine (SVM) models (M8–M13) are popular tools in machine learning that can be very effective at solving regression problems, especially in cases where the relationship between input and output variables is nonlinear. The basic SVM algorithm, as defined by Vladimir Vapnik [48], uses an error margin ε, which defines a tolerance region for errors. Points that fall within this margin are considered acceptable errors and do not affect the model training process. This allows the SVM to focus on significant differences in points that lie outside this margin and have a significant impact on the shape of the regression function. These points are referred to as “support vectors”. Regression tasks use various kernel functions (linear, radial basis function, polynomial, sigmoid), which enable the modeling of complex, nonlinear relationships in the data. When working with SVM, data preprocessing and transformation (e.g., standardization, normalization) are recommended, as these affect the model’s performance. The hyperparameters of the SVM model are as follows: margin ε; regularization parameter (C); type and number of kernel functions; and the scale of the adopted kernel functions.
In regression, the Gaussian process algorithm is based on the assumptions of normal probability distributions for function values at a given point and on modeling the dependencies between points using kernel functions [49]. Additionally, Gaussian processes provide information about the uncertainty of the prediction, which is represented as its variance. Basis functions are used to transform the input data, which allows for complex patterns in the modeling process to be better captured. These properties make Gaussian processes a flexible tool for data modeling in applications where uncertainty is important. In the models (M18–M21) in MATLAB, various kernel functions (rational quadratic, squared exponential, matern, exponential) are used. These functions allow for various patterns in the data to be captured and are crucial for the effective modeling of nonlinear relationships. The hyperparameters of the Gaussian process regression (GPR) model include the types of basis and kernel functions and their number, as well as the scale of the functions and their variances.
Artificial neural networks (ANN) work by connecting neurons into layers that process data hierarchically [8]. Each neuron calculates the weighted sum of the inputs it receives and then applies an activation function and bias that determine whether the neuron sends a signal to the neuron in the next layer. The first input layer has as many neurons as there are independent variables (predictors). The last layer has one or more outputs in regression tasks, representing the dependent variable or variables. Thanks to supervised learning and optimization processes, networks are able to adjust their weights, which allows for the efficient modeling of complex functions and patterns based on data. As part of the Regression Learner module, models can be chosen (M22–M26) by setting the number of layers, the number of neurons in the layers, and the type of activation function (ReLU, tanh, sigmoid).
The last group of models (M27–M28) are models that use Gaussian kernels to change the model in low-dimensional space to obtain a linear model in high-dimensional space.

4. Results and Discussion

4.1. Statistical Analysis

Using STATISTICA [42] software, statistical tests of the data sets were carried out. Table 2 shows the basic statistics of the analyzed data set (count, mean, minimum, maximum, standard deviation). The unit weight was assumed to be the dependent variable (Y). The measured soil parameters were assumed to be independent variables (X1–X4).
Verification of the hypotheses was undertaken as to whether the variables (Y, X1, X2, X3, X4) had normal distributions. The normality of the distributions was checked at the significance level of p = 0.05. The probability levels p (Kolmogorov–Smirnov (K-S), Kolmogorov–Smirnov with Lilliefors correction, W Shapiro–Wilk (W S-W)) were calculated. The analyzed distributions of variables did not correspond to the normal distribution because all the significance levels were less than 0.05. Figure 4 presents data correlation graphs with the values of the Spearman’s R rank linear correlation coefficient. Statistically significant correlation coefficients are marked in bold (p < 0.05).
Based on Figure 4 and the calculated values of the Spearman’s R rank linear correlation, it can be seen that there is no linear correlation for the data from the first example, where, for the dependencies γt(fs) and γt(qc), the obtained R2 values were lower than 0.05. For the data from the second example, it was found that the dependent variable was strongly correlated. The R2 equal to 0.834 was obtained for the dependency γt(LOIT) and R2 equal to 0.953 for the dependency γt(w). The scatterplots for the data from the second example indicate the existence of nonlinear correlation. Additionally, there was a strong linear correlation between the independent variables w(LOIT) with R2 equal to 0.787. For both examples, multiple linear regression models were developed: γt(fs, qc) and γt(LOIT, w). The obtained linear models (9) and (10) were formed according to the general Formula (5), as follows:
γ t = 0.049 f s 0.015 q c + 17.46 ± 2.249
γ t = 0.047 L O I T 0.031 w + 17.73 ± 1.178
Model (9) for the data from the first example poorly explains the dependent variable γt (R2 = 0.249). Practical use of this model will not be useful. For the data from the second example (R2 = 0.794), model (10) was no better than models using one independent variable. The presented statistical analysis indicates the necessity of using nonlinear models. The results of the linear regression were used to verify the regression models developed using machine learning methods.

4.2. Machine Learning Analysis

The analysis results were compared for two examples of estimating the unit weight of soil (γt). The first one used the results of CPT probing (fs, qc), while the second one used the results of laboratory tests (LOIT, w).

4.2.1. Example One γt (fs, qc)

The first of the analyzed examples is characterized by a weak correlation. Figure 5 shows a comparison of MAPE errors for validation (5a) and testing (5b) of all the analyzed regression models. The results are grouped by colors in the figures, depending on the applied training method.
Analyzing the groups of ML methods, it can be seen that the ensemble of trees (M16–M17) and Gaussian process (M18–M21) models performed best during the training. The testing results were mostly comparable to the training. The ANN models achieved better test results than during training (M23–M27). Neural networks had better generalization properties; however, this may have also been due to overfitting. Table 3 contains selected results of the prediction of the unit weight of organic soils based on values from the CPT probe (fs, qc). The best models for each group of algorithms are listed. The second column shows the type of method used in machine learning; the third column indicates the best type of model from a given learning method. The following columns contain calculated errors from the validation and testing process. The first row contains results for a simple multiple regression model (M1—reference model). Comparing the results for this model with the remaining regression models in Table 3, lower values of the validation error MAPE were obtained in all models compared to the M1 model, which indicates an improvement in the prediction accuracy of the machine learning models. The validation results in the table are average results. The validation results in Table 3 are average values. For some results, the validation error values (MAPE, R2) are weak and worse than the testing errors. This may be due to very poor validation results for some subsets obtained in the cross-validation. The best ML model for validation is the rational quadratic Gaussian process (M21). The best ML model from the testing is the trilayered neural network (M26), for which the MAPE is 7.37%. For this model, R2 = 0.663 was obtained. This is a better result compared to the analyses [38], taking into account the use of only ANN, where a lower value of the coefficient of determination in the test set R2 = 0.564 was obtained. The R2 = 0.249 calculated for model (9) from the statistical analyses is comparable to the R2 = 0.253 coefficient of the reference model (M1) from Table 3.
Figure 6 shows the results of testing the models: (a) reference (M1)—linear regression and the best results obtained from ML; and (b) the Gaussian process (M21). Figure 6a,b shows comparisons of actual values from the studies (expected) versus values predicted by the regression models. The ideal prediction occurs when the points are as close as possible to the diagonal. The figures show the red line with a range of relative error ± 20%. The obtained results in the first example indicate an improvement in the prediction of unit weight using ML.

4.2.2. Example Two γt (LOIT, w)

The second example analyzed is characterized by high correlation. Figure 7 shows a comparison of MAPE errors for validation (7a) and testing (7b) of all analyzed regression models. The results in the figures are grouped by colors, depending on the applied training method.
Analyzing the groups of ML methods, it can be seen that the Gaussian process (M18–M21) and neural networks (M23–M27) models perform best during training and testing. Table 4 contains the results of unit weight prediction based on values from laboratory tests (LOIT, w). Comparing the results of the reference model (M1) with the results of the other regression models in Table 4, a decrease in the validation MAPE error value was obtained in all models, which indicates an improvement in the prediction accuracy of the ML models. The validation error values are similar to the testing errors, which indicates a good ability to generalize. The best of the ML models is the exponential Gaussian process (M20), for which the testing MAPE value decreased from 7.87% to 1.25%. For this model, R2 = 0.992 was obtained. Compared to the analyses [39] that took into account the use of neural networks only, a similar test result of R2 = 0.986 was obtained. The error R2 = 0.794 calculated for model (10) is comparable to the reference model M1 (R2 = 0.797) from Table 4.
Figure 8 shows the results of testing the models: (a) reference (M1)—linear regression and the best ML; and (b) the Gaussian process (M20). Figure 8a,b shows comparisons of actual values from the studies (expected) versus values predicted by the regression models. The figures show the red line with the range of relative error ± 5%.
The increase in the R2 value is due to the fact that linear regression does not accurately model the curvilinear relationship seen in graph 8a. The obtained results in the second example indicate an improvement in the prediction of unit weight using ML.

5. Conclusions

The main goal of this work was to identify and validate tools for constructing regression models for geotechnical applications. This task was based on the results of unique organic soil studies. For this purpose, machine learning algorithms for multiple regression issues available in MATLAB were used. Two relations γt (fs, qc) and γt (LOIT, w) were tested in the analyses. In the statistical analyses, the nature of these relations was qualitatively different, with the first showing a weak correlation and the second showing a strong one. The built regression models were designed to determine the unit weight of organic soils γt. Based on the results obtained, it can be stated that:
  • the use of machine learning to build regression models allowed for a more accurate estimation of the unit weight of organic soils for both analyzed examples;
  • the use of machine learning tools can be an important element of a geotechnical engineer’s workshop;
  • reaching for advanced regression tools should be preceded by statistical data analysis, on the basis of which comparative models can be obtained.
  • The presented results may be supplemented by a comparison with classical nonlinear regression models.

Author Contributions

Conceptualization, A.B. and M.J.S.; methodology, A.B., G.S. and M.J.S.; software, A.B.; validation, A.B., G.S. and M.J.S.; formal analysis, A.B. and M.J.S.; investigation, A.B. and G.S.; resources, A.B. and G.S.; writing—original draft preparation, A.B. and G.S.; writing—review and editing, M.J.S.; visualization, A.B.; funding acquisition, A.B. and G.S. All authors have read and agreed to the published version of the manuscript.

Funding

This study was financed by the Ministry of Science and Higher Education, Republic of Poland within the program “Regional Excellence Initiative”. The research leading to these results received funding from the commissioned task entitled “VIA CARPATIA Universities of Technology Network named after the resident of the Republic of Poland Lech Kaczyński”, under the special purpose grant from the Minister of Science and Higher Education, contract no. MEiN/2022/DPI/2578, action entitled: “In the neighborhood-inter-university research internships and study visits”.

Data Availability Statement

Upon request, the authors are willing to provide the raw data supporting the conclusions contained in the article.

Acknowledgments

The authors are grateful to the Department of Geological Services and Design Construction and Environmental Protection GEOTECH Ltd. for providing data and allowing the data to be used for research purposes.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
MLMachine Learning
CPTCone Penetration Test
MAPEMean Absolute Percentage Error
SVMSupport Vector Machine
GPRGaussian Process Regression
ANNArtificial Neural Network

References

  1. Gilbert, P.A. Effect of Sampling Disturbance on Laboratory-Measured Soil Properties; U.S. Army Engineer Waterways Experiment Station: Vicksburg, MS, USA, 1992. [Google Scholar]
  2. Mayne, P.W.; Peuchen, J.; Bouwmeester, D. Soil unit weight estimation from CPTs. In Proceedings of the 2rd International Symposium on Cone Penetration Testing, Huntington Beach, CA, USA, 9–11 May 2010. [Google Scholar]
  3. Robertson, P.; Cabal, K. Estimating soil unit weight from CPT. In Proceedings of the 2nd International Symposium on Cone Penetration Testing, Huntington Beach, CA, USA, 9–11 May 2010. [Google Scholar]
  4. Ozer, A.T.; Bartlett, S.F.; Lawton, E.C. CPTU and DMT for estimating soil unit weight of Lake Bonneville Clay. Geotech. Geophys. Site Charact. 2012, 4, 291–296. [Google Scholar]
  5. Ghanekar, R.K. Unit weight estimation from CPT for Indian offshore soft calcareous clay. In CPTU and DMT in Soft Clays and Organic Soils; Młynarek, Z., Wierzbicki, J., Eds.; Exlemplum Press: Poznań, Poland, 2014; pp. 31–44. [Google Scholar]
  6. Kovacevic, M.S.; Gavin, K.G.; Reale, C.; Libric, L. The use of neural networks to develop CPT correlations for soils in northern Croatia. In Proceedings of the 4th International Symposium on Cone Penetration Testing (CPT’18), Delft, The Netherlands, 21–22 June 2018; CRC Press: Boca Raton, FL, USA, 2018. [Google Scholar] [CrossRef]
  7. Lengkeek, H.J.; de Greef, J.; Joosten, S. CPT based unit weight estimation extended to soft organic soils and peat. In Proceedings of the 4th International Symposium on Cone Penetration Testing (CPT’18), Delft, The Netherlands, 21–22 June 2018; CRC Press: Boca Raton, FL, USA, 2018. [Google Scholar] [CrossRef]
  8. Lengkeek, H.J.; Brinkgreve, R.B.J. CPT-based unit weight estimation extended to soft organic clays and peat: An update. In Cone Penetration Testing 2022; CRC Press: Boca Raton, FL, USA, 2022. [Google Scholar] [CrossRef]
  9. Bishop, C.M.; Bishop, H.; Cham, S. Deep Learning—Foundations and Concepts; Springer Nature: London, UK, 2023. [Google Scholar] [CrossRef]
  10. Deka, P.C. A Primer on Machine Learning Applications in Civil Engineering; CRC Press: Boca Raton, FL, USA, 2019. [Google Scholar] [CrossRef]
  11. Sargiotis, D. Transforming Civil Engineering with AI and Machine Learning: Innovations, Applications, and Future Directions. Int. J. Res. Publ. Rev. 2025, 6, 3780–3805. [Google Scholar] [CrossRef]
  12. Mobasheri, F.; Hosseinpoor, M.; Yahia, A.; Pourkamali-Anaraki, F. Machine Learning as an Innovative Engineering Tool for Controlling Concrete Performance: A Comprehensive Review. Arch. Comput. Methods Eng. 2025, 1–45. [Google Scholar] [CrossRef]
  13. Ebid, A.M. 35 Years of (AI) in Geotechnical Engineering: State of the Art. Geotech. Geol. Eng. 2021, 39, 637–690. [Google Scholar] [CrossRef]
  14. Padarian, J.; Minasny, B.; McBratney, A.B. Machine learning and soil sciences: A review aided by machine learning tools. Soil 2020, 6, 35–52. [Google Scholar] [CrossRef]
  15. Liu, H.; Su, H.; Dias do Costa, D. State-of-the-art review on the use of AI-enhanced computational mechanics in geotechnical engineering. Artif. Intell. Rev. 2024, 57, 196. [Google Scholar] [CrossRef]
  16. Rauter, S.; Tschuchnigg, F. CPT Data Interpretation Employing Different Machine Learning Techniques. Geosciences 2021, 11, 265. [Google Scholar] [CrossRef]
  17. Abu-Farsakh, M.Y.; Shoaib, M.M. Machine Learning Models to Evaluate the Load-Settlement Behavior of Piles from Cone Penetration Test Data. Geotech. Geol. Eng. 2024, 42, 3433–3449. [Google Scholar] [CrossRef]
  18. Nierwinski, H.P.; Pfitscher, R.J.; Barra, B.S.; Menegaz, T.; Odebrecht, E. A practical approach for soil unit weight estimation using artificial neural networks. J. South Am. Earth Sci. 2023, 131, 104648. [Google Scholar] [CrossRef]
  19. Roy, S.; Abir, A.R.; Ansary, M.A. Use of CPT and other parameters for estimating soil unit weight using optimized machine learning models. Earth Sci. Inform. 2025, 18, 221. [Google Scholar] [CrossRef]
  20. Lechowicz, Z.; Sulewska, M.J. Assessment of the Undrained Shear Strength and Settlement of Organic Soils under Embankment Loading Using Artificial Neural Networks. Materials 2023, 16, 125. [Google Scholar] [CrossRef]
  21. Yousefpour, N.; Medina-Cetina, Z.; Hernandez-Martinez, F.G.; Al-Tabbaa, A. Stiffness and Strength of Stabilized Organic Soils—Part II/II: Parametric Analysis and Modeling with Machine Learning. Geosciences 2021, 11, 169. [Google Scholar] [CrossRef]
  22. Ulloa, H.O.; Ramirez, A.; Jafari, N.H.; Kameshwar, S.; Harrouch, I. Machine Learning–Based Organic Soil Classification Using Cone Penetrometer Tests. J. Geotech. Geoenvironmental Eng. 2024, 150, 9. [Google Scholar] [CrossRef]
  23. Czudec, G.; Bulanda, J. Geological and Engineering Geological Conditions for Recognition—Engineering for the Construction of Multi-Storey Building at Witolda Street in Rzeszów; Geotech, Ltd.: Aurora, ON, Canada; Department of Geological Services Design and Construction and the Environment: Rzeszow, Poland, 2010. [Google Scholar]
  24. Straż, G. Identification of local organic soils based on cone penetration test results. Constr. Archit. 2014, 13, 49–56. [Google Scholar] [CrossRef]
  25. PN-B-04452:2002; Geotechnics. Field Studies. Polish Committee for Standardization: Warsaw, Poland, 2002.
  26. EN 1997-1: 2008; Eurocode 7: Geotechnical Design—Part 1: General Rules. European Committee for Standardization: Brussels, Belgium, 2008.
  27. EN 1997-2: 2009/10; Eurocode 7: Geotechnical Design—Part 2: Ground Investigation and Testing. European Committee for Standardization: Brussels, Belgium, 2009.
  28. PN-EN ISO 22476-12:2009; Geotechnical Investigation and Testing—Field Testing—Part 12: Mechanical Cone Penetration Test. Polish Committee for Standardization: Warsaw, Poland, 2009.
  29. PN-EN ISO 22476-1:2023-06; Geotechnical investigation and Testing—Field Testing—Part 1: Electrical Cone and Piezocone Penetration Test. Polish Committee for Standardization: Warsaw, Poland, 2023.
  30. PN-B-04481: 1988; Building Soils. Laboratory Tests. Polish Committee for Standardization: Warsaw, Poland, 1988.
  31. PN-B-02480: 1986; Building Soils. Terms, Symbols, Division and Description of Soil. Polish Committee for Standardization: Warsaw, Poland, 1986.
  32. PN-EN ISO 14688-2: 2018-05; Geotechnical Investigation and Testing. Identification and Classification of Soil. Part 1: Identification and Description. European Committee for Standardization: Brussels, Belgium, 2018.
  33. PN-EN ISO 14688-1: 2018-05; Geotechnical Investigation and Testing. Identification and Classification of Soil. Part 2: Principles for a Classification. European Committee for Standardization: Brussels, Belgium, 2018.
  34. PN-EN ISO 17892-1: 2015; Geotechnical Investigation and Testing—Laboratory Testing of Soil—Part 1: Determination of Water Content. European Committee for Standardization: Brussels, Belgium, 2015.
  35. Marut, M.; Straż, G. Verification of standard guidelines for organic matter content determination in organic soils by the loss on ignition method. Geol. Rev. 2016, 64, 918–924. [Google Scholar]
  36. Heiri, O.; Lotter, A.F.; Lemcke, G. Loss on ignition as a method for estimating organic and carbonate content in sediments: Reproducibility and comparability of results. J. Paleolimnol. 2001, 25, 101–110. [Google Scholar] [CrossRef]
  37. PN-EN ISO 17892-2:2014; Geotechnical Investigation and Testing—Laboratory Testing of Soil—Part 2: Determination of Bulk Density. European Committee for Standardization: Brussels, Belgium, 2014.
  38. Straż, G.; Borowiec, A. Evaluation of the unit weight of organic soils from a CPTM using an Artificial Neural Networks. Arch. Civ. Eng. 2021, 67, 259–281. [Google Scholar] [CrossRef]
  39. Straż, G.; Borowiec, A. Estimating the unit weight of local organic soils from laboratory tests using artificial neural networks. Appl. Sci. 2020, 10, 2261. [Google Scholar] [CrossRef]
  40. Stanisz, A. An Accessible Course in Statistics Using STATISTICA PL with Examples from Medicine. In Basic Statistics; StatSoft: Kraków, Poland, 2006; Volume 1: Basic Statistics. [Google Scholar]
  41. Stanisz, A. An Accessible Course in Statistics Using STATISTICA PL with Examples from Medicine. In Basic Statistics; StatSoft: Kraków, Poland, 2007; Volume 2: Linear and nonlinear models. [Google Scholar]
  42. TIBCO Software Inc. Statistica (Data Analysis Software System), v. 13. 2017. Available online: https://docs.tibco.com/pub/stat/tibco-statistica-13-3-0_documentation.zip (accessed on 6 August 2025).
  43. MathWorks: MATLAB Documentation (R2024b). Available online: https://ww2.mathworks.cn/products/new_products/release2024b.html (accessed on 13 December 2024).
  44. Snoek, J.; Larochelle, H.; Adams, R.P. Practical Bayesian Optimization of Machine Learning Algorithms. Adv. Neural Inf. Process. Syst. 2012, 25, 2960–2968. [Google Scholar] [CrossRef]
  45. Breiman, L.; Friedman, J.H.; Olshen, R.A.; Stone, C.J. Classification and Regression Trees; CRC Press: Boca Raton, FL, USA, 1984. [Google Scholar]
  46. Breiman, L. Bagging Predictors. Mach. Learn. 1996, 26, 123–140. [Google Scholar] [CrossRef]
  47. Freund, Y.; Schapire, R.E. A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting. J. Comput. Syst. Sci. 1997, 55, 119–139. [Google Scholar] [CrossRef]
  48. Vapnik, N.V. The Nature of Statistical Learning Theory; Springer: New York, NY, USA, 1995. [Google Scholar] [CrossRef]
  49. Rasmussen, C.E.; Williams, C.K.I. Gaussian Processes for Machine Learning; MIT Press: Cambridge, MA, USA, 2006. [Google Scholar]
Figure 1. Location of selected CPT research points and exploratory holes on the investment situation map [23].
Figure 1. Location of selected CPT research points and exploratory holes on the investment situation map [23].
Applsci 15 09079 g001
Figure 3. Regression models verification scheme.
Figure 3. Regression models verification scheme.
Applsci 15 09079 g003
Figure 4. Data correlation matrix.
Figure 4. Data correlation matrix.
Applsci 15 09079 g004
Figure 5. The MAPE error of models γt (fs, qc): (a) validation; and (b) testing.
Figure 5. The MAPE error of models γt (fs, qc): (a) validation; and (b) testing.
Applsci 15 09079 g005
Figure 6. Model testing results γt (fs, qc): (a) linear regression (M1); and (b) Gaussian process (M21).
Figure 6. Model testing results γt (fs, qc): (a) linear regression (M1); and (b) Gaussian process (M21).
Applsci 15 09079 g006
Figure 7. The MAPE error of models γt (LOIT, w): (a) validation; and (b) testing.
Figure 7. The MAPE error of models γt (LOIT, w): (a) validation; and (b) testing.
Applsci 15 09079 g007
Figure 8. Results of testing the γt (LOIT, w) models: (a) linear regression (M1); and (b) Gaussian process (M20).
Figure 8. Results of testing the γt (LOIT, w) models: (a) linear regression (M1); and (b) Gaussian process (M20).
Applsci 15 09079 g008
Table 1. Regression models in Regression Learner [42].
Table 1. Regression models in Regression Learner [42].
NrModel TypePresetInterpretabilityFlexibility
M1Linear RegressionLinearEasyVery low
M2Linear RegressionInteractions LinearEasyMedium
M3Linear RegressionRobust LinearEasyVery low
M4Linear RegressionStepwise LinearEasyMedium
M5Decision TreeFine TreeEasyHigh
M6Decision TreeMedium TreeEasyMedium
M7Decision TreeCoarse TreeEasyLow
M8SVM Linear SVMEasyLow
M9SVMQuadratic SVMHardMedium
M10SVMCubic SVMHardMedium
M11SVMFine Gaussian SVMHardHigh
M12SVMMedium Gaussian SVMHardMedium
M13SVMCoarse Gaussian SVMHardLow
M14Efficiently Trained LinearLeast SquaresEasyMedium
M15Efficiently Trained LinearSVMEasyMedium
M16Ensemble of TreesBoosted TreesEasyMedium
M17Ensemble of TreesBagged TreesEasyMedium
M18Gaussian ProcessSquared Exponential GPRHardAutomatic
M19Gaussian ProcessMatern 5/2 GPRHardAutomatic
M20Gaussian ProcessExponential GPRHardAutomatic
M21Gaussian ProcessRational Quadratic GPRHardAutomatic
M22Neural NetworkNarrow Neural NetworkHardMedium
M23Neural NetworkMedium Neural NetworkHardMedium
M24Neural NetworkWide Neural NetworkHardMedium
M25Neural NetworkBilayered Neural NetworkHardHigh
M26Neural NetworkTrilayered Neural NetworkHardHigh
M27KernelSVM KernelHardMedium
M28KernelLeast Squares KernelHardMedium
Table 2. Basic statistics of soil data.
Table 2. Basic statistics of soil data.
VariableCountMeanMinMaxStd Dev
X1: fs [kPa]13547.599.80209.0342.90
X2: qc [kPa]135314.64140.00644.00106.88
X3: LOIT [%]13520.725.0284.9320.73
X4: w [%]135103.3423.52417.91106.22
Y: γt [kN/m3]13515.2510.2719.862.60
Table 3. Machine learning results for CPT probing data γt (fs, qc).
Table 3. Machine learning results for CPT probing data γt (fs, qc).
NrModel TypePresetR2 (Validation)R2 (Test)MAPE% (Validation)MAPE% (Test)
M1Reference modelLinear0.2110.25313.8713.57
M2Linear RegressionInteractions Linear0.2080.24913.8413.55
M6Decision TreeMedium Tree0.4830.2859.5911.06
M11SVMFine Gaussian0.3730.63310.198.52
M17EnsembleBagged Trees0.4980.3639.4010.62
M21Gaussian ProcessRational Quadratic0.4370.5859.299.57
M26Neural NetworkTrilayered0.3170.6639.937.37
Table 4. Machine learning results for laboratory data γt (LOIT, w).
Table 4. Machine learning results for laboratory data γt (LOIT, w).
NrModel TypePresetR2 (Validation)R2 (Test)MAPE% (Validation)MAPE% (Test)
M1Reference modelLinear0.7870.7976.107.87
M2Linear RegressionInteractions Linear0.9270.9343.403.90
M5Decision TreeFine Tree0.9670.9681.972.95
M12SVMFine Gaussian0.9630.9682.072.38
M17EnsembleBagged Trees0.9730.9731.802.63
M20Gaussian ProcessExponential0.9800.9921.471.25
M23Neural NetworkMedium0.9710.9891.752.48
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Borowiec, A.; Straż, G.; Sulewska, M.J. Machine Learning Estimation of the Unit Weight of Organic Soils. Appl. Sci. 2025, 15, 9079. https://doi.org/10.3390/app15169079

AMA Style

Borowiec A, Straż G, Sulewska MJ. Machine Learning Estimation of the Unit Weight of Organic Soils. Applied Sciences. 2025; 15(16):9079. https://doi.org/10.3390/app15169079

Chicago/Turabian Style

Borowiec, Artur, Grzegorz Straż, and Maria Jolanta Sulewska. 2025. "Machine Learning Estimation of the Unit Weight of Organic Soils" Applied Sciences 15, no. 16: 9079. https://doi.org/10.3390/app15169079

APA Style

Borowiec, A., Straż, G., & Sulewska, M. J. (2025). Machine Learning Estimation of the Unit Weight of Organic Soils. Applied Sciences, 15(16), 9079. https://doi.org/10.3390/app15169079

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop