Exploring Database Quality Through Shapley Values: Application to Dynamic Soil Parameters Databases

Borderon, Julien; Dufour, Nathalie; Régnier, Julie

doi:10.3390/geotechnics5030061

Open AccessArticle

Exploring Database Quality Through Shapley Values: Application to Dynamic Soil Parameters Databases

by

Julien Borderon

¹,

Nathalie Dufour

^1,* and

Julie Régnier

²

¹

GéoCoD, Cerema, 30 rue Albert Einstein, 13290 Aix-en-Provence, France

²

Cerema, Observatoire de la Côte d’Azur, Université Côte d’Azur, CNRS, IRD, 06560 Valbonne, France

^*

Author to whom correspondence should be addressed.

Geotechnics 2025, 5(3), 61; https://doi.org/10.3390/geotechnics5030061

Submission received: 19 May 2025 / Revised: 22 July 2025 / Accepted: 1 September 2025 / Published: 4 September 2025

Download

Browse Figures

Versions Notes

Abstract

Geotechnical engineering faces challenges related to data, especially the ones related to dynamic soil behavior (i.e., shear modulus reduction and damping ratio curves with strain), with only a few datasets in open-access format and a slow transition to a more data-driven method. This lack of data, combined with variations in data collection methods, makes it difficult to build accurate predictive models. These challenges arose while developing a model to predict the shear modulus curves, an important soil property to better understand seismic hazard from three different databases. Combining multiple databases can sometimes degrade model performance. To address this, a novel approach in geotechnics based on Shapley values computed from an XGBoostRegressor model is introduced. This game–theoretic method quantifies each database’s marginal contribution to the model’s R² across all possible combinations, making it possible to identify which databases contribute most to improving performance. As the number of available databases continues to grow, this method will become increasingly useful. For shear modulus reduction curves, two out of three databases explored have Shapley values of 0.341 and 0.339, while the last one reaches only a value of 0.320. This suggests that the first two databases contribute more to the model’s performance.

Keywords:

seismic hazard; nonlinear soil behavior; Shapley values; machine learning; ensemble learning; database selection; XGBoost

1. Introduction

Geotechnical engineering frequently contends with significant data challenges. As highlighted by recent studies, this field is historically a “data-poor field” with only a few datasets available and a slow transition to so-called “data-centric geotechnics” [1,2]. Moreover, the current heterogeneity of available datasets, often gathered for diverse objectives, in different regions, and by various laboratories, continues to hinder the development of accurate and generalizable predictive models. However, such models are crucial for advancing both research and practice in geotechnical engineering.

Faced with these limitations, the harmonization of geotechnical datasets becomes essential for the development of reliable models. This involves numerous data preprocessing steps, including outlier management, normalization, and data selection. Nevertheless, even with these adjustments, choosing appropriate databases remains a major challenge.

In this study, an innovative approach is introduced in geotechnical engineering by using Shapley values to quantify the individual contribution of each database to the performance of a machine learning model predicting soil behavior parameters. Initially developed within the framework of cooperative game theory [3], this method is commonly used in machine learning to evaluate the feature importance [4,5,6,7], particularly with tools like the SHAP library [8]. However, it has rarely been applied to evaluate the quality of different datasets, mainly due to the high computational cost associated with large datasets. This approach is particularly valuable as available datasets in geotechnics continue to grow, as it offers a structured method to guide future database selection, similar to its successful application in other disciplines such as medicine. Indeed, similar challenges have been observed in this field, where data collected from multiple sources are often biased by equipment or measurement errors. In such cases, Shapley values have demonstrated their effectiveness, for example, in distinguishing poor-quality data from useful data to improve pneumonia detection from radiographic image databases [9,10].

More specifically, Shapley values are employed here to objectively assess the contribution of three published Italian datasets (Facciorusso, 2020 [11]; Ciancimino et al., 2023 [12]; Gaudiosi et al., 2023 [13]) in predicting dimensionless shear modulus reduction curves (G/G_max, where G is the shear modulus and G_max is its maximum value) and the evolution of damping ratio (D) with increasing shear strain. These dynamic soil parameters are essential, particularly for understanding site effects in seismology [14]. In recent years, researchers have sought to clarify discrepancies in their measurement and to develop standardized protocols for data acquisition [15]. Meanwhile, researchers must still contend with data heterogeneity and rely on geotechnical databases such as those compiled by Facciorusso, Ciancimino, and Gaudiosi, which continue to serve as valuable resources for analyzing the dynamic behavior of Italian soils.

The article will first present these initial databases with a brief description of the laboratory methods employed to derive the dynamic soil properties, then the data preparation and methodology, and finally, the results of the Shapley value analysis.

2. Presentation of Initial Geotechnical Databases

Three geotechnical databases containing collections of shear modulus reduction and damping ratio curves versus distortion have been recently published (Facciorusso, 2020 [11]; Ciancimino et al., 2023 [12]; Gaudiosi et al., 2023 [13]). These databases were constructed using data from different sources: the Università degli Studi di Firenze (UNIFI) for the Facciorusso database, the Politecnico di Torino (POLITO), the Università degli Studi di Firenze (UNIFI), the Università degli Studi di Enna (UNIKORE), the Università degli Studi di Messina (UNIME), the Università degli Studi di Napoli (UNINA), the Sapienza Università di Roma (UNIROMA1), and the Università degli Studi (UNICH) for the Ciancimino database, and multiple laboratories as the data were retrieved from seismic microzonation studies for the Gaudiosi database.

2.1. Soil Specimens and Tests

In this study, exclusively the results of tests performed on fine-grained soils, such as clayey and silty soils, as detailed in Table 1 below, are analyzed. Two different types of apparatus were used: a resonant column apparatus and a cyclic double specimen direct simple shear (CDSDSS) device.

Resonant shear tests were conducted using a resonant column apparatus (Figure 1). It is used to evaluate the evolution of shear modulus and damping ratio for small ranges of deformation by applying torsional vibration stresses to the soil specimen.

In this setup, a cylindrical soil specimen is prepared and installed in a compressed air cell. Then, the specimen is saturated through a back-pressure process and consolidated with an isotropic effective confining stress. To determine the shear modulus, the test is carried out in forced vibrations (Figure 2). This involves exciting the top of the specimen by applying a sustained torsional movement with a predetermined amplitude over a specified frequency range, while the bottom of the specimen remains fixed. The frequency is adjusted in order to reach the resonance of the specimen. The first natural frequency of the specimen is estimated from the peak of the amplitude/frequency curve (Figure 3). Then, an accelerometer installed at the top cap provides a measurement of the specimen response. The resonance of the specimen under forced vibration stresses occurs for a quarter of a wavelength, which allows, in the case of torsion, the calculation of the velocity

V_{S}

of the shear wave in the specimen as follows:

V_{S} = \frac{2 π f L}{β} .

(1)

In Equation (1),

f

is the resonance frequency,

L

is the height of the test specimen and

β

is determined using the following equation:

\frac{I}{I_{0}} = β \tan (β),

(2)

where

I_{0}

is the moment of inertia of the system, calculated during the calibration of the apparatus.

I

is the moment of inertia of the specimen with

D

its diameter and m its mass, given by the following equation in terms of rotation:

I = \frac{m D^{2}}{8}

(3)

Then, the shear modulus

G

can be calculated from the formula in Equation (4) and the distortion

ϒ

with Equation (5):

G = ρ {V s}^{2},

(4)

ϒ = \frac{4.596 A_{m a x} D}{f L}

(5)

where the maximum amplitude

A_{m a x}

corresponds to the resonance frequency (Figure 3) [16] and where

ρ

is the density.

For each resonant frequency, the soil damping ratio is estimated by using either the Steady-State Vibration (SSV) or Free Vibration Decay (FVD) method. The SSV is also known as half-power bandwidth, and FVD is known as the logarithmic decrement method. Consequently, the damping ratio can be determined either from the width of the frequency response curve or the free vibration decay curve in the resonant column test.

The resonance curve obtained in forced vibrations (Figure 3) is characterized by a bandwidth

∆ f

. The damping ratio is deduced from the following expression [17]:

D = \frac{∆ f}{2 f}

(6)

It is also possible to perform free vibration tests by stopping the vibration load instantaneously (Figure 2). Soil damping ratio can then be estimated by analyzing the time decay of amplitude, defined using logarithmic decrement

δ

in the form:

δ = \frac{1}{n} \ln (\frac{z_{1}}{z_{n + 1}})

(7)

where n is the number of cycles between two consecutive peaks in the record and

z_{1}

and

z_{n + 1}

are the amplitudes of cycles 1 and n + 1.

Finally, the damping ratio is computed from the following:

D = \sqrt{\frac{δ^{2}}{4 π^{2} + δ^{2}}}

(8)

The same apparatus was also employed to perform closed-loop torsional shear tests (Figure 4). These tests are typically conducted at a fixed frequency around 0.5 Hz. The driving system applied a fixed number of sinusoidal cycles.

Contrary to the resonant shear tests, the cyclic torsional shear tests setup uses two displacement transducers (i.e., proximity sensors) mounted at the top of the specimen. The resulting torsional response is captured in the form of hysteresis loops on the shear stress–shear strain (τ–γ) plane (Figure 4). From these loops, the shear modulus (G) and damping ratio (D) are calculated for each loading cycle, based on the following defining equations:

G = \frac{τ_{p p}}{γ_{p p}} .

(9)

In this equation, with reference to Figure 4,

τ_{p p}

and

γ_{p p}

denote the double-amplitude maximum shear stress and shear strain, respectively.

W

represents the elastic energy stored during the loading cycle (hatched area in Figure 4), while

∆ W

refers to the energy dissipated within the cycle (area enclosed by the hysteresis loop in Figure 4).

Finally, the damping ratio is calculated from the following:

D = \frac{1}{4 π} \frac{∆ W}{W} .

(10)

Accordingly, for each amplitude of the applied cyclic torsional loading, the shear modulus (G), the damping ratio (D), and the induced maximum distortion (γ) are computed using Equations (9) and (10) at predefined loading cycles (i.e., 1st, 5th, 15th, 20th, and 25th). The evolution of G and D with respect to γ is then characterized by plotting G–γ and D–γ curves, which are constructed by repeating the test at progressively increasing loading amplitudes.

Additionally, a cyclic double specimen direct simple shear (CDSDSS) device was used with a double specimen configuration. Tests are carried out under constant volume conditions, and a horizontal piston is used to apply the cyclic loading ([18,19]). This device operates by placing two identical cylindrical soil specimens in a stacked configuration, enclosed between a shared central platen and top/bottom caps, with lateral confinement provided by flexible stacked rings to simulate simple shear conditions. A constant vertical stress is applied, while cyclic horizontal displacement is imposed on the central platen, generating shear strains in opposite directions within the two specimens. Shear stress is measured using force transducers, and shear strain is computed from horizontal displacements relative to specimen height, allowing for the construction of hysteresis loops from which shear modulus (G) and damping ratio (D) are derived (like after a cyclic shear test).

All three tests (resonant shear tests, closed-loop torsional shear tests, and cyclic double specimen direct simple shear tests) provide the dynamic properties of soil at very small strains, shear modulus G and damping ratio D. However, differences in test protocols and inter-laboratory variability can significantly impact the comparability of data obtained from resonant shear, cyclic torsional shear, and cyclic double specimen direct simple shear tests. Each of these tests involves distinct loading mechanisms and boundary conditions. Resonant shear tests determine dynamic properties by exciting the specimen at its natural frequency, typically at very small strains, under controlled resonant conditions. Cyclic torsional shear tests apply cyclic rotational shear stresses, allowing for control over shear strain amplitude and frequency, and are often used to characterize the nonlinear and damping behavior over a wider strain range. Cyclic double specimen direct simple shear tests (CDSS) impose cyclic shear directly on stacked specimens under constant normal stress, providing direct measurements of shear stress–strain relationships under more uniform shear conditions. As a result, differences in strain uniformity, loading paths, and strain rates inherently affect the measured shear modulus and damping ratio.

Moreover, using multiple laboratories with potentially varying protocols can introduce both systematic and random biases. First, biases may be introduced during the coring of natural samples. Disturbances caused by sampling methods, changes in stress conditions, or microcracks induced during coring can alter the intrinsic properties of the samples, leading to results that may not fully represent in situ conditions. Then, in the laboratory, differences in equipment calibration, testing procedures, sample preparation, and data analysis methods may all affect the outcomes. Variations in operator expertise and local environmental conditions (such as temperature and humidity) can further influence measurements.

In the following study, the raw data from the laboratory tests are used without any interpolation to maintain high fidelity to the data.

2.2. Soil Properties Available

In addition to the dynamic soil properties, there is significant variability in the available features among the databases. Given the specific focus of this study and the limited features in Gaudiosi, it was decided to retain only the eight features listed in Table 2.

These features provide a representation of the soil’s mechanical and physical properties. For example, the confining pressure, which simulates the in situ pressure of the soil, is applied to the soil specimen during testing. The plasticity index and the liquid limit provide information about the soil plasticity and its ability to retain water, while the water content and void ratio reflect its moisture content and porosity. Finally, depth helps to place the specimen within the soil profile. These features have been highlighted in previous studies to have a significant impact on dynamic soil properties. The normalized shear modulus is mainly dependent on the mean confining pressure (

σ_{c}^{'}

) and the plasticity index (PI) in a way that an increase in these parameters induced an increase in the curves, i.e., the soil behavior tends to be more linear. Vucetic & Dobry (1991) [20] indicated that the void ratio and two testing parameters (maximal deformation and the number of cycles) also have a great influence. The damping ratio mainly depends on the mean confining pressure (

σ_{c}^{'}

) [21] and the Plasticity Index (PI) [22,23,24,25].

3. Data Preparation

Several steps were taken to ensure that the datasets exclusively contained soils relevant to the study before applying any model. First, the datasets were checked to include only soil specimens classified (USCS) as “CL” (low plasticity clay), “CH” (high plasticity clay), “MH” (high plasticity silt), “ML” (low plasticity silt), or any mix between these types. Then, duplicate rows were identified and removed to ensure that each point from any test is represented only once. In the same way, rows with missing values for USCS classification, G/G_max, and D (only for the part of the analysis that focuses on this parameter) were also removed. Then, to ensure that the analysis focused on a significant amount of data and to minimize error, extensive cleaning steps were applied. Thus, the outliers were detected and removed using the z-score method, eliminating specimens with a z-score greater than 3. This value is widely accepted and utilized in data science. The statistical basis for this threshold lies in the empirical rule, which states that 99.7% of data points fall within

\pm

3 standard deviations (

σ

) from the lean in a normal distribution. This choice strikes a balance between maintaining data quality and ensuring that the dataset remains representative of the underlying population.

Moreover, tests with fewer than six points were removed to guarantee minimum resolution. This threshold was selected to ensure that each test is representative of the shear modulus and damping curves. Below this threshold, we anticipate that the curves may not be well described. This threshold also ensures that an excessive amount of data is not discarded. In addition, to ensure that there is a monotonic relationship between γ (%) and G/G_max tests showing a rise greater than 0.075 between two consecutive γ (%) values were also removed. This threshold was chosen to allow a small rise that could be attributed to measurement uncertainty, but to ensure that the model does not incorrectly interpret this relationship as non-monotonic decreasing. Finally, after completing the data cleaning process, the number of rows in each database was reduced: from 2861 to 2686 for Facciorusso, from 3113 to 2321 for Gaudiosi, and from 2063 to 1628 for Ciancimino. The distribution of each feature was then checked to ensure consistency between datasets.

As shown in Figure 5, the distributions of the input parameters overlap effectively, indicating consistency across datasets. Shear strain γ ranges primarily from −5% to 1% on a log₁₀ scale, with all datasets peaking around −2.5%. Confining pressure

σ_{c}^{'} (k P a)

is right-skewed, mostly between 50 kPa and 400 kPa. Depth, z(m), is concentrated between 5 m and 20 m. Plasticity Index (PI) and Liquid Limit (LL) are mostly within 10–40 and 30–60, respectively, with overlapping peaks around 20 for PI and 45 for LL. The initial void ratio

e_{0}

lies mainly between 0.5 and 1.0, peaking near 0.75 in all datasets. Water content (w) is tightly clustered between 15% and 30% and the density

ρ (t / m^{3})

is mostly between 1.6

t / m^{3}

and 2.1

t / m^{3}

.

Finally, a test dataset comprising 20% of each database was created and remained constant across all tests for calculating Shapley values, which allows us to have a consistent evaluation of the model’s performance.

4. Method

4.1. Shapley Values

In this study, Shapley values are used to evaluate the contribution of each dataset as if they were players in a cooperative game where the goal is to optimize the model’s performance. In this scenario, the payoff is the improvement in model accuracy, quantified through the R² metric calculated on the test set. Each database’s contribution is quantified by calculating its impact on the model’s performance when it is included versus when it is excluded, considering all possible combinations of databases. This allows us to assign a precise value to each database’s contribution, providing a unique perspective on their value in enhancing the overall accuracy.

Although our model is non-linear, and metrics such as MAE or RMSE are often used in conjunction with R² for regression tasks, we chose to focus on R² as it remains the standard and most widely reported metric in regression-based performance evaluation. This choice is consistent with prior work, including studies by Wu et al. (2023) [26] and Huang et al. (2023) [27], where R² was used as the primary performance metric.

The Shapley value for each database is calculated using the following equation:

ϕ_{i} = \sum_{S \subseteq N ∖ {i}} \frac{|S|! (n - |S| - 1)!}{n!} (v (S \cup {i}) - v (S))

(11)

where:

$ϕ_{i}$ is the Shapley value for the $i$ ^th database
$S$ is a subset of all databases, excluding the $i$ ^th database
$N$ is the total set of databases
$n$ is the total number of databases
$(v (S))$ is the payoff function, defined herein as the $R^{2}$ of the model trained with the subset $S$ of databases
$(v (S \cup {i}))$ represents the payoff when database $i$ is added to subset $S$

Thus, by evaluating the marginal contribution of each database across all potential combinations, a new metric that allows us to measure the impact of each dataset on the chosen model is created. This impact is a direct way to measure the quality of a dataset for training predictive models.

4.2. Model Chosen

In this study, different families of machine learning models were evaluated, including support vector machines (SVMs), nearest neighbor algorithms (KNN regressor), decision trees, and ensemble tree-based approaches (like RandomForest). Among them, Shapley values were used with an XGBoostRegressor model [28] (Figure 6), which proved to be the most efficient approach, consistent with the findings of Huang et al. (2023) [27]. Since its introduction in 2016, this model has remained a leading machine learning solution for nonlinear regression tasks on classical tabular data. Its dominance is evidenced by its repeated use among Kaggle competition winners [29], where it consistently outperforms deep learning models, particularly on datasets with limited specimens [30].

XGBoostRegressor is part of the ensemble learning family and is based on the principle of boosting. As illustrated in Figure 6, the model builds decision trees sequentially, where each new tree is trained to reduce the residual errors left by the previous trees. The process begins with an initial prediction, y₀, typically set to a constant value optimized by the model depending on the inputs (see the base score optimization in XGBoostRegressor documentation). The first tree, f₁(x), is trained to minimize the error between this initial prediction and the true target values. Residuals are then calculated and used as the target for training the next tree, f₂(x). Each subsequent tree, f_i(x), is constructed to predict the residuals left by the ensemble of all previous trees. This sequential process ensures that the model focuses on data points where earlier predictions were less accurate, thereby boosting its overall performance.

At the leaf level, a weight l_j is computed for the leaf j using the gradient g_k and the Hessian h_k of the loss function computed from the k residuals in the leaf, penalized by a regularization term λ, which helps prevent overfitting. The predictions are updated iteratively by adding the contributions from all of the trees scaled by η, the learning rate that controls the contribution of each tree.

This architecture allowed the model to handle correlated input features and to deal with missing values, making it robust and adaptable to different geotechnical datasets.

To ensure optimal performance, hyperparameter optimization was performed using GridSearchCV at each run on each database combination, as shown in Figure 7. The grid included the following search space: ‘xgbregressor__n_estimators’: range (10, 80, 10), ‘xgbregressor__learning_rate’: [0.001, 0.005, 0.01, 0.1], and ‘xgbregressor__max_depth’: range (3, 7, 1), allowing us to select the best combination for each experimental scenario.

5. Results

As seen in the data presentation section, the Gaudiosi database lacks the variables e₀ and w, which are present in the Facciorusso and Ciancimino databases. To evaluate the impact of these missing features on the Shapley values, four scenarios were considered. For predicting G/G_max, one scenario used only the first six variables listed in Table 1, while another included all of the available variables. Similarly, for predicting damping ratio D, one scenario incorporated the six variables plus G/G_max as an additional feature (seven inputs), and the other used all of the available variables, including G/G_max. Additionally, as explained in Section 3, data-cleaning choices were made on each database, and to measure this impact, the Shapley values were computed before and after this cleaning. Given the size of the datasets involved, no computational time issues were encountered, and the Shapley values could be computed exhaustively (the R² values are in Appendix A).

5.1. Without Extensive Cleaning

Table 3 shows the results obtained on the datasets prior to the physical cleaning choices implemented. It can be observed that the absence of the variables e₀ and w does not lead to a greater contribution of Gaudiosi in the model predicting G/G_max. This is not surprising, as these two variables are not used in the parametric reference models [23,24,25] and are therefore not expected to play a significant role. Furthermore, it is noticeable that the contributions vary depending on whether the goal is to predict G/G_max or D. For predicting G/G_max, Facciorusso and Ciancimino contribute almost equally (with Shapley values of 0.346 and 0.342, respectively, when using all inputs), whereas Gaudiosi achieves a lower Shapley value of 0.312. Conversely, for the prediction of damping ratio D, Gaudiosi’s contribution increases to 0.340 (with all inputs).

5.2. With Extensive Cleaning

After the implementation of the extensive cleaning process, we can observe in Table 4 that the contributions of Facciorusso and Ciancimino slightly decrease when compared to previous results in predicting G/G_max, unlike the contribution of Gaudiosi, which increases from 0.312 to 0.319 with the 6 inputs and from 0.312 to 0.320 when all inputs are considered. The cleaning choices that have been made had a slight impact and helped harmonize the datasets for the prediction of this parameter.

For the prediction of damping ratio D, the cleaning choices applied increased the role of Facciorusso while reducing the impact of Gaudiosi. However, without extensive cleaning, the ranking of variable importance for creating predictive models remains reversed between Ciancimino and Gaudiosi, depending on whether the prediction is for G/G_max or D.

This difference in each dataset’s influence on model training can partly be explained by the origin of each dataset and the methods used to build it. Indeed, the most comparable databases (Ciancimino and Facciorusso) provided data that are directly dynamic soil test results from Italian laboratories, while for the Gaudiosi database, the data come from various sources (different laboratories and even literature data) and were preprocessed to be useful in a microzoning study.

Thus, the best R² with a value of 0.97, for the prediction of G/G_max, is obtained by using only the two highest quality datasets, Facciorusso and Ciancimino, in the training sets as shown in Table A1 (Appendix A) where all R² values are summarized. For the prediction of D, the best R², with a value of 0.78, is obtained this time by using all three datasets. This parameter is much harder to measure than G/G_max and therefore exhibits greater variability, as demonstrated in laboratory cross-validation tests [15]. Indeed, the shear modulus is generally more straightforward to determine, as it primarily depends on the amplitude and phase of the applied load relative to the induced deformation. It can be derived from the shear stress–strain curve or estimated from resonant frequency measurements, both of which are recognized for their precision and repeatability. Conversely, the damping ratio, which characterizes energy dissipation under cyclic loading, is intrinsically more susceptible to measurement uncertainties due to several factors. At low strain levels, commonly encountered in resonant column or small-amplitude cyclic tests, the hysteresis loops are narrow, and the enclosed area (representing energy loss per cycle) becomes highly sensitive to noise and minor inaccuracies in displacement or load measurements. Furthermore, the apparent damping encompasses not only material specimen damping but also contributions from system effects, including equipment friction, compliance, and electronic noise, which are often difficult to decouple. Consequently, while both parameters are critical for defining the dynamic behavior of soils, damping measurements typically display greater variability and demand more rigorous interpretation and stringent testing protocols. According to Mog and Anbazhagan (2022) [31], the number of successive loading cycles has a substantial impact on the measured damping ratio at small-to-medium strain amplitudes. For medium strains (>0.005%) in FVD tests, it is recommended to use the damping ratio from the second or third cycle, which provides better consistency. Moreover, the discrepancy between damping values obtained from SSV and FVD methods can exceed ±15%, partly due to ambient noise at low strains and partly because of the lower number of cycles in FVD tests compared to SSV tests. These factors contribute to the inherent scatter in damping measurements, making them more difficult to predict accurately.

5.3. Sensitivity of Shapley Values to Gaussian Noise Added to the Training Target (G/G_max)

To further test the ability of the Shapley value framework to reflect data quality, a perturbation analysis was performed by introducing artificial noise in the output variable G/G_max while using the datasets after extensive cleaning with all inputs. Specifically, a Gaussian noise with a standard deviation

σ_{n o i s e}

= 0.1 was added to the training target of each database independently. This level of distortion is representative of a strong degradation in label precision, potentially simulating experimental or transcription errors. The expectation was that a degradation would be captured by a decrease in the associated Shapley value.

As shown in Table 5, this behavior is indeed observed, although the decrease remains moderate. For instance, the contribution of the Ciancimino database drops from 0.341 to 0.334 when noise is added to its target values, while that of the Facciorusso database drops slightly to 0.336 under similar conditions. That of the Gaudiosi database, already associated with a lower baseline contribution, decreases from 0.320 to 0.311 when affected by noise. All R² associated values are summarized in Table A2 (Appendix A).

These results confirm that the Shapley value method is able to detect a loss of data reliability, even when the underlying input features remain unchanged. This property reinforces the potential of Shapley values as a tool not only for measuring the utility of datasets in terms of model performance but also for quantifying their reliability and resilience to noise.

6. Conclusions

This study demonstrates a new way of using Shapley values as a tool for evaluating and optimizing the contributions of different databases in predictive modeling within geotechnical engineering. By quantifying the impact of each database, it is possible to decide which datasets are most useful for predicting deformability or resistance parameters.

Indeed, a demonstration was conducted using three recently published databases, comprising around 540 results from dynamic tests conducted in soil mechanics laboratories across Italy. These databases offer detailed data, including the evolution of shear modulus and damping with strain, alongside key parameters such as state conditions, physical properties of the tested specimens, and loading conditions. Three distinct testing methods were used to build these databases: the torsional resonance test, the cyclic torsion test, and the double cyclic direct simple shear test with double specimens.

Analysis of the results obtained by means of the Shapley values method shows that, depending on the prediction purpose (in the detailed case herein, predicting the normalized shear modulus reduction curve as a function of distortion or the evolution of damping as a function of distortion), the contribution of each database could vary. Thus, the Ciancimino and Facciorusso databases are more useful for training a prediction model for normalized shear modulus reduction curves than the Gaudiosi database. Conversely, for damping prediction, the contribution of the Facciorusso and Gaudiosi databases seems to be predominant.

This new approach offers a promising framework for the future, particularly when more geotechnical data become available. It could be applied at two levels within the workflow of geotechnical studies: first, in the selection of laboratory tests, and more broadly, in assessing the impact of different in situ or laboratory testing campaigns on the design of geotechnical structures.

Firstly, the method should prompt a critical examination of the heterogeneity observed in test results (whether from laboratory or in situ investigations) conducted by different operators. In particular, concerning the determination of dynamic parameters in the laboratory, the American standard (ASTM, 2021) [32] underscores the necessity of cross-check testing to harmonize procedures. The method presented herein could thus serve as a tool to systematically analyze a large number of interlaboratory tests in this regard.

Then, the method can be used in selecting the dynamic soil parameters (or other hydro-mechanical characteristics of soils) for establishing the geotechnical model of a site. Indeed, several site investigation campaigns (including in situ and laboratory tests) may have been carried out on the same site at different times and by different operators.

By applying the Shapley values method, researchers and engineers can better assess the heterogeneity of datasets and ensure that the most appropriate data sources are used to improve predictive results. This methodology provides a valuable decision-making tool for data selection, paving the way for more robust and accurate models in the field of geotechnical engineering.

Author Contributions

Conceptualization, J.B., N.D. and J.R.; methodology, J.B., N.D. and J.R.; software, J.B.; validation, J.B., N.D. and J.R.; data curation, J.B.; writing—original draft preparation, J.B.; writing—review and editing, J.B., N.D. and J.R.; supervision, N.D. and J.R.; project administration, N.D.; funding acquisition, N.D. All authors have read and agreed to the published version of the manuscript.

Funding

The authors would like to express their gratitude to the Carnot Clim’adapt Institute of Cerema (MedIA project) for supporting this research.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article. The code is available upon request.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Table A1. R² values for G/G_max and D predictions under different cleaning and feature availability scenarios (Section 5.1 and Section 5.2).

	G/G_max				D
	Without Extensive Cleaning		With Extensive Cleaning		Without Extensive Cleaning		With Extensive Cleaning
	$Without e_{0}$ and w	$With e_{0}$ and w	$Without e_{0}$ and w	$With e_{0}$ and w	$Without e_{0}$ and w	$With e_{0}$ and w	$Without e_{0}$ and w	$With e_{0}$ and w
Ciancimino	0.934	0.929	0.954	0.950	0.686	0.695	0.635	0.637
Facciorusso	0.935	0.935	0.952	0.950	0.742	0.744	0.723	0.736
Gaudiosi	0.885	0.885	0.927	0.927	0.735	0.735	0.671	0.671
Ciancimino + Facciorusso	0.939	0.939	0.958	0.958	0.745	0.752	0.719	0.726
Ciancimino + Gaudiosi	0.925	0.924	0.943	0.945	0.749	0.737	0.687	0.711
Facciorusso + Gaudiosi	0.932	0.927	0.943	0.940	0.771	0.764	0.737	0.726
Ciancimino + Facciorusso + Gaudiosi	0.936	0.935	0.951	0.952	0.776	0.777	0.744	0.779

Table A2. R² values for G/G_max prediction with Gaussian noise (σ = 0.1) added to individual training datasets (as discussed in Section 5.3).

	G/G_max
	Noise on Ciancimino	Noise on Facciorusso	Noise on Gaudiosi
Ciancimino	0.929	0.950	0.950
Facciorusso	0.950	0.941	0.950
Gaudiosi	0.927	0.927	0.904
Ciancimino + Facciorusso	0.956	0.955	0.958
Ciancimino + Gaudiosi	0.944	0.945	0.940
Facciorusso + Gaudiosi	0.940	0.945	0.941
Ciancimino + Facciorusso + Gaudiosi	0.953	0.954	0.948

References

Bozorgzadeh, N.; Feng, Y. Evaluation Structures for Machine Learning Models in Geotechnical Engineering. Georisk Assess. Manag. Risk Eng. Syst. Geohazards 2024, 18, 52–59. [Google Scholar] [CrossRef]
Phoon, K.-K. The Story of Statistics in Geotechnical Engineering. Georisk Assess. Manag. Risk Eng. Syst. Geohazards 2020, 14, 3–25. [Google Scholar] [CrossRef]
Roth, A.E. The Shapley Value: Essays in Honor of Lloyd S. Shapley; Cambridge University Press: Cambridge, UK, 1988. [Google Scholar]
Lin, S.; Liang, Z.; Zhao, S.; Dong, M.; Guo, H.; Zheng, H. A Comprehensive Evaluation of Ensemble Machine Learning in Geotechnical Stability Analysis and Explainability. Int. J. Mech. Mater. Des. 2024, 20, 331–352. [Google Scholar] [CrossRef]
Ngo, A.Q.; Nguyen, L.Q.; Tran, V.Q. Developing Interpretable Machine Learning-Shapley Additive Explanations Model for Unconfined Compressive Strength of Cohesive Soils Stabilized with Geopolymer. PLoS ONE 2023, 18, e0286950. [Google Scholar] [CrossRef]
Hansen, T.F. Can We Trust the Machine Learning Based Geotechnical Model? In Proceedings of the Information Technology in Geo-Engineering, Golden, CO, USA, 5–8 August 2024; Gutierrez, M., Ed.; Springer Nature: Cham, Switzerland, 2025; pp. 332–340. [Google Scholar]
Kannangara, K.K.P.M.; Zhou, W.; Ding, Z.; Hong, Z. Investigation of Feature Contribution to Shield Tunneling-Induced Settlement Using Shapley Additive Explanations Method. J. Rock Mech. Geotech. Eng. 2022, 14, 1052–1063. [Google Scholar] [CrossRef]
Lundberg, S.; Lee, S.-I. A Unified Approach to Interpreting Model Predictions. arXiv 2017, arXiv:1705.07874. [Google Scholar] [CrossRef]
Tang, S.; Ghorbani, A.; Yamashita, R.; Rehman, S.; Dunnmon, J.A.; Zou, J.; Rubin, D.L. Data Valuation for Medical Imaging Using Shapley Value and Application to a Large-Scale Chest X-Ray Dataset. Sci. Rep. 2021, 11, 8366. [Google Scholar] [CrossRef]
Pandl, K.D.; Feiland, F.; Thiebes, S.; Sunyaev, A. Trustworthy Machine Learning for Health Care: Scalable Data Valuation with the Shapley Value. In Proceedings of the Conference on Health, Inference, and Learning, Virtual, 8–10 April 2021; ACM: New York, NY, USA, 2021; pp. 47–57. [Google Scholar]
Facciorusso, J. An Archive of Data from Resonant Column and Cyclic Torsional Shear Tests Performed on Italian Clays. Earthq. Spectra 2021, 37, 545–562. [Google Scholar] [CrossRef]
Ciancimino, A.; Cosentini, R.M.; Foti, S.; Lanzo, G.; Pagliaroli, A.; Pallara, O. The PoliTO–UniRoma1 Database of Cyclic and Dynamic Laboratory Tests: Assessment of Empirical Predictive Models. Bull Earthq. Eng 2023, 21, 2569–2601. [Google Scholar] [CrossRef]
Gaudiosi, I.; Romagnoli, G.; Albarello, D.; Fortunato, C.; Imprescia, P.; Stigliano, F.; Moscatelli, M. Shear Modulus Reduction and Damping Ratios Curves Joined with Engineering Geological Units in Italy. Sci. Data 2023, 10, 625. [Google Scholar] [CrossRef]
Régnier, J.; Cadet, H.; Bonilla, L.F.; Bertrand, E.; Semblat, J.-F. Assessing Nonlinear Behavior of Soils in Seismic Site Response: Statistical Analysis on KiK-Net Strong-Motion Data. Bull. Seismol. Soc. Am. 2013, 103, 1750–1770. [Google Scholar] [CrossRef]
Dufour, N.; Calissano, H.; Batilliot, L.; Rogoff, I.; Simon, C.; Disantantonio, A.; Lando, T.; Bourguignon, E. Resonant Column Round-Robin Testing. In Geotechnical Engineering Challenges to Meet Current and Emerging Needs of Society; CRC Press: Boca Raton, FL, USA, 2024; pp. 1100–1103. [Google Scholar]
Pigeot, L.; Dufour, N.; Calissano, H.; Dermenonville, F.; Soive, A. Influence of the Curing Stress Effect on the Stiffness Degradation Curve of a Silt Stabilized with Lime and Cement. Eng. Geol. 2024, 337, 107574. [Google Scholar] [CrossRef]
Semblat, J.-F.; Lenti, L.; Jacqueline, D.; Leblond, J.-J.; Grasso, E. Railway Vibrations Induced into the Soil: Experiments, Modelling and Isolation. - Vibrations Induites Dans Les Sols Par Le Trafic Ferroviaire: Expérimentations, Modélisations et Isolation. arXiv 2011, arXiv:1108.3404. [Google Scholar] [CrossRef]
Doroudian, M.; Vucetic, M. A Direct Simple Shear Device for Measuring Small-Strain Behavior. Geotech. Test. J. 1995, 18, 69–85. [Google Scholar] [CrossRef]
Doroudian, M.; Vučetić, M. Small-Strain Testing in an NGI-Type Direct Simple Shear Device. In Geotechnical Hazards; CRC Press: Boca Raton, FL, USA, 1998; ISBN 978-1-003-07817-3. [Google Scholar]
Vucetic, M.; Dobry, R. Effect of Soil Plasticity on Cyclic Response. J. Geotech. Engrg. 1991, 117, 89–107. [Google Scholar] [CrossRef]
Matesic, L.; Hsu, C.-C.; D’Elia, M.; Vučetić, M. Development of Database of Cyclic Soil Properties from 94 Tests on 47 Soils; Missouri University of Science and Technology: San Diego, CA, USA, 2010. [Google Scholar]
Dobry, R. Dynamic Properties and Seismic Response of Soft Clay Deposits. Proc. Int. Symp. Geotech. Engrg. Soft Soils 1987, 2, 51–87. [Google Scholar]
Ishibashi, I.; Zhang, X. Unified Dynamic Shear Moduli and Damping Ratios of Sand and Clay. Soils Found. 1993, 33, 182–191. [Google Scholar] [CrossRef]
Darendeli, M.B. Development of a New Family of Normalized Modulus Reduction and Material Damping Curves; The University of Texas at Austin: Austin, TX, USA, 2001. [Google Scholar]
Zhang, J.; Andrus, R.D.; Juang, C.H. Normalized Shear Modulus and Material Damping Ratio Relationships. J. Geotech. Geoenviron. Eng. 2005, 131, 453–464. [Google Scholar] [CrossRef]
Wu, Q.; Wang, Z.; Qin, Y.; Yang, W. Intelligent Model for Dynamic Shear Modulus and Damping Ratio of Undisturbed Marine Clay Based on Back-Propagation Neural Network. J. Mar. Sci. Eng. 2023, 11, 249. [Google Scholar] [CrossRef]
Huang, Y.; Wang, Y.; Xu, Z.; Wang, P. Prediction and Variable Importance Analysis for Small-Strain Stiffness of Soil Based on Ensemble Learning with Bayesian Optimization. Comput. Geotech. 2023, 162, 105688. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; ACM: New York, NY, USA, 2016; pp. 785–794. [Google Scholar]
Nielsen, D. Tree Boosting with XGBoost. Master’s Thesis, NTNU, Taipei, Taiwan, 2016. [Google Scholar]
Shwartz-Ziv, R.; Armon, A. Tabular Data: Deep Learning Is Not All You Need. Inf. Fusion 2022, 81, 84–90. [Google Scholar] [CrossRef]
Mog, K.; Anbazhagan, P. Evaluation of the Damping Ratio of Soils in a Resonant Column Using Different Methods. Soils Found. 2022, 62, 101091. [Google Scholar] [CrossRef]
Standard Test Methods for Modulus and Damping of Soils by Fixed-Base Resonant Column Devices. Available online: https://store.astm.org/d4015-21.html (accessed on 15 July 2025).

Figure 1. Resonant column apparatus (a) and zoom (b) (source: Cerema, GéoCoD laboratory).

Figure 2. Type of loading: forced cyclic vibrations or free vibrations.

Figure 3. Typical response of a specimen under forced vibration: Amplitude–frequency curve obtained from a resonant shear test.

Figure 4. Shear modulus G and damping ratio D determined from the shear stress–strain hysteresis loop (torsional shear test).

Figure 5. Distribution of input features across the three datasets.

Figure 6. Flowchart of the XGBoostRegressor model.

Figure 7. Flowchart illustrating the computation of Shapley values.

Table 1. Number and type of tests performed in each database.

Database	Number of Tests	Type of Tests
(Facciorusso, 2020) [11]	170	Resonant shear and cyclic torsional shear tests
(Ciancimino, 2023) [12]	187	Resonant shear, cyclic torsional shear, and CDSDSS tests
(Gaudiosi, 2023) [13]	180	Resonant shear, cyclic torsional shear, and CDSDSS tests

Table 2. Additional features provided in the three databases.

Features		Ciancimino	Facciorusso	Gaudiosi
UCSC	Geological description	x	x	x
z	Depth of soil specimen (m)	x	x	x
$σ_{c}^{'}$	Confining pressure (kPa)	x	x	x
$ρ$	Density (t/m³)	x	x	x
PI	Plasticity index (%)	x	x	x
LL	Liquid limit (%)	x	x	x
w	Water content (%)	x	x
$e_{0}$	Void ratio (-)	x	x

Table 3. Normalized Shapley values without extensive cleaning.

Database	Normalized Shapley Value for G/G_max		Normalized Shapley Value for D
	6 Inputs	All Inputs	7 Inputs	All Inputs
Ciancimino	0.342	0.342	0.301	0.306
Facciorusso	0.346	0.346	0.350	0.354
Gaudiosi	0.312	0.312	0.349	0.340

Table 4. Normalized Shapley values with extensive cleaning.

Database	Normalized Shapley Value for G/G_max		Normalized Shapley Value for D
	6 Inputs	All Inputs	7 Inputs	All Inputs
Ciancimino	0.341	0.341	0.291	0.302
Facciorusso	0.340	0.339	0.383	0.375
Gaudiosi	0.319	0.320	0.326	0.323

Table 5. Normalized Shapley values with extensive cleaning and Gaussian noise.

Database	$Noise with σ_{n o i s e} = 0.1$ on
	Ciancimino	Facciorruso	Gaudiosi
Ciancimino	0.334	0.341	0.344
Facciorusso	0.342	0.336	0.345
Gaudiosi	0.324	0.323	0.311

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Borderon, J.; Dufour, N.; Régnier, J. Exploring Database Quality Through Shapley Values: Application to Dynamic Soil Parameters Databases. Geotechnics 2025, 5, 61. https://doi.org/10.3390/geotechnics5030061

AMA Style

Borderon J, Dufour N, Régnier J. Exploring Database Quality Through Shapley Values: Application to Dynamic Soil Parameters Databases. Geotechnics. 2025; 5(3):61. https://doi.org/10.3390/geotechnics5030061

Chicago/Turabian Style

Borderon, Julien, Nathalie Dufour, and Julie Régnier. 2025. "Exploring Database Quality Through Shapley Values: Application to Dynamic Soil Parameters Databases" Geotechnics 5, no. 3: 61. https://doi.org/10.3390/geotechnics5030061

APA Style

Borderon, J., Dufour, N., & Régnier, J. (2025). Exploring Database Quality Through Shapley Values: Application to Dynamic Soil Parameters Databases. Geotechnics, 5(3), 61. https://doi.org/10.3390/geotechnics5030061

Article Menu

Exploring Database Quality Through Shapley Values: Application to Dynamic Soil Parameters Databases

Abstract

1. Introduction