Enhanced Pedotransfer Functions Through Optuna-Optimized Extreme Gradient Boosting: Application to Soil Water Retention Modeling

Monavvar Sabegh, Sanaz; Zarehaghi, Davoud; Samadianfard, Saeed; Sattari, Mohammad Taghi; Ahmad, Sajjad

doi:10.3390/earth7030094

Open AccessArticle

Enhanced Pedotransfer Functions Through Optuna-Optimized Extreme Gradient Boosting: Application to Soil Water Retention Modeling

by

Sanaz Monavvar Sabegh

¹

,

Davoud Zarehaghi

¹,

Saeed Samadianfard

^2,3,4

,

Mohammad Taghi Sattari

^2,3,5

and

Sajjad Ahmad

^6,*

¹

Department of Soil Science and Engineering, Faculty of Agriculture, University of Tabriz, Tabriz 5166616471, Iran

²

Department of Water Engineering, Faculty of Agriculture, University of Tabriz, Tabriz 5166616471, Iran

³

Water Sciences and Hydroinformatics Research Center, Khazar University, Mahsati Str. 41, Baku 1096, Azerbaijan

⁴

Department of Environmental Engineering, Izmir Institute of Technology, Urla, Izmir 35430, Türkiye

⁵

Department of Agricultural Engineering, Ankara University, Ankara 06100, Türkiye

⁶

Department of Civil and Environmental Engineering and Construction, University of Nevada, Las Vegas, NV 89154, USA

^*

Author to whom correspondence should be addressed.

Earth 2026, 7(3), 94; https://doi.org/10.3390/earth7030094

Submission received: 25 April 2026 / Revised: 28 May 2026 / Accepted: 30 May 2026 / Published: 2 June 2026

Download

Browse Figures

Versions Notes

Abstract

Soil water retention curves (SWRCs) are fundamental inputs for simulating vadose-zone processes, yet their direct measurement is labor-intensive and often impractical across large spatial domains. Pedotransfer functions (PTFs), therefore, provide an essential alternative for estimating SWRCs from readily measured soil properties. This study developed machine learning-based PTFs to estimate SWRCs using the UNSODA 2.0 database. An extreme gradient boosting (XGB) model was implemented and optimized using two Bayesian hyperparameter tuning frameworks, Hyperopt and Optuna, across eleven input scenarios incorporating combinations of textural, structural, and compositional soil attributes. Model performance was assessed using RMSE, R², and Kling–Gupta efficiency (KGE). To prevent data leakage from the hierarchical structure of the UNSODA 2.0 database, a nested grouped cross-validation framework was employed, ensuring an unbiased assessment of model generalization performance across independent soil samples. The Optuna-tuned XGB model trained on the full feature set achieved the highest accuracy, with a test RMSE of 0.0183, R² of 0.9815, and KGE of 0.9825, outperforming both the baseline and Hyperopt-optimized models. Feature importance and SHAP analyses indicated that soil texture dominated the estimations, while porosity, bulk density, and organic matter provided complementary improvements and particle density contributed marginally. These findings demonstrate that advanced hyperparameter optimization enhances the accuracy and interpretability of XGB-based PTFs, offering a robust framework for improved estimation of SWRCs in hydrological and soil-management applications.

Keywords:

soil water retention curve; pedotransfer functions; extreme gradient boosting; machine learning; UNSODA database

1. Introduction

The vadose zone functions as a central regulator of global water, energy, and solute exchange, controlling infiltration, evaporation, groundwater recharge, plant water uptake, and contaminant transport [1,2,3]. These processes are fundamentally governed by soil hydraulic properties that determine the movement and retention of water under unsaturated conditions. Among them, the soil water retention curve (SWRC), which describes the relationship between volumetric water content (θ) and soil water matric potential or pressure head (h), constitutes a cornerstone of vadose-zone hydrology [2,4,5]. The SWRC not only characterizes moisture storage but also underpins predictive models of water flow and solute dynamics in soils.

Accurate representation of the SWRC is indispensable for simulations based on the Richards equation, which forms the theoretical backbone of most vadose-zone models [6]. Because unsaturated hydraulic conductivity is typically derived from the SWRC through nonlinear constitutive relationships, small biases in θ(h) estimation can propagate nonlinearly, and often exponentially, into conductivity predictions [4,7]. These propagated uncertainties directly affect simulated infiltration rates, drainage behavior, solute transport, and the timing and magnitude of vadose-zone fluxes [6,8]. Consequently, uncertainty in the SWRC is widely recognized as a dominant source of modeling error at plot, catchment, and regional scales [9,10]. Improving the reliability and physical realism of SWRC estimation, therefore, remains a critical challenge in hydrological science.

Direct measurement of the SWRC is labor-intensive, time-consuming, and costly, particularly when extensive spatial coverage is required. Conventional laboratory methods require undisturbed soil cores and specialized equipment, yet often fail to capture field-scale heterogeneity and dynamic structural evolution [9,11]. As a result, pedotransfer functions (PTFs) have emerged as practical tools for estimating hydraulic properties from more readily available soil attributes such as texture, bulk density (ρb), and organic matter (OM) content [12,13]. Through connecting the basic soil properties with the hydraulic parameters, the application of the PTFs makes it possible to implement soil hydraulic properties into the hydrological and land-surface models without exhaustive measurements. Traditionally, PTFs were developed in the process of moving from empirical regression methods to increasingly data-driven techniques. At the early stage, the development of the functions involved the construction of empirical connections between easily measurable soil properties and water retention curves or certain parameters of analytical models like the van Genuchten or Brooks–Corey functions [4,6]. In spite of being computationally efficient and convenient for practical implementation, parametric PTFs are limited in terms of predictive power due to assumptions about the underlying functional form and insufficient representation of the soil structure in the set of input variables [4,8].

Many natural soils possess complicated structures involving the occurrence of com-plex pore size distribution and heterogeneous structural patterns associated with aggregation, root channel formation, biological activities, and other factors. High-resolution image analysis proves that the pore connectivity and 3D structure play an important role in hydraulic behavior [5,14,15]. At the same time, they are poorly represented by the simplest texture-based predictors. Therefore, classical parametric PTFs tend to show systematic errors in extrapolation or in application to structurally complicated soils [8,16].

A very important problem associated with traditional PTFs is their poor ability to capture the dynamics of the soil structural state. The vadose zone is far from being static; it experiences continuous modification due to agricultural management such as tillage, compacting effects caused by machinery use, grazing pressure, and shrinking–swelling cycles, as well as other processes that modify ρb, porosity (n), and pore connectivity, thus changing the water retention and unsaturated hydraulic conductivity functions [14,17,18,19]. Several widely employed PTF formulas ignore structural indicators or do not change the SWRC shape depending on the variation in ρb and n. Tian et al. [7] showed that many PTFs provided almost the same retention functions regardless of the level of compaction in experiments that demonstrated significant SWRC flattening under conditions of increasing ρb. Such insensitivity to dynamic structural changes limits the usefulness of classical PTFs in agricultural applications and makes them less reliable in simulating water dynamics.

Recently, ML techniques have become increasingly popular in constructing PTFs due to their ability to detect non-linear dependencies without the need for predefining the mathematical equation. Such approaches as ANN, SVM, RF, XGB, and others have shown higher prediction abilities in comparison with traditional regression techniques [9,13]. Large databases with information about soil hydraulics, e.g., UNSODA 2.0, allow for the training and evaluation of PTFs based on ML algorithms. For instance, Rastgou et al. [20] achieved high predictability in SWRC estimation using optimized deep neural networks, whereas Pham et al. [9] pointed out that XGB showed impressive success in modeling complicated retention behavior.

New techniques go even further beyond single-model approaches. The usage of ensemble learning, hybrid metaheuristic, and geographically informed machine learning models proved to provide even more robust results [16]. For instance, Sun et al. [21] used stacked generalization to improve SWRC prediction accuracy. Taherdangkoo et al. [22] utilized a combination of PSO–GA optimization with an XGBoost model to predict compacted clay soils within an extremely wide suction spectrum. Moreover, Niu et al. [23] considered geospatial heterogeneity while using machine learning (ML) algorithms for improving regional mapping of hydraulic properties. These innovations indicate a clear tendency towards developing flexible and geographically adaptive solutions.

However, there are still two crucial issues hampering the wide adoption of ML-based PTFs in hydrology. The first is high sensitivity to hyperparameter configuration, as most of the existing research uses manual tuning or grid-search techniques to optimize models, which may result in suboptimal parameter configurations [12,13]. Hyperparameter optimization frameworks were compared and showed large variance in terms of search speed and solution quality [18]; however, advanced methods like Bayesian optimization, available in the Optuna framework, are seldom used for SWRC models. Interpretability is the second issue that should be addressed for achieving greater success with applying ML PTFs. It is necessary to figure out how specific soil properties influence θ(h) predictions to prove the physical consistency of obtained models. Emerging explainable ML techniques, including permutation importance and Shapley Additive Explanations (SHAP), provide quantitative insight into feature contributions, but systematic application of these tools across multiple input configurations remains limited.

Under such conditions, the current research work proposes a highly accurate yet interpretable XGB-based modeling methodology for estimating SWRCs through the use of the UN-SODA 2.0 dataset. In contrast to traditional parametric methods like the Van Genuchten–Mualem model, which requires the assumption of a definite continuous function, the current methodology involves prediction of the θ values at the designated matric potential points. The technique may be termed high-density pointwise estimation of SWRCs.

Three model configurations are compared: (i) a baseline XGB model with default settings, (ii) a Hyperopt-tuned model employing Bayesian optimization, and (iii) an Optuna-optimized model using an efficient tree-structured search strategy. Eleven input scenarios are evaluated, ranging from texture-only predictors to extended feature sets incorporating structural and compositional indicators such as ρb, n, OM, and particle density (ρp). Model performance is assessed using root mean square error (RMSE), coefficient of determination (R²), and Kling–Gupta efficiency (KGE), providing complementary perspectives on predictive accuracy and hydrological reliability.

To enhance physical transparency, permutation importance and SHAP analyses are conducted to quantify the contribution of individual soil properties across suction levels and input scenarios. By integrating structural soil indicators, advanced hyperparameter optimization, and explainable ML within a unified framework, this study aims to advance the accuracy, robustness, and interpretability of ML-based PTFs. In doing so, it addresses key methodological gaps in current SWRC modeling and provides a scalable pathway for improving vadose-zone simulations in both research and applied hydrological contexts.

2. Materials and Methods

2.1. UNSODA 2.0

All pedotransfer models were developed and evaluated using version 2.0 of the UNSODA database, a publicly available global compilation of measured soil hydraulic properties and associated pedological information. UNSODA 2.0 contains laboratory-measured SWRCs, hydraulic conductivity, and diffusivity data, together with supporting particle-size distribution and basic soil descriptors for 790 soil samples collected worldwide. Each sample is indexed by a unique soil identifier linking tables describing texture, ρb, ρp, n, OM content, and hydraulic measurements [17].

In this study, the drying branch of the SWRC was extracted for all soils with multi-point measurements. To avoid geographic or textural bias, all available samples meeting this criterion were included, yielding a diverse dataset spanning nine USDA textural classes. The resulting distribution across the soil texture triangle is shown in Figure 1.

Any record containing missing values in one or more of the variables required by a certain scenario was dropped. This resulted in different sample sizes per scenario (see Table 1). The training and validation for each of the scenarios were performed using exactly the same 80/20% split achieved via a fixed random seed to make sure that the data split would be consistent. The same test set was used for all scenarios to facilitate direct comparisons of model performance, and no test records were involved in the hyperparameter tuning process.

2.2. Input Variables and Feature Scenarios

The objective of this study was to identify a sufficient set of input variables that enables high-fidelity SWRC estimation while preserving physical interpretability and practical applicability in hydrological modeling. Accordingly, eleven input scenarios were designed to incorporate textural, structural, and compositional soil variables commonly available from soil surveys and experimental studies.

All scenarios included h, which governs water retention, together with texture fractions (Fsand, Fsilt, Fclay) that control pore-size distribution. Structural and compositional variables, ρb, n, OM, and ρp, were added incrementally to assess their marginal contribution to estimation accuracy and to determine which combinations yield the greatest improvement relative to data requirements.

Eleven input scenarios were defined to evaluate the predictive value of progressively expanded soil-variable sets. Scenario 1 represents the texture-only baseline (Fsand, Fsilt, Fclay); subsequent scenarios incrementally incorporate structural and compositional variables, culminating in the full-feature configuration in Scenario 11 (Table 2). This structured design enables comparison of model performance across scenarios and supports identification of the smallest set of input variables capable of delivering accurate SWRC estimates.

The statistical properties of the dataset are outlined in Table 3 to establish the statistical envelope and dimensionality of the input space on which the eleven scenarios are based. The values presented in the table are important for understanding scenario performance because of the wide range and large coefficient of variation (CV) of texture and OM, which creates contrasting data that help the XGB algorithm to learn the mapping between θ and h. On the other hand, low variability of ρp with a CV value of 0.03 indicates that this variable carries less information density, contributing to its lack of importance for prediction compared to other features in the dataset. As such, Table 3 describes the variability in the physical variables, while the scenario sample size (Table 1) and selected input variables (Table 2) control how this variability is split between scenarios.

2.3. Modeling Framework

The modeling objective was to estimate θ as a function of soil properties, thereby reconstructing the θ–h relationship from data. Each measured θ–h pair, together with its associated soil descriptors, was treated as an independent training sample. This formulation avoids reliance on predefined parametric retention models and allows the learning algorithm to infer nonlinear relationships from measurements.

XGB was selected due to its strong performance on tabular data, ability to capture nonlinear interactions, and built-in regularization. All models were implemented in Python using the XGBoost library (version 1.7.6). The regression objective was set to reg:squarederror, and the histogram-based tree method (tree_method = “hist”) was used to improve computational efficiency. A fixed random seed (random_state = 42) ensured reproducibility.

For each scenario, three model configurations were evaluated:

Baseline XGB, trained using default hyperparameters.
Hyperopt-optimized XGB, tuned via Bayesian optimization.
Optuna-optimized XGB, tuned using an alternative Bayesian search strategy.

Baseline models were trained with 600 boosting rounds, selected based on preliminary experiments that balanced accuracy and training time. No early stopping was applied to ensure consistent comparison across configurations.

Training the XGB models was carried out with the aim of predicting θ at the predetermined matric potential values throughout the full suction range covered by the UNSODA database. The predictions were produced individually for every suction value without interpolating them after training into parametric curves. As a result, the reconstructed SWRC in every soil sample is composed of many points produced by the model instead of one continuous hydraulic curve. Connecting the predicted points is possible for visualizing the retention curve, though.

2.4. Hyperparameter Optimization

Hyperparameter tuning was conducted using Hyperopt and Optuna to evaluate the impact of advanced Bayesian optimization frameworks on XGB performance. Both approaches employ tree-structured Parzen estimators but differ in sampling strategy and internal optimization mechanics.

The following hyperparameters were optimized in both frameworks: n_estimators (100–1000), max_depth (3–12), learning rate η (0.005–0.3, log-uniform), subsample (0.6–1.0), colsample_bytree (0.6–1.0), reg_alpha (0–10), and reg_lambda (0.1–10). These bounds reflect commonly recommended ranges for regression tasks and encompass values reported in soil-physics literature.

Each optimization was conducted for 50 trials per scenario. For Hyperopt, the TPE algorithm (tpe.suggest) was used with a fixed random seed (42). For Optuna, a separate study was created for each scenario with direction = “minimize”. No pruning or early stopping was applied in either framework. The optimization objective was the minimization of RMSE on the fixed test set. After optimization, the best hyperparameter configuration was used to retrain the model on the training set. The resulting configurations are summarized in Table 4.

2.5. Model Evaluation

Model performance was evaluated using the RMSE, mean absolute error (MAE), R², Willmott’s index of agreement (WI), and KGE. These complementary metrics collectively quantify accuracy, bias, variability, and overall agreement between measured and estimated θ. RMSE and MAE are expressed in cm³ cm⁻³, whereas R², WI, and KGE are dimensionless. Metric definitions are provided in Equations (1)–(5).

All metrics were computed on the test dataset to ensure unbiased comparison across input scenarios and optimization strategies.

R M S E = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {({\hat{θ}}_{i} - θ_{i})}^{2}}

(1)

M A E = \frac{1}{n} \sum_{i = 1}^{n} | {\hat{θ}}_{i} - θ_{i} |

(2)

R^{2} = 1 - \frac{\sum_{i = 1}^{n} {(θ_{i} - {\hat{θ}}_{i})}^{2}}{\sum_{i = 1}^{n} {(θ_{i} - \overline{θ})}^{2}}

(3)

W I = 1 - \frac{\sum_{i = 1}^{n} {({\hat{θ}}_{i} - θ_{i})}^{2}}{\sum_{i = 1}^{n} (|{\hat{θ}}_{i} - \overline{θ}| + |θ_{i} - \overline{θ}|)^{2}}

(4)

K G E = 1 - \sqrt{{(r - 1)}^{2} + {(α - 1)}^{2} + {(β - 1)}^{2}}

(5)

where

\overline{θ}

is the mean of the measured

θ

. r is the Pearson correlation coefficient between

\hat{θ}

and θ.

α = σ_{\hat{θ}} / σ_{θ}

is the ratio of estimated to measured standard deviations, and

β = μ_{\hat{θ}} / μ_{θ}

is the ratio of estimated to measured means.

{\hat{θ}}_{i}

and

θ_{i}

denote the estimated and measured volumetric water contents for the i-th sample, respectively, and n is the total number of test samples.

2.6. Model Validation Using Nested Grouped Cross-Validation (GCV)

However, given the presence of multiple θ–h observations generated from the same soil sample in the UNSODA 2.0 database, it is likely that the independence assumption regarding samples in a typical random train–test split can result in the overestimation of model performance. Specifically, in the compiled data used for the analysis, there were 1955 θ–h observations for 175 unique soil sample codes, implying that each soil sample included multiple observations. As a result, in the conventional random split, there were 170 soil sample codes present both in the training and testing splits, thereby leading to some form of information leakage between the two partitions.

To address the potential bias due to the hierarchical nature of the UNSODA 2.0 database (that is, the presence of several observations corresponding to the same soil sample) and to obtain an unbiased estimate of the model performance, the nested grouped cross-validation (GCV) procedure was adopted in this study. Specifically, the soil/sample code was employed as the grouping variable to preserve the physical integrity of samples in the folds. In the outer loop, the unbiased estimation of out-of-sample generalization capabilities was ensured using the 5-fold GroupKFold cross-validation approach. In the inner loop, hyperparameter tuning in each outer fold was carried out via the 3-fold GroupKFold optimization. Notably, the outer validation partition was strictly independent throughout the tuning process, implying that the selection of hyperparameters was determined exclusively by the minimization of the mean RMSE across inner folds.

2.7. Model Interpretability

To assess physical consistency and improve transparency, permutation importance and SHAP analyses were applied to the trained models. Permutation importance quantifies the reduction in R² resulting from random shuffling of each input variable (Equation (6)), thereby identifying variables essential for estimation performance. Each permutation was repeated ten times and averaged to reduce sampling variability.

{P I}_{j} = P e r f (\hat{f}; D) - P e r f (\hat{f}; D_{π_{j}})

(6)

where

\hat{f}

denotes the trained model,

D

is the original test dataset, and

D_{π_{j}}

represents the same dataset with the j-th randomly permuted input variable.

P e r f

(.) denotes the performance metric.

SHAP values were computed using TreeSHAP, which provides exact Shapley values for tree-based ensembles. Importance rankings were derived from mean absolute SHAP values, while explanations were visualized using waterfall plots for representative samples. These analyses clarify how texture, structure, and compositional variables influence θ predictions across the range of h.

Since there is high collinearity between certain predictors (for example, n and ρb r = −0.94), SHAP values are regarded as marginal impacts on the trained model, not as the influence of each predictor independently. Therefore, the attribution analysis is based mainly on physically meaningful groups of predictors.

The overall workflow of data preprocessing, model training, optimization, evaluation, and interpretability analysis is summarized in Figure 2.

3. Results

Pearson correlation coefficients (r) between input variables and measured θ are illustrated in Figure 3. θ exhibited positive correlations with Fclay (r ≈ 0.42), n (r ≈ 0.38), and OM (r ≈ 0.31), reflecting increased water retention associated with finer textures, greater pore volume, and enhanced aggregation. In contrast, θ was negatively correlated with ρb (r ≈ −0.38), Fsand (r ≈ −0.37), and ρp (r ≈ −0.20), consistent with reduced storage capacity in coarser or more compacted soils.

Strong collinearity was evident among texture fractions, particularly between Fsand and Fsilt (r ≈ −0.88) and Fsand and Fclay (r ≈ −0.67). Structural variables also showed strong dependence: n and ρb were highly negatively correlated (r ≈ −0.94), indicating that they conveyed overlapping but not identical information regarding pore space. OM was moderately correlated with n (r ≈ 0.56) and negatively correlated with ρb (r ≈ −0.64). These relationships highlight the importance of regularization and feature selection in multivariate modeling.

The estimation performance of baseline, Hyperopt-optimized, and Optuna-optimized XGB models is summarized for the train and test datasets in Table 5 and Table 6, respectively. Across all scenarios, baseline XGB models provided reasonable accuracy, with a mean test RMSE of 0.0455 and a mean R² of 0.8837. The best baseline performance was obtained for Scenario 11, which yielded a test RMSE of 0.0356, R² = 0.9299, and KGE = 0.9267.

Bayesian hyperparameter optimization improved model performance. Averaged across scenarios, Hyperopt-XGB reduced the mean test RMSE to 0.0243 and increased the mean R² to 0.9658, while Optuna-XGB further reduced the mean RMSE to 0.0235 and increased R² to 0.9679. Relative to the baseline, these improvements correspond to average RMSE reductions of 46.9% (Hyperopt) and 48.8% (Optuna), with concurrent increases in R² of 0.082 and 0.084, respectively.

Train–test comparisons indicate that the tuned models exhibited slightly larger performance gaps than the baseline, reflecting increased model flexibility. However, the absolute differences remained small (mean RMSE gaps of −0.0142 for Hyperopt and −0.0144 for Optuna; mean R² gaps ≈ 0.026), indicating good generalization and no evidence of severe overfitting.

Baseline model performance varied systematically across input scenarios (Table 6). When only texture fractions were used (Scenario 1), the test RMSE was 0.0612, and R² was 0.7937. Adding ρb (Scenario 2) produced modest improvement (RMSE = 0.0492; R² = 0.8666), whereas adding n (Scenario 3) led to a more pronounced gain (RMSE = 0.0446; R² = 0.8902). Scenarios including only OM or ρp provided limited improvement relative to texture-only models.

Hyperparameter tuning reduced performance disparities across scenarios while preserving consistent trends. For Scenario 1, Hyperopt-XGB and Optuna-XGB reduced RMSE to 0.0400 and 0.0395, respectively, corresponding to error reductions exceeding 34%. In Scenario 3, tuned models achieved RMSE values near 0.021, representing reductions of more than 50% relative to the baseline. Scenarios incorporating n (Scenarios 3, 6, 9–11) yielded the highest accuracy, whereas scenarios relying on ρp alone remained weaker.

The most comprehensive scenario (Scenario 11) produced the best overall performance. Optuna-XGB achieved a test RMSE of 0.0183, R² = 0.9815, WI = 0.9953, and KGE = 0.9825. Relative to the baseline in the same scenario, this represents a 48.6% reduction in RMSE and an increase in R² of 0.0516. Compared with the simplest baseline configuration (Scenario 1), RMSE was reduced by 70.1%.

To complement the quantitative evaluation, Figure 4 visually compares measured and estimated θ–h relationships for the best-performing Optuna-XGB-11 model across the test set. The estimated θ values generally overlapped the measured observations over the logarithmic h range, indicating that the model reproduced the expected decline in water content with increasing suction. This agreement was evident across contrasting textures, including the rapid drainage behavior of sand and loamy sand, the intermediate retention patterns of silty loam and sandy loam, and the more gradual water release of clay loam and clay. Minor local deviations occurred mainly near the wet and dry extremes, where measured hydraulic data are typically more variable. Overall, Figure 4 supports the numerical results by showing that the optimized model preserved physically plausible, texture-dependent SWRC behavior.

To complement the overall performance metrics, Figure 4 visually evaluates whether the Optuna-XGB-11 model reproduced θ–h behavior across three hydrologically relevant suction ranges: low suction (h < 100 cm), representing near-saturated conditions; intermediate suction (100 ≤ h < 1000 cm), corresponding to the main drainage transition; and high suction (h ≥ 1000 cm), where retention is increasingly controlled by finer pore domains. Across these ranges, estimated θ values generally followed the measured observations and preserved the expected decline in water content with increasing h. The model reproduced the rapid drainage behavior of coarse-textured soils and the more gradual water release of finer-textured soils, with only local deviations near the wet and dry extremes. Thus, Figure 4 provides visual support for the quantitative results by confirming physically plausible SWRC behavior across contrasting textures and suction conditions.

Texture-specific comparisons confirmed that the Optuna-XGB-11 model provided strong agreement between measured and estimated θ across all soil texture classes represented in the test set (Figure 5). The estimated values were closely distributed around the 1:1 agreement line, with high R² values ranging from 0.9656 for sandy loam to 0.9860 for clay, indicating stable predictive performance across contrasting textures. Loamy sand showed nearly unbiased behavior, with an intercept close to zero and a slope close to unity, while sand, silty loam, and sandy loam also exhibited low intercepts and slopes near one, demonstrating reliable prediction in coarse- and medium-textured soils. For clay loam and clay, the model retained high agreement across the broader θ range associated with finer-textured soils, although the positive intercepts and slopes slightly below unity indicate minor compression of predictions toward the central θ range. Overall, these results show that the optimized XGB model maintained robust texture-specific performance, with only limited class-dependent bias.

Residual distributions for the most accurate models are shown in Figure 6. The baseline model exhibits systematic bias, with negative residuals at high θ and positive residuals at intermediate values, indicating underestimation in wetter soils. Hyperopt-XGB and Optuna-XGB reduce both bias and variance, producing near-Gaussian residual distributions with improved symmetry.

Model accuracy is further summarized using a Taylor diagram (Figure 7), which jointly compares standard deviation, correlation coefficient, and centered RMSE for XGB-11, Hyperopt-XGB-11, and Optuna-XGB-11. The baseline model deviates from the reference point due to lower correlation and higher normalized RMSE. Hyperopt-XGB-11 moves closer to the reference, indicating improved correlation and reduced error. Optuna-XGB-11 lies closest to the reference point, reflecting the highest correlation and lowest centered RMSE among the evaluated models. This visualization confirms the superior overall performance of the Optuna-optimized configuration.

Compared to random splits, the GCV method provided a more rigorous way of assessing the generalization capabilities of the model. It can be observed from the results shown in Table 7 and Figure 8 that the random splits produced optimistic performance estimates since the observations used were correlated between training and testing sets in relation to the same soil samples.

GCV showed that the baseline XGB model demonstrated a higher RMSE compared to random validation, growing from 0.0356 to 0.0557. The performance deterioration is even more pronounced for optimized models: The Hyperopt XGB RMSE grew from 0.0227 to 0.0530, and the Optuna XGB RMSE from 0.0216 to 0.0543. Furthermore, the R² score deteriorated from 0.9717 to 0.8374 for the first case and from 0.9742 to 0.8288 for the second one.

Even in the case of a more conservative estimation approach, both optimized models retain their predictive power, producing an R² greater than 0.82, which means that generalizable relations are captured by the models between soil characteristics and moisture content. Among all the models, Hyperopt XGB model demonstrated the best performance under GCV.

SHAP waterfall plots for representative samples are shown in Figure 9. Negative contributions from Fsand reflect rapid drainage associated with coarse textures, whereas positive contributions from Fclay, n, and OM reflect enhanced storage in finer and more aggregated pore networks. Tuned models exhibit larger absolute contributions for structurally meaningful variables, indicating that optimization enhanced sensitivity to physically relevant soil properties.

Permutation importance results (Figure 10) corroborate these patterns. Across all models, h is the most influential variable, followed by Fsand. Structural descriptors, particularly n, exhibit higher importance in tuned models than in the baseline. OM shows moderate influence, while ρp ranks lowest. Mean absolute SHAP values (Figure 11) confirm the same hierarchy and indicate reduced variance in feature effects for optimized models, suggesting improved stability. Given the near-deterministic relationship between n and ρb, their relative SHAP rankings are interpreted as reflecting a shared structural control on SWRC behavior rather than isolated variable importance.

4. Discussion

The present study demonstrates that Optuna-optimized XGB-based PTFs provide accurate, robust, and physically interpretable estimates of SWRCs across a wide range of soil textures and matric suctions. By integrating Bayesian hyperparameter optimization with XGB and interpretable diagnostics, this work addresses two persistent challenges in data-driven SWRC modeling: sensitivity to model configuration and limited physical transparency.

The results clearly indicate that Bayesian hyperparameter optimization enhanced XGB performance relative to untuned baseline models. On average, Optuna-optimized models reduced test RMSE by approximately 50% and increased R² by about 0.08 across the eleven evaluated input scenarios. These improvements are consistent with previous ML studies showing that default or heuristically selected hyperparameters rarely yield optimal performance for complex environmental datasets [12,13]. The marginally superior performance of Optuna compared with Hyperopt is attributable to its more efficient exploration of the hyperparameter space and adaptive trial selection, as reported in other hydrological and environmental applications.

From a hydrological perspective, such reductions in SWRC estimation error represent more than statistical improvement. Because unsaturated hydraulic conductivity is derived from the SWRC through nonlinear constitutive relationships, even modest reductions in θ(h) error can translate into disproportionately large improvements in simulated infiltration, drainage, and soil water storage [4,7]. The magnitude of RMSE reduction achieved here therefore constitutes a meaningful advance in the reliability of vadose-zone simulations under near-saturated conditions where model sensitivity is greatest [6,8].

Residual analyses further show that hyperparameter optimization not only reduced random error but also mitigated systematic bias, particularly the underestimation of θ near saturation observed in untuned models. Accurate representation of the wet end of the SWRC is critical for simulating rainfall partitioning between infiltration and runoff and for capturing surface–subsurface exchange processes [2,6]. Inadequate representation of this region can lead to persistent underestimation of infiltration capacity and overprediction of runoff during high-intensity precipitation events, limitations that are reduced by the optimized models.

The difference between random splitting and GCV demonstrates the necessity of taking into account the hierarchy inherent to soil database construction. Datasets like UNSO-DA include multiple measurements made on the same soil sample, which therefore share common physical characteristics. Such an approach may result in misleading accuracy evaluation when applying the conventional random splitting technique. The use of grouped samples in accordance with their soil sample code guarantees that the test data comprise only soil samples not encountered before in the training set. Consequently, a higher value of RMSE along with a lower R² represent a better approximation of model generalization. However, even in this case, the predictive performance is sufficiently high (GCV R² > 0.82).

The evaluation of eleven input scenarios clarifies the relative roles of textural and structural soil properties in controlling SWRC predictions. Texture fractions dominate estimation accuracy, reflecting the strong influence of particle size distribution on soil pore size distribution and, consequently, on water retention behavior. However, the inclusion of structural descriptors such as n and ρb further improves model performance beyond texture-only scenarios by accounting for variations in pore volume and soil structure [2,4].

The model results under different scenarios should be evaluated taking into account that the unequal size of the training set is due to scenario-dependent filtration. This issue may have a slight influence on estimation precision since bigger data samples normally lead to better estimates. In order to eliminate such effects, the test sample and methodology were identical for all scenarios. As a result, any discrepancies between scenarios should be considered in light of the predictive power of additional predictors. Nevertheless, any improvement in the event of complex inputs should be regarded as both a quantity and quality issue.

The variable n exerts a strong influence by constraining the upper limit of θ and shaping the wet end of the SWRC. The marked improvement observed when n was added (Scenario 3) relative to texture-only models confirms that descriptors of pore space provide information not captured by texture alone [19,24]. Although ρb is strongly correlated with n (r ≈ −0.94), it captures complementary information related to soil packing, aggregation, and mechanical disturbance, which influence pore connectivity and macroporosity beyond total pore volume [17,18,19]. The joint inclusion of n and ρb therefore enables sensitivity to structural changes induced by compaction, tillage, or land use, an important limitation of many traditional PTFs [24].

OM contributed modest improvements when combined with structural variables. This behavior aligns with the established understanding that OM enhances aggregation and microporosity, thereby increasing water retention, especially in finer-textured soils [19,20]. In contrast, ρp exhibited negligible importance. Its limited variability within mineral soils constrains its explanatory power, and scenarios including only ρp showed marginal gains over texture-only models. This finding corroborates earlier studies indicating that ρp provides little additional information for SWRC estimation in mineral soil datasets [12,16].

Negative SHAP contributions associated with high Fsand reflect rapid drainage from macropores, while positive contributions from Fclay, n, and OM capture enhanced retention in finer pores and greater total pore volume [19,20]. High negative correlation (r ≈ −0.94) between the n and ρb suggests that the same structural information regarding pore space is embedded in these two variables. This poses some challenges for attribution analysis. For example, tree-based models have arbitrariness in assigning split frequency and gain between predictors that are highly correlated, and the SHAP scores only estimate marginal contributions conditioned on the model structure but not the unique physical effect. The larger SHAP importance score assigned to n compared to ρb does not indicate the intrinsic importance of n but simply demonstrates the choice made by the model to utilize n as a representation of pore space.

Crucially, collinearity does not affect prediction performance, but does limit attribution interpretability. Based on physics knowledge, the two features should be considered together as structurally controlling water retention process. Although other techniques, e.g., conditional or grouped feature importance, could be applied to further separate the importance of these two features, they were selected to retain physical meaning and comparability with previous studies in pedotransfer literature. Therefore, SHAP importance will be interpreted from the viewpoint of structural versus textural features instead of individual variables.

Hyperparameter optimization amplified the influence of structurally meaningful variables and reduced residual variance, indicating that tuning improved not only predictive accuracy but also the clarity with which the model distinguished the roles of texture and structure. Such interpretability is essential for the acceptance of ML-based PTFs in hydrological modeling workflows, where transparency and physical plausibility are critical.

The estimation accuracy achieved in this study is competitive with that reported for both conventional and ML-based PTFs in the literature. SVM models developed by Cisty and Povazanova [11] achieved an RMSE value of 0.018 for the wetting branch of the SWRC, outperforming classical parametric formulations such as the Mualem and Kool–Parker models. The Optuna-optimized XGB models presented here achieved comparable RMSE values (0.0183–0.0236) while explicitly modeling the entire drying branch across a broader range of soil types and using an extended set of input variables.

Boosted regression trees tuned via differential evolution have been shown to reduce RMSE relative to untuned models, yet their final accuracies (RMSE ≈ 0.11–0.17; R² ≈ 0.58–0.79) remain lower than those reported here [13]. ANN-based PTFs have also demonstrated strong performance; for example, Totola et al. [19] reported RMSE values near 0.045 for Brazilian soils, while Rastgou et al. [20] achieved R² values approaching 0.98 using deep learning architectures. The present results indicate that optimized XGB models can match or exceed the accuracy of more complex ANN approaches while offering superior interpretability through SHAP-based explanations.

Recent physics-informed neural network approaches that embed the Richards equation have reported improved performance at the dry end of the SWRC [10]. Incorporating analogous physical constraints into XGB-based or hybrid ensemble frameworks represents a logical extension of the present work and may contribute to reduced bias near saturation while preserving computational efficiency.

From an applied perspective, the results provide clear guidance for selecting input variables based on data availability and application requirements. Texture-only scenarios yield reasonable estimates but are associated with higher uncertainty. Including n leads to substantial performance gains and represents an effective compromise between data requirements and predictive accuracy. The inclusion of ρb further improves model accuracy for disturbed or compacted soils, and is therefore advisable when such information is available. The full feature set delivers the highest accuracy and is most appropriate for applications requiring precise SWRC estimates, including detailed vadose-zone simulations, agronomic planning, and geotechnical assessments.

The modeling strategy adopted in this study enables estimation of the θ–h relationship without imposing predefined functional forms, thereby addressing a long-recognized source of uncertainty in hydrological modeling [9,10]. The improved representation of both wet- and dry-end retention behavior supports increased confidence in downstream simulations of infiltration, evapotranspiration, drainage, and solute transport.

Several limitations merit consideration. Treating each h–property pair as an independent sample neglects potential autocorrelation within individual SWRCs, an assumption commonly adopted in PTF development. Future investigations could evaluate hierarchical or soil-specific modeling approaches to explicitly account for within-profile dependence.

In addition, reliance on complete-case analysis reduced the effective sample size for extended feature scenarios and may have led to the underrepresentation of highly organic or strongly structured soils. Expanding available datasets and incorporating additional descriptors of soil structure would likely improve model generalizability across a wider range of soil conditions.

Finally, although the proposed models exhibit strong physical consistency, they remain purely data-driven and do not explicitly enforce physical constraints such as the monotonic decrease of θ with increasing h. Integrating optimized XGB with physics-informed constraints constitutes a well-motivated direction for future research and may contribute to reduced bias near saturation while preserving computational efficiency.

5. Conclusions

The implementation of the optimized XGB algorithm on the UNSODA 2.0 dataset proves that PTFs based on data-driven algorithms could successfully avoid the stringent limitations imposed by classical parametric approaches for modeling the intricate, non-linear soil water retention function. Not only was our study able to obtain the most accurate predictions by using the Bayesian optimization approach, but we were also able to develop a hierarchy of the most relevant predictors. While the particle size distribution was identified as the primary influencer, the use of additional descriptors such as n and ρb proved crucial for the prediction of the soil water retention over the entire potential range. We thus suggest that the best configuration for large-scale hydrological modeling consists of a minimalistic set of texture fractions, with n, and ρb as the inputs. GCV offers an objective evaluation of the model’s performance. While the framework indeed shows a reduction in performance compared to the regular split, an effect of accounting for soil sample hierarchy, it is clear that the predictive accuracy of the models is not compromised. This highlights the importance of conducting a robust validation process of ML-based PTFs.

Author Contributions

S.M.S.: Data curation; formal analysis; investigation; methodology; validation; visualization; writing—original draft. D.Z.: Data curation; formal analysis; resources; supervision; writing—original draft. S.S. and M.T.S.: Conceptualization; formal analysis; methodology; validation; visualization; supervision; writing—original draft. S.A.: Writing—review and editing. All authors have read and agreed to the published version of the manuscript.

Funding

The authors did not receive support from any organization for the submitted work.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Acknowledgments

Generative AI tools were used solely to assist with language refinement and graphical layout of conceptual figures. All textual content, scientific workflow design, data processing, analysis, and interpretations were developed and verified exclusively by the authors.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

θ	Volumetric water content
h	Pressure head
SWRC	Soil water retention curve
PTF	Pedotransfer function
ρb	Bulk density
ML	Machine learning
ANN	Artificial neural network
SVM	Support vector machine
RF	Random forest
XGB	Extreme gradient boosting
PSO	Particle swarm optimization
n	Porosity
OM	Organic matter
RMSE	Root means square error
MAE	Mean absolute error
R²	Coefficient of determination
WI	Willmott’s index of agreement
KGE	Kling–Gupta efficiency
UNSODA	Unsaturated soil hydraulic Database
SHAP	Shapley Additive Explanations

References

Yang, C.; Wu, J.; Li, P.; Wang, Y.; Yang, N. Evaluation of Soil-Water Characteristic Curves for Different Textural Soils Using Fractal Analysis. Water 2023, 15, 772. [Google Scholar] [CrossRef]
Javanshir, S.; Bayat, H.; Gregory, A.S. Effect of Free Swelling Index on Improving Estimation of the Soil Moisture Retention Curve by Different Methods. Catena 2020, 189, 104479. [Google Scholar] [CrossRef]
Botula, Y.-D.; Nemes, A.; Mafuka, P.; Van Ranst, E.; Cornelis, W.M. Prediction of Water Retention of Soils from the Humid Tropics by the Nonparametric K-Nearest Neighbor Approach. Vadose Zone J. 2013, 12, 1–17. [Google Scholar] [CrossRef]
Weber, T.K.D.; Weihermüller, L.; Nemes, A.; Bechtold, M.; Degré, A.; Diamantopoulos, E.; Fatichi, S.; Filipović, V.; Gupta, S.; Hohenbrink, T.L. Hydro-Pedotransfer Functions: A Roadmap for Future Development. Hydrol. Earth Syst. Sci. 2024, 28, 3391–3433. [Google Scholar] [CrossRef]
Wen, T.; Chen, X.; Luo, Y.; Shao, L.; Niu, G. Three-Dimensional Pore Structure Characteristics of Granite Residual Soil and Their Relationship with Hydraulic Properties under Different Particle Gradation by X-Ray Computed Tomography. J. Hydrol. 2023, 618, 129230. [Google Scholar] [CrossRef]
Bouma, J. Transfer Functions and Threshold Values: From Soil Characteristics to Land Qualities. In Proceedings of the Quantified Land Evaluation. Proc. ISSS/SSSA Workshop, Washington, 1987; ITC Publication: Geneva, Switzerland, 1987. [Google Scholar]
Tian, Z.; Chen, J.; Cai, C.; Gao, W.; Ren, T.; Heitman, J.L.; Horton, R. New Pedotransfer Functions for Soil Water Retention Curves That Better Account for Bulk Density Effects. Soil Tillage Res. 2021, 205, 104812. [Google Scholar] [CrossRef]
de Castro Moreira da Silva, L.; Amorim, R.S.S.; Fernandes Filho, E.I.; Bocuti, E.D.; da Silva, D.D. Pedotransfer Functions and Machine Learning: Advancements and Challenges in Tropical Soils. Geoderma Reg. 2023, 35, e00720. [Google Scholar] [CrossRef]
Pham, K.; Kim, D.; Le, C.V.; Won, J. Machine Learning-Based Pedotransfer Functions to Predict Soil Water Characteristics Curves. Transp. Geotech. 2023, 42, 101052. [Google Scholar] [CrossRef]
Norouzi, S.; Pesch, C.; Arthur, E.; Norgaard, T.; Møldrup, P.; Greve, M.H.; Beucher, A.; Sadeghi, M.; Zaresourmanabad, M.; Tuller, M.; et al. Physics-Informed Neural Networks for Estimating a Continuous form of the Soil Water Retention Curve from Basic Soil Properties. Water Resour. Res. 2025, 61, e2024WR038149. [Google Scholar] [CrossRef]
Cisty, M.; Povazanova, B. Evaluation of Water Retention Curves by Regression and Machine Learning Methods. IOP Conf. Ser. Mater. Sci. Eng. 2021, 1203, 032088. [Google Scholar] [CrossRef]
Nazem, M.; Kardani, N.; Moridpour, S.; Zhou, A. Prediction of Soil-Water Characteristic Curve Using Optimised Machine Learning Approaches. In Proceedings of the 10th European Conference on Numerical Methods in Geotechnical Engineering; ISSMGE: London, UK, 2022. [Google Scholar] [CrossRef]
Gebauer, A.; Ellinger, M.; Brito Gomez, V.M.; Ließ, M. Development of Pedotransfer Functions for Water Retention in Tropical Mountain Soil Landscapes: Spotlight on Parameter Tuning in Machine Learning. Soil 2020, 6, 215–229. [Google Scholar] [CrossRef]
Wen, T.; Luo, Y.; Tang, M.; Chen, X.; Shao, L. Effects of Representative Elementary Volume Size on Three-Dimensional Pore Characteristics for Modified Granite Residual Soil. J. Hydrol. 2024, 643, 132006. [Google Scholar] [CrossRef]
Luo, Y.; Wen, T.; Lin, X.; Chen, X.; Shao, L. Quantitative Analysis of Pore-Size Influence on Granite Residual Soil Permeability Using CT Scanning. J. Hydrol. 2024, 645, 132133. [Google Scholar] [CrossRef]
dos Santos Pereira, S.A.; Gitirana, G.d.F.N.; Mendes, T.A.; Gomes, R.d.A. Artificial Neural Networks for the Prediction of the Soil-Water Characteristic Curve: An Overview. Soil Tillage Res. 2025, 248, 106466. [Google Scholar] [CrossRef]
Nemes, A.; Schaap, M.; Leij, F.J.; Wösten, J.H.M. UNSODA 2.0: Unsaturated Soil Hydraulic Database. Database and Program for Indirect Methods of Estimating Unsaturated Hydraulic Properties; Dataset; US Salinity Laboratory—ARS—USDA: Riverside, CA, USA, 2015. [CrossRef]
Shekhar, S.; Bansode, A.; Salim, A. A Comparative Study of Hyper-Parameter Optimization Tools. In Proceedings of the 2021 IEEE Asia-Pacific Conference on Computer Science and Data Engineering (CSDE), Brisbane, Australia, 8–10 December 2021; pp. 1–6. [Google Scholar]
Totola, L.B.; Bicalho, K.V.; Hisatugu, W.H. Artificial Neural Networks for Predicting Soil Water Retention Data of Various Brazilian Soils. Earth Sci. Inf. 2023, 16, 3579–3595. [Google Scholar] [CrossRef]
Rastgou, M.; Jin, X.; Jiang, Q.; Liu, S.; Lou, R.; Wang, J.; Tang, R.; He, Y. Optimizing Deep Neural Networks for Estimating Soil Water Retention Curves: A Comparison of Metaheuristic and Numerical Algorithms. Vadose Zone J. 2025, 24, e70035. [Google Scholar] [CrossRef]
Sun, K.; Gao, Y.; He, W.; Wang, L.; Sun, X. Prediction of Soil-Water Retention Curves in Unsaturated Soils Based on Stacked Generalization. J. Rock Mech. Geotech. Eng. 2025, 18, 2421–2436. [Google Scholar] [CrossRef]
Taherdangkoo, R.; Nagel, T.; Tyurin, V.; Chen, C.; Ardejani, F.D.; Butscher, C. Prediction of the Soil–Water Retention Curve of Compacted Clays Using PSO–GA XGBoost. Artif. Intell. Geosci. 2025, 7, 100173. [Google Scholar] [CrossRef]
Niu, L.; Jia, X.; Dai, X.; Gao, L.; Huang, L.; Wei, X.; Yang, X.; Shao, M. Geospatial Heterogeneity-Informed Machine Learning for Mapping Soil Hydraulic Properties across China’s Drylands. Geoderma 2025, 464, 117622. [Google Scholar] [CrossRef]
Awwal, Y.A.; Maniyunda, L.M.; Jimoh, I.A. Random Forest Modelling of Water Retention Patterns in Typic Plinthustalfs Soils of Sub-Humid High-Plain Zones in Nigeria. SSRN 2024. [Google Scholar] [CrossRef]

Figure 1. Distribution of soil samples on the USDA soil texture triangle. Classification of soil samples based on measured sand, silt, and clay fractions. Colored background polygons indicate USDA soil textural classes, and orange circles represent individual soil samples. The distribution illustrates the range and heterogeneity of soil textures represented in the dataset.

Figure 2. Study workflow and modeling framework. Schematic overview of data sources, preprocessing steps, feature-scenario construction, model training (baseline XGB, Hyperopt-XGB, and Optuna-XGB), and evaluation procedures.

Figure 3. Pearson correlation coefficients between soil properties and measured θ. r heatmap illustrating relationships between texture fractions, ρb, ρp, n, OM, and measured θ. Color intensity denotes correlation strength.

Figure 4. Measured and estimated SWRCs using the Optuna-XGB-11 model. Gray squares indicate measured θ, and red circles indicate estimated θ for representative soil samples from each USDA texture class. H is displayed on a logarithmic scale to show retention behavior across the suction range.

Figure 5. Texture-specific measured versus estimated θ using the Optuna-XGB-11 model. Each panel corresponds to one USDA soil texture class in the test set, and each dot represents an individual test-set observation. The dashed diagonal indicates the 1:1 agreement line, and the reported intercept, slope, and R² are from linear regression of estimated θ against measured θ. Identical x- and y-axis limits and a 1:1 aspect ratio were used to facilitate visual comparison and avoid boundary clipping.

Figure 6. Residual diagnostics for the best-performing models: (a) XGB-11, (b) Hyperopt-XGB-11, and (c) Optuna-XGB-11 evaluated on the test dataset. In the upper panels, blue dots represent individual test samples, and the dashed diagonal line represents the 1:1 agreement line between observed and predicted θ. The marginal density curves show the distributions of observed and predicted θ. In the lower panels, blue bars show the histogram of residuals, the red curve shows the smoothed residual-density distribution, and the vertical reference lines indicate residual bias relative to zero. Residuals are defined as estimated minus measured volumetric water content, (

\hat{θ} - θ

).

Figure 6. Residual diagnostics for the best-performing models: (a) XGB-11, (b) Hyperopt-XGB-11, and (c) Optuna-XGB-11 evaluated on the test dataset. In the upper panels, blue dots represent individual test samples, and the dashed diagonal line represents the 1:1 agreement line between observed and predicted θ. The marginal density curves show the distributions of observed and predicted θ. In the lower panels, blue bars show the histogram of residuals, the red curve shows the smoothed residual-density distribution, and the vertical reference lines indicate residual bias relative to zero. Residuals are defined as estimated minus measured volumetric water content, (

\hat{θ} - θ

).

Figure 7. Taylor diagram summarizing the performance of the XGB-11, Hyperopt-XGB-11, and Optuna-XGB-11 models relative to measured θ. The radial distance represents the standard deviation, the angular position indicates the Pearson correlation coefficient, and the concentric arcs represent the centered root mean square error.

Figure 8. Influence of GCV on the estimated generalization performance of the evaluated models. Model accuracy obtained from the conventional random train–test split is compared with that from GCV, where folds were grouped by soil sample code to prevent leakage of measurements from the same soil sample across training and validation sets. Panels show (a) RMSE and (b) R². Error bars denote the standard deviation across the five GCV folds.

Figure 9. SHAP explanations of model estimations. SHAP waterfall plots illustrating the contribution of individual variables to estimated θ for representative samples. Positive values increase, and negative values decrease the prediction relative to the model baseline.

Figure 10. Permutation-based feature importance. Permutation importance expressed as the decrease in R² resulting from random shuffling of each variable. Error bars represent variability across repeated permutations.

Figure 11. SHAP feature importance. Mean absolute SHAP values summarizing the overall influence of each variable on model predictions across the test dataset.

Table 1. Numbers of complete records used for each input-feature scenario and the test set after excluding records with missing required attributes.

Scenarios	1	2	3	4	5	6	7	8	9	10	11	Test
Data points	6080	5827	2331	3204	2368	2311	2967	2348	1399	2139	1368	587

Table 2. Definition of the eleven input-feature scenarios applied in SWRC modeling. Scenario 1 represents the texture-only baseline, whereas Scenario 11 includes all available predictors. The corresponding dataset sizes are listed in Table 1.

Scenario	Input Feature Combinations
1	h	Fsand	Fsilt	Fclay
2	h	Fsand	Fsilt	Fclay	ρb
3	h	Fsand	Fsilt	Fclay	n
4	h	Fsand	Fsilt	Fclay	OM
5	h	Fsand	Fsilt	Fclay	ρp
6	h	Fsand	Fsilt	Fclay	ρb	n
7	h	Fsand	Fsilt	Fclay	ρb	OM
8	h	Fsand	Fsilt	Fclay	ρb	ρp
9	h	Fsand	Fsilt	Fclay	ρb	n	OM
10	h	Fsand	Fsilt	Fclay	ρb	n	ρp
11	h	Fsand	Fsilt	Fclay	ρb	n	OM	ρp

Table 3. Summary statistics of soil properties and measured θ, characterizing the distributional representativeness and variability in the input space used across all modeling scenarios.

Features	Mean	Min	Max	Sx	CV	Csx
h (cm)	3634.13	0	4,248,000	69,983.49	0	0
θ (cm³ cm⁻³)	0.29	0.01	0.84	0.14	0.48	0.1
Fsand (%)	54.14	0	99	30.32	0	−0.06
Fsilt (%)	29.45	0	87	22.79	0	0.49
Fclay (%)	16.41	0	63	13.19	0	1.03
ρb (g cm⁻³)	1.45	0.46	1.97	0.22	0.15	−1.31
ρp (g cm⁻³)	2.63	1.98	2.83	0.08	0.03	−2.17
n (%)	0.47	0.26	0.92	0.09	0.2	1.44
OM (%)	1.55	0.01	21.4	1.97	1.27	3.31

Min: minimum, Max: maximum, Sx: standard deviation, CV: coefficient of variation, and Csx: skewness.

Table 4. Hyperparameter search spaces for XGB optimization.

	n Estimators	Max Depth	Learning Rate	Subsample	Column Sample by Tree
XGB	100	3	0.1	-	-
Hyperopt-XGB-1	199	7	0.15	0.81	0.84
Hyperopt-XGB-2	398	8	0.13	0.58	0.78
Hyperopt-XGB-3	412	6	0.13	0.87	0.90
Hyperopt-XGB-4	276	7	0.16	0.75	0.86
Hyperopt-XGB-5	199	6	0.13	1.00	0.83
Hyperopt-XGB-6	214	7	0.15	0.80	0.84
Hyperopt-XGB-7	297	6	0.22	0.69	0.89
Hyperopt-XGB-8	273	7	0.16	0.81	0.83
Hyperopt-XGB-9	278	7	0.09	0.59	0.73
Hyperopt-XGB-10	266	6	0.11	0.66	0.83
Hyperopt-XGB-11	266	6	0.11	0.66	0.83
Optuna-XGB-1	348	7	0.09	0.65	0.93
Optuna-XGB-2	471	8	0.09	0.77	0.93
Optuna-XGB-3	425	6	0.20	0.86	0.96
Optuna-XGB-4	416	7	0.14	0.62	0.97
Optuna-XGB-5	297	6	0.14	0.52	0.71
Optuna-XGB-6	443	7	0.14	0.84	0.81
Optuna-XGB-7	295	6	0.20	0.73	0.94
Optuna-XGB-8	488	6	0.09	0.66	0.97
Optuna-XGB-9	349	7	0.11	0.65	0.78
Optuna-XGB-10	471	7	0.12	0.85	0.62
Optuna-XGB-11	476	6	0.08	0.73	0.95

Table 5. Training performance of baseline and optimized XGB models across feature scenarios.

Training	RMSE	MAE	R²	WI	KGE
XGB-1	0.0580	0.0425	0.8214	0.9473	0.8444
XGB-2	0.0486	0.0358	0.8744	0.9645	0.8852
XGB-3	0.0471	0.0352	0.8927	0.9697	0.8883
XGB-4	0.0402	0.0303	0.9164	0.9772	0.9207
XGB-5	0.0483	0.0361	0.8898	0.9689	0.8887
XGB-6	0.0454	0.0337	0.9011	0.9722	0.8942
XGB-7	0.0364	0.0275	0.9309	0.9813	0.9273
XGB-8	0.0423	0.0310	0.9157	0.9768	0.9124
XGB-9	0.0330	0.0258	0.9399	0.9837	0.9271
XGB-10	0.0429	0.0317	0.9128	0.9759	0.9098
XGB-11	0.0318	0.0248	0.9442	0.9849	0.9300
Hyperopt-XGB-1	0.0257	0.0167	0.9650	0.9909	0.9638
Hyperopt-XGB-2	0.0111	0.0065	0.9935	0.9984	0.9925
Hyperopt-XGB-3	0.0081	0.0050	0.9969	0.9992	0.9949
Hyperopt-XGB-4	0.0068	0.0044	0.9976	0.9994	0.9962
Hyperopt-XGB-5	0.0158	0.0104	0.9883	0.9970	0.9836
Hyperopt-XGB-6	0.0075	0.0044	0.9973	0.9993	0.9959
Hyperopt-XGB-7	0.0058	0.0041	0.9983	0.9996	0.9973
Hyperopt-XGB-8	0.0080	0.0045	0.9970	0.9992	0.9960
Hyperopt-XGB-9	0.0049	0.0036	0.9987	0.9997	0.9968
Hyperopt-XGB-10	0.0115	0.0079	0.9938	0.9984	0.9905
Hyperopt-XGB-11	0.0059	0.0044	0.9981	0.9995	0.9964
Optuna-XGB-1	0.0258	0.0169	0.9647	0.9908	0.9641
Optuna-XGB-2	0.0097	0.0049	0.9950	0.9987	0.9938
Optuna-XGB-3	0.0068	0.0037	0.9978	0.9994	0.9970
Optuna-XGB-4	0.0062	0.0039	0.9980	0.9995	0.9968
Optuna-XGB-5	0.0165	0.0114	0.9872	0.9968	0.9856
Optuna-XGB-6	0.0065	0.0032	0.9980	0.9995	0.9976
Optuna-XGB-7	0.0059	0.0041	0.9982	0.9995	0.9971
Optuna-XGB-8	0.0097	0.0063	0.9956	0.9989	0.9935
Optuna-XGB-9	0.0029	0.0022	0.9995	0.9999	0.9986
Optuna-XGB-10	0.0064	0.0035	0.9980	0.9995	0.9972
Optuna-XGB-11	0.0034	0.0026	0.9994	0.9998	0.9983

Table 6. Test performance of baseline and optimized XGB models across feature scenarios.

Testing	RMSE	MAE	R²	WI	KGE
XGB-1	0.0612	0.0457	0.7937	0.9379	0.8257
XGB-2	0.0492	0.0363	0.8666	0.9616	0.8704
XGB-3	0.0446	0.0328	0.8902	0.9695	0.9012
XGB-4	0.0476	0.0360	0.8752	0.9648	0.8891
XGB-5	0.0496	0.0380	0.8646	0.9625	0.9004
XGB-6	0.0443	0.0325	0.8917	0.9701	0.9054
XGB-7	0.0434	0.0326	0.8963	0.9712	0.9024
XGB-8	0.0445	0.0333	0.8909	0.9699	0.9083
XGB-9	0.0380	0.0286	0.9204	0.9784	0.9238
XGB-10	0.0423	0.0318	0.9012	0.9729	0.9149
XGB-11	0.0356	0.0271	0.9299	0.9810	0.9267
Hyperopt-XGB-1	0.0400	0.0283	0.9120	0.9764	0.9341
Hyperopt-XGB-2	0.0248	0.0176	0.9662	0.9912	0.9602
Hyperopt-XGB-3	0.0217	0.0152	0.9741	0.9933	0.9740
Hyperopt-XGB-4	0.0229	0.0164	0.9710	0.9924	0.9641
Hyperopt-XGB-5	0.0290	0.0202	0.9536	0.9879	0.9588
Hyperopt-XGB-6	0.0217	0.0145	0.9740	0.9933	0.9716
Hyperopt-XGB-7	0.0220	0.0150	0.9732	0.9932	0.9784
Hyperopt-XGB-8	0.0221	0.0155	0.9730	0.9931	0.9737
Hyperopt-XGB-9	0.0205	0.0142	0.9767	0.9940	0.9733
Hyperopt-XGB-10	0.0227	0.0157	0.9716	0.9927	0.9735
Hyperopt-XGB-11	0.0200	0.0137	0.9780	0.9944	0.9793
Optuna-XGB-1	0.0395	0.0284	0.9141	0.9770	0.9345
Optuna-XGB-2	0.0236	0.0157	0.9692	0.9920	0.9646
Optuna-XGB-3	0.0213	0.0148	0.9750	0.9936	0.9769
Optuna-XGB-4	0.0221	0.0155	0.9732	0.9931	0.9709
Optuna-XGB-5	0.0282	0.0207	0.9561	0.9885	0.9579
Optuna-XGB-6	0.0199	0.0135	0.9782	0.9944	0.9756
Optuna-XGB-7	0.0219	0.0143	0.9737	0.9933	0.9796
Optuna-XGB-8	0.0217	0.0150	0.9741	0.9933	0.9738
Optuna-XGB-9	0.0203	0.0139	0.9773	0.9941	0.9682
Optuna-XGB-10	0.0215	0.0143	0.9746	0.9934	0.9642
Optuna-XGB-11	0.0183	0.0124	0.9815	0.9953	0.9825

Table 7. Comparison between conventional train–test split and GCV using soil sample code as the grouping variable.

Model	Conventional RMSE	Conventional R²	GCV RMSE	GCV R²	ΔRMSE	ΔR²
XGB	0.0356	0.9299	0.0557 ± 0.0047	0.8205 ± 0.0456	+0.0200	−0.1094
Hyperopt-XGB	0.0227	0.9717	0.0530 ± 0.0044	0.8374 ± 0.0417	+0.0303	−0.1342
Optuna-XGB	0.0216	0.9742	0.0543 ± 0.0056	0.8288 ± 0.0484	+0.0327	−0.1454

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Monavvar Sabegh, S.; Zarehaghi, D.; Samadianfard, S.; Sattari, M.T.; Ahmad, S. Enhanced Pedotransfer Functions Through Optuna-Optimized Extreme Gradient Boosting: Application to Soil Water Retention Modeling. Earth 2026, 7, 94. https://doi.org/10.3390/earth7030094

AMA Style

Monavvar Sabegh S, Zarehaghi D, Samadianfard S, Sattari MT, Ahmad S. Enhanced Pedotransfer Functions Through Optuna-Optimized Extreme Gradient Boosting: Application to Soil Water Retention Modeling. Earth. 2026; 7(3):94. https://doi.org/10.3390/earth7030094

Chicago/Turabian Style

Monavvar Sabegh, Sanaz, Davoud Zarehaghi, Saeed Samadianfard, Mohammad Taghi Sattari, and Sajjad Ahmad. 2026. "Enhanced Pedotransfer Functions Through Optuna-Optimized Extreme Gradient Boosting: Application to Soil Water Retention Modeling" Earth 7, no. 3: 94. https://doi.org/10.3390/earth7030094

APA Style

Monavvar Sabegh, S., Zarehaghi, D., Samadianfard, S., Sattari, M. T., & Ahmad, S. (2026). Enhanced Pedotransfer Functions Through Optuna-Optimized Extreme Gradient Boosting: Application to Soil Water Retention Modeling. Earth, 7(3), 94. https://doi.org/10.3390/earth7030094

Article Menu

Enhanced Pedotransfer Functions Through Optuna-Optimized Extreme Gradient Boosting: Application to Soil Water Retention Modeling

Abstract

1. Introduction

2. Materials and Methods

2.1. UNSODA 2.0

2.2. Input Variables and Feature Scenarios

2.3. Modeling Framework

2.4. Hyperparameter Optimization

2.5. Model Evaluation

2.6. Model Validation Using Nested Grouped Cross-Validation (GCV)

2.7. Model Interpretability

3. Results

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI