Parametric Modeling of the Unsaturated Soil Hydraulic Conductivity Function Using Tree-Based and Ensemble Machine Learning Algorithms: A Comparative Analysis of Cubist, Random Forest, and LightGBM

Wang, Peng; Rastgou, Mostafa; Qi, Zhiming; Jiang, Qianjing; He, Yong

doi:10.3390/agronomy16111116

Open AccessArticle

Parametric Modeling of the Unsaturated Soil Hydraulic Conductivity Function Using Tree-Based and Ensemble Machine Learning Algorithms: A Comparative Analysis of Cubist, Random Forest, and LightGBM

by

Peng Wang

¹,

Mostafa Rastgou

¹,

Zhiming Qi

²

,

Qianjing Jiang

¹ and

Yong He

^1,*

¹

Department of Biosystems Engineering, Zhejiang University, 866 Yuhangtang Road, Hangzhou 310058, China

²

Department of Bioresource Engineering, McGill University, Ste-Anne-de-Bellevue, QC H9X 3V9, Canada

^*

Author to whom correspondence should be addressed.

Agronomy 2026, 16(11), 1116; https://doi.org/10.3390/agronomy16111116 (registering DOI)

Submission received: 19 March 2026 / Revised: 29 May 2026 / Accepted: 2 June 2026 / Published: 5 June 2026

(This article belongs to the Section Precision and Digital Agriculture)

Download

Browse Figures

Versions Notes

Abstract

Modeling the unsaturated soil hydraulic conductivity function (SHCF) is essential for understanding water movement in unsaturated zones and supporting effective agricultural and environmental management. Accurate estimation of SHCF parameters, particularly the α and n parameters of the van Genuchten–Mualem (VGM) model, remains a challenging endeavor due to the complex interplay of soil physical properties. Tree-based machine learning methods have shown promising capabilities in this area. To further assess and compare the performance of tree-based approaches, this study aimed to evaluate the efficiency of three algorithms, Cubist, RF, and light gradient boosting machine (LightGBM), in the parametric estimation of SHCF using 196 soil samples from the UNSODA database. Input variables, including sand, clay, soil bulk density (BD), field capacity (FC), and permanent wilting point (PWP), were structured into four progressively complex pedotransfer functions (PTFs). Results indicate that Cubist demonstrated the best overall generalization during testing, achieving the lowest average RMSD (7.165) across the four PTFs compared to RF (7.602) and LightGBM (8.068), although RF and LightGBM achieved marginally better performance on individual PTF-metric combinations. All three algorithms achieved high coefficients of determination (R² ≥ 0.95) across all PTFs. Specifically, in PTF4, the best-performing model, Cubist achieved a 6.8% lower RMSD than RF and a 12.4% improvement over LightGBM. Shapley additive explanations (SHAP) conducted via XGBoost surrogate models, suggested that FC and PWP were the most influential predictors of SHCF among the variables examined. These findings suggest that Cubist is a viable approach for estimating SHCF, particularly when input data are limited to basic soil properties.

Keywords:

cubist; LightGBM; pedotransfer functions; random forest; unsaturated soil hydraulic conductivity function

1. Introduction

The unsaturated soil hydraulic conductivity function (SHCF) illustrates the relationship between the soil water content (or matric potential) and the hydraulic conductivity of soil [1]. It is a fundamental physical function that governs the rate at which water flows through the soil matrix under a hydraulic gradient [2]. This function is essential for modeling water flow and solute transport in both saturated and unsaturated soil conditions [3]. The SHCF is a key hydraulic property with broad applications in soil science, hydrology, geotechnical engineering, and environmental science [4,5]. It is used to model soil water processes and unsaturated zone dynamics, including water infiltration, redistribution, pollutant transport, and groundwater contamination risks. Additionally, it supports analyses of soil stability, drainage system design, soil erosion, and settlement of foundations [6,7].

Determining the SHCF is challenging due to the labor-intensive and time-consuming measurement process [8]. This limitation necessitates the use of indirect estimation methods, among which pedotransfer functions (PTFs) have emerged as a practical and widely adopted approach [4]. PTFs serve as mathematical models that bridge the gap between easily obtainable soil properties and the more difficult-to-measure hydraulic properties [9]. In other words, the essence of PTFs lies in their ability to leverage the readily available data on soil granulometry, bulk density, and organic matter content to predict soil hydraulic properties [10]. The construction of PTFs has been revolutionized by the application of various machine learning techniques, providing improved accuracy and efficiency in predicting soil hydraulic properties [11]. Machine learning algorithms, with their ability to discern complex relationships within data, have proven particularly well-suited for developing PTFs [12].

The selection of an appropriate machine learning algorithm for PTF development depends on dataset characteristics and the required prediction accuracy. While regression-based and neural network approaches have been widely used to estimate the SHCF [13,14,15,16], they exhibit notable limitations. Conventional regression models often fail to capture complex, nonlinear relationships in soil hydraulic data [17], and classical neural networks may lack interpretability or computational efficiency [18]. To overcome these limitations, recent studies have increasingly investigated advanced machine learning techniques that can model intricate patterns more effectively. In this regard, Farasati et al. (2024) demonstrated the superiority of random forest (RF) over support vector machine (SVM) and least-squares SVM in modeling hydraulic conductivity [19]. Similarly, Sihag et al. (2019) found RF outperformed M5P and regression analysis for unsaturated hydraulic conductivity estimation [20], while Veloso et al. (2022) identified RF and SVM as top performers for saturated conductivity prediction [21]. Most recently, Mouaddine et al. (2025) highlighted the RF method as a robust and reliable tool for predicting and mapping soil hydraulic conductivity [22].

Despite the demonstrated success of the RF algorithm in recent studies on estimating soil hydraulic conductivity, its performance has not yet been systematically compared to newer, advanced regression tree-based methods such as Cubist and light gradient-boosting machine (LightGBM). This study presents a novel comparative analysis of these three machine learning approaches (RF, Cubist, and LightGBM) for predicting the SHCF. The innovation of this research lies in determining whether RF maintains its competitive advantage over these newer algorithms when estimating parametric PTFs of the SHCF using the UNSODA database (Supplementary Material S1). Therefore, the primary objective of this study was to systematically evaluate and compare the prediction accuracy and reliability of RF, Cubist, and LightGBM algorithms for SHCF estimation.

2. Materials and Methods

2.1. Fitting Soil Moisture Data and Estimating Hydraulic Conductivity Function

In this study, 196 soil moisture characteristic curve datasets were obtained from the UNSODA database (https://catalog.data.gov/dataset/unsoda-2-0-unsaturated-soil-hydraulic-database-database-and-program-for-indirect-methods-o, accessed on 1 June 2026) [23] and fitted to the van Genuchten moisture retention equation (Equation (1)) [24] using a nonlinear optimization method in the MATLAB 2020 environment. The van Genuchten equation is widely used for describing the soil water retention behavior and is expressed as:

θ (h) = θ_{r} + \frac{θ_{s} - θ_{r}}{{[1 + {|α h|}^{n}]}^{1 - \frac{1}{n}}}

(1)

where θ(h) is the volumetric water content at pressure head h (cm). θ_s and θ_r denote the saturated and residual water contents, respectively.

The α (cm⁻¹) and n are fitting parameters related to the inverse of air entry suction and pore size distribution. The curve fitting was carried out using MATLAB’s nonlinear least squares optimization tools. The goodness-of-fit for each dataset was evaluated using the root mean square error, coefficient of determination, and the corrected Akaike information criterion to assess model complexity and fit quality, as illustrated in Figure 1. Subsequently, the optimized parameters obtained from Equation (1) were integrated into the VGM hydraulic conductivity model [24,25] (Equation (2)) to calculate the hydraulic conductivity function over a suction range of 0 to 1500 kPa [25]:

\begin{array}{l} K (S_{e}) = K_{s} S_{e}^{L} {[1 - {(1 - S_{e}^{1 / m})}^{m}]}^{2} m = 1 - \frac{1}{n} \\ S_{e} = \frac{θ (h) - θ_{r}}{θ_{s} - θ_{r}} \end{array}

(2)

where S_e and K_s are the effective saturation and saturated hydraulic conductivity (cm day⁻¹), respectively. L is an empirical parameter related to the pore size interaction term, which was fixed at 0.5 [26]. The parameter m is constrained by the Mualem restriction [25]. Throughout this study, soil suction is expressed in kPa for numerical analysis (0 to 1500 kPa), where 1 kPa ≈ 10 cm of water column. For consistency with common soil physics conventions, suction is presented in cm on logarithmic axes in selected figures, with the conversion noted in the respective captions. The K_s values were obtained directly from the UNSODA database as laboratory-measured data accompanying each soil sample and were not estimated by the PTFs. Accordingly, in the subsequent modeling workflow, K_s was treated as a known input parameter to Equation (2), while α and n served as the target variables to be predicted by the machine learning algorithms. Hereafter, the resulting hydraulic conductivity as a function of suction head h is denoted as K(h), and a complete K(h) curve spanning the suction range from 0 to 1500 kPa constitutes one realization of the SHCF for a given soil sample.

2.2. Development of PTFs

PTFs were developed using a hierarchical combination of input variables to estimate the SHCF. The predictors included sand and clay content, soil bulk density (BD), moisture content at 33 kPa suction (field capacity, FC), and moisture content at 1500 kPa suction (permanent wilting point, PWP). These soil properties were obtained from the UNSODA database [23], where water retention at 33 and 1500 kPa is primarily determined using the pressure plate apparatus. Four progressively complex PTFs were structured as follows: PTF1 = Sand + Clay, PTF2 = Sand + Clay + BD, PTF3 = Sand + Clay + BD + FC, and PTF4 = Sand + Clay + BD + FC + PWP. This stepwise formulation approach, which has been widely reported in the literature [27,28], provides a systematic framework for identifying the contribution of each soil property to the estimation accuracy of the hydraulic conductivity function. By incrementally adding predictors, one can evaluate model sensitivity and robustness, ensuring that the final models remain both interpretable and efficient.

The two van Genuchten–Mualem parameters, α and n, were modeled as independent regression targets. Each machine learning algorithm was trained separately for α and for n using the identical set of input predictors within each PTF configuration. This separate modeling approach was adopted because α and n represent physically distinct soil hydraulic properties that may exhibit different functional relationships with the basic soil attributes used as predictors. Following the independent prediction of α and n, the K(h) curve for each soil sample was reconstructed by substituting these estimates together with the measured K_s and the fitted θ_s and θ_r into the van Genuchten–Mualem model (Equation (2)).

It should be noted that FC and PWP are point measurements of soil water content at specific matric pressures, obtained directly from laboratory analysis of each soil sample. These values are not derived from, or dependent upon, the van Genuchten model fitting procedure described in Section 2.1; rather, they are independently measured data points on the same soil moisture characteristic curve from which α and n were subsequently estimated via nonlinear optimization. Consequently, using FC and PWP as predictors for α and n does not constitute a circular inference, it reflects the physically meaningful relationship between water retention at discrete pressure points and the continuous shape parameters of the full retention curve. From the practical standpoint, FC and PWP are among the most routinely measured and widely reported soil hydraulic properties in national and international soil databases, making them accessible input variables for real-world PTF applications [27,28].

2.3. Data Preprocessing

Data preprocessing is a crucial step in developing reliable machine learning models, as it improves the quality of input data, reduces noise, and ensures consistent scale and distribution. In this study, outliers were initially identified using the interquartile range (IQR) method, where values outside 1.5 × IQR from the first (Q1) or third quartile (Q3) were replaced with the respective upper (Q3 + 1.5 × IQR) or lower bounds (Q1 − 1.5 × IQR). Subsequently, to normalize the feature scales, all data were standardized to a 0–1 range using min-max scaling [29]:

X_{s c a l e d} = \frac{X - X_{\min}}{X_{\max} - X_{\min}}

(3)

where X denotes the original values, while X_min and X_max denote the minimum and maximum values for each feature. This preprocessing step ensures comparability across features and reduces bias stemming from differences in magnitude. It should be noted that all preprocessing transformations were integrated into the cross-validation framework to avoid data leakage. In each iteration of the 10-fold cross-validation, the IQR bounds for outlier capping and the minimum and maximum values for min-max scaling were calculated exclusively from the training fold and subsequently applied to transform both the training and the corresponding held-out test fold. This within-fold fitting strategy ensures that no information from the test partition contributes to the preprocessing statistics, thereby preserving the independence of training and testing data and providing an unbiased evaluation of model generalization performance.

2.4. Machine Learning Algorithms

In this study, three machine learning algorithms, RF, Cubist, and LightGBM, were used to develop PTFs for estimating the SHCF. Each algorithm is described below along with its core mathematical concept and corresponding pseudocode.

2.4.1. RF Algorithm

RF is an ensemble learning method that builds multiple decision trees to improve prediction accuracy and reduce overfitting [30]. Figure 2A shows a visual representation of the RF algorithm (Algorithm 1).

Algorithm 1: Random forest (RF)

Parameters: Number of trees = 500;

Procedure

1 For t = 1 to T:

2 - Draw a bootstrap sample from the data

3 - Grow a decision tree f_t using a random feature subset

4 - Store the decision tree f_t

5 End For

6 To predict:

7 - Return average of all tree predictions

Given a training dataset

D = {[(x_{i}, y_{i})]}_{i = 1}^{N}

, where x_i is a feature vector and y_i is a continuous target, RF generates bootstrap samples, D₁, D₂, …, D_t, by randomly sampling with replacement from the original dataset. A decision tree, f_t, is trained on each bootstrap sample. During the construction of each tree, only a random subset of m features (where m < total number of features) is considered at each split, which helps reduce correlation between the trees [31]. For regression, splits are chosen to minimize the mean squared error (MSE). For a node t, the MSE is computed as [32]:

M S E (t) = \frac{1}{N_{t}} \sum_{i \in t} {(y_{i} - {\bar{y}}_{t})}^{2}

(4)

where N_t is the number of samples in node t, and

{\bar{y}}_{t}

is the mean target value in the node. The final output for a new input x is obtained by averaging the predictions from all individual trees [33]:

\hat{y} = \frac{1}{T} \sum_{t = 1}^{T} f_{t} (x)

(5)

where

\hat{y}

is the predicted value, f_t(x) is the prediction from the t-th tree, and T is the total number of trees.

2.4.2. Cubist Algorithm

The Cubist algorithm (Algorithm 2) is an advanced regression technique that combines rule-based decision trees with linear models to improve predictive accuracy while maintaining interpretability [34]. Unlike traditional decision trees that assign constant values at the leaves, Cubist constructs model trees, where each internal node splits the data based on feature thresholds, and each leaf node contains a linear regression model (Figure 2B) [35]. After constructing the tree, Cubist transforms it into a set of if-then rules. A linear equation is then fitted to the data associated with each rule:

\hat{y} = β_{0} + β_{1} x_{1} + β_{2} x_{2} + \dots + β_{n} x_{n}

(6)

where

\hat{y}

is the predicted value and β_i are regression coefficients for input features x_i, fitted within each rule or path of the tree.

Algorithm 2: Cubist

Parameters: Number of committees = 10;

Neighbors = 5;

Procedure

1 Train a rule-based tree:

2 - Split nodes to minimize prediction error

3 - Fit linear models at leaves using least squares

4 Generate rules

5 - Convert tree paths into interpretable if-then rules

6 To predict:

7 - Identify matching rule/path

8 - Apply associated linear model

2.4.3. LightGBM Algorithm

LightGBM (Algorithm 3) is a gradient boosting algorithm that constructs trees leaf-wise instead of level-wise (Figure 2C) [36]. In contrast to level-wise growth, LightGBM expands the leaf with the highest potential for loss reduction. This strategy contributes to higher accuracy and improved generalization by mitigating overfitting. The algorithm employs gradient descent to minimize a differentiable loss function over successive boosting iterations [37]:

{\hat{y}}_{i}^{(t)} = {\hat{y}}_{i}^{(t - 1)} + η f_{t} (x_{i})

(7)

where

{\hat{y}}_{i}^{(t)}

is the prediction at iteration t, f_t is the decision tree added at step t, and η is the learning rate.

Algorithm 3: Light gradient-boosting machine (LightGBM)

Parameters: Learning rate (η) = 0.01;

Number of leaves = 31;

Number of rounds (T) = 1000;

Procedure

1 Initialize model with a constant value

2 For t = 1 to T:

3 - Compute the gradients of the loss function

4 - Fit regression tree f_t to the computed gradients

5 - Update prediction using the output of f_t

6 End For

7 Return the final prediction

2.5. Model Training and Implementation

In this study, the dataset was partitioned using 10-fold cross-validation (k = 10) for both training and testing phases. In each iteration, one fold was held out as the test set while the remaining nine folds constituted the training set; no separate, independent test set was reserved. Accordingly, all testing-phase metrics reported in this study represent cross-validated predictions aggregated across the ten held-out folds. Critically, the entire preprocessing pipeline, including IQR-based outlier capping and min-max normalization, was embedded within the cross-validation loop. For each fold, the preprocessing transformations were fitted solely on the training samples and then applied to the held-out test samples, thereby preventing any information leakage from the test set into model development. This approach aligns with the findings of Marcot and Hanea [38], who demonstrated that k = 10 provides an optimal balance between computational efficiency and reliable error estimation for datasets of moderate size.

All algorithms were implemented using their default or commonly recommended hyperparameter configurations without additional optimization via nested cross-validation or grid search. For RF, the number of trees was set to 500, as prior studies have demonstrated that this value is sufficient to achieve stable out-of-bag error estimates for datasets of comparable size, with negligible improvement beyond this threshold [31,39]. For Cubist, the number of committees was set to 10 and the number of neighbors to 5, corresponding to the default settings of the Cubist package, which have been shown to provide robust performance across a wide range of regression tasks without extensive tuning [34,40]. For LightGBM, a learning rate of 0.01, 31 leaves per tree, and 1000 boosting rounds were adopted; these values represent a conservative configuration that prioritizes generalization by limiting tree complexity and employing a small learning rate with a sufficiently large number of iterations, as recommended by Ke et al. [36] and Shi et al. [41]. This fixed-parameter strategy was deliberately chosen to ensure a fair, like-for-like comparison among the three algorithms under standardized conditions, avoiding the confounding effect that differential tuning effort could introduce.

2.6. Shapley Additive Explanation Analysis for Variable Importance

SHapley Additive exPlanations (SHAP) was employed to interpret the influence of input variables on the soil hydraulic parameters (α and n) derived from the van Genuchten–Mualem model. SHAP is a game-theoretic approach that quantifies the contribution of each feature to a model’s output by computing Shapley values from cooperative game theory. It provides a unified, model-agnostic framework for breaking down a prediction into additive feature contributions, facilitating both global feature importance rankings and local interpretability. The SHAP value for a feature represents the average marginal contribution of that feature across all possible combinations of inputs. Mathematically, the SHAP value ϕ_i for feature i is defined as [42]:

ϕ_{i} = \sum_{S \subseteq F \ \{i\}} \frac{|S|! (|F| - |S| - 1)!}{|F|!} (f (S \cup \{i\}) - f (S))

(8)

where F is the set of all input features, S is a subset of F not containing feature i, f(S) is the model output when using only features in subset S, and |S| is the cardinality of subset S. In this study, SHAP values were computed using XGBoost surrogate models. Because Cubist does not natively support TreeSHAP, and to maintain methodological consistency across all three algorithms, an XGBoost regressor was trained to approximate each original model’s predictions on the same input data. TreeSHAP was then applied to the XGBoost surrogates to extract SHAP values. The resulting SHAP values are interpreted as approximating, rather than exactly replicating, the feature contributions of the original models. Mean absolute SHAP values were used for global feature importance ranking, and SHAP dependence plots were employed to visualize nonlinear effects and feature interactions.

Because the Cubist and randomForest packages in R 4.5.0 do not provide native SHAP value computation, a surrogate modeling approach was adopted for the SHAP analysis. Separate gradient-boosted tree models (XGBoost) were trained for α and for n using the same input features as the primary PTF models, serving as accurate and computationally efficient approximations of the tree-based modeling framework. SHAP values were then computed from these surrogate models using the fastshap package, which employs a Monte Carlo sampling strategy to estimate Shapley values for arbitrary prediction functions. This approach yields model-agnostic, global feature importance metrics and dependence relationships that are representative of the tree-based modeling pattern underlying Cubist, RF, and LightGBM.

2.7. Evaluation Statistics and Taylor Diagram for Performance Comparison

The performance of the three algorithms (RF, Cubist, and LightGBM) was evaluated using the root mean squared deviation (RMSD) and the coefficient of determination (R²). The RMSD, calculated across the entire suction range (0 to 1500 kPa), provided a comprehensive measure of prediction error by integrating local deviations [43]:

R M S D (cm {day}^{- 1}) = \sqrt{\frac{1}{a_{2} - a_{1}} \int_{a_{1}}^{a_{2}} {({\hat{K}}_{i} (h) - K_{i} (h))}^{2} d \log |h|}

(9)

where K_i(h) and

{\hat{K}}_{i} (h)

are the observed and predicted hydraulic conductivity values at suction h, respectively. a₁ and a₂ represent the suction range from 0 to 1500 kPa. Since the integration in Equation (9) is performed with respect to d(log h)—where h is the suction in kPa, the resulting RMSD retains the same physical units as K(h) (cm day⁻¹). The use of logarithmic integration weights deviations approximately equally across orders of magnitude of suction, which is appropriate given that K(h) varies over several orders of magnitude across the 0–1500 kPa range. R² quantified the proportion of variance in hydraulic conductivity explained by each algorithm, calculated as:

R^{2} = 1 - \frac{\sum_{i = 1}^{N} {(K_{i} (h) - {\hat{K}}_{i} (h))}^{2}}{\sum_{i = 1}^{N} {(K_{i} (h) - {\bar{K}}_{i} (h))}^{2}}

(10)

where

{\bar{K}}_{i} (h)

represents the mean observed conductivity and N denotes the number of data points. In addition to RMSD and R², the mean absolute error (MAE) was computed to provide a complementary measure of prediction accuracy that is less sensitive to large deviations than RMSD.

It is important to recognize that the RMSD computed by Equation (9) inherently reflects the propagation of prediction errors from the estimated parameters (α, n) to the final K(h) curve. Since the van Genuchten–Mualem model (Equation (2)) is a nonlinear function of α and n, any deviations between the ML-predicted and the reference parameter values are nonlinearly transmitted to the computed hydraulic conductivity across the suction range. The integration over the entire log-suction interval (0 to 1500 kPa) in Equation (9) provides a comprehensive, end-to-end measure of this aggregated error along the full K(h) curve. Consequently, a lower testing RMSD, as achieved by Cubist for three of four PTFs (Table 1), indicates not only higher accuracy in predicting α and n individually but also a more favorable error propagation behavior through the hydraulic model, yielding K(h) curves that more faithfully reproduce the observed conductivity response across the full suction domain.

To enable formal statistical comparison of algorithm performance, per-sample RMSD values were computed by applying Equation (9) to the test-phase K(h) predictions of each individual soil sample. These values (n = 196 per algorithm per PTF) constitute independent, sample-level observations suitable for statistical inference. The non-parametric Friedman rank-sum test was subsequently employed to assess whether significant overall differences existed among the three algorithms within each PTF configuration.

In addition to the per-sample statistics, the fold-level stability of model performance was assessed. For each PTF–algorithm combination, the 196 test samples were randomly assigned to 10 folds, and the mean RMSD within each fold was computed using Equation (9). The mean and standard deviation of these 10 fold-level means are reported as Fold-Mean and Fold-SD in the rightmost columns of Table 1. The Fold-SD measures the stability of model performance across different data partitions: a smaller value indicates that the model’s accuracy is less sensitive to how the samples are divided.

In this study, alongside statistical metrics, Taylor diagrams were employed to visually compare the performance of the Cubist, RF, and LightGBM algorithms across four PTFs in estimating SHCF. The Taylor diagram is a graphical tool that synthesizes multiple statistical metrics, including the Pearson correlation coefficient, root mean square error, and standard deviation, into a single polar plot, facilitating the comparison of model performance against a reference (observations) [44]. In this diagram, the radial distance from the origin indicates the standard deviation, the azimuthal angle represents the correlation coefficient, and the distance from the reference point (observed data) reflects the centered root mean square error. Algorithms that are nearer to the reference point demonstrate better alignment with observations. This method provides an comprehensive assessment of accuracy, reliability, variability, and bias in predictions under various combinations of input variables.

3. Results and Discussion

3.1. Descriptive Statistics and Data Distribution Analysis

Figure 3 shows probability density distributions of soil hydraulic parameters and their predictors. The distribution of soil properties showed that sand and clay content had the highest standard deviations (SD = 28.63 and 14.63, respectively), indicating considerable variability in particle size distribution across the sampled soils. This dispersion indicates a heterogeneous soil composition within the dataset, which is further corroborated by the texture triangle analysis (Figure 4). The USDA texture triangle analysis of 196 selected UNSODA soil samples revealed an uneven distribution across 10 textural classes, with the majority clustered in four dominant classes: silt loam (n = 47, 24.0%), sand (n = 32, 16.3%), sandy loam (n = 32, 16.3%), and loam (n = 32, 16.3%).

The soil samples spanned all USDA texture classes except sandy clay and sandy clay loam, demonstrating broad textural diversity suitable for PTF development. This textural diversity ensures representation of the wide hydraulic property ranges associated with distinct textural groups, which is critical for creating robust PTFs [16,45]. The absence of sandy clay and sandy clay loam textures may limit extrapolation to these specific classes, though their hydraulic behavior can often be approximated by neighboring textural groups in the triangle [27]. The predominance of loamy textures (Figure 4) aligns with the moderate skewness and kurtosis values in Figure 3 (e.g., skewness = 0.41 for sand, 1.06 for clay), indicating generally balanced particle-size distributions without extreme outliers. Bulk density approximated a normal distribution (kurtosis ≈ 3), while FC and PWP exhibited platykurtic distributions (kurtosis < 3), indicating weaker tail effects than those typically expected for pore-size-dependent properties. These deviations from normality highlight the need for machine learning approaches that can effectively manage non-linear relationships.

3.2. Performance Comparison of Cubist, RF, and LightGBM Across PTFs

Table 1 presents a quantitative assessment of the Cubist, RF, and LightGBM algorithms in both the training and testing phases for each of the four developed PTFs. Figure 5 and Figure 6 visually summarize these comparisons through Taylor diagrams in the training and testing phases, respectively. In these diagrams, algorithms closer to the reference point (observed data) perform better, showing higher correlation, lower centered root mean square error, and a standard deviation closely approximating that of the observations. As shown in Table 1 and Figure 5, during the training phase, the RF algorithm consistently outperformed Cubist and LightGBM across all PTFs.

For instance, in PTF1, RF achieved an RMSD approximately 19.6% lower and an R² approximately 1.6% higher than Cubist. This pattern persisted across all PTFs (PTF2-PTF4), where RF maintained 29.4–32.9% lower RMSD values and 1.2–2.4% higher R² scores than competing algorithms. However, as illustrated in Table 1 and Figure 6, the Cubist algorithm demonstrated better generalization in the testing phase for most PTFs. Specifically, for PTF1, Cubist reduced RMSD by 7.2% (8.004 vs. 8.625) while improving R² by 0.4% (0.953 vs. 0.950) compared to RF. This performance gap widened in PTF2 (12.5% lower RMSD, 6.920 vs. 7.912) and remained evident in PTF4 (6.8% lower RMSD, 6.490 vs. 6.962), with corresponding R² improvements of 0.8% and 0.1% respectively. In PTF3, RF achieved a marginally lower aggregate RMSD (6.911 vs. 7.246) than Cubist, indicating that Cubist’s advantage is not uniform across all PTF metric combinations. However, the per-sample statistical analysis confirmed that Cubist remained significantly better than RF in PTF3 (Table 1), reinforcing that aggregate RMSD, which is more heavily weighted by samples with large K(h) values, does not always align with per-sample rankings. However, in this case the R² and per-sample RMSD both favored Cubist over RF (Table 1), illustrating that aggregate RMSD, which is more heavily weighted by samples with large K(h) magnitudes, can occasionally diverge from other performance indicators. This reinforces the importance of consulting multiple complementary metrics when comparing algorithm performance.

It should be noted that the training metrics primarily reflect each algorithm’s capacity to fit the data on which it was trained, whereas the testing metrics assess generalization to unseen soil samples, the latter being the criterion of practical relevance for PTF applications. Among the testing indicators, RMSD is prioritized here because it integrates prediction error across the entire suction range (0–1500 kPa) and penalizes large deviations more heavily than MAE, which is critical for hydrological modeling where substantial errors in K(h) at specific suction levels can propagate to unreliable flux estimates. The consistently low bias observed for all algorithms (Table 2) indicates that none of the three methods exhibited systematic over- or under-prediction tendencies.

While RF achieved optimal training accuracy through its ensemble approach, Cubist showed superior generalization capability during testing for three of four PTFs (PTF1, PTF2, and PTF4). The superior training performance of the RF algorithm can be attributed to its ensemble nature and high flexibility. RF tends to fit the training data very closely by constructing numerous decision trees and averaging their outputs, effectively reducing bias [39,46,47]. However, this very flexibility can lead to reduced performance when applied to testing data, particularly under default hyperparameter configurations without dedicated regularization [48,49]. As noted in Section 2.5, all algorithms in this study were evaluated under their default settings; the observed training-testing divergence therefore reflects differences in each algorithm’s default bias-variance trade-off rather than an inherent propensity toward overfitting. On the other hand, the Cubist model, a rule-based framework that builds piecewise linear models within decision trees, achieves a favorable balance between bias and variance under the conditions of this study.. By integrating rule-based splits with linear regression at the leaves, Cubist maintains a relatively simple structure that imposes implicit regularization, allowing it to generalize more effectively to new data [50,51,52]. This visual and aggregate-metric evidence is corroborated by a per-sample statistical analysis: Friedman rank-sum tests confirmed significant overall differences among Cubist, RF, and LightGBM for all four PTFs (p < 0.001; Table 1). Across all PTFs, Cubist achieved the lowest or statistically tied-lowest mean per-sample RMSD, reinforcing its superior generalization capability.

These results suggest that Cubist performs competitively in the testing phase for most PTFs, with modest advantages over RF and LightGBM in three of four input configurations. The comparable generalization of Cubist, despite its simpler structure, indicates it can adequately capture the underlying trends in soil hydraulic conductivity. LightGBM, while competitive, typically ranked slightly lower than Cubist and RF in both phases, potentially due to its boosting mechanism being more responsive to minor fluctuations and noise when the dataset size is moderate [53]. A comparison of the predicted hydraulic conductivity curves using the Cubist, RF, and LightGBM algorithms for a representative soil sample from each soil textural class during the testing phase based on PTF4 is presented in Figure 7.

3.3. Decoding Predictive Relationships: SHAP-Based Interpretation of SHCF Parameters

SHAP values quantify the contribution of each feature to model predictions and should be interpreted as model-based measures of feature importance, not as direct evidence of physical causality. Furthermore, the soil properties used as predictors in this study, including FC, PWP, texture, and bulk density, are inherently intercorrelated due to their shared dependence on pore-size distribution and particle packing. SHAP importance rankings may therefore reflect a feature’s role as a proxy for correlated but unmeasured soil attributes, rather than a unique causal effect.

Figure 8 presents the relative importance of five key soil attributes, clay and sand content, BD, FC, and PWP, in predicting the van Genuchten–Mualem model parameters α and n. These rankings are derived from mean absolute SHAP values, which quantify each feature’s contribution to the model output. The SHAP analysis reveals that FC and PWP are the highest-ranking features for both α and n, with consistently high mean absolute SHAP values (Figure 8). This result suggests that water retention characteristics (FC and PWP) are assigned higher model-based importance than basic textural properties (clay and sand) or BD. It should be noted that, while FC and PWP are measured from the same retention curves used to fit α and n, their dominant SHAP importance is not a statistical artifact of circularity: the nonlinear optimization in Equation (1) estimates α and n from the full curve, whereas FC and PWP are single-point measurements at fixed pressures, providing independent information that constrains the curve’s shape at its most hydrologically significant segments. Specifically, FC and PWP were identified as the most important predictors of α (which corresponds to the inverse of air-entry suction) and n (which represents the pore-size distribution), underscoring the pivotal role of water retention metrics in shaping the soil hydraulic parameters [54,55,56]. This predominant influence can be attributed to the integrative nature of FC and PWP: unlike basic textural fractions, which reflect only particle size distribution, water retention at 33 and 1500 kPa inherently captures the combined effects of soil mineralogy, specific surface area, organic matter content, and pore architecture [16,57]. In particular, PWP at 1500 kPa is governed primarily by adsorptive surface forces on clay minerals, making it sensitive to clay mineralogy rather than clay content alone. Consequently, FC and PWP encode the physicochemical determinants of soil hydraulic behavior more directly than texture or BD alone, accounting for their elevated SHAP importance.

Figure 9 provides a deeper understanding of the relationships between soil properties and the VGM parameters by presenting SHAP dependence plots. These plots depict the detailed, individual effects of each input variable on the soil hydraulic parameters, highlighting not only the magnitude and direction of influence but also the presence of nonlinear behaviors.

Three distinct trends are evident from these visualizations. First, FC and PWP exhibit nonlinear, threshold-like behaviors. Specifically, FC exhibited peak SHAP values for α and n within a moderate moisture content range of 20–30%, while PWP showed the highest SHAP values at low moisture levels, particularly below 10% for n. These findings indicate that the contribution of water retention characteristics to soil hydraulic function is not uniform but varies sharply across moisture ranges. Second, the SHAP dependence patterns for textural properties such as clay and sand differ by parameter and follow contrasting patterns. For α, the trends are largely linear and opposite in direction, clay was positively associated with α, while sand showed a negative association. However, their model-derived association with n follows a U-shaped relationship, suggesting that texture plays a more complex role in controlling pore-size distribution than in determining air-entry behavior. This duality in texture effects reflects earlier insights from pedotransfer function modeling literature [27,55]. Lastly, BD consistently shows a negative relationship with both α and n. Soils with higher BD values, especially those exceeding 1.6 g cm⁻³, are associated with lower values of both parameters. This trend supports evidence that compaction reduces pore connectivity and restricts water movement, thereby diminishing hydraulic conductivity [57]. Together, these insights underscore the complex, feature-specific mechanisms that govern soil hydraulic behavior.

A limitation of this study concerns the SHAP interpretation methodology. Because the Cubist algorithm does not have a native TreeSHAP implementation, SHAP values were derived from XGBoost surrogate models rather than directly from the original Cubist, RF, and LightGBM models. While XGBoost surrogates can approximate the original model’s input–output mapping with reasonable fidelity, the SHAP values ultimately reflect the surrogate’s internal decision rules, which may differ from those of the original algorithms, particularly for Cubist, whose rule-based linear-model structure differs fundamentally from gradient-boosted trees. Consequently, the feature importance rankings (Figure 8) and dependence patterns (Figure 9) should be interpreted as indicative rather than definitive. Future studies could address this by employing model-agnostic Shapley methods that operate directly on the original model’s predict function, or by adopting Cubist-specific interpretability tools.

3.4. Contextualization of Cubist Performance Within the Broader PTF Literature

To situate the performance of the Cubist algorithm in the present study within the broader landscape of PTF research, Table 3 provides a contextual overview of reported R² values from selected previous studies that employed machine learning or regression methods for estimating soil hydraulic conductivity. It must be emphasized that the studies listed in Table 3 vary considerably in their database source, sample size, input variable suite, and validation protocol. Consequently, the R² values reported across these studies are not directly comparable; rather, Table 3 is intended to illustrate the range of predictive accuracies achievable under diverse experimental conditions and to position the present results within this spectrum.

Within the specific context of this study, the UNSODA database with SHCF parameters α and n as prediction targets, Cubist achieved an R² of 0.96 in the testing phase. The higher coefficient of determination achieved by the Cubist model underscores its superior capability in modeling the complex, non-linear relationships between soil properties and hydraulic conductivity. While this value falls at the upper end of the range reported in the literature, it should be interpreted with the understanding that the present study benefits from well-curated input variables and a consistent modeling framework with rigorous cross-validation.

The favorable performance of Cubist observed in this study is consistent with the structural characteristics discussed in Section 3.2: its rule-based piecewise linear framework provides implicit regularization that balances global trend capture with local adaptation, making it well-suited for PTF development where input–output relationships are often heterogeneous [67,68].

The input variables selected in this study, sand, clay, BD, FC, and PWP, are directly related to soil structure, water retention, and porosity, all of which are critical determinants of hydraulic conductivity [69]. The inclusion of FC and PWP proved particularly consequential: as the SHAP analysis in Section 3.3 indicated, these two water-retention variables were consistently the highest-ranking predictors across all PTF configurations, and their inclusion in PTF3 and PTF4 yielded marked improvements in predictive accuracy (Table 1). Previous studies reporting higher R² values, such as Weynants et al. [60], Kashani et al. [61], and Mouaddine et al. [22], also incorporated variables closely tied to water retention, whereas studies with lower accuracies often relied on fewer or less directly relevant predictors [59,65].

3.5. Limitations and Future Work

Several limitations should be considered when interpreting these results. First, the dataset comprised 196 soil samples from the UNSODA database, which is a moderate sample size for machine learning applications. The fold-level standard deviations reported in Table 1 indicate that the observed differences in mean RMSD between algorithms are small relative to cross-validation variability. Consequently, the relative ranking of algorithms should be regarded as indicative rather than conclusive, and performance may vary with different data partitions or sample compositions.

Second, all three algorithms were evaluated using fixed hyperparameters. While these values were selected based on common practice in the literature, they do not represent optimized configurations for the specific dataset. RF and LightGBM, in particular, are known to benefit substantially from hyperparameter tuning, whereas Cubist has fewer tunable parameters. The fixed-hyperparameter design may therefore underestimate the potential performance of RF and LightGBM relative to Cubist.

Third, all models were developed and validated using the UNSODA database without external validation on independent soil datasets. The textural diversity of the UNSODA samples provides reasonable internal representativeness, but the models’ transferability to soils from different geographic regions, parent materials, or management histories remains untested. Future studies should incorporate external datasets, systematic hyperparameter optimization, and larger sample sizes to more conclusively assess the comparative merits of tree-based algorithms for SHCF estimation.

4. Conclusions

This study demonstrated that tree-based machine learning algorithms can provide effective solutions for estimating the parameters of the van Genuchten–Mualem model. Among the three algorithms evaluated, Cubist showed competitive testing performance, achieving slightly lower RMSD than RF and LightGBM in three of four PTFs.. The SHAP analysis identified FC and PWP as the most important predictors of the model outputs, ranking above traditional predictors such as texture and bulk density. This ranking reflects model-based feature importance and does not necessarily imply physical causality, given the inherent correlations among soil hydraulic properties. SHAP analysis based on XGBoost surrogates suggested that FC and PWP are likely the most influential predictors, consistent with their known physical role in controlling soil water behavior, though the exact importance rankings may be sensitive to the surrogate approximation. This highlights that, although soil texture and bulk density establish the fundamental hydraulic framework, changes in water retention characteristics, driven by land use, management practices, or climatic conditions, can substantially impact the α and n parameters. These observations suggest that FC and PWP are potentially important variables for characterizing soil water behavior, though confirmation with model-specific interpretation methods and larger datasets is warranted. Given these findings, future research should focus on developing hybrid and rule-based models, in combination with carefully selected and management-sensitive input variables, to further improve the robustness and applicability of SHCF estimations. This approach holds promise for improving irrigation planning, soil–water modeling, and sustainable agricultural water management.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/agronomy16111116/s1.

Author Contributions

Methodology, P.W. and M.R.; writing—original draft, P.W. and M.R.; writing—review & editing, Z.Q., Y.H. and Q.J.; supervision, Y.H. and Q.J.; project administration, Q.J.; funding acquisition, Y.H. and Q.J. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Key Research and Development Program of China (grant number: 2024YFD2301100), the National Natural Science Foundation of China (grant number: 32271980) and the Key Pioneer Research Project of Zhejiang Province (grant number: 2022C02014).

Data Availability Statement

The original contributions presented in the study are included in the article; further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Scarfone, R.; Wheeler, S.J.; Lloret-Cabot, M. A hysteretic hydraulic constitutive model for unsaturated soils and application to capillary barrier systems. Geomech. Energy Environ. 2022, 30, 100224. [Google Scholar] [CrossRef]
Fuentes, C.; Chávez, C.; Brambila, F. Relating Hydraulic Conductivity Curve to Soil-Water Retention Curve Using a Fractal Model. Mathematics 2020, 8, 2201. [Google Scholar] [CrossRef]
Ontman, R.; Groffman, P.M.; Driscoll, C.T.; Cheng, Z. Surprising relationships between soil pH and microbial biomass and activity in a northern hardwood forest. Biogeochemistry 2023, 163, 265–277. [Google Scholar] [CrossRef]
Dahunsi, J.; Pathirana, S.; Cheema, M.; Krishnapillai, M.; Galagedara, L. Estimating soil hydraulic conductivity from time-lapse ground-penetrating radar data in podzolic soils using the green-ampt model. J. Hydrol. 2025, 657, 133059. [Google Scholar] [CrossRef]
Rastgou, M.; He, Y.; Jiang, Q. Implementation and efficient evaluation of backpropagation network training algorithms in parametric simulations of soil hydraulic conductivity curve. J. Hydrol. 2024, 636, 131302. [Google Scholar] [CrossRef]
Cui, L.-X.; Cheng, Q.; So, P.S.; Tang, C.-S.; Tian, B.-G.; Li, C.-Y. Relationship between root characteristics and saturated hydraulic conductivity in a grassed clayey soil. J. Hydrol. 2024, 645, 132231. [Google Scholar] [CrossRef]
Yuliana, Y.; Apriyono, A.; Kamchoom, V.; Boldrin, D.; Cheng, Q.; Tang, C.-S. Seasonal dynamics of root growth and desiccation cracks and their effects on soil hydraulic conductivity. Eng. Geol. 2025, 349, 107973. [Google Scholar] [CrossRef]
Wang, Y.; Ma, R.; Zhu, G. Improved Prediction of Hydraulic Conductivity with a Soil Water Retention Curve That Accounts for Both Capillary and Adsorption Forces. Water Resour. Res. 2022, 58, e2021WR031297. [Google Scholar] [CrossRef]
Albalasmeh, A.; Mohawesh, O.; Gharaibeh, M.; Deb, S.; Slaughter, L.; El Hanandeh, A. Artificial neural network optimization to predict saturated hydraulic conductivity in arid and semi-arid regions. CATENA 2022, 217, 106459. [Google Scholar] [CrossRef]
Tong, Y.; Wang, Y.; Zhou, J.; Guo, X.; Wang, T.; Xu, Y.; Sun, H.; Zhang, P.; Li, Z.; Lauerwald, R. Expanding scales: Achieving prediction of van Genuchten model hydraulic parameters in deep profiles by incorporating broad in situ soil information in pedotransfer functions. J. Hydrol. 2025, 656, 132912. [Google Scholar] [CrossRef]
Tartakovsky, A.M.; Marrero, C.O.; Perdikaris, P.; Tartakovsky, G.D.; Barajas-Solano, D. Physics-Informed Deep Neural Networks for Learning Parameters and Constitutive Relationships in Subsurface Flow Problems. Water Resour. Res. 2020, 56, e2019WR026731. [Google Scholar] [CrossRef]
Arbor, A.; Schmidt, M.; Zhang, J.; Bulmer, C.; Filatow, D.; Kasraei, B.; Smukler, S.; Heung, B. Filling the gaps in soil data: A multi-model framework for addressing data gaps using pedotransfer functions and machine-learning with uncertainty estimates to estimate bulk density. CATENA 2024, 245, 108310. [Google Scholar] [CrossRef]
Bayat, H.; Sedaghat, A.; Sinegani, A.A.S.; Gregory, A.S. Investigating the relationship between unsaturated hydraulic conductivity curve and confined compression curve. J. Hydrol. 2015, 522, 353–368. [Google Scholar] [CrossRef]
Doussan, C.; Ruy, S. Prediction of unsaturated soil hydraulic conductivity with electrical conductivity. Water Resour. Res. 2009, 45, W10408. [Google Scholar] [CrossRef]
Sedaghat, A.; Bayat, H.; Sinegani, A.S. Estimation of soil saturated hydraulic conductivity by artificial neural networks ensemble in smectitic soils. Eurasian Soil Sci. 2016, 49, 347–357. [Google Scholar] [CrossRef]
Vereecken, H.; Weynants, M.; Javaux, M.; Pachepsky, Y.; Schaap, M.; Genuchten, M.T. Using pedotransfer functions to estimate the van Genuchten–Mualem soil hydraulic properties: A review. Vadose Zone J. 2010, 9, 795–820. [Google Scholar] [CrossRef]
Rastgou, M.; Bayat, H.; Mansoorizadeh, M.; Gregory, A.S. Estimating the soil water retention curve: Comparison of multiple nonlinear regression approach and random forest data mining technique. Comput. Electron. Agric. 2020, 174, 105502. [Google Scholar] [CrossRef]
Lee, J.; Cho, J.; Justice, D.; Kim, S. Towards Explainability of Classical Neural Network via Quantum Computing. In 2024 IEEE International Conference on Quantum Computing and Engineering (QCE); IEEE: New York, NY, USA, 2024; pp. 398–399. [Google Scholar]
Farasati, M.; Seyedian, M.; Fathaabadi, A. Predicting soil hydraulic conductivity using random forest, SVM, and LSSVM models. Nat. Resour. Model. 2024, 37, e12407. [Google Scholar] [CrossRef]
Sihag, P.; Mohsenzadeh Karimi, S.; Angelaki, A. Random forest, M5P and regression analysis to estimate the field unsaturated hydraulic conductivity. Appl. Water Sci. 2019, 9, 129. [Google Scholar] [CrossRef]
Veloso, M.F.; Rodrigues, L.N.; Filho, E.I.F. Evaluation of machine learning algorithms in the prediction of hydraulic conductivity and soil moisture at the Brazilian Savannah. Geoderma Reg. 2022, 30, e00569. [Google Scholar] [CrossRef]
Mouaddine, A.; Barakat, A.; Hajaj, S.; Mosaid, H.; Bouzekraoui, H.; Bni, Z.; Hilali, A. Predicting and mapping soil saturated hydraulic conductivity in the Beni Moussa irrigated perimeter (Tadla Plain, Morocco) using Random Forest machine learning model. Model. Earth Syst. Environ. 2025, 11, 82. [Google Scholar] [CrossRef]
Nemes, A.; Schaap, M.; Leij, F.; Wösten, J. Description of the unsaturated soil hydraulic database UNSODA version 2.0. J. Hydrol. 2001, 251, 151–162. [Google Scholar] [CrossRef]
van Genuchten, M.T. A closed-form equation for predicting the hydraulic conductivity of unsaturated soils. Soil Sci. Soc. Am. J. 1980, 44, 892–898. [Google Scholar] [CrossRef]
Mualem, Y. A new model for predicting the hydraulic conductivity of unsaturated porous media. Water Resour. Res. 1976, 12, 513–522. [Google Scholar] [CrossRef]
Schaap, M.G.; Van Genuchten, M.T. A modified Mualem-van Genuchten formulation for improved description of the hydraulic conductivity near saturation. Vadose Zone J. 2006, 5, 27–34. [Google Scholar] [CrossRef]
Wösten, J.; Pachepsky, Y.A.; Rawls, W. Pedotransfer functions: Bridging the gap between available basic soil data and missing soil hydraulic characteristics. J. Hydrol. 2001, 251, 123–150. [Google Scholar] [CrossRef]
Zhang, Y.; Schaap, M.G. Weighted recalibration of the Rosetta pedotransfer model with improved estimates of hydraulic parameter distributions and summary statistics (Rosetta3). J. Hydrol. 2017, 547, 39–53. [Google Scholar] [CrossRef]
Mazziotta, M.; Pareto, A. Normalization methods for spatio-temporal analysis of environmental performance: Revisiting the Min–Max method. Environmetrics 2022, 33, e2730. [Google Scholar] [CrossRef]
Breiman, L.; Adele, C.; Andy, L.; Wiener, M. randomForest: Breiman and Cutlers Random Forests for Classification and Regression, Version 4.7-1.2; CRAN: Vienna, Austria, 2024. [CrossRef]
Yang, K.; Wang, J.; Zhao, G.; Wang, X.; Cong, W.; Yuan, M.; Luo, J.; Dong, X.; Wang, J.; Tao, J. NIDS-CNNRF integrating CNN and random forest for efficient network intrusion detection model. Internet Things 2025, 32, 101607. [Google Scholar] [CrossRef]
Zhou, R.; Chen, J.; Cui, S.; Li, L.; Qian, J.; Zhao, H.; Huang, G. A data-driven framework to identify influencing factors for soil heavy metal contaminations using random forest and bivariate local Moran’s I: A case study. J. Environ. Manag. 2025, 375, 124172. [Google Scholar] [CrossRef]
Probst, P.; Boulesteix, A.-L. To tune or not to tune the number of trees in random forest. J. Mach. Learn. Res. 2018, 18, 6673–6690. [Google Scholar]
Butler, B.M. Mineral sources of aqua regia extractable base cations in Scottish soils interpreted from Cubist models trained with quantitative mineralogy data. Chem. Geol. 2020, 551, 119773. [Google Scholar] [CrossRef]
Pouladi, N.; Møller, A.B.; Tabatabai, S.; Greve, M.H. Mapping soil organic matter contents at field level with Cubist, Random Forest and kriging. Geoderma 2019, 342, 85–92. [Google Scholar] [CrossRef]
Thai, H.-T. Machine learning for structural engineering: A state-of-the-art review. In Structures; Elsevier: Amsterdam, The Netherlands, 2022; pp. 448–491. [Google Scholar]
Yin, Z.; Jiao, J.; Xie, P.; Luo, H.; Wei, L. Multi-objective optimization control for shield cutter wear and cutting performance using LightGBM and enhanced NSGA-II. Autom. Constr. 2025, 171, 105957. [Google Scholar] [CrossRef]
Marcot, B.G.; Hanea, A.M. What is an optimal value of k in k-fold cross-validation in discrete Bayesian network analysis? Comput. Stat. 2021, 36, 2009–2031. [Google Scholar] [CrossRef]
Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Kuhn, M.; Steve, W.; Chris, K.; Coulter, N.; Quinlan, R.; Rulequest Research Pty Ltd. Cubist: Rule- and Instance-Based Regression Modeling, Version 0.6.0; CRAN: Vienna, Austria, 2025. [CrossRef]
Shi, Y.; Guolin, K.; Damien, S.; James, L.; Qi, M.; Thomas, F.; Taifeng, W.; Wei, C.; Weidong, M.; Qiwei, Y.; et al. Lightgbm: Light Gradient Boosting Machine, Version 4.6.0; CRAN: Vienna, Austria, 2025. Available online: https://github.com/Microsoft/LightGBM (accessed on 1 June 2026).
Han, J.; Guzman, J.A.; Chu, M.L. Prediction of gully erosion susceptibility through the lens of the SHapley Additive exPlanations (SHAP) method using a stacking ensemble model. J. Environ. Manag. 2025, 383, 125478. [Google Scholar] [CrossRef]
Ungaro, F.; Calzolari, C.; Busoni, E. Development of pedotransfer functions using a group method of data handling for the soil of the Pianura Padano–Veneta region of North Italy: Water retention properties. Geoderma 2005, 124, 293–317. [Google Scholar] [CrossRef]
Wadoux, A.M.J.C.; Walvoort, D.J.J.; Brus, D.J. An integrated approach for the evaluation of quantitative soil maps through Taylor and solar diagrams. Geoderma 2022, 405, 115332. [Google Scholar] [CrossRef]
Pachepsky, Y.A.; Rawls, W.; Lin, H. Hydropedology and pedotransfer functions. Geoderma 2006, 131, 308–316. [Google Scholar] [CrossRef]
Parhi, S.K.; Patro, S.K. Compressive strength prediction of PET fiber-reinforced concrete using Dolphin echolocation optimized decision tree-based machine learning algorithms. Asian J. Civ. Eng. 2024, 25, 977–996. [Google Scholar] [CrossRef]
Probst, P.; Wright, M.N.; Boulesteix, A.L. Hyperparameters and tuning strategies for random forest. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2019, 9, e1301. [Google Scholar] [CrossRef]
Barreñada, L.; Dhiman, P.; Timmerman, D.; Boulesteix, A.-L.; Van Calster, B. Understanding overfitting in random forest for probability estimation: A visualization and simulation study. Diagn. Progn. Res. 2024, 8, 14. [Google Scholar] [CrossRef]
Zhou, Z.-H. Ensemble Learning. In Machine Learning; Zhou, Z.-H., Ed.; Springer: Singapore, 2021; pp. 181–210. [Google Scholar]
Chen, J.; He, Y.; Liang, Y.; Wang, W.; Duan, X. Estimation of gross calorific value of coal based on the cubist regression model. Sci. Rep. 2024, 14, 23176. [Google Scholar] [CrossRef] [PubMed]
Kuhn, M.; Johnson, K. Applied Predictive Modeling; Springer: New York, NY, USA, 2013; Volume 26. [Google Scholar]
Quinlan, J.R. Combining instance-based and model-based learning. In Proceedings of the Tenth International Conference on Machine Learning; Elsevier: Amsterdam, The Netherlands, 1993; pp. 236–243. [Google Scholar]
Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.-Y. Lightgbm: A highly efficient gradient boosting decision tree. In Proceedings of the Advances in Neural Information Processing Systems 30, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Rastgou, M.; Bayat, H.; Mansoorizadeh, M.; Gregory, A.S. Estimating Soil Water Retention Curve by Extreme Learning Machine, Radial Basis Function, M5 Tree and Modified Group Method of Data Handling Approaches. Water Resour. Res. 2022, 58, e2021WR031059. [Google Scholar] [CrossRef]
Schaap, M.G.; Leij, F.J.; van Genuchten, M.T. Rosetta: A computer program for estimating soil hydraulic parameters with hierarchical pedotransfer functions. J. Hydrol. 2001, 251, 163–176. [Google Scholar] [CrossRef]
Touil, S.; Degre, A.; Chabaca, M.N. Sensitivity analysis of point and parametric pedotransfer functions for estimating water retention of soils in Algeria. Soil 2016, 2, 647. [Google Scholar] [CrossRef]
Horn, R.; Smucker, A. Structure formation and its consequences for gas and water transport in unsaturated arable and forest soils. Soil Tillage Res. 2005, 82, 5–14. [Google Scholar] [CrossRef]
Schaap, M.G.; Leij, F.J. Improved prediction of unsaturated hydraulic conductivity with the Mualem-van Genuchten model. Soil Sci. Soc. Am. J. 2000, 64, 843–851. [Google Scholar] [CrossRef]
Merdun, H.; Çınar, Ö.; Meral, R.; Apan, M. Comparison of artificial neural network and regression pedotransfer functions for prediction of soil water retention and saturated hydraulic conductivity. Soil Tillage Res. 2006, 90, 108–116. [Google Scholar] [CrossRef]
Weynants, M.; Vereecken, H.; Javaux, M. Revisiting Vereecken pedotransfer functions: Introducing a closed-form hydraulic model. Vadose Zone J. 2009, 8, 86–95. [Google Scholar] [CrossRef]
Kashani, M.H.; Ghorbani, M.A.; Shahabi, M.; Naganna, S.R.; Diop, L. Multiple AI model integration strategy—Application to saturated hydraulic conductivity prediction from easily available soil properties. Soil Tillage Res. 2020, 196, 104449. [Google Scholar] [CrossRef]
Azadmard, B.; Mosaddeghi, M.R.; Ayoubi, S.; Chavoshi, E.; Raoof, M. Estimation of near-saturated soil hydraulic properties using hybrid genetic algorithm-artificial neural network. Ecohydrol. Hydrobiol. 2020, 20, 437–449. [Google Scholar] [CrossRef]
Williams, C.G.; Ojuri, O.O. Predictive modelling of soils’ hydraulic conductivity using artificial neural network and multiple linear regression. SN Appl. Sci. 2021, 3, 152. [Google Scholar] [CrossRef]
Granata, F.; Di Nunno, F.; Modoni, G. Hybrid Machine Learning Models for Soil Saturated Conductivity Prediction. Water 2022, 14, 1729. [Google Scholar] [CrossRef]
Adjuik, T.A.; Nokes, S.E.; Montross, M.; Sama, M.P.; Wendroth, O. Predictor Selection and Machine Learning Regression Methods to Predict Saturated Hydraulic Conductivity from a Large Public Soil Database. J. ASABE 2023, 66, 285–296. [Google Scholar] [CrossRef]
Moosavi, A.A.; Nematollahi, M.A.; Omidifard, M. Comparing machine learning approaches for estimating soil saturated hydraulic conductivity. PLoS ONE 2024, 19, e0310622. [Google Scholar] [CrossRef]
Demir, S.; Sahin, E.K. The effectiveness of data pre-processing methods on the performance of machine learning techniques using RF, SVR, Cubist and SGB: A study on undrained shear strength prediction. Stoch. Environ. Res. Risk Assess. 2024, 38, 3273–3290. [Google Scholar] [CrossRef]
Nguyen, H.; Bui, X.-N.; Tran, Q.-H.; Mai, N.-L. A new soft computing model for estimating and controlling blast-produced ground vibration based on Hierarchical K-means clustering and Cubist algorithms. Appl. Soft Comput. 2019, 77, 376–386. [Google Scholar] [CrossRef]
Schaap, M.G.; Leij, F.J.; van Genuchten, M.T. Neural network analysis for hierarchical prediction of soil hydraulic properties. Soil Sci. Soc. Am. J. 1998, 62, 847–855. [Google Scholar] [CrossRef]

Figure 1. Distribution of fitting accuracy metrics for the van Genuchten model applied to 196 UNSODA datasets. Boxplots show quartiles and medians; violin shapes indicate data density. Red diamonds indicate mean value.

Figure 2. A visual representation of the random forest (RF), Cubist, and light gradient-boosting machine (LightGBM) algorithms.

Figure 3. Probability density distributions of soil hydraulic parameters and their predictors. The blue curves represent the empirical distributions of the observed data, while the red dashed curves show the theoretical normal distributions with matching mean and standard deviation. Shaded areas under each curve highlight the probability densities. Statistical moments (mean, standard deviation, skewness, and kurtosis) are provided for each parameter: sand content (%), clay content (%), bulk density (BD, g cm⁻³), field capacity (FC, cm³ cm⁻³), permanent wilting point (PWP, cm³ cm⁻³), and VGM model parameters (α, cm⁻¹ and n, dimensionless).

Figure 4. Distribution of 196 soil samples on USDA soil texture triangle. Cl, Clay; SaCl, Sandy Clay; SiCl, Silty Clay; ClLo, Clay Loam; SiClLo, Silty Clay Loam; SaClLo, Sandy Clay Loam; Sa, Sand; LoSa, Loamy Sand; SaLo, Sandy Loam; Lo, Loam; SiLo, Silty Loam; Si, Silt.

Figure 5. Taylor diagram comparing the performance of Cubist, random forest (RF), and LightGBM algorithms in the training phase for estimating soil hydraulic conductivity function (SHCF) using different pedotransfer functions (PTFs). The diagram displays the algorithms’ correlation, standard deviation, and root mean square error (represented by the brown dashed line) relative to the reference data. Closer proximity to the reference point indicates better algorithm performance. Blue, orange, and green markers denote RF, Cubist, and LightGBM algorithms, respectively.

Figure 6. Taylor diagram comparing the performance of Cubist, random forest (RF), and LightGBM algorithms in the testing phase for estimating soil hydraulic conductivity function (SHCF) using different pedotransfer functions (PTFs). The diagram displays the algorithms’ correlation, standard deviation, and root mean square error (represented by the brown dashed line) relative to the reference data. Closer proximity to the reference point indicates better algorithm performance. Blue, orange, and green markers denote RF, Cubist, and LightGBM algorithms, respectively.

Figure 7. Predicted soil hydraulic conductivity curves from Cubist, random forest (RF), and LightGBM algorithms for a representative soil sample from each soil textural class during the testing phase based on PTF4. Suction is displayed in cm on a logarithmic scale (1 kPa ≈ 10 cm); the modeled suction range is 0–1500 kPa (≈0–15,000 cm), and the axis extension to 100,000 cm serves visual continuity on the log scale.

Figure 8. Importance of input variables (clay, sand, BD, FC, and PWP) on the VGM model parameters α and n, based on mean absolute SHAP values. Higher SHAP values indicate greater feature influence on model predictions.

Figure 9. SHAP dependence plots for α and n in relation to input variables. Each dot represents a soil sample. The x-axis denotes the measured value of the feature, while the y-axis shows the SHAP value. Red lines indicate thresholds or nonlinear transitions in influence.

Table 1. Comparison of Cubist, random forest (RF), and LightGBM algorithms in the training and testing phases across four pedotransfer functions (PTFs), based on root mean squared deviation (RMSD) and coefficient of determination (R²), and 10-fold cross-validation stability metrics.

PTF	Algorithm	Training		Testing
		RMSD ± SD	R² ± SD	Per-Sample RMSD ± SD	R² ± SD	Per-Sample MAE ± SD	10-Fold CV RMSD ± SD
PTF1	CU	6.02 ± 11.17	0.96 ± 0.08	8.00 ± 18.13	0.95 ± 0.10	1.75 ± 3.78	7.89 ± 3.84
	RF	4.84 ± 11.73	0.98 ± 0.05	8.63 ± 19.55	0.95 ± 0.10	2.59 ± 5.39	8.45 ± 3.58
	LGBM	6.65 ± 14.72	0.97 ± 0.07	8.62 ± 19.78	0.95 ± 0.10	1.27 ± 4.37	8.44 ± 4.38
PTF2	CU	6.72 ± 16.18	0.96 ± 0.10	6.92 ± 15.75	0.96 ± 0.09	1.56 ± 3.35	6.90 ± 3.30
	RF	4.74 ± 11.62	0.98 ± 0.04	7.91 ± 18.48	0.95 ± 0.10	2.29 ± 5.29	7.80 ± 3.84
	LGBM	6.28 ± 13.73	0.97 ± 0.07	8.59 ± 18.50	0.95 ± 0.10	1.60 ± 4.18	8.50 ± 4.55
PTF3	CU	6.05 ± 15.18	0.97 ± 0.08	7.25 ± 17.69	0.96 ± 0.01	1.22 ± 2.81	7.22 ± 4.56
	RF	4.06 ± 10.58	0.99 ± 0.04	6.91 ± 16.73	0.96 ± 0.09	2.32 ± 5.91	6.87 ± 2.95
	LGBM	4.93 ± 11.63	0.98 ± 0.06	7.66 ± 17.56	0.97 ± 0.09	1.40 ± 4.45	7.58 ± 3.63
PTF4	CU	5.57 ± 13.14	0.97 ± 0.05	6.49 ± 12.53	0.96 ± 0.08	1.06 ± 2.50	6.43 ± 3.55
	RF	3.89 ± 9.89	0.99 ± 0.03	6.96 ± 16.64	0.96 ± 0.09	2.22 ± 5.43	6.76 ± 3.82
	LGBM	4.66 ± 10.62	0.98 ± 0.04	7.41 ± 15.97	0.97 ± 0.09	1.38 ± 3.32	7.33 ± 3.53

PTF: pedotransfer functions, CU: Cubist, RF: random forest, LGBM: light gradient-boosting machine. RMSD: root mean squared deviation, SD: standard deviation, R²: coefficient of determination. Training metrics: mean ± SD across 10 cross-validation training folds. Testing metrics: computed from pooled K(h) predictions over the full suction range per Equations (9) and (10) Fold CV represents the mean and standard deviation of the 10 per-fold mean RMSD values in the testing phase. Per-sample RMSD values were computed from individual test-phase K(h) predictions. Per-sample MAE, mean absolute error computed per sample over the same suction range as RMSD.

Table 2. Mean bias (predicted—observed K(h), cm day⁻¹ ± SD) across 196 test samples.

PTF	Cubist	RF	LightGBM
PTF1	−0.54 ± 4.13	0.12 ± 5.98	0.13 ± 4.55
PTF2	−0.19 ± 3.69	0.60 ± 5.73	−0.21 ± 4.47
PTF3	0.16 ± 3.05	1.07 ± 6.26	0.54 ± 4.63
PTF4	−0.01 ± 2.63	0.94 ±5.76	0.24 ± 3.59

Table 3. Contextual overview of reported R² values from selected studies employing machine learning and regression methods for soil hydraulic conductivity estimation.

Study	Database-Location	Number of Samples	Input Variables	Algorithm	R²
[58]	UNSODA	235	θ_r, θ_s, α, n, K_s	ANN	0.64
[59]	Turkey	276	PSD, BD, and DPS	MLR	0.64
[60]	Belgium	166	ST, BD, and OC	MLR	0.91
[20]	India	20	ST, BD, and MC	RF	0.82
[61]	Iran	245	ST, BD, OM, EC, and pH	ANN	0.91
[62]	Iran	212	PSD, OM, EC, and BD	ANN	0.80
[63]	-	144	PSD, PI, and MDD, MC	ANN	0.95
[21]	Brazil	188	PSD, BD, PD, P, and PWP	RF	0.41
[64]	SWIG	640	ST, OM, BD, d_g, σ_g, and θ_s	ANN, RF and SVR	0.83
[65]	FSCD	4686	ST, and BD	RF	0.71
[19]	Iran	-	ST, EC, pH, P, BD, and SAR	RF	0.89
[66]	Iran	25	θ_s, θ_i, BD, EC, pH, CCE, MWD, and GMD	ANN	0.92
[22]	Morocco	72	ST, MC, BD, P, OM, OC, and CC	RF	0.94
This study	UNSODA	196	ST, BD, FC, and PWP	Cubist	0.96

UNSODA: unsaturated soil hydraulic database; MLR: multiple linear regression; ANN: artificial neural network; RF: random forest; SVR: support vector regression; SWIG database: soil water infiltration global; FSCD: Florida soil characterization database. ST: soil texture; BD: bulk density, FC: field capacity; PWP: permanent wilting point; OM: organic matter; PSD: particle size distribution; EC: electrical conductivity; θ_r, θ_s, α, n, K_s: VGM parameters; MWD: mean weight diameter; GMD: geometric mean diameter; CCE: calcium carbonate equivalent; θ_i: initial water content; DPS: different pore sizes; MC: moisture content; PD: particle density; P: porosity; d_g: mean of soil particle diameter; σ_g: standard deviation of soil particle diameter; OC: organic carbon; CC: calcium carbonate; MDD: maximum dry density; PI: plasticity index; SAR: sodium adsorption ratio.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, P.; Rastgou, M.; Qi, Z.; Jiang, Q.; He, Y. Parametric Modeling of the Unsaturated Soil Hydraulic Conductivity Function Using Tree-Based and Ensemble Machine Learning Algorithms: A Comparative Analysis of Cubist, Random Forest, and LightGBM. Agronomy 2026, 16, 1116. https://doi.org/10.3390/agronomy16111116

AMA Style

Wang P, Rastgou M, Qi Z, Jiang Q, He Y. Parametric Modeling of the Unsaturated Soil Hydraulic Conductivity Function Using Tree-Based and Ensemble Machine Learning Algorithms: A Comparative Analysis of Cubist, Random Forest, and LightGBM. Agronomy. 2026; 16(11):1116. https://doi.org/10.3390/agronomy16111116

Chicago/Turabian Style

Wang, Peng, Mostafa Rastgou, Zhiming Qi, Qianjing Jiang, and Yong He. 2026. "Parametric Modeling of the Unsaturated Soil Hydraulic Conductivity Function Using Tree-Based and Ensemble Machine Learning Algorithms: A Comparative Analysis of Cubist, Random Forest, and LightGBM" Agronomy 16, no. 11: 1116. https://doi.org/10.3390/agronomy16111116

APA Style

Wang, P., Rastgou, M., Qi, Z., Jiang, Q., & He, Y. (2026). Parametric Modeling of the Unsaturated Soil Hydraulic Conductivity Function Using Tree-Based and Ensemble Machine Learning Algorithms: A Comparative Analysis of Cubist, Random Forest, and LightGBM. Agronomy, 16(11), 1116. https://doi.org/10.3390/agronomy16111116

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Parametric Modeling of the Unsaturated Soil Hydraulic Conductivity Function Using Tree-Based and Ensemble Machine Learning Algorithms: A Comparative Analysis of Cubist, Random Forest, and LightGBM

Abstract

1. Introduction

2. Materials and Methods

2.1. Fitting Soil Moisture Data and Estimating Hydraulic Conductivity Function

2.2. Development of PTFs

2.3. Data Preprocessing

2.4. Machine Learning Algorithms

2.4.1. RF Algorithm

2.4.2. Cubist Algorithm

2.4.3. LightGBM Algorithm

2.5. Model Training and Implementation

2.6. Shapley Additive Explanation Analysis for Variable Importance

2.7. Evaluation Statistics and Taylor Diagram for Performance Comparison

3. Results and Discussion

3.1. Descriptive Statistics and Data Distribution Analysis

3.2. Performance Comparison of Cubist, RF, and LightGBM Across PTFs

3.3. Decoding Predictive Relationships: SHAP-Based Interpretation of SHCF Parameters

3.4. Contextualization of Cubist Performance Within the Broader PTF Literature

3.5. Limitations and Future Work

4. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI