remote sensing Global Mapping of Soil Water Characteristics Parameters— Fusing Curated Data with Machine Learning and Environmental Covariates

: Hydrological and climatic modeling of near-surface water and energy ﬂuxes is critically dependent on the availability of soil hydraulic parameters. Key among these parameters is the soil water characteristic curve (SWCC), a function relating soil water content ( θ ) to matric potential ( ψ ). The direct measurement of SWCC is laborious, hence, reported values of SWCC are spatially sparse and usually have only a small number of data pairs ( θ , ψ ) per sample. Pedotransfer function (PTF) models have been used to correlate SWCC with basic soil properties, but evidence suggests that SWCC is also shaped by vegetation-promoted soil structure and climate-modiﬁed clay minerals. To capture these effects in their spatial context, a machine learning framework (denoted as Covariate-based GeoTransfer Functions, CoGTFs) was trained using (a) a novel and comprehensive global dataset of SWCC parameters and (b) global maps of environmental covariates and soil properties at 1 km spatial resolution. Two CoGTF models were developed: one model (CoGTF-1) was based on predicted soil covariates because measured soil data are not generally available, and the other (CoGTF-2) used measured soil properties to model SWCC parameters. The spatial cross-validation of CoGTF-1 resulted, for the predicted van Genuchten SWCC parameters, in concordance correlation coefﬁcients (CCC) of 0.321–0.565. To validate the resulting global maps of SWCC parameters and to compare the CoGTF framework to two pedotransfer functions from the literature, the predicted water contents at 0.1 m, 3.3 m, and 150 m matric potential were evaluated. The accuracy metrics for CoGTF were considerably better than PTF-based maps. learning algorithms have been used to produce global maps of soil hydraulic properties based on well-curated spatially distributed measurements and environmental covariates. In this study, we have demonstrated the beneﬁt of (i) improving coverage of SWCC and the quality of the estimated parameters, (ii) better coverage of information from all climatic regions for ML training, and (iii) including local covariates such as climate, terrain, and vegetation metrics in the prediction. We have shown quantitatively (based on CCC, R 2 , RMSE, and BIAS values) that our new parameter maps perform better in predicting soil water retention compared to previous maps based on PTFs. We highlight that despite the possibility of calibrating a model with measured covariate values (e.g., soil texture), predicted values should be the default for the calibration step if measurements are not available at a global scale (i.e., for each pixel). One of the primary contributors to the improved predictions is the step increase in the number of SWCC data sets and the rigor by which the parameters have been derived. We conclude that including information based on local covariates implicitly injects the effects of soil formation processes on soil hydraulic properties (an easy to implement alternative to the explicit soil structure modeling reported in Fatichi et al. [6] or Bonetti et al. [7]).


Introduction
The quantification of hydrological processes in the soil unsaturated zone (infiltration, runoff, drainage, evaporation, and water storage) is critically dependent on the quality of parameters of the soil water characteristics curve (SWCC) describing the relationship between soil water content (θ) and matric potential (ψ) [1][2][3]. Computing climatic predictions by contemporary land surface models (LSMs) requires highly resolved maps of SWCC parameters for the terrestrial surface of the Earth.
This rapidly expanding need for spatially exhaustive and highly resolved maps of SWCC parameters has prompted the use of readily available soil information such as soil texture to estimate parameters of SWCCs by pedotransfer functions (PTFs) [4,5]. PTFs typically use soil texture, bulk density, and organic content to correlate with SWCC parameter values. A major shortcoming of these PTFs is the omission of the role that soil structure (aggregation of soil particles, biopores formed by plant roots and earthworms) exerts on surface fluxes [6,7]. PTFs are often trained using data on agricultural soils and are tailored for a specific region or country, and cannot be transferred to regions with different land-cover and soil-forming processes [8,9].
To overcome some of these inherent limitations in the estimation of soil hydraulic properties, new approaches that harness local information on soil-forming processes have been recently proposed to map the saturated hydraulic conductivity (Ksat) by linking Ksat measurements to environmental covariates on vegetation, climate, and topography using a machine learning algorithm [10]. Likewise, Chaney et al. [11] developed the maps of soil hydraulic properties (POLARIS dataset) at 30 m resolution using remote sensing covariates for the United States using the SSUGRO (Soil Survey Geographic) database. This study uses a similar approach and improves the spatial maps of SWCC parameters by injecting information on vegetation, topography, and climate along with spatial information on soil properties into SWCC mapping using a machine learning algorithm. To differentiate this approach from previous PTFs, we use the term Covariate-based GeoTransfer Functions (CoGTFs) [10].
The objectives of this study are: 1.
To link an extensive dataset of SWCC parameters to environmental covariates using the CoGTF framework to develop global maps of van Genuchten (vG) parameters.

2.
To compare the CoGTF derived parameter maps with earlier published maps based on PTFs (Rosetta 3 and HiHydroSoil v2.0).

3.
To highlight the limitations related to the application of a model based on measured soil properties for global mapping when only predicted soil information is available.

Covariate-Based GeoTransfer Functions (CoGTFs) Framework
We used here the CoGTF framework proposed by Gupta et al. [10] where soil response variables were combined with environmental covariates using machine learning (specifically, we used here a random forest approach). As described in Gupta et al. [12], we prepared a geo-referenced dataset of the van Genuchten (vG) parameters that included also values of basic soil properties. Then, we overlaid the observation sites with maps of the environmental covariates to produce a regression matrix and fitted a random forest (RF) model to the data. After evaluating the performance of the RF model, we produced spatial predictions of the vG parameters for a global 1 km mesh of locations. The R environment was used for the whole analysis [13].

Training Data-Expanding the Global SWCC Dataset
We compiled and curated a dataset of 15,259 SWCCs ( Figure 1) called the Global Soil Hydraulic Properties (GSHP) database [12]. This dataset is much larger than the training datasets used for two frequently applied PTFs, i.e., Rosetta 3 [14] and HiHydroSoil v2.0 [15] that is based on the PTF of Tóth et al. [16]. Rosetta 3 used 2134 SWCCs (mainly from North America and Europe) whereas HiHydroSoil v2.0 used 3773 SWCCs from Europe to develop the respective PTFs. Hence, maps of vG parameters deduced from Rosetta 3 or related to HiHydoSoil v2.0 have been calibrated by a rather limited dataset of SWCCs, and this motivated us to collect a more comprehensive dataset. We collected SWCC data from 2702 locations that represent all continents and climatic regions. The geographic and climatic distribution of locations with SWCC data was uneven, with most SWCCs measured in North America, followed by Africa, Europe, Asia, South America, and Australia/Oceania. Additionally, the majority of SWCCs (10,048 out of 15,259 curves) were obtained in temperate regions, and 2344, 1422, 1411, and 34 SWCCs were obtained from boreal, tropical, arid, and polar regions, respectively. We estimated the parameters of the simplified van Genuchten model with the restriction m = 1 − 1/n (Equation (1)) for all 15,259 SWCCs using the 'soihypfit' R package [17]: where θ(ψ) is the volumetric water content (m 3 /m 3 ) measured at matric potential ψ (m), θ s and θ r are the saturated and residual water contents, respectively (m 3 /m 3 ), and α (m −1 ) and n (dimensionless) are the inverse air entry pressure and shape parameters. To estimate the SWCC parameters, we restricted the parameter range for α to 0-100 m −1 and for n to 1-7. SWCCs, where the estimated values were equal to the upper limits of the parameter ranges were discarded. We further computed profile likelihood-based confidence intervals for the α and n parameters (see Chapter 4 in Uusipaikka [18]). The limits of the confidence intervals guided us to remove SWCCs that had a flat upper loglikelihood profile for α and n (i.e., SWCCs that had a high or indeterminate estimated standard error). The details of this analysis are explained in Gupta et al. [12].

Soil and Environmental Covariates
We combined spatial information on four soil properties (sand and clay content, soil depth, and bulk density) with 21 environmental covariates for modeling vG parameters, all globally available from https://www.openlandmap.org/ (accessed on 22 December 2021) [19]. The environmental covariates were selected to represent ecological conditions important for soil formation [20], including topography, precipitation patterns, and vegetation cover. The 25 spatial covariates are classified into five groups: (a) climate, (b) terrain, (c) surface reflectance, (d) vegetation, and (e) soil properties. The covariates were selected as described in Table A1 with respect to their hypothesized relevance for modeling vG parameter values.

CoGTF Models Based on Measured and Predicted Soil Covariates
As mentioned above, we trained the CoGTF model using environmental covariates and soil properties. However, related to soil covariates, we faced the problem that we knew measured soil properties for most SWCC locations but not for the locations of the prediction mesh. For global mapping, we completely relied on predicted soil covariates. On the one hand, models based on measured soil properties likely better reveal the 'true' relations between soil properties, environmental covariates, and SWCCs. On the other hand, the model should be trained with the same type of information that is available on a global scale otherwise different types of information will be incoherently mixed. To explore the effect of using different types of soil information on the predictions of the VG parameters, we built two models within the CoGTF framework. The first model CoGTF-1 was developed using predicted soil properties as covariates, extracted from https://www.openlandmap.org/ (accessed on 22 December 2021) and environmental covariates with 11,705 SWCCs parameter sets. The second model, CoGTF-2, was created using the same environmental covariates along with measured soil property data available only for 9958 SWCC parameters. A total of 1747 SWCCs were discarded due to a lack of measured soil properties data (Figures 1 and A1).

Computational Aspects of the Random Forest Model
Following the estimation of the vG parameters and the formation of regression matrix for both CoGTF-1, and CoGTF-2, the R package 'ranger' [21] was used to fit the RF model. Partial dependence plots (see Section 10.13.2 in Hastie et al. [22]) were generated by the R packages 'hexbin' [23], 'lattice' [24], and 'viridis' [25]. The optimal value of the most sensitive hyper-parameter mtry of RF was determined by spatial five-fold cross-validation (see Section 11.4 in Lovelace et al. [26]), and default values of ranger were used for the remaining hyper-parameters, e.g., number of trees, minimal node size, maximum tree depth, and splitting rule. For spatial cross-validation (SCV) the Earth's surface was divided into 1-degree by 1-degree blocks. We then formed the SCV partitions by randomly assigning the blocks to 5 subsets such that each contained about 20% of the data. The RF model was then fitted 5 times, leaving a subset out at a time and using it as a validation set. This SCV procedure was repeated 3 times, and predictions of the 3 SCV repetitions were averaged. The outcome of SCV analysis will be shown in the results section for the optimal mtry in a hexbin plot with a LOWESS (Locally Weighted Scatterplot Smoothing) line [27] to reveal the eventual conditional bias of the vG parameters' predictions.
The relative importance of the various covariates for modeling vG parameters was assessed by the node impurity, which is computed for RF regression problems as the decrease of residual sum of squares (RSS) when a particular covariate splits the data at the nodes of a tree (see Sections 10.13.1 and 15.3.2 in Hastie et al. [22]). The variable that provided maximum decline in RSS (and consequently increase in node purity) was considered the most important variable, the variable with the second-largest RSS decrease was considered the second most important, and so on.
The SCV and the list of important covariates were estimated for both CoGTF-1 and CoGTF-2. For the SCV of CoGTF-1, the predicted soil covariates were used in calibration and validation whereas, for CoGTF-2, the measured soil covariates were used to calibrate and validate the model. To show the effect of 'mixing' soil information in model formulation and application, we conducted another SCV using measured soil properties for the calibration but predicted soil properties for the validation. The results are presented in Figure A6.

Criteria to Assess Predictive Accuracy of SWCC Parameters
The accuracy of the SCV predictions was evaluated using BIAS, root mean square error (RMSE), coefficient of determination (R 2 ), and concordance correlation coefficient (CCC).
BIAS and RMSE are defined as is the sum of squared errors between the predictionsŷ i (predicted vG parameters and water content θ at matric potential ψ) and the measurements y i (vG parameter and water content θ measured at matric potential ψ), and N is the total number of observations. R 2 is defined as is the total sum of squares andȳ the arithmetic mean of the measurements. The concordance correlation coefficient (CCC) [28], a further measure of the agreement between observed and predicted vG parameter values and for validating predictions of water contents θ at prescribed matrix potentials ψ in CoGTF framework, is given by where ρ is the Pearson correlation coefficient betweenŷ i and y i , and sŷ and s y are the respective standard deviations. CCC is equal to 1 for a perfect model withŷ i = y i in which case we have BIAS = 0, ρ = 1 and sŷ = s y .

Validation of Van Genuchten Parameter Maps
The Earth's surface was again divided into 1-degree × 1-degree blocks. For a fair comparison of the CoGTF vG parameter maps with Rosetta 3 and HiHydroSoil v2.0, we used a validation dataset from all continents except North America and Europe because Rosetta 3 and HiHydroSoil v2.0 PTFs were calibrated with the data from North America and Europe. To compute the validation predictions by the CoGTF model, we split the validation data into five parts and predicted the parameters of each part in turn by an RF model that was calibrated with the data of the remaining four parts complemented by the data from North America and Europe. The predictions by Rosetta 3 and HiHydoSoil v2.0 were obtained by overlaying the locations of the validation data with the respective maps. To compute CCC, R 2 , RMSE, and BIAS, we compared the predicted water content with measured water content for three matric potentials (0.1 m, 3.3 m, and 150 m). Predicted water content at these matric potentials was calculated by inserting the predicted vG parameters into Equation (1).
As in the case of SCV, we quantified the model performance of both CoGTF models (CoGTF-1, and CoGTF2). The totals of 2621 SWCCs for COGTF-1 and 1817 SWCCs for CoGTF-2 were used as the validation datasets. Only for a subset of these SWCCs, the water content had been measured at 0.1, 3.3, and 150 m matric potential: 1572 (0.1 m), 719 (3.3 m), and 1184 (150 m) SWCCs for CoGTF-1 and 856 (0.1 m), 611 (3.3 m), and 732 (150 m) SWCCs for CoGTF-2, respectively. We evaluated the accuracy criteria for all of these subsets for either CoGTF-1 or CoGTF-2, Rosetta 2, and the HiHydoSoil v2.0 maps. Figure 2 lists the most important covariates for vG parameter modeling. The figure shows that the covariate importance differed between the various vG parameters (some parameters were dominated by climate covariates, others by soil properties) but was rather similar for CoGTF-1, and CoGTF-2. Climate covariates were the most important factors for the inverse air entry pressure parameter α (Figure 2a,e) whereas sand and clay were the key covariates for the shape parameter n (Figure 2b  Importance of the covariates for modeling vG parameters by a random forest model. The x-axis displays the average increase in node purity (the larger the value, the more important is a covariate). The 7 most important covariates are shown for both models (CoGTF-1 with predicted soil covariates and CoGTF-2 with measured soil covariates). The plots (a,e) show the importance for the common logarithm log 10 α of inverse air entry pressure parameter (unit of α m −1 ), (b,f) for the shape parameter n, (c,g) for the residual water content θ r , and (d,h) for the saturated water content θ s . Sand content, bulk density (BD), soil depth (SD), and clay content belong to soil covariates. Elevation is one of the terrain covariates. Temperature seasonality (TS), minimum temperature of warmest month (MTCM), annual average land surface temperature (LST), minimum temperature of coldest month (MTWM), precipitation of driest month (PDM), mean annual temperature (AMT), diffuse irradiation (DI), and mean annual precipitation (AMP) belong to the climate category.

Covariate Importance and Model Performance for the CoGTF Approach
The results of SCV for both models CoGTF-1 and CoGTF-2 are presented by hexbin density plots in Figure 3. For parameters α, the line of LOWESS fell close to the 1:1 line, hence predictions were approximately conditionally unbiased for both models. However, for θ s and θ r the LOWESS lines were unbiased up to approximately 0.15 and 0.60 m 3 /m 3 , respectively, but slightly biased for higher water content. SCV of CoGTF-1 showed concordance correlation coefficients (CCC) for log 10 α (unit of α m −1 ), n, θ r , and θ s of 0.432, 0.525, 0.321, and 0.565, respectively, and coefficients of determination (R 2 ) of 0.273, 0.325, 0.172, and 0.412 , respectively. The root mean square errors (RMSE) for the four parameters were equal to 0.442, 0.888, 0.076, and 0.082, respectively, and the BIASes equal to 0.008, 0.050, −0.001, and 0.001, respectively. Likewise, SCV for CoGTF-2 showed concordance correlation coefficients (CCC) for log 10 α (unit of α m −1 ), n, θ r , and θ s of 0.481, 0.753, 0.573, and 0.835, and coefficients of determination (R 2 ) of 0.323, 0.605, 0.421, and 0.735, respectively. The root mean square errors (RMSE) for the four parameters were equal to 0.400, 0.723, 0.060, and 0.050, and the BIASes equal to −0.009, 0.004, 0.005, and 0.002, respectively. The model performance of CoGTF-2 was seemingly better than for CoGTF-1, however, the SCV criteria of CoGTF-2 do not characterize the predictive quality honestly for locations where measured soil property data are lacking, i.e., for the locations of the global prediction mesh.

Global Maps of vG Parameters Based on
CoGTF-1 Figure 4 shows the global maps of the vG parameters at the soil surface (0.00 m depth) as calculated by the CoGTF-1 model. Note that the global maps with CoGTF-2 could be created as well, but are not shown here because they were built with 'mixed' soil information using measured soil covariates to calibrate the model but substituted predicted soil covariates for the missing soil property measurements when computing the predictions. CoGTF-1 predictions of the inverse air entry pressure parameter α (Figure 4a) ranged from 0.11 to 48 (m −1 ). High α values (corresponding to soils with dominantly large pores and small air entry pressure) were predicted for the Sahara, the Middle East (regions with high sand content), and tropical regions of northern South America and Central Africa. Low α values were found in the higher northern latitudes with a colder climate. We can conclude that predicted α values are large for coarse textures and for warm regions with high rainfall rates (more intense structure formation). High values of the shape parameter n (Figure 4b) were predicted for Sahara, the Middle East, Central Asia, and the Gobi Desert (regions with high sand content) but also in the higher northern latitudes where soils with high organic matter content and high porosity (low bulk density) are common. In tropical regions both saturated (Figure 4c) and residual water contents (Figure 4d) were high. As expected, the saturated water content was strongly related to bulk density with high values in the northern regions with low bulk density soils. Figure A2 shows the cumulative distribution function (CDF) for global maps of vG parameters at different depths (0, 30, 60, and 100 cm) for CoGTF-1. There is no clear change in vG parameters values with depth except for θ s with much higher values at the surface compared to the deeper soil layer. This is very likely due to low bulk density in topsoil horizons with high organic content.

Comparison with Alternate Global vG Parameters Maps
We compared the new parameter maps from CoGTF-1 with the maps generated with the PTFs Rosetta 3 [14] and HiHydroSoil v2.0 [15]. In Figure 5 we show the comparison for the shape parameter n and provide the other maps in Appendix A. Note that in the comparison, only CoGTF-1 maps are shown. In contrast to the CoGTF maps, Rosetta 3 predicted large values of n only for desert regions with high sand content. CoGTF predicted large values as well for arid regions in general and northern latitudes with low bulk densities in particular. The HiHydroSoil v2.0 map predicted low n values for the entire globe. The differences between the three maps are clearly visible in the probability density functions (PDFs) shown in Figure 6b. The medians of globally predicted n values were equal to 1.67 (CoGTF-1), 1.50 (Rosetta 3), and 1.34 (HiHydroSoil v2.0), respectively. CoGTFbased α and n maps showed a wider distribution compared to maps based on PTFs (see PDFs in Figure 6a Figures A3 and A4). The CoGTF-based θ s map appeared similar to the other two maps (see PDF in Figure 6d and global maps in Figure A5). with CoGTF-1 model based on predicted soil covariates. The maps (a) show inverse air entry pressure parameter log 10 α (unit of α m −1 ), (b) shape parameter n, (c) the residual water content θ r , and (d) the saturated water content θ s . The parameter n is high for sandy soils. Predictably, θ s is affected by soil bulk density map (high θ s values in high latitudes dominated by organic soils with low bulk density). In contrast, the parameter α shows low values in these northern latitudes (large pores and low capillarity).  Table 1 shows the results of the validation performed for CoGTF-1 (with predicted soil covariates), Rosetta 3, and HiHydroSoil v2.0 maps, where measured water content was compared with the predicted water content at 0.1 m, 3.3 m, and 150 m matric potentials. All the maps suffered from some negative bias. The bias was small and its contribution to RMSE was negligible for water content at 0.1 m and 150 m matric potentials, however for water content at 3.3 m matrix potential BIAS contributed substantially to the overall error (HiHydroSoil v2.0: BIAS 2 /RMSE 2 = 0.57). On the whole, the CoGTF predictions had the smallest RMSE and largest CCC and R 2 for water contents measured at the three matric potential values. Note that the R 2 was negative for the Rosetta and HiHydroSoil v2.0 predictions of the water content at 3.3 m matric potential, which means that the arithmetic mean of the validation data yielded more precise predictions than the respective maps. Likewise, the comparison of CoGTF-2 with other PTF-based maps is shown in Table A2. For the water content at higher absolute potential (3.3 and 150 m), the model performance was similar; however, for the water content close to saturation the model based on predicted soil covariates had higher CCC and R 2 values.  Table 1. Concordance correlation coefficient (CCC), coefficient of determination (R 2 ), root mean square error (RMSE), and BIAS of predicted water content (WC) at 0.1 m, 3.3 m, and 150 m matric potential using CoGTF-1 (based on predicted soil covariates), Rosetta 3, and HiHydroSoil v2.0. The criteria were computed for the validation data (SWCC curves outside from Europe and North America).

Models
Water

Characteristics of the CoGTF Global vG Parameters Maps
The study presents new global maps of vG parameters at a 1 km resolution based on the most comprehensive available dataset of SWCC measurements and harnessing local terrain, climate, and vegetation information. The effect of environmental covariates on SWCC parameters was most prominent for the inverse air entry pressure α. As shown in Figure 2a, climatic variables were more important for α than soil texture and bulk density. The size and continuity of the larger pores are controlled by soil structure formation and cannot be captured from basic soil properties alone. We can therefore conclude that for this parameter soil structure was more relevant than soil texture. A similar finding was reported for the saturated hydraulic conductivity Ksat in Gupta et al. [10], showing that large Ksat values were found not only in sandy textures but in tropical regions as well. In tropical regions, the wet and warm conditions result in different mineralogy of the clay-sized soil particle fraction [29,30]. We noticed evidence for the role of the soil-forming processes in the other vG parameters as well with higher saturated and residual water content values and lower n values for tropical regions (see maps in Figure 4). For the other three vG parameters, soil properties were the most important covariates. For example, the model of the shape parameter n was dominated by sand content, and this confirmed the findings of Haverkamp et al. [31] who showed that the shape parameter was closely correlated to soil texture. Similarly, bulk density and clay were the most important covariates for θ s and θ r , respectively. The same results were presented by Oleszczuk and Truba [32] and Feng and Zhang [33]. Remarkable were the high values of θ r in tropical regions. The high values of θ r are likely due to a large proportion of very fine pores in the Ferralsol (oxisols) soils. These pores remain water-filled at potentials as low as −4 MPa [34].

Prediction Accuracy Improves with Covariates and Larger and Well-Curated SWCC Dataset
The CoGTF-based vG parameter maps differed from PTF-based maps in the visual comparisons. As shown in Figure 6, vG parameter values for the CoGTF framework had a wider distribution than calculated with PTFs because only the CoGTF could take into account the variations of environmental covariates and their effect on soil properties. The high variations in vG parameter values were also related to the large dataset (11,705 for CoGTF-1 compared to 3773 for HiHydroSoil v2.0 and 2134 for Rosetta 3) used for training the model. The training data included samples from all soil textural classes and climatic regions. Moreover, the validation results clearly demonstrated that CoGTF performs better than the PTF based maps (Tables 1 and A2). Therefore, CoGTF overcomes the limitations of the Rosetta 3 PTF where the training data was dominated by loamy and sandy soil texture classes (see Figure 1 in [35]) from temperate climatic regions. Furthermore, Table A3 shows SCV results for a model CoGTF-A that was trained using all 15,259 SWCCs. Results showed that the CoGTF-1 model using only curated data predicts α more accurately than CoGTF-A. For the other vG parameters, the model performances of both models were similar. Likewise, Table A4 shows that the CCC and R 2 of CoGTF-A were slightly better than for CoGTF-1 except for water content at 0.1 m. This might be due to the fact that during testing most of the curves that had no wet-end (water content measured for matric potential ≤ 0.2 m) information and could lead to fallacious parameter values were removed.

Comparison of CoGTF Models Based on Measured and Predicted Soil Covariates
In this study, we compared two CoGTF based models (CoGTF-1 where predicted soil properties were used as covariates for model calibration and computing predictions, and CoGTF-2 where measured properties were used to calibrate the model but predicted soil covariates were used to compute the predictions). The reason for using CoGTF-1 was the lack of measured soil properties at the locations of the global prediction mesh. As stated in Section 2.4, the application of the CoGTF-2 model for global mapping was problematic because different information was used for model calibration and for computing the predictions. As expected, the CoGTF-2 models with measured soil properties (Figure 3e-h) predicted the vG parameter in the SCV exercise more precisely than the CoGTF-1 models (Figure 3a-d) because measured soil texture and bulk density are more directly related to SWCCs than the predicted soil covariates. However, these SCV results of CoGTF-2 did not allow us to assess how well CoGTF-2 predicted the vG parameters for the global prediction mesh, where measured soil property data was unavailable. Therefore, to test the loss of accuracy when applying a model based on measured soil covariates on sites with predicted soil information, we changed the SCV scheme as follows: we calibrated the CoGTF-2 model still using measured soil properties as covariates but predicted the vG parameter of the SCV subsets using predicted soil properties as covariates. Figure A6 showed reduced model performance compared to the SCV results shown in Figure 3e-h. The performance was also slightly reduced compared to the model calibrated with predicted soil covariates (Figure 3a-d). Due to the reduced model performance and the inherent inconsistency of mixing measured and predicted soil covariate data we only showed global maps created with CoGTF-1.

Use of the Global CoGTF vG Parameters Maps and Future Developments
The new maps of vG parameters presented in this study can be used in Earth system and land surface models that need information on spatially distributed soil hydraulic parameters. These maps can also be used more directly to compute other important soil physical properties like ponding time, characteristic length of evaporation, sorptivity, and soil strength. In addition, the plant's available water is typically deduced from water contents at specific matric potential values (3.3 m for field capacity and 150 m for wilting point) and can be computed globally with the new maps. The CoGTF maps presented here have a spatial resolution of 1 km. This resolution can likely be improved in the near future, considering various initiatives to estimate soil and environmental covariate information with higher spatial resolution. In addition, the maps can be improved if more comprehensive SWCCs data from Canada and Western and Northern Australia become available.

Conclusions
Machine learning algorithms have been used to produce global maps of soil hydraulic properties based on well-curated spatially distributed measurements and environmental covariates. In this study, we have demonstrated the benefit of (i) improving coverage of SWCC and the quality of the estimated parameters, (ii) better coverage of information from all climatic regions for ML training, and (iii) including local covariates such as climate, terrain, and vegetation metrics in the prediction. We have shown quantitatively (based on CCC, R 2 , RMSE, and BIAS values) that our new parameter maps perform better in predicting soil water retention compared to previous maps based on PTFs. We highlight that despite the possibility of calibrating a model with measured covariate values (e.g., soil texture), predicted values should be the default for the calibration step if measurements are not available at a global scale (i.e., for each pixel). One of the primary contributors to the improved predictions is the step increase in the number of SWCC data sets and the rigor by which the parameters have been derived. We conclude that including information based on local covariates implicitly injects the effects of soil formation processes on soil hydraulic properties (an easy to implement alternative to the explicit soil structure modeling reported in Fatichi et al. [6] or Bonetti et al. [7]).

Data Availability Statement:
The users can download the dataset using this link: https://zenodo. org/record/6348799, accessed on 8 April 2022.

Acknowledgments:
The study was supported by ETH Zurich (Grant ETH-18 18-1). We thank Zhongwang Wei, Associate professor at Sun Yat-Sen University, for helping to collect the datasets and for insightful discussions. We would like to thank Andrea Carmintai, Professor at ETH Zurich, for the insightful discussions.

Conflicts of Interest:
The authors declare no conflict of interest. Appendix A Figure A1. Conceptual workflow used to generate the different CoGTF models using the SWCCs shown in Figure 1. The CoGTF-A model was generated using all 15,259 SWCCs whereas to generate the CoGTF-1 model only 11,705 SWCCs were used with predicted soil properties. Furthermore, the CoGTF-2 model was produced using 9958 SWCCs that had measured soil properties.     Figure A6. Results of SCV of SWCC parameter predictions to show the effect of mixing soil information for model calibration (using measured soil properties) and computing predictions (using predicted soil properties). (a) inverse air entry pressure parameter log 10 α (unit of α m −1 ), (b) shape parameter n, (c) the residual water content θ r , and (d) the saturated water content θ s . The term 'measured' on the axes of the figures relates to parameter estimates obtained by fitting the vG model to measured SWCCs.