A Machine Learning-Based Approach for Surface Soil Moisture Estimations with Google Earth Engine

: Due to its relation to the Earth’s climate and weather and phenomena like drought, ﬂooding, or landslides, knowledge of the soil moisture content is valuable to many scientiﬁc and professional users. Remote-sensing offers the unique possibility for continuous measurements of this variable. Especially for agriculture, there is a strong demand for high spatial resolution mapping. However, operationally available soil moisture products exist with medium to coarse spatial resolution only ( ≥ 1 km). This study introduces a machine learning (ML)—based approach for the high spatial resolution (50 m) mapping of soil moisture based on the integration of Landsat-8 optical and thermal images, Copernicus Sentinel-1 C-Band SAR images, and modelled data, executable in the Google Earth Engine. The novelty of this approach lies in applying an entirely data-driven ML concept for global estimation of the surface soil moisture content. Globally distributed in situ data from the International Soil Moisture Network acted as an input for model training. Based on the independent validation dataset, the resulting overall estimation accuracy, in terms of Root-Mean-Squared-Error and R 2 , was 0.04 m 3 · m − 3 and 0.81, respectively. Beyond the retrieval model itself, this article introduces a framework for collecting training data and a stand-alone Python package for soil moisture mapping. The Google Earth Engine Python API facilitates the execution of data collection and retrieval which is entirely cloud-based. For soil moisture retrieval, it eliminates the requirement to download or preprocess any input datasets.


Introduction
The soil moisture content (SMC) is a crucial state variable in the complex global cycles of water, energy, and carbon, and is therefore very relevant for studying the Earth's climate and weather [1]. Furthermore, SMC plays a crucial role in natural hazards like drought, floods, and landslides [2]. Satellite remote sensing presents the only possibility for the spatially continuous measurement of surface SMC over large areas. Current, widely used approaches belong to two main categories: those based on active or passive microwave remote sensing, and those based on optical (i.e., shortwave and thermal radiation) remote sensing. The underlying methods for the estimation of SMC are fundamentally different. Most microwave-based retrieval algorithms rely on the same principle, exploiting the dielectric properties of water and its effect on the reflected microwave radiation [3]. For optical remote sensing, many different approaches exist, exploiting the relationship between SMC and surface reflectance, changes of vegetation indices, or surface temperature [4]. Independent of the type of underlying measurement, these approaches require complex retrieval models, often relying on assumptions and approximations.
The essential advantages of microwaves are their low sensitivity to atmospheric conditions, sun-illumination and clouds, and the fact that there is a direct, physical relationship generally applicable model, by modelling SMC with an ANN using coarse resolution (36 km) SMAP data as an input. ML presents an effective way to combine or fuse different types of data (e.g., from remote sensing, in situ measurements, or models). Results from a study by Liu et al. [34] show that the SMC estimation accuracies, for a farmland test area, by a combination of Sentinel-1 and Sentinel-2 and several different ML algorithms (SVR, deep neural networks, and generalized regression neural networks) were higher than those by traditional semi-empirical models based on either Sentinel-1 or Sentinel-2. Another example for data fusion was presented by Bhuyian et al. [35], with a study in which SMAP data were combined with MODIS data in an ML model to improve the estimation of precipitation.
The increasing availability of open data (e.g., from the Copernicus program), paired with the emergence of platforms like Google Earth Engine (GEE), which offer analysisready data and server-side processing capabilities, has further increased the popularity of ML for the estimation of biophysical parameters in recent years [36][37][38]. Due to the high volume of data from modern satellite missions, access and processing have become more complex [39], which means that this shift in the data exploitation strategies has become necessary.
The study presented hereafter followed a similar approach to that of Pasolli et al. [30], who used in situ SMC measurements as a target variable for the ML algorithm and the construction of an empirical model. One of the novel aspects of this work is to propose a data-driven approach that can be applied regardless of location, to be globally applicable but locally relevant. Instead of focusing on a specific study region like in previous studies, the model was trained and tested on the International Soil Moisture Network (ISMN), with global coverage. Chatterjee et al. followed a similar aim with the work presented in [38]. They introduced an approach to training a model for SMC estimations within the entire continental United States of America using US Climate Reference Network (USCRN) measurements as a training target. Some limitations described in [40] were related to the lack of accurate auxiliary data (e.g., land cover, soil type). The article presented here demonstrates how some of these limitations can be overcome by (1) further increasing the size of the training dataset using measurements from the International Soil Moisture Network (ISMN) and (2) combining Sentinel-1 SAR data with optical data from Landsat-8. It shows how ML can be applied in a data-driven approach to estimate high-resolution SMC based on a spatially dispersed training dataset. The proposed solution tackles a gap of the currently available operational datasets regarding their spatial resolutions. Existing operational satellite-based soil moisture products focus on the mapping at medium to coarse spatial resolutions [14][15][16][17]. Furthermore, one of the study outputs is a software toolbox allowing the fully cloud-based mapping of the SMC, enabling easy access for data users and integration in other studies.

Data and Study Area
The following section describes the datasets used in the present study, i.e., in situ data of the ISMN Network [41,42], S1 backscatter measurements, Landsat-8 (L8) shortwave reflectance and thermal radiance, and modelled surface parameters from the global landsurface model GLDAS. The analysis focuses on the period from October 2014 until mid-2020. Google Earth Engine (GEE) provided all datasets except ISMN. The training set encompassed approximately 30,000 samples.
The study area extent is, in principle, global (restricted to specific land-cover classes). Based on the Copernicus Global Land Layer (see Section 2.2.2), masking was carried out to include only the following classes (Section 3.2 describes the masking process in more detail): shrubs, herbaceous vegetation, cropland, bare/sparse vegetation, open forest.

The International Soil Moisture Network
The ISMN is an initiative to establish and maintain a global in situ soil moisture database. Through its website (https://ismn.geo.tuwien.ac.at, accessed on 28 February Remote Sens. 2021, 13, 2099 4 of 21 2020), data from hundreds of monitoring stations worldwide from many providers are available. This network's primary goal is to provide the basis for the large-scale validation of satellite-derived soil moisture products. All data provided through the ISMN are free to use for scientific purposes.
The ISMN dataset is heterogenous, i.e., stations are operated by different providers, using various measurement techniques at different depths, and the stations are located in different land-cover types and climate zones. Therefore, some of the ISMN data may not be suitable because it represents SMC variability that cannot be detected by satellite remote sensing. Due to the large number of stations, individual selection was impossible. Therefore, based on the known limitations of used satellite datasets, a set of rules was defined to filter the dataset (Section 3.2 describes the masking procedure). The number of 461 globally distributed ISMN stations from 13 in situ networks met the requirements to provide target measurements for the algorithm training. Most of the stations are located in North America. Table 1 shows a list of the used monitoring networks (with their location, provider, and available references) and Figure 1 their geographic distribution. The remaining measurement errors and uncertainties are part of the inherent noise, and contribute to the general uncertainty of the retrieval model. The open source python package pytesmo [43] provides reading and postprocessing tools for the ISMN dataset.   2.2. Google Earth Engine S1 A and B produce more than 1 TB of data daily [58]. During its operational lifespan, S1 will generate an enormous amount of data. Consequently, the user would have to handle a substantial volume of data and invest a significant amount of time for the preprocessing of low-level satellite data to fully exploit the potential of the S1 archive. For many other satellite, model-based, or geospatial datasets, users are facing the same problems. This problem has sparked a paradigm shift for the large-scale analytics of geospatial datasets. Recent years have shown the emergence of more and more providers offering cloud processing and online access to analysis-ready data. One of these platforms is GEE. GEE hosts the satellite and auxiliary data for this study. The GEE Python Application Programming Interface (API) allows convenient access to its data and processing functionality [39].
2.2.1. Sentinel-1 S1 is a C-Band Synthetic Aperture Radar (SAR) operated within the Copernicus program, which is a joint initiative of the European Commission (EC) and the European Space Agency (ESA). The standard acquisition mode over land is the Interferometric Wide Swath Mode (IW), with acquisitions at a 250 km wide swath and a spatial resolution of 5 by 20 m. S1 flies in a near-polar, sun-synchronous orbit with a 12-day repeat cycle. The two satellites A and B share the same orbit plane with a 180 • orbital phasing difference, which results in a 6-day repeat cycle for the S1 constellation. A description of all sensor and platform details can be found in ESAs S1 user handbook [59]. The data available on GEE provide σ 0 based on the dual-polarization (VV + VH) Ground Range Detected (GRD) product.

Copernicus Global Land Cover Layer (CGLS-LC100)
The CGLS-LC100 [60] delivers global-land cover maps at a spatial resolution of 100 m, derived from optical satellite remote sensing data. These include a discrete classification and the fractional cover for specific land-cover types. These maps are updated annually, starting from 2015. Land-cover data provided input for masking and were a feature candidate for the SMC retrieval model.

The Global Land Data Assimilation System (GLDAS)
GLDAS [61] ingests satellite and ground-based observational data products and uses advanced land surface modelling and data assimilation techniques to simulate many landsurface parameters. The dataset in version 2.1 covers the period from 1 January 2000, to the present, with about one month latency. Soil temperature and snow-water-equivalent Remote Sens. 2021, 13, 2099 6 of 21 (SOILTMP0_10cm_inst and SWE_inst) were required to mask in situ measurements and satellite data.

Landsat-8 Shortwave Reflectance and Thermal Radiance
L8 is a satellite operated by the USGS, providing imagery of the entire Earth every 16 days with a spatial resolution of 30 m to 100 m. It is acquiring data using two instruments, the Operational Land Imager (OLI) and the Thermal Infrared Sensor (TIRS). Data in the GEE data collection USGS L8 Surface Reflectance Tier 1, which consists of atmospherically corrected surface reflectance for five visible and near-infrared (NIR) bands, two short wave infrared bands, and two thermal infrared bands [62], were considered as potential features for the retrieval model ( Table 2).

MOD13Q1 Enhanced Vegetation Index
The MODIS Enhanced Vegetation Index (EVI) is an improved version of the Normalized Difference Vegetation Index (NDVI), which minimizes canopy background variations and maintains better sensitivity over dense vegetation conditions by including also the blue band and information about atmospheric influences. The MOD13Q1.005 [63] product provides 16-day temporal composites at a spatial resolution of 250 m with global coverage. With its consistent temporal information, the EVI complemented L8 to capture vegetation dynamics.

OpenLandMap (OLM) Soil Information
The spatial distribution and patterns of SMC have a strong link to soil properties. The soil texture class [64], soil bulk density [65], clay content [66], and sand content [67] for the 0 cm layer of the OLM collection were acting as training feature candidates. This dataset provides maps with global coverage at 250 m spatial resolution.

Methods
This section describes the applied methods, covering feature extraction, data masking and merging, data preprocessing, and the ML model training ( Figure 2). The subsections describe the individual steps in more detail.

S1 Preprocessing
Several SAR preprocessing steps were already pre-applied to the data, which are available on GEE [68]:
Terrain correction using the SRTM 30, or the ASTER DEM for areas beyond ±60 • latitude, where SRTM is not available 5.
Resampling to a 10 m grid Beyond these steps above, the following custom processing steps were applied: 1.
Feature extraction

OpenLandMap (OLM) Soil Information
The spatial distribution and patterns of SMC have a strong link to soil properties. The soil texture class [64], soil bulk density [65], clay content [66], and sand content [67] for the 0 cm layer of the OLM collection were acting as training feature candidates. This dataset provides maps with global coverage at 250 m spatial resolution.

Methods
This section describes the applied methods, covering feature extraction, data masking and merging, data preprocessing, and the ML model training ( Figure 2). The subsections describe the individual steps in more detail.

Figure 2.
Flowchart with an overview of the processing steps for feature extraction, masking, and merging. Figure 2. Flowchart with an overview of the processing steps for feature extraction, masking, and merging.

Multitemporal Speckle Filter
SAR imagery is prone to a specific type of noise, i.e., the characteristic salt and pepper effect called speckle. Speckle is not a type of noise introduced by measurement system errors, but by the actual physical properties of the observed target, which cause a noiselike signal. It results from constructive and deconstructive interference of the coherent, but phase-shifted (caused by the random orientation of subpixel scatterers) returned signal. The GEE preprocessing does not include any speckle filtering. Quegan et al. [69] demonstrated the effectiveness of a multitemporal approach that incorporates both the spatial and temporal pixel neighborhood to determine the local value. The corrected intensity J at the pixel location (x, y) for image k, with the original pixel value I, and the spatial average of the pixel neighborhood denoted as <.> is defined as

Radiometric Terrain Correction
Due to the side-looking viewing geometry of S1, the data are particularly affected by topography, causing geometric and radiometric distortions. Vollrath et al. [70] developed a GEE based approach for the angular-based radiometric slope correction to correct these distortions. It employs two reference models, for volume-or surface-scattering dominated surfaces, computing the radiometrically corrected γ 0 f . The following equation is applied, in case of volume-scattering [71]: with the slope steepness in range direction α r and the incidence angle θ i . In case of surfacescattering case, the approach exploits a model by Ulander [72], which adds the tilt in azimuth direction α az as an additional quantity. The incidence angle corrected backscatter γ 0 is derived from the normalized radar cross-section σ 0 : The Shuttle Radar Topography Mission [73] (SRTM) digital elevation model (DEM) V3 provides the topographic reference.

Computation of Temporal Statistics
For bare soil, the backscatter intensity is determined mainly by SMC and surface roughness [3]. The structure and water content of vegetation, if present, create a signal, which adds to that of the soil. In [74], the authors demonstrated that in extreme cases, caused, for example, by the row patterns in agricultural fields, the effect of roughness could be as strong as 10 dB, dominating the SMC signal by far. The effect depends strongly on the local incidence angle but on average, a roughness related σ 0 variability of not less than 2 dB can be expected. Furthermore, in an experiment based on C-Band VV data from ERS-1, the sensitivity of σ 0 to SMC was quantified as 0.26 dB/0.01 m 3 ·m −3 , which shows that roughness information is crucial for the retrieval of SMC. For the change detection approach, Wagner et al. [75] assume that surface roughness remains constant over time or that changes occur on very large time scales, i.e., it is responsible for a constant background signal. Temporal backscatter statistics, median and second, third, and fourth central statistical moments were computed to characterize these static effects caused by surface roughness.

Feature Extraction, Masking, and Merging
The feature extraction was carried out in GEE based on the ISMN station locations and the measurements dates. Each dataset was resampled to a spatial resolution of 50 m using bilinear interpolation. Values were then extracted based on the average of all pixels touched by a 50 m diameter circular buffer around the sampling location. After applying the filtering criteria listed in Table 3, the training database contained approximately 30,000 samples in 62 features (Table 4)

Model Training
Studies [76,77] show that the choice of the best ML algorithm for retrieving biophysical parameters varies significantly, depending on the target variable and the structure of the training dataset. This study used a Gradient Boosted Regression Trees (GBRT) algorithm. GBRT belongs to the ensemble methods, which means several weak learners are built and combined into one powerful ensemble. Weak learners are combined sequentially, and each newly added model tries to correct the prediction bias of all previous models combined [78,79]. GBRT, like Random Forest, belongs to the family of tree-based methods, which means that it is naturally compatible with different data types and ordinal scales. Further advantages are its insensitivity to differently scaled features and the low computational cost associated with algorithm training and target prediction. Moreover, it has proven to perform very well in similar applications [80][81][82]. A comparison of GBRT with three other popular ML methods is provided in supplementary document S1. GBRT uses an additive model to combine weak learners: where M is the total number of models, γ is a multiplication factor (the so-called step size), and h(x) is the weak learner. An iterative process is building the model using decision trees of fixed size as weak learners. The loss L is minimized (with a least-square loss function, in this case) for each tree h m given the previous ensemble F m−1 : where x i and y i are the feature and target values of the training dataset. GBRT solves the minimization problem numerically via steepest descent [83]. The following equation determines the value of γ, based on line-search [84]: An in-depth discussion of GBRT is beyond the scope of this article. For further details, the reader should refer to [85]. This study used the algorithm implementation of Scikit-Learn [86]. Figure 3 shows the detailed workflow of the training procedure, which consists of the following steps: 1.
Generation of training and test datasets: the test dataset consisted of 20% of the samples and was randomly selected. The remaining 80% represented the training dataset.

2.
Tuning and training: cross-validation (CV) and a grid-search approach drove the optimization of hyperparameters [87]. To be specific, Leaf-One-Group-Out-Cross-Validation (LOGO-CV) was applied to find the optimal setting of hyperparameters. Feature selection: the training procedure was further nested in an automatic feature selection routine wherein the training was repeated iteratively with the least essential feature removed in each step. The ranking was based on the impurity-based feature importance, which is provided as an output of the Scikit-Learn implementations.

4.
Testing: the final assessment of the SMC estimation accuracy (the test score) was performed based on the independent test dataset extracted from the whole dataset before the training procedure. Therefore, the test dataset constituted an unseen set and allowed estimating the model's generalization capabilities. A sizeable negative difference between the validation score and the test score would indicate overfitting of the model to the training dataset.  3. Feature selection: the training procedure was further nested in an automatic feature selection routine wherein the training was repeated iteratively with the least essential feature removed in each step. The ranking was based on the impurity-based feature importance, which is provided as an output of the Scikit-Learn implementations. 4. Testing: the final assessment of the SMC estimation accuracy (the test score) was performed based on the independent test dataset extracted from the whole dataset before the training procedure. Therefore, the test dataset constituted an unseen set and allowed estimating the model's generalization capabilities. A sizeable negative difference between the validation score and the test score would indicate overfitting of the model to the training dataset.

Implementation
For the training, ISMN data were prefiltered, downloaded and stored locally. Based on the in situ locations (in space and time), the S1, L8, GLDAS and other auxiliary data

Implementation
For the training, ISMN data were prefiltered, downloaded and stored locally. Based on the in situ locations (in space and time), the S1, L8, GLDAS and other auxiliary data were retrieved from GEE using the Python API. All preprocessing and filtering steps were carried out server-side, i.e., only the filtered and preprocessed training data were downloaded to the local workstation. Scikit-learn [86] was used to solve the regression problem and derive the GBRT model. Figure 4 gives a schematic overview of the architecture. Depending on the number of training samples, the training procedure can be time-consuming, mainly due to GEE's data retrieval. The GEE based mapping of SMC was implemented as a stand-alone Python package, the PYthon Sentinel-1 Soil-Moisture Mapping Toolbox (PYSMM) [88]. PYSMM is freely available as open-source software and was delivered as Supplementary Material to this article.

Results and Discussion
The following chapters present the algorithm training results, assess the GBRT performance, and evaluate the SMC validation results. Table 6 presents the results of the LOGO-CV based hyperparameter selection described in Section 3.3. Table 6. Result of the hyperparameter selection.

Algorithm Hyperparameters
GBRT Learning-rate 0.1 Number of estimators 100 The fraction of samples used for fitting 0.5 Maximum depths of individual regression trees 10 Early stopping after n iterations with no change 10 A summary of validation and test scores is presented in Table 7. The similarity of the coefficient of determination (R²) and the Root-Mean-Squared-Error (RMSE) based on the cross-validation average and based on the test set dataset shows that the GBRT model is not subject to overfitting. It is noteworthy that a model with a relatively low level of com-

Results and Discussion
The following chapters present the algorithm training results, assess the GBRT performance, and evaluate the SMC validation results. Table 6 presents the results of the LOGO-CV based hyperparameter selection described in Section 3.3. A summary of validation and test scores is presented in Table 7. The similarity of the coefficient of determination (R 2 ) and the Root-Mean-Squared-Error (RMSE) based on the cross-validation average and based on the test set dataset shows that the GBRT model is not subject to overfitting. It is noteworthy that a model with a relatively low level of complexity (100 base estimators and a maximum depth of the individual trees of 10) achieved the best results. Fewer and shallower regression trees lead to a low computational cost of model training and target estimation. The sound overall predictive power of the GBRT model is evident in Figure 5

Assessment of the Temporal Accuracy
Validation and test scores presented above represent the overall predic combining spatial and temporal errors. The results presented in this secti on a separate analysis of temporal and spatial variabilities to estimate the a components. Table 8 reports the median scores obtained by calculating R MAE (the Spearman correlation ρ, the Pearson correlation R, and the Klin ciency KGE [89] were included in the table as well to make the results comp SMC products and the comparison in Section 4.5) for each location contai dataset separately, in order to reflect the sensitivity of predicted SMC to te tions. The results were averaged over all locations to estimate the overall t and grouped by the land-cover classes. Figure 6 visualizes the same analysi whisker plots, which describe the distribution of RMSE and R²

Assessment of the Temporal Accuracy
Validation and test scores presented above represent the overall prediction accuracy, combining spatial and temporal errors. The results presented in this section were based on a separate analysis of temporal and spatial variabilities to estimate the associated error components. Table 8 reports the median scores obtained by calculating R 2 , RMSE, and MAE (the Spearman correlation ρ, the Pearson correlation R, and the Kling-Gupta Efficiency KGE [89] were included in the table as well to make the results comparable to other SMC products and the comparison in Section 4.5) for each location contained in the test dataset separately, in order to reflect the sensitivity of predicted SMC to temporal variations. The results were averaged over all locations to estimate the overall temporal error and grouped by the land-cover classes. Figure 6 visualizes the same analysis with box and whisker plots, which describe the distribution of RMSE and R 2 the overall value ( Figure 5), which hints at a reduced sensitivity to temporal variations compared to spatial variations. The plot in Figure 6b groups the same results by landcover classes (see Table 3 for an overview of class labels). Locations belonging to the herbaceous vegetation or cropland class (30 or 40) or the two open-forest lasses (121 and 126) show similar R² values. However, the quartile range for herbaceous vegetation is large. The average temporal correlation was negatively impacted by the significantly worse results for the class bare/sparse vegetation (60). The boxplots in Figure 6c,d present the same analysis results with respect to MAE and RMSE. Overall, the errors are low, and the differences between the land-cover classes were less distinct compared to those in the analysis based on R². It is interesting to note that bare/sparse vegetation performed better, in terms of the error, than the other classes, even though it showed relatively low R² values. The low correlation in combination with a low error can be explained by the also low average SMC of 0.07 m 3 m −3 for this land-cover class, which indicates that it belongs to an area associated with an arid climate. In [90], Morrison and Wagner demonstrated that the relationship between SMC and radar backscatter in such areas is fundamentally different, caused by the strong effect of surface roughness and the response to subsurface features. Furthermore, the analysis showed that the forest classes have a higher error than vegetation or cropland, even though the results were similar in terms of R². By analyzing the temporally averaged SMC, the discrepancy between the temporal accuracies presented above and the overall accuracy can be explained. The correlation between the average of all actual and estimated SMC values of the same location ( Figure  7) shows a very accurate estimation of the static spatial differences (R 2 = 0.97, RMSE = 0.01 m 3 m −3 , MAE = 0.01 m 3 m −3 ), which contributes to the high overall accuracy of the results. With a median of 0.476, the average temporal correlation is significantly lower than the overall value ( Figure 5), which hints at a reduced sensitivity to temporal variations compared to spatial variations. The plot in Figure 6b groups the same results by land-cover classes (see Table 3 for an overview of class labels). Locations belonging to the herbaceous vegetation or cropland class (30 or 40) or the two open-forest lasses (121 and 126) show similar R 2 values. However, the quartile range for herbaceous vegetation is large. The average temporal correlation was negatively impacted by the significantly worse results for the class bare/sparse vegetation (60). The boxplots in Figure 6c,d present the same analysis results with respect to MAE and RMSE. Overall, the errors are low, and the differences between the land-cover classes were less distinct compared to those in the analysis based on R 2 . It is interesting to note that bare/sparse vegetation performed better, in terms of the error, than the other classes, even though it showed relatively low R 2 values. The low correlation in combination with a low error can be explained by the also low average SMC of 0.07 m 3 m −3 for this land-cover class, which indicates that it belongs to an area associated with an arid climate. In [90], Morrison and Wagner demonstrated that the relationship between SMC and radar backscatter in such areas is fundamentally different, caused by the strong effect of surface roughness and the response to subsurface features. Furthermore, the analysis showed that the forest classes have a higher error than vegetation or cropland, even though the results were similar in terms of R 2 .
By analyzing the temporally averaged SMC, the discrepancy between the temporal accuracies presented above and the overall accuracy can be explained. The correlation between the average of all actual and estimated SMC values of the same location ( Figure 7) shows a very accurate estimation of the static spatial differences (R 2 = 0.

Feature Importances
These results were achieved by building the GBRT model based on a ture selection from a large pool of 62 candidate features. Table 9 show optimal features. Omitting any of these features would result in a degrad dictive power of the estimation model. In the table's far-right column, the importance is reported, which is a direct output of the GBRT algorithm. I number of times a feature is used to decide in one of the tree nodes. The im of all features sum up to 1. Figure 8 enables a better interpretation of thi bution based on the accumulation by the sources of the features. It empha icant contribution of optical data and auxiliary datasets and that, despite t ical relationship between SMC and S1, estimations could not be based on tensities without these.

Feature Importances
These results were achieved by building the GBRT model based on an automatic feature selection from a large pool of 62 candidate features. Table 9 shows a list of the 16 optimal features. Omitting any of these features would result in a degradation of the predictive power of the estimation model. In the table's far-right column, the relative feature importance is reported, which is a direct output of the GBRT algorithm. It is based on the number of times a feature is used to decide in one of the tree nodes. The importance scores of all features sum up to 1. Figure 8 enables a better interpretation of this relative contribution based on the accumulation by the sources of the features. It emphasizes the significant contribution of optical data and auxiliary datasets and that, despite the proven physical relationship between SMC and S1, estimations could not be based on backscatter intensities without these.
Soil bulk density feature 0.047 No. of days between L8 and S1 acquisitions feature 0.044 S1 . feature 0.043 L8, band 5 feature 0.041 Percentage of sandy soils feature 0.040 L8, band 3 feature 0.032  Figure 9 presents further insights into the estimation model performance. The learning curve compares the training and cross-validation scores which are dependent on the size of the training dataset. It shows that a large dataset is necessary to increase the generalization capabilities of the retrieval model. In this case, the cross-validation score curve still shows a positive trend, even with the maximum number of samples; therefore, an increased estimation accuracy could be expected with a more extensive training dataset. The model's scalability shows that the cost (in terms of computational time) of increasing the number of training samples is linear. However, the gain, in terms of increased score, shows a nonlinear behavior. Due to the linear relationship between the number of training samples and computational effort, this curve shows identical behavior as the curve given by the cross-validation score.   Table 9. Overview of the automatically features ordered by their importance for the retrieval model.

Variable Name Variable Type Importance
In situ surface SMC from the ISMN covering the topmost layer of soil (0 to 5 cm) target - Similar results were achieved in other studies [30], which showed the high impact of auxiliary and optical data on SAR-based retrieval of SMC. The authors demonstrated that this phenomenon is due to the combination of an indirect relationship between vegetation properties and SMC and their high impact on the backscatter intensities. Research presented in several other articles [29,91,92] confirmed these findings by analyzing the impact of the vegetation phenology and surface roughness on Sentinel-1 backscatter intensities. Figure 9 presents further insights into the estimation model performance. The learning curve compares the training and cross-validation scores which are dependent on the size of the training dataset. It shows that a large dataset is necessary to increase the generalization capabilities of the retrieval model. In this case, the cross-validation score curve still shows a positive trend, even with the maximum number of samples; therefore, an increased estimation accuracy could be expected with a more extensive training dataset. The model's scalability shows that the cost (in terms of computational time) of increasing the number of training samples is linear. However, the gain, in terms of increased score, shows a nonlinear behavior. Due to the linear relationship between the number of training samples and computational effort, this curve shows identical behavior as the curve given by the cross-validation score. Figure 9 presents further insights into the estimation model performance. The learning curve compares the training and cross-validation scores which are dependent on the size of the training dataset. It shows that a large dataset is necessary to increase the generalization capabilities of the retrieval model. In this case, the cross-validation score curve still shows a positive trend, even with the maximum number of samples; therefore, an increased estimation accuracy could be expected with a more extensive training dataset. The model's scalability shows that the cost (in terms of computational time) of increasing the number of training samples is linear. However, the gain, in terms of increased score, shows a nonlinear behavior. Due to the linear relationship between the number of training samples and computational effort, this curve shows identical behavior as the curve given by the cross-validation score. Figure 9. GBRT model performance in terms of the learning curve, the model scalability, and model performance. Figure 9. GBRT model performance in terms of the learning curve, the model scalability, and model performance.

Comparison to Established Methods and Other Experimental Results
This section compares results presented in this article to those from established SMC products, as well as results in recently published articles. Certain limitations apply: quantification of the differences is difficult because the underlying SMC products are fundamentally different, for example, in terms of spatial resolution or the characteristics of the underlying remote-sensing data, and the validation approaches apply different methods and reference datasets. The numbers presented in Table 10 are, therefore, intended for a qualitative assessment. One of the freely available operational SMC products with the highest spatial resolution is the Copernicus Surface Soil Moisture (CSSM) product [19], based on Sentinel-1 data. It provides a mapping of the SMC with a spatial resolution of 1 km. A validation report [93] presents an assessment of the temporal correlation between estimated SMC and measurements from several in situ networks, which are part of the ISMN. The result was a median temporal correlation of ρ = 0.315. Like in the analysis presented in Section 4.2, the range of ρ values for the individual sites is significant.
The Soil Moisture Active Passive (SMAP) mission [15] provides several SMC products derived from passive only and a combination of active and passive microwave data. The L2SMAP product is providing SMC data at a spatial resolution of 9 km. In [94], Colliander et al. performed the validation of several SMAP products based on the so-called core validation sites, which are in situ networks established explicitly to validate SMAP. In terms of the Pearson correlation coefficient (R), the reported median temporal correlation for the L2SMAP product was R = 0.752.
Chatterjee et al. [40] presented a study with a similar aim, as presented in this article. The authors tested several ML approaches to estimate SMC for the continental USA, also based on Sentinel-1 data. Training and validation were performed based on in situ measurements from the USCRN network. The overall correlation between actual and estimated SMC, derived with a Random-Forest based model (RFS1), was R 2 = 0.682.

Conclusions
This study introduced an approach to estimate SMC at a high spatial resolution on a quasi-global scale. The novelty of this approach is the application of a data-driven model in a large-scale context. One of its strengths is that the mapping of SMC is cloud-based, which means that newly available reference data can be easily integrated, and the model retrained without the necessity to process further satellite data. A Python package, called PYSMM [88], for the online retrieval of SMC and a demonstrator dataset was developed to supplement this study. The validation demonstrated that, within certain boundary conditions, an overall high accuracy could be achieved. Compared to the performance of available SMC products as well as similar studies, the results are promising. The overall accuracy fulfils SMC monitoring requirements set by the Global Observing System (GCOS) in [95], which specifies an overall retrieval error of fewer than 0.04 m 3 m −3 . Certain main limitations do apply: (1) the irregular distribution of samples within the feature space leads to variable accuracies (as the results in Section 4.2 show based on the example of the landcover class bare/sparse vegetation); (2) the approach is limited to low vegetation density areas. Based on the CGLS classification, this reduces the mappable area, for example, in Europe, to about 55% of the total land area and about 66% in the USA, and as low as 15% in Indonesia, which is further reduced by masking high NDVI values, frozen soil, or snow; (3) the training dataset covers only some of the global climate zones (Figure 1). Especially point 1 could be tackled in future research by identifying and targeting the low sampled areas of the feature space. Future work should also focus on extending the model to incorporate other remote-sensing sensors like Sentinel-2 and the collection of reference data for currently missing climate zones.
Two studies achieved promising results based on SMC estimations derived with PYSMM, demonstrating spatial and temporal mapping potential. Lei et al. [96] showed how SMC maps could be assimilated into a hydrological model to improve the spatial details of the model simulations, and Greifeneder et al. [97] combined time-series of estimated SMC with GLDAS soil moisture climatologies to derive temporal anomalies.
Supplementary Materials: The comparison of GBRT with three further ML algorithms is provided in S1: Algorithm Comparison, available online at: http://doi.org/10.5281/zenodo.4742678; S2, the PYMM source code is available online at http://doi.org/10.5281/zenodo.4552813 The documentation (S3) is available directly here: https://pysmm.readthedocs.io/en/latest/. Two SMC demonstrator data sets (S4) can be viewed and downloaded through a Google Earth Engine App (https://felixgreifeneder.users.earthengine.app/view/sm-explorer). Funding: This work was partially funded by the Horizon 2020 project "Ecopotential-Improving Future Ecosystem Benefits through Earth Observation, which has received funding from the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation programme (grant agreement n • 641762) and the European Fund for Regional Development project "DPS4ESLAB".
Institutional Review Board Statement: Not applicable.

Informed Consent Statement: Not applicable.
Data Availability Statement: Data available in a publicly accessible repository that does not issue DOIs. Publicly available datasets were analyzed in this study. This data can be found here: https://ismn.tuwien.ac.at and https://developers.google.com/earth-engine/datasets, accessed on 28 February 2020.