Machine Learning to Estimate Surface Soil Moisture from Remote Sensing Data

: Soil moisture is an integral quantity parameter in hydrology and agriculture practices. Satellite remote sensing has been widely applied to estimate surface soil moisture. However, it is still a challenge to retrieve surface soil moisture content (SMC) data in the heterogeneous catchment at high spatial resolution. Therefore, it is necessary to improve the retrieval of SMC from remote sensing data, which is important in the planning and efficient use of land resources. Many methods based on satellite ‐ derived vegetation indices have already been developed to estimate SMC in various climatic and geographic conditions. Soil moisture retrievals were performed using statistical and machine learning methods as well as physical modeling techniques. In this study, an important experiment of soil moisture retrieval for investigating the capability of the machine learning methods was conducted in the early spring season in a semi ‐ arid region of Iran. We applied random forest (RF), support vector machine (SVM), artificial neural network (ANN), and elastic net regression (EN) algorithms to soil moisture retrieval by optical and thermal sensors of Landsat 8 and knowledge of land ‐ use types on previously untested conditions in a semi ‐ arid region of Iran. The statistical comparisons show that RF method provided the highest Nash–Sutcliffe efficiency value (0.73) for soil moisture retrieval covered by the different land ‐ use types. Combinations of surface reflectance and auxiliary geospatial data can provide more valuable information for SMC estimation, which shows promise for precision agriculture applications.


Introduction
Soil moisture (SM) is a significant component of the hydrological cycle regulating runoff, vegetation production and evapotranspiration [1]. Soil moisture is a major soil indicator to define and identify agricultural drought. Estimation of soil moisture has applications for identifying earlystage water deficit conditions and evolving drought events for crop yield uncertainty and food security conditions, agricultural insurance, policymaking and decision-making, and crop planning [2,3] especially for the arid and semi-arid parts of the globe. Agricultural drought has a catalytic effect that contributes to social and political conflicts in developing countries [4]. Therefore, soil moisture modeling and monitoring are of increasing interest. Monitoring the spatial and temporal variations of SM is a prerequisite both for mitigating and adapting to climate changes for the sustainability of cropping systems as well as for developing precision agriculture and food security [5][6][7][8][9]. Surface soil The agricultural sector plays a strategic role in the economy for arid countries more prone to be sensitive to climate events. In arid regions, droughts are recurring climatic events, often threatening agricultural systems and food security [36,37]. The definition of policies and development of best practices management options for dealing with water stress in such regions require the adoption of scenarios of likely future SM conditions as well as real-time or near real-time monitoring and sharing of SM over extended regions. This can only be achieved by adopting freely available remote sensing data and a model that translates the satellite data to SM estimates with an accuracy that is sufficient to support management. This study attempts to semi operationally assess the spatially and temporally continuous estimates of surface soil moisture content using machine learning models which are formulated for visible, thermal infrared remotely sensed data and easily accessible auxiliary data to support and to improve the framework of agriculture management. The primary goal of this study was to develop an alternative remote-sensed approach with rapid, less expensive, and reliable techniques as opposed to ground-based methods to spatially quantify SM by routinely available satellite imagery. This approach must be repeatable and, ideally, able to provide automated spatiotemporal soil moisture mapping over large areas.

Study Area
The study area represents the semi-arid region of west Khorasan-Razavi province in Iran (Northeastern), located completely on Quaternary alluvial sediments ( Figure 1). The Northeastern semi-arid region also has frequent droughts that can be characterized by the absence of soil moisture information. Droughts are extreme climate events, which often affect SMC in Iran [37]. There were the worst droughts during the period of 1998-2001 and again in 2014 in the Middle-East [38]. The study area includes the Sabzevar county and city, an area of approximately 1100 km 2 intensively used for residential, semi-industrial, and agricultural purposes. The land-use is mainly forested areas (less than 1% of canopy cover) (42%), agricultural (14%), and settlement (1.5%), and the remaining natural and planted forests of Haloxylon spp, that is a commonly used plant as sand-fixing. The study area is erosion-sensitive and with potential environmental limitations on vegetation and agriculture [39][40][41] and it is mountainous towards the North and large plain characterizes the South; the elevation of the study area varies between 891 and 2085 m, with a mean value of 1229 m above sea level ( Figure 1). Although the slope angles vary from 0 to 43 degree, the majority of the terrain is only sloping between 0 and 2 degrees. Long term average annual precipitation at the Sabzevar Weather Station is 188.6 mm. The rainiest month is March with an average of 37 mm while August and September are the driest rainy months (< 1 mm). The annual mean air temperature is about 17.4 °C and the average relative air humidity is 41% [42].

Experimental Design of Soil Samples
Soil moisture in situ measurements were obtained using a portable multi-sensor capacitance probe (PR2, Delta-T Devices Ltd., Cambridge, UK), which can measure volumetric soil moisture (%vol) at six depths: 10,20,30,40,60, and 100 cm. The PR2 probe has a measure range of 0-100% (m 3 /m 3 ) and a precision of ±6%; ±0.06 m 3 /m 3 (range of 0 °C to 40 °C) in generalized soil calibration [43]. This sensor was selected n study because of its profile resolution, easy installation, and favorable soil conditions [44]. An important side of soil moisture variability is its effect on the required number of soil samples at different locations to properly estimate the true value of soil moisture through remote sensing data. The experimental design of soil samples (moisture content) is generally based on land-use or soil types in the experiment area [45,46]. For this research, data on landuse types were obtained at 30 m resolution from previous research ( Figure 2) [47]. Land-use data provided by Adab, Farajzadeh, Filhkash, and Esmaili [47] are more detailed for soil moisture estimation than for others. The landuse map derived from Landsat 7 ETM+ images consists of eight land-use categories obtained by the maximum likelihood classifier method (Kappa statistic = 0.846, see Table 1). The SMC samples were collected from six typical land-use systems described in the study area (Table 1). Gardens were excluded from SMC sampling because the area of this type is negligible ( Figure 2). The study area was divided into three geographical sectors, namely the north sector, west, and east sector. The west sector and east sector close to the plains and alluvial fans as major agro-ecoregions are dominant.
However, the northern sector is mainly located in the rough topographic relief which is mixed rangeland, forested area, and irrigated field crops. Sampling points were also determined based on a map of World Reference Base (2006) Soil Groups produced by SoilGrids project [48]. The major soil reference groups for the study area are Aric Regosols, Calcaric Regosols, Haplic Calcisols, Haplic Fluvisols, and Haplic Fluvisols (Calcaric) (Figure 2). These reference groups are classified according to soil properties, characteristics, horizons, and profiles. Soil moisture samples should measure from different altitudes to get the most soil moisture variation which might be influenced by elevation because elevation reflects microclimate condition.
In this case, after overlapping soil layer, land-use layer, and elevation data, sample locations were distributed randomly due to accessibility to sample throughout the study area of 1135.3 Km 2 . A total of 58 sampling points were identified in the study area which 49 locations used for 10-fold crossvalidation of the model (Figure 1). The remaining sampling points (n = 9) were set aside to be used as an external test data set to estimate the capability of the model on new data (for more details Section 2.6 Model validation and assessment). Soil moisture probe is then placed at the points of interest by augering with minimal soil disturbance. Each measurement represents a point measurement of the moisture in the upper 10 cm of soil. The SMC field measurements were corresponding with the satellite passing date of Landsat-8 schedule as best as logistically feasible. The field measurements included two-time steps of point sampling carried out under wet and dry climate conditions. The first field measurements were performed on three days 6-8 March (days of year calendar (DOY) [65][66][67] and the second one on 26 May (DOY 146) of 2017.

CHIRPS and SMAP Datasets
For further analysis, Climate Hazards Group InfraRed Precipitation with Station data (CHIRPS) and Soil Moisture Active Passive (SMAP) satellite have been used to compare their agreement with the soil moisture derived from the present study. The CHIRPS quasi-global rainfall dataset systematically developed by the USGS in collaboration with the EROS center, have been distributed over the past decade by integrating satellite imageries and in situ gauge-collected observations at 0.05° spatial resolution [51]. NASA's SMAP provides information about soil moisture contained in surface layers of the soil with a 9-km spatial resolution at a global extent by combining L-band passive (radiometer, 1.41 GHz) and active (radar, 1.26 GHz) microwave. Time series of precipitation and Level-3 (L3) surface soil moisture were used across the study area in 2017 and were obtained through the Google earth engine (GEE) online platform (https://earthengine.google.com/).

Pre-Processing of Data
Digital numbers (DNs) of the Landsat imagery were converted to surface reflectance using the ATmospheric CORrection (ATCOR3) model of PCI Geomatica 2018 software [52]. ATCOR is an addon module of PCI Geomatica, which applies MODTRAN 4 RTM to generate look-up tables for different atmospheric input parameters. The visibility values for the image acquisition time were obtained from the Sabzevar Weather Station, and as Sabzevar locates in the continental areas with urban and industrial aerosol sources, an urban aerosol atmospheric model with mid-latitude winter and spring standard atmospheres were selected. The calibration files for Landsat 8 were retrieved from the image metadata file for the band-specific gain and bias values. Topographic correction is very important in the study area with rugged terrain. In this study, ATCOR-3 was applied for rugged terrain effects on Landsat 8 data acquired over the study area by incorporating ASTER DEM data and their derivatives such as aspect, slope, sky view factor, and cast shadow. DNs of the thermal infrared sensor were also corrected into surface temperature with ATCOR module. The ATCOR Surface Temperature workflow uses input thermal bands in scaled radiance (raw DN values), a DEM and terrain derivatives, and, optionally, thermal flux settings from Sabzevar Weather Station (e.g., elevation reference field, air temperature field, and water vapor partial pressure field, and air temperature gradient) to generate a surface temperature of the thermal image. The relationship between the fraction of green vegetation cover and SMC was analyzed through the croplands. The fraction of the vegetation classes were classified by SPEAR Tools, ENVI 5.3 on 7 March 2017. Normalized difference vegetation index (NDVI) usually applies for vegetation delineation, with values in the ranges from 0.05 (for sparse vegetation) to 0.7 (for dense vegetation cover) [53].
Geometric correction of satellite imagery was not performed on the data because level 1TP products are geometrically corrected using ground control points (GCPs) with less than a half-pixel circular error [54]. However, the Landsat 8 satellite images were geometrically double-checked by selecting control points (CPs) from well-defined locations such as crossovers of roads and other welldefined features from the Google Earth image. The total root mean square error (RMSE) of the registered Landsat 8 satellite images was 0.52 pixel.

Statistical Evaluation of the Input Data
Predictor variables were first statistically processed and screened to ensure their quality using XLSTAT software, Add in soft, New York, NY, USA before they were used to estimate SMC. The Grubbs test was used to identify the high level of presence of outliers among variables which may have a large impact on the model. Data normalizing has not been used in this study because the value range of predictor variables may not be the same due to intra-month fluctuations of remote sensing data. Normal distribution of the SMC data and predictor variables have been then checked by means of the jarque-bera (JB) test [55]. The JB statistic test compares the discrepancy between the distribution of data and an ideal Gaussian distribution [56]. To understand the statistical relationship of the eight predictor inputs on surface soil moisture, cumulative sums (CUSUM) statistic was used to measure the linearity and nonlinearity fashion. The CUSUM statistic measures the strength of linear association, which is defined as a running sum of the number of observations above and below the fitted regression line between dependent and independent variables. It is expected the points above and below the line are randomly scattered, when the relationship is linear and therefore the CUSUM statistic is small. One-way analysis of covariance (ANCOVA) is also used for assessing the influence of landuse as a categorical variable on SMC.
The presence of outliers data may give rise to a biased model [57] and a less accurate estimate of soil moisture. Therefore normal distribution test and outliers test of the variables have been checked by means of the jarque-bera test and Grubbs test ( Table 2). As it turns out some variables do not conform to normal distribution (p > 0.05) but met the absence of outliers assumption (p > 0.05). This micro-variability would be considered acceptable for the desired level of soil moisture estimation that can be detected at the scale of sampling based on the current study. From a descriptive point of view, it can be stated that the data are sound without major biases and acceptable for machine learning modeling ( Table 2). The range of coefficient of variation (CV) for soil moisture is 0.72, considerably high, representing that soil moisture in the datasets has a great level of dispersion around the average and therefore sampling points relatively covered a variety value of soil moisture content. Despite considerable variation in soil moisture among sites, the relative variability (i.e., CV) of spectral reflectance of bands across sites was relatively similar, with red band and SWIR2 showing the highest CV and LST the lowest. This provided some degree of confidence in the soil moisture estimation.

Database for Machine Learning Regressions
In this paper, we applied the most commonly used machine learning methods, including random forests (RF), artificial neural networks (ANN), support vector machines (SVM), and elastic net regression (EN). The output cell size of SMC predictors was again set to 30 × 30 m and the snap raster was realigned to the resampled to ensure cell alignment. The area grid was 1164 rows by 1523 columns (i.e., the total number is 1,255,859 cells with excluding no-data pixels in the Extent domain of study area). A total of 10,046,872 predictor cells were created and all layers were then subsequently converted to points, from which appropriate CSV format files were created and the spatial databases of each factor used in the Orange 3.24.1 open-source machine learning software (http://www.ailab.si/orange) [58] to run MLs, and later re-conversion to incorporate them into the GIS database. The estimation of soil moisture data for the study area was converted to a raster grid with 30 × 30 m cells for the study area. Due to the huge amount of predictor data that is generated by GIS software, Arc GIS 10.7, there was a need to automate the process. Figure 3 represented the flowchart of the methodological approach used in the present study. The flowchart consists of four main steps: (1) data preparation including input variable selection (VIS/NIR/and SWIR domain, land surface temperature and Landuse) and output variable (surface soil moisture), (2) data splitting in machine learning, (3) machine learning algorithms, and (4) statistical validation of SMC maps produced by the machine learning algorithms (see Figure 3). Details of each step are described in the following sections.

Tuning Machine Learning Models
The next important stage in using MLs is the optimization of model itself. Tuning plays an important stage in the performance of MLs especially when the tuned model is going to use for semi operational mapping of surface soil moisture. Choosing appropriate hyper-parameters will result in high accuracy which may truly be translated as the success of the model. Each ML has a different setting of hyper-parameters that govern the learning model. However, it is unknown beforehand which values of the tuned parameters for the models are suitable for the SMC estimation. For that reason, to achieve a high prediction performance with the models, some parameters need to be tuned in the training process of the established model. Although k-fold cross-validation is the most suitable technique for training a successful ML, but the full advantage of accuracy cannot be taken by ML without tuning the model by specific external testing data. External testing data use in generalization which is the main point in tuning MLs model parameters. Because it tells us how well the learning model applies to specific data not used by the model. In the present study, the best model hyperparameters were chosen after making a judgment about several settings with expert intervention. In order to be able to assess the generalization capability of the MLs, 10-fold crossvalidation of the model (49 samples), and external testing (9 samples) was applied to machine learning regressions (See Figure 1 for more details).
The MLs developed in this study were also evaluated on external testing that does not contain any pattern used in 10-fold cross-validation. The same data were used for 10-fold cross-validation and external testing of the MLs. The proposed model is first trained and validated with data in March 2017 and then tested on data from May 2017 to gain a confident estimate of the models' performance. In this way, the prediction of SMC for unused point samples in May 2017 in the models can be analyzed. In other words, possible external testing helps us to sure the robustness of the final model in intra-and inter-season. Therefore, when modeling soil moisture, it is necessary to evaluate the accuracy of the model by external testing. In this study, 15% of the whole datasets being used for external testing the ML. The training subsets were used to train the machine learning regression (MLR), while the validation subsets to compute the cross-validation of the trained model. The validation subsets served as independent noise data to test the generalization capabilities of the trained MLR, especially when the training data are limited. The MLR was also compared by a k-fold cross-validation framework, sometimes called rotation estimation, as the heuristics technique [59]. The cross-validation estimator delivers nearly unbiased estimate information on model accuracy since the model is tested with k-folds not involved for model development [60]. The performance of the RF, ANN, SVM, and EN models on the full data set has been evaluated in this study by 10-fold cross-validation [61]. This cross-validation was applied to show that the experiment results of the algorithm are repeatable and not dependent on a particular subsample of the database. K-fold cross-validation includes randomly dividing the data into 10 k equal sized subsamples or folds. Iteratively, 8 subsets were applied to train the model, 1 as a validation dataset to stop the training procedure, and 1 as a test set to evaluate the performance of the model. This procedure has been repeated 10 times for each fold sequentially, which means each of the 10 subsets was applied once as a test set [62].

Artificial Neural Networks (ANNs)
A multi-layer perception (MLP) is a class of feedforward neural network (FNN) which has been selected to estimate surface soil moisture content. The configuration of the ANN has to be designed by the trial-and-error procedure because there are no definitive rules to find out the optimum configuration [63]. The following hyperparameter settings have been introduced: 6, 4, or 3 input neurons, 3 hidden layers, a regularization parameter of 8, and a maximum of 200 iterations. In this study, the identity activation function is used for all of the neurons in the hidden layer which is equal to fitting a linear regression model (note that an identity function is always employed as the activation function in the output layer). The backpropagation (BP) algorithm is a common gradient-based algorithm used for training FNNs. In order to solve some of BP problems, metaheuristic algorithms have been represented to train FNNs [64], namely the limited memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) algorithm as the family of quasi-Newton, stochastic gradient descent (SGD), and Adam method [65]. The Broyden-Goldfarb-Shanno (BFGS) is used in this study, which is one of the promising methods for training neural networks and it is the best popular quasi-Newton algorithm [66] for general optimization as well as general machine learning [67]. BFGS method is not dominated by noise and has great advantages over the conventional backpropagation (BP) algorithm, including superlinear convergence, good approximation and generalization abilities, high accuracy, and less prone to overfitting [68][69][70]. Regularization used in this study which can use in filtering out noise from data, and eventually avoid overfitting by adding bias to penalize extreme parameter weights [71].

Support Vector Machines (SVM)
SVMs are a class of supervised learning algorithms, which is derived from statistical learning theory and were first introduced by Vapnik [72]. In some ways, SVMs are closely associated with ANNs [73]. The SVMs have been successfully applied for both classification and regression proposes [73,74]. SVM uses a kernel function for transforming the input data and then apply linear regression to the transformed data [75]. SVM involves two steps, including, (1) selecting an appropriate kernel type and setting kernel parameter (kernel width G) and (2) specifying the penalty parameter C [76]. A set of varying hyperparameters, with the best results obtained with regression cost (C) of 0.5, complexity bound (V) of 1.00, sigmoid kernel with g 0.98, c 0.10, numerical tolerance of 0.03, and a limited value of iterations, equal to 200, have been selected.

Elastic Net regression (EN)
Regularized regression techniques were developed to overcome the weaknesses of ordinary least squares method regression [77]. In such a method, a penalty parameter, representing a bias to be added to the regression coefficients in the equation, is introduced as a regularization parameter [78]. The significance of imposing the penalty is to shrink the coefficient values towards zero or near zero which effectively allows less independent variables to have a coefficient close to zero or equal zero [78]. A linear regression model that applies the L1 regularization technique is called Lasso (least absolute shrinkage and selection operator) and a model that uses L2 is called ridge. Elastic net regression can improve the predictive performance of Lasso regression by combining the grouping effect of ridge regression with the lasso regression. Thus, elastic net has two tuning parameters: α controlling the balance between L1 (lasso) and L2 (ridge) penalties, and λ controlling overall strength of the penalty [78]. The elastic net regression adopted in this study with the hyper parameter settings: α = 0.15 and L1:L2 ratio of 0.71:0.29. These tuning parameters were obtained using both k-fold crossvalidation and external testing data set.

Random Forest (RF)
RF is a nonparametric method that is robust to noise in predictors and thus does not require a preselection of variables [79,80]. RF has several main advantages over another statistical modeling, such as the ability to find high dimensional non-linear relationships, using of categorical and continuous predictors, resistance to overfitting, and relative robustness to noise in predictors and therefore does not need a preselection of variables, and only a few user-defined parameters [80,81]. Numerous trees are composed by the algorithm and the final predictions involve the average of the results from all developed trees in the forest [82]. It is necessary to define a priori three user-defined parameters before running the model, namely the number of trees in the forest, the number of attributes for consideration at each split, and growth control which represents the depth of individual trees and split subsets. Applying RF the following hyperparameters have to be set: number of trees, number of random attributed at each split, the seed for random generator, and depth limit of individual trees. They have been set equal to 50, 5, 5, and 30, respectively; furthermore, limit the depth of individual trees (pruning) has been adopted to avoid over-fitting.

Model Validation and Assessment
Model validation deals in this study with regression metrics and statistical tests. In order to assess the performance of the developed models quantitatively, evaluation metrics involving the mean bias error (MBE), mean absolute error (MAE), root mean square error (RMSE), index of agreement (D-index), and Nash-Sutcliffe model efficiency coefficient (NSE) have been used. MBE is primarily used to indicate the average bias in the model. The positive MBE indicates an overestimation in the predicted SMC while a negative MBE shows an underestimation [83]. The MAE is the simplest measure of estimate accuracy which measures the average magnitude of the errors in prediction data, without considering their direction. It is the absolute value of the difference between the predicted values and the observed ones. MAE may be preferred because it is a simple error metric and is less sensitive to outliers than RMSE [83,84]. RMSE measures the variation between predicted values by a model and the measured values. The lower RMSE indicates good agreement between estimation data and observation data. Both MAE and RMSE represent average model prediction error from 0 to ∞ in the same units as the response variable.
The Nash-Sutcliffe efficiency (NSE) and index of agreement (D-index) coefficients as normalized error-index statistic have been used in this study to seek the model on the basis of its reliability. Dindex was proposed by [85] as a standardized measure of the degree of model prediction error which is dimensionless and in the range of 0-1. The index of agreement does not measure correlation or association between two series of variables. Instead, it provides a measure of the agreement to which the model predictions match exactly the observed data with no proportionality [86]. However, it is very sensitive to extreme values due to the squared differences. Nash and Sutcliffe [87] proposed an alternative goodness-of-fit index to overcome the limitations of the correlation coefficient, which is referred to as the efficiency index (NSE). NSE represents the relative magnitude of the residual variance compared to the measured data variance. The NSE theoretically varies on the range −∞ to 1 and higher values of NSE show a better agreement between predicted values and observations, a model with NSE of 0.0 is no more accurate than predicting the mean value, and negative values represent that the model is worse than the mean value as a predictor. Model performance is considered "acceptable" when NSE ≥ 0.50, and D ≥ 0.70 [88]. Once the highest accuracy model according to evaluation metrics on the validation data has been identified, another field test data has been selected to evaluate the model performance; this procedure has been set up to check the accuracy of SMC prediction with the same model. The model has been then finalized to make SMC predictions for the entire extent of the study area and thus generate a map of the predicted soil moisture.

Linearity Test between Soil Moisture and Optical Domain
The statistical comparison between single bands and SMC shows that the reflectance in visible wavelengths decreases as moisture content in areas (Table 3). There is also a tendency that wetter soils correlate to lower temperatures. The results for the (SWIR) spectral regions show no clear tendencies. The test for non-linear correlations indicates that only the blue band and surface temperature are linearly correlated to SMC. These suggest that spectral surface reflectance of the samples over different wavelength is influenced by surface soil moisture in non-linear form during the wet and cold season. It becomes clear from Table 3 that there is an effective absorption for wet soils in the visible domain. In contrast, it is found that the correlation between NIR and soil moisture is positive (also given in Table 3). The results of the ANCOVA show a significant relationship between landuse and SMC (p < 0.05). Forested areas (5-25%) have the highest SMC (20%) and Irrigated cultivated areas and Forested areas (Less than 1%) have the least SMC (7%). Higher soil moisture in the Forested areas (5-25%) may be due to relatively low potential evaporation from the soil surface. Haplic Fluvisols (Calcaric) exhibits high soil moisture values with an average of 30% moisture content. These soils are characterized by young alluvial soils mainly found along rivers or other low terrain positions, which show stratification or other evidence of recent sedimentation. aplic Fluvisols (Calcaric) are located in more humid areas in the study area because they receive runoff water additional to their portion of precipitation. The lowest value of moisture content (less than 10%) was found in Haplic Calcisols which is characterized by soils of (semi-) arid regions with enrichment of secondary carbonates.

Estimating Soil Moisture by Implementing Machine Learning Models
The soil moisture information provided by the MLs is used to produce a box plot in conjunction with Landuse classification map ( Figure 4). It can be seen that there are some distinct differences between the land-use classes in terms of soil moisture. Visual inspection of box plots showed that there was generally the same SMC prediction between EN and ANN from entire landuse classes. The mean comparison of EN against ANN showed that these techniques gave the same range of soil moisture prediction for the Landuse classes. The box plots also represented RF having a slight edge over SVM. Box plots indicate that forested area (less than 5%) and rangelands (5-25%) have the lowest mean of soil moisture and slightly different from the other land-uses. The entire standard deviation of the mean soil moisture clearly discriminates RF and SVM from the EN and ANN (blue highlighted area in Figure 4). Values of the first (25%) and the third (75%) quantiles of soil moisture for EN and ANN have the same values for the eight land-use classes, while the entire range of SMC values (from the lowest to the highest value in the data set for the soil moisture) is not notable. The boxplot of SMC in Figure 4 shows that the rainfed cultivated areas and forested area (1-5%) tend to be more SMC than the other land-use classes. The standard deviation of the mean of the SMC derived from ANN and Elastic Net regression models are broader, indicating very high variations of SMC in terms of soil moisture estimation (thin blue line in Figure 4). However, the standard deviation of the mean for SVM and RF is distributed narrow, presenting more pixel values in the low range of SMC which indicates a low variation of soil moisture. Overall, the ratio of the variance to the mean represents that both RF and SVM generally estimate SMC with less dispersion with values of 0.25 and 0.32 across all categories. In contrast, the dispersion values for ANN and elastic net regression models are 0.78 and 0.76, respectively. After implementing models, ANN, SVM, RF, and EN regression have predicted soil moisture content of the study area showing the spatial distribution of soil moisture at 30 m resolution. The models have been applied to the landuse data and the Landsat 8 OLI and TIRS data on image dated 7 March 2017, that captured the entire study area to map soil moisture content ( Figure 5). Relatively high SMC were found in slope with a gentle gradient in the lower parts of South of the area and near to the agricultural and urban and suburban areas and also coalescing alluvial fans connecting to the mountain, however, mountain lands in North of the area experienced very low SMC (less than 6%), dominated by less foliage with rock exposure, and steep slope in this area. The results show that the spatial pattern of dry and wet trends is mostly the same but the range of values is slightly different. Almost all the models have totally agreed upon the low SMC to find the high land area and highest levels of SMC on areas of river floodplains and alluvial fans on 7 March 2017; this behavior has been mainly shown by ANN and elastic net regression model. The SMC map of ANN and Elastic Net regression ( Figure 5) is too small to show a few soil moisture saturated individual grid cells. These cells of wet areas show a close correspondence with the local poorly drained areas in the study area. This is due to local control of terrain which has an influence on the wet pattern of soil moisture where areas of high local convergence may be the cause of a temporary increase of water content following rainfall in the soil. The obtained models have been iterated to estimate SMC at the next time step on 26 May 2017, as testing data sets ( Figure 6). Analyzing the results during testing, it can be observed that ANN and Elastic Net regression models have represented very wet conditions compared to SVM and RF which can be seen for training data sets as well. For SMC testing dataset the ANN and Elastic Net regression models have estimated the SMC of 0-100% while the SVM and RF models estimated the SMC of 3-10% and 0-20%, respectively. This spatial variability could be due to both model structure and data noise. Therefore, even for the same data sets, different models may have different implications for the model fit to the data and will perform poorly in generalization.

Validation of Machine Learning Models
A comparison between the results from the testing data set (Table 4) shows that the estimated SMC for RF model has a RMSE of 4.60%, Nash-Sutcliffe efficiency 0.73, an index of agreement 0.91, and a smaller bias (MBE = 2.16) compared to other ML models. The RF model consistently outperforms all other models. Values from the RF model are acceptable comparing to those from the other three models for testing datasets; therefore, it has been chosen as the best method to predict SMC at the unsampled locations in this study. Visual analysis of SMC has shown that RF can be considered a robust spatial predictor model that could estimate SMC in different topographic features and slope and aspect topography ( Figure 5). The results show that there are large differences in error rates among the four data mining techniques. There are also relatively big spatial differences in SMC among the four techniques, specifically, the spatial differences of SVM and RF are thoroughly obvious. Furthermore, the correlation of models increased when landuse included in MLs (ANN = 0.49, SVM = 0.44, RF = 0.56, and Elastic Net regression = 0.55) compared to that obtained without integration landuse data (ANN = 0.48, SVM = 0.4, RF = 0.5, and Elastic Net regression = 0.5), which has suggested that auxiliary geospatial data produced better SMC prediction than was expected using the optical values alone. The scatter plots between observed and retrieved soil moisture covered by six dominant land-use types using ML models are shown in Figure 7. According to the scatter plots, SMC could be predicted with moderate accuracy for the RF (r = 0.56), while the SVM had the lowest accuracy of estimation.

Correlation of Soil Moisture and Predictor Variables
Nonlinearity between soil moisture content and spectral reflectance in topsoil was significantly observed over the Green, Red, NIR, SWIR1, and SWIR2 wavelength, and linearity observed for Blue wavelength and LST. Soil spectral reflectance decreases at 0.4 μm to 1 μm wavelengths when moisture content increases [89]. An earlier study has shown that the decrease of spectral reflectance upon wetting of soil is non-linear because of the hydraulic behavior of water in unsaturated sand [90]. Soil reflectance was observed to be nonlinearly correlated with soil moisture, which was well correlated by a curvilinear exponential model between the 1100 and 2500 nm wavelengths [91]. Even though green and red wavelengths have a nonlinear relationship with SMC, the Pearson correlation coefficient represents a fairly weak negative relationship of visible wavelength with SSM [23,92]. SSM experiences a fairly weak positive relationship with NIR wavelength (Table 3) and this trend is consistent with the results of other studies that used reflectance of MODIS data in a tropical area [92]. This is likely the result of [27] that found the positive correlation between NIR wavelength and SMC for bare soil area and negative correlation with vegetated soil, however, the existing studies showed an exponential negative relationship between soil moisture and NIR reflectance [23]. Estimating surface soil moisture is more complicated over NIR reflectance for a mixture of bare soil and vegetation because the relationship between NIR reflectance and moisture measured in the field for vegetation and bare soil is totally different [27]. The NIR reflectance of vegetation is higher than that of bare soil [93]. These differences results were coming from different examined soil samples which collected from miscellaneous landuse characteristics. Nagy et al. [26] found that the increase of soil moisture content did not result in linear changes in reflectance value at 950 nm and 450 nm and the equations for SMC estimation were set up separately for different soil texture. The linear correlations between spectral features and SMC over 25-30% and less than 5% of moisture content is not significant [26].
In this study, Pearson's coefficient did not capture the linear correlation between SWIR1 and SWIR2 wavelength and SMC. However, the best correlation between reflectance and SMC was found for the reflectance of short-wave infrared bands SWIR1 and SWIR2 for bare soil and vegetated soil [27]. It has been shown that soil spectral reflectivity is recognized as a function of water content but spectral reflectivity can be affected by intrinsic soil factors such as the amount of organic matter, particle size distribution, mineral composition, surface roughness, and color of soil elements [94,95]. Land surface temperature shows to have a week negative influence over soil moisture (Table 3) which highlighted the more complex behavior of the SSM-LST relationship due to the complexity of land surface and for a mixture of bare soil and vegetation [94,96,97].
The results of the study show a significant relationship between landuse and SMC (p < 0.05). Land-use is very significant in determining the spatial variability of SMC because it influences vegetation cover and infiltration and runoff rates, evapotranspiration processes, soil surface characteristics [98,99]. Landuse can even eliminate the effects of related parameters of the topography on SMC [100]. Several studies provide solutions for estimating soil moisture by reflectance images and auxiliary geospatial data [101,102]. Higher soil moisture in the Forests of Haloxylon spp (5-25%) may be due to relatively low potential evaporation from the soil surface. It was observed in a previous study that the amount of clay and silt increases rapidly when sand dunes are stabilized by Haloxylon spp because of suspended particle accumulation and fine particulates which are produced by the weathering mechanism of sand [103]. Increasing clay and silt content also had a slightly positive correlation with soil water content [104]. Topsoil (0-5 cm) moisture content in the youngest Haloxylon ammodendron was significantly less than older plantations [104] because soil surface experiences high evaporation potential due to high solar radiation.

Perspective of Machine Learning Algorithms in Soil Moisture Content
Soil moisture retrievals were computed using four different machine learning algorithms in a semi-arid area, using optical and ancillary data at high spatial resolution. Based on the performance indices reported in Table 4, it is noted that the RF model estimated soil moisture has RMSE less than 4 for all the selected samples during the testing phase. These comparative results are consistent with those of [105] who used VNIR hyperspectral imaging over the plot set up with an area of one square meter and the results indicated that RF outperformed SVR, ANN, and EN in estimating SMC. Although the other studies reported slightly better estimation accuracy [106] of SVM, this may be due to the different scales of the study areas, topography conditions, soil characteristics, sampling densities, incorporating the missing environmental data or quantity and quality of the auxiliary data used to estimate SMC [107,108]. It was indicated that the uncertainty of SMC does not vary with the spacing of point samples in southeastern Australia [109], however, it was represented that uncertainty decreases as the sampling density increased in the Australian National Airborne Field Experiments 2005 (NAFE'05) [107]. From these studies, it can be concluded that the performance of models to estimate SMC is very site-specific and the complex characteristics of uncertainty and the sampling configuration are very dependable on the study scheme. Although the accuracy of the capacitance probe for field soil moisture measurement is still under investigation, the integration of PR2 Profile probe with remote sensing data is capable of monitoring surface soil water content especially when the amount of surface soil water is required on a site-specific [110]. Since a universal relationship between SMC and remote sensing data does not exist, it must be explored empirically by calibration data.
It can be obviously seen in previous researches that ANN model and SVM model have been the most popular method to predict the soil moisture content due to solving non-linear relation between input and output with high accuracy [28]. However, the present study showed that the predictive results of ANN and SVM on the testing dataset are vulnerable which is affected by the uncertainties on nonlinear relation. ANN and SVM structure with the best performance is difficult to be determined [111] however RF is fast to train and tune as compared to other ML methods [112] because RF only has a few tuning parameters. In the present study, we found larger spatial differences in some cases among the machine-learning algorithms. EN offers the most comfortable usage technique because very few parameters have to be optimized and calculation times are marginal for big data such as remote sensing data. However, these advantages are covered by poor prediction performance. Random Forest in contrast is easy to use since only two parameters need to be set by the user. It was noted that RF model is suitable when sample plots and variation are relatively large (i.e., LAI with more than one growth period) [113]. As shown in this study, surface soil moisture was measured in the fields in two periods with high variation. An important characteristic of RF is that it is relatively resistant to overfitting problems [114] and the RF algorithm has well proven to handle high-dimensional data sets and, thus, has a high tolerance for data faults [115]. However, it may be difficult to find an efficient train RF model with a small sample size [116]. The learning of ANN is based on large samples and the performance of the ANN is affected by the complexities of the network structure and the sample. The higher learning accuracy by increasing more neurons may lead to weaker generalization ability. Therefore ANN makes prone to over-learning and reducing the ability for generalization [113] as it was shown for the testing dataset in this study ( Table 4). The ANN model is more suitable when sample plots and variations are relatively low (i.e., LAI for a single growth period) [113]. Based on our results, the SVM model is the second appropriate for soil surface moisture estimations. At some point, a small sample size might be not enough for optimal training of SVM [113].

Soil Moisture Status Over Croplands
SMC is thought to be controlled by soil texture, vegetation type, and microclimate. Figure 8 represents the soil moisture within three vegetation vigor classes (Dense, Moderate, and Sparse). As shown in Figure 8, the soil moisture has not a clear association with three vegetation vigor classes. Because the vegetation status (e.g., type, age, growth stage, and frequencies and durations of irrigation and leaf area index) are not uniform among different fields . Relative high soil moisture was observed for some croplands with sparse vegetation. This soil moisture status is clearest in Figure 8 where the croplands might be under a heavy irrigation event at the time the Landsat data was acquired on 7 March 2017. However, some parts of cropland with moderate to dense vegetation were experienced low soil moisture. In a general concept, high correlations were found in densely vegetated areas (normalized difference vegetation index) and SMC [117], however from Figure 8, this study represented that heavily vegetated areas do not always have a high amount of SMC. This finding is generally consistent with the results of Tianjiao, et al. [118], who noticed that different physiological characteristics of vegetation types have different effects on soil moisture content in a semiarid climatic zone. Dense vegetation also reduced the stability of soil moisture [119] that can be seen in dense vegetation areas in Figure 8. Also, some of the field areas in the study area were not irrigated during the growing cycle and were expected to be less moist. Cropland areas in the study area are dominated by expensive crops (e.g., Pistachios) and the relative soil moisture values of dense vegetation areas were generally less than 10%. Therefore water stress can influence heavily vegetated covers in the area over the study period. Different crop types have different water demands and respond differently to water stress that causes heterogeneity of surface soil moisture after uniform irrigation [29]. This soil moisture spatial heterogeneity for dense vegetation appeared in the form of cropping patterns. Vegetation influences soil moisture and researchers have found that different plant species can influence the spatial and temporal properties of the soil water content [118,120,121]. These studies clarified that plant species can consume amounts of soil water and cause soil drying [119,122,123]. Other factors such as soil texture, surface roughness, and the temperature of the upper soil layer affect soil moisture on the upper 5 cm of the soil [124]. Topographic variability, which influences soil properties and exposure to wind and solar radiation [125]. Therefore the global function of SMC index maps obtained on the different areas is not necessarily useful because they are extracted from a different environment, and calculated by various parameters. This issue complicates the monitoring of the SMC in a specific site.

Further Analysis of SMC with Operational Satellite-Derived Precipitation and Soil Moisture Products
In this study, soil moisture estimates of the Landsat 8 from 7 March 2017 and 26 May 2017 were compared with the SMAP 9-km soil moisture product. Even though the coarser-resolution soil moisture estimates from SMAP cannot be used for pixel to pixel comparison, but the dry-down and wetting trends can be assessed for validating the algorithm. Therefore an average for the entire pixel of soil moisture estimation from landsat was used to compare with SMAP. However, an incorrect consistency of spatial resolution may have been introduced biases into the results. The CHIRPS gridded precipitation dataset with 0.05° spatial resolution were also used in order to facilitate the analysis of the SMC generated maps because precipitation is a proxy for soil moisture. Figure 9 represents daily precipitation data created by CHIRPS for the study area and particularly L3 SMAP products were also added to the time-series for evaluating the ability of the Landsat 8 SMC estimates in capturing the surface soil moisture due to precipitation. Figure 9 clearly indicates that the Landsat 8 SMC estimates at two days (black dot points) follow the precipitation events (e.g., wetting events at the beginning and dry downs at the end of the year).
The beginning of the wet season recorded accumulated precipitation of about 47 mm, since the ending of February, soil moisture trend for both SMAP and Landsat 8 derived soil moisture has started to increase compared to a dry season to less than 5% at the end of May. The effects worsened during the dry season with a rainfall deficit continuing to decrease soil moisture until the end of the season with about 0%. The corresponding L3 SMAP and Landsat 8 soil moisture estimates are slightly different. However, both soil moisture products exhibit a similar trend when the surface soil is mostly wet and dry. It is possible that some agricultural lands become wet from irrigation in the study area, despite there being no significant rainfall event. Using satellite-based soil moisture estimates in arid and semi-arid areas has several advantages: (1) Regional coverage enables soil moisture monitoring of large part of natural ecosystems; (2) 16-day time period improves the ability to monitor the drought-related events in seasonal coverage; (3) precise measurement of water consumption for irrigation needs the soil moisture mapping in irrigated and non-irrigated areas; (4) the application of algorithms enables us to apply it for upscaling SMC to other sensors with high temporal resolution and to capture the irrigation pattern. It must be stressed that this study used only two months (March-May 2017) and soil moisture estimates that can have higher uncertainties over regions of dense vegetation, and therefore validation procedure needs to be performed using in situ observations [126]. Therefore, care should be taken in generalizing results from the present study.

Conclusions
In this study, four different types of rapid nonlinear data-driven models (SVM, ANN, EN, and RF) for predicting the near-surface (5 cm) soil moisture (SMC) were established in practice over different land-use types by combining Landsat-8 surface reflectance data and land-use auxiliary geospatial data on a (30 m resolution) pixel basis. Predictions show good agreement with observed soil moisture measurements. Results from the RF, SVR, ANN, and EN modeling are compared with measurements obtained from PR2 field data and show that RF model performed better for soil moisture forecasting than the other three methods in testing cases. From the result of Nash-Sutcliffe efficiency, the predictive SMC produced from RF has the highest explanatory ability (NS = 0.73). The accuracy of machine learning methods depends on the selection of an appropriate function and parameters. Various techniques are often based on the vegetation index, triangle method, and trapezoid method in the optical domain to estimate SMC rather than on the reflectance of soils. It is represented in this study that deriving surface soil moisture from remote sensing data is still complicated, as the surface soil moisture is not just a function of the reflectance of soil, but is affected by spatial auxiliary data such as land-use types. This hinders solutions for estimating soil moisture using reflectance images. Clearly, the 30 m remotely sensed images show more spatial detail in the surface soil moisture quantitative estimates and also spatial pattern than the sparse ground measurements. This 30 m soil moisture can serve a variety of water resource applications such as watershed modeling over semi-arid climate. More detailed estimations of surface soil moisture will in turn benefit agriculture, with the monitoring of vegetation stress, and provide a valuable dataset for drought monitoring in a semi-arid climate. The limitations faced by this study were time, space, and amount of field soil moisture measurements. However, this study covers several gaps related to the usage of surface reflectance and land-use data, even if the potential of incorporating other variables for estimating soil moisture for future research is still excessive. In conclusion, the high-resolution gridded soil moisture presented in this study can be used in spatial decision support tools for precise irrigation scheduling and to indicate areas with plant potential limited by the lack of soil moisture.