Statistical Modeling of Soil Moisture , Integrating Satellite Remote-Sensing ( SAR ) and Ground-Based Data

We present a flexible, integrated statistical-based modeling approach to improve the robustness of soil moisture data predictions. We apply this approach in exploring the consequence of different choices of leading predictors and covariates. Competing models, predictors, covariates and changing spatial correlation are often ignored in empirical analyses and validation studies. An optimal choice of model and predictors may, however, provide a more consistent and reliable explanation of the high environmental variability and stochasticity of soil moisture observational data. We integrate active polarimetric satellite remote-sensing data (RADARSAT-2, C-band) with ground-based in-situ data across an agricultural monitoring site in Canada. We apply a grouped step-wise algorithm to iteratively select best-performing predictors of soil moisture. Integrated modeling approaches may better account for observed uncertainty and be tuned to different applications that vary in scale and scope, while also providing greater insights into spatial scaling (upscaling and downscaling) of soil moisture variability from the fieldto regional scale. We discuss several methodological extensions and data requirements to enable further statistical modeling and validation for improved agricultural decision-support. Remote Sens. 2015, 7 2753

There are substantial challenges in modeling soil moisture and integrating remote-sensing and ground-based data reliably, given significant spatial and temporal measurement variability and model prediction uncertainty.While soil moisture estimation from Synthetic Aperture Radar (SAR) polarimetry (or scatterometer) data is a topic that has been investigated for over 30 years, with numerous papers having been written and statistical approaches developed, SAR and models using such data are nonetheless continuing to be re-configured, improved and extended given the wider availability of SAR data and to address a rapidly growing demand in its use in a broad set of industrial and environmental applications [1].More reliable predictions of soil moisture are needed when optimizing crop water use and validating satellite remote-sensing/earth observational information [2,3].Agricultural crop irrigation scheduling, disaster response and water management during droughts or flooding extreme events, soil erosion and pollution monitoring making use of hydrological models, all require reliable predictions of daily and field-scale soil moisture.
Soil moisture is a key variable used to calibrate complex agroecosystem models and for forecasting crop yield at the regional scale, and increasingly hydrological and agroecosystem models are being used in environmental decision support and policy-making.Yet, despite its broad importance, field-scale soil moisture data are often not available or closest neighbor values are used when modeling hydrological and biochemical processes or when calibrating regional-scale predictions generated by complex agroecosystem models.This is, in part, due to constraints and limitations in acquiring and assembling such data over large regions and across sufficient time-periods; the acquisition process is not only costly, but labour intensive, and has high variability when upscaled from the field, to landscape, up to the regional-scale [4][5][6].Instead of relying on direct soil moisture information validated against remote-sensing data, auxiliary predictions are often substituted based on indirect, interpolative or extrapolative assumptions that may not be statistical accurate, nor readily verifiable.Coupled with such challenges, there is also a lack of sufficient understanding that is required to optimally: (1) predict soil moisture across sites or regions where data are sparse or not available, and (2) generate predictions that are robust under different environmental and land-management conditions, given high observed variability at the field-scale, as well as, high stochasticity linked with changing weather patterns and the timing and severity of rainfall events.
Soil moisture is a process that is strongly time and space dependent.Nonetheless, there are advantageous properties of soil moisture variability that enable one to use available data, obtained at specific locations, to predict for unobserved times and spatial locations, namely: (1) a deterministic relationship between the high "dielectric constant" of water and variation in horizontal and vertical "backscatter" in remote sensing (hereafter denoted by RS) data; (2) reproducible spatial-temporal patterning and trends that arise, for example, from spatial variation in soil type and characteristics and/or seasonal patterns of stochastic rainfall events, and (3) significant dependence between soil, vegetation, climate/atmospheric, topographic and other environmental variables in time and space.

Research Objectives
In this paper, we present a flexible, integrated statistical-based modeling approach to improve the robustness of soil moisture data predictions.We apply this approach in exploring the consequence of different spatial correlation assumptions and choice of leading predictors and model structures.Previous investigations that have applied statistical models have not included variable (covariate) selection, spatial correlation aspects, and propagation uncertainty [7][8][9][10][11][12].We demonstrate our approach using a multi-site data across an agricultural study area in Canada.This selected data was associated with conditions of high environmental variability and homogeneous terrain and thus provided a strong "stress-test" for predictive-based modeling.Our aim was to generate new findings and insights on the: (1) selection of different predictor variables from a set of competing ones linked with available RS data, expert knowledge and semi-empirical algorithms, and (2) selection of different models with differing spatial correlation assumptions.A statistical modeling approach that integrates variable and model-based selection offers greater flexibility to enable models to be more broadly applied across a wide range of applications.The approach we describe also deals with overfitting in the multivariate context.We utilize a broad set of statistical validation measures (e.g., AIC, BIC and DIC criterion), including cross-validated RMSE (CVE) and correlation (CVR) for evaluating the performance of model soil moisture predictions.
The paper is structured as follows: Section 2 includes a summary of the data collection methods.Section 3 defines our statistical modeling approach and the procedures we applied for selecting, optimizing and evaluating the performance of different sets of predictors, covariates, model structures, and spatial dependence.Section 4 presents results on predictor selection and validity, the relative performance of different statistical model structures and the relative influence of spatial correlation on model performance.In Sections 5 and 6 we summarize our findings, their implications and the importance of applying statistical-based modeling that enables automated selection of predictors, covariates, model structural and spatial correlation for optimizing soil moisture predictions and obtaining robust, cross-validated model performance statistics, integrating SAR and gound-based data.We also outline our future work and goals.

Study Region and Data Sources
The study was conducted in an agricultural area located in the county of Prescott-Russel in eastern Ontario near Ottawa, Canada, centered at 45.37 • N, 75.01 • W. This agricultural research site was established by Agriculture and Agri-Food Canada (AAFC) in 2006, in a region of non-irrigated dryland agriculture and under private land ownership, approximately 50 km east of Ottawa.Field size averages 20 ha (relatively small) with a crop mix of corn, soybean, cereal and pasture-forage.The growing season is May through to September.
RADARSAT-2 (MacDonald, Dettwiler and Associates Ltd., MDA) data supplied to the Government of Canada (GC)/Agriculture and Agri-Food Canada (AAFC) was obtained with images acquired over 25 × 25 km areas during three field campaigns on 5, 16 and 23 May (i.e., early in the growing season) in 2008 .RADARSAT-2 is an Earth observation satellite that was successfully launched in 2007 for the Canadian Space Agency (CSA).It is equipped with a fully polarimetric, synthetic aperture radar (SAR), operating at C-Band (5.3 GHz).Fine-quadpole beam modes (FQ19, FQ11, FQ16) were applied in the 5, 16, 23 May RADARSAT-2 acquisitions, respectively.Hereafter, we refer to each of the three observation days as Time 1, Time 2 and Time 3, respectively.Field measurement campaigns for soil moisture were carried out on SAR data acquisition dates.
Figure 1.The Casselman study region/agricultural area situated in eastern Ontario, outside of Ottawa, Canada.RADARSAT-2 acquisition swaths are outlined, as well as location of large water bodies (The Great Lakes).In the zoomed map, soil moisture sampling locations (red points) are indicated, along with weather stations (green points).These points are super-imposed over contours of slope (digital elevation model, DEM).This map was generated using ArcMap 10.1 (ESRI).
A total of 44 sampling sites (within 42 fields) were used (Figure 1).Each sampling site had a plot area of 120 × 120 m, or roughly 12 × 12 fine quad-mode SAR pixels (i.e., a nominal spatial resolution of ∼8 m).Near-surface volumetric (i.e., in-situ) soil moisture was measured at depths of 6 cm within ±3 h of each RADARSAT-2 acquisition, using a Delta-T Soil Moisture Sensor, hand-held impedance probe, with a non-site specific soil calibration factors used, and an accuracy of ±0.05 cm 3 /cm 3 .For each site, 16 sampling points were selected that were separated 30 m apart.Replicate measurements (3) were obtained within a 1 m radius of each of these sampling points in an attempt to capture moisture variations within the top, middle, and bottom of a soil ridge [13,14].This sampling plan yielded 48 soil moisture measurements per site.These measurements were pooled to provide representative mean estimates of the observed soil moisture variation at each of the 44 sites.Surface roughness measurements were taken at each site using a 1 m needle profiler, consisting of a tripod mounted with a digital camera.These measurements were aligned to the look direction of the radar, and selected to be representative of the entire site area (i.e., field).Ground-based photos were processed using a MATLAB TM application to extract root-mean-square height (h RM S ) and correlation length (CL).Crop residue cover, tillage, soil type, and slope were also measured.Further information on the SAR data acquisition and processing and ground-based sampling are provided in [14].Table 1 provides a summary of the data set and the measurement variables, alongside their mathematical notation for reference purposes.
Table 1.Summary of the relevant SAR and ground-based measurement variables and their mathematical notation for the Casselman agricultural monitoring site.

Variable Description Units
Response variable: m volumetric soil moisture (cm 3 /cm  Estimates of volumetric soil moisture percentage (m), incidence angle (θ), backscatter coefficients (σ vv , σ hh ) and surface roughness parameters (h RM S and CL) at the Casselman site are provided in Table 2. Here, we adopt the notation convention for SAR backscatter coefficients in dB units having subscripts σ vv , σ hh whereby linear values are denoted with superscripts.This is based on the relationship prescribed by, σ dB = 10 • log 10 σ o , where σ o is a linear value having a superscript index, and σ dB is the corresponding log value having a subscript index.95% quantile ranges (i.e., 2.5% and 97.5% quantiles) for each of the continuous variables for each of the time points are included.Incident angle was smallest at Time 2. Mean soil moisture and its variability across the sites was substantially less at Time 2 coinciding with the second repeat SAR acquisition.

Broad Range of Model Assumptions and Predictive Accuracy
There are a wide variety of existing models that can be used to predict soil moisture and integrate satellite, RS imagery data-from simpler deterministic and semi-empirical models to probabilistic optimization methods (e.g., feed-forward neural networks (ANNs), Bayesian, Nelder-Mead gradient-based approaches) [15,16].Theoretical radiation-transfer models, such as the small perturbation model (SPM), the physical optics model (PO) and the geometrical optics model (GO) predict the radar backscatter in response to changes in surface roughness or surface (< 5 cm) soil moisture [17].Because the soil dielectric constant is highly correlated with moisture content (i.e., the dielectric constant of dry soil is about 2-3 and the dielectric constant of water is about 80) one can apply indirect, mathematical inversion/matrix methods to predict soil moisture.However, many of these methods perform poorly when used to predict soil moisture for natural surfaces (i.e., that depart from bare soil) using radar backscatter data due to their restrictive assumptions [17].To circumvent these problems, semi-empirical models were developed to predict soil moisture and surface roughness from radar imagery [17,18].These models use co-polarized back-scatter coefficients, in the horizontal transmit-receive (HH) and/or vertical transmit-receive polarization (VV) to predict soil moisture as they are less sensitive to system noise and cross-interference than the weaker cross-polarized coefficients (i.e., HV and VH).Semi-empirical models assume that the backscatter coefficient is dependent on the soil dielectric constant, and a variable relationship between the dielectric constant and soil moisture.Agricultural sites and their water, soil, weather characteristics are typically very dynamic and heterogeneous.Nonetheless, soil moisture retrieval often employs semi-empirical models-in Canada, they have been also previously applied, their assumptions inter-compared, and combined to extend their range of validity [14].Selecting empirical models in different applications depends both on available data and model-based assumptions and statistical uncertainty.The accuracy of empirical and other models for moisture retrieval changes with sample size/available data as well as site characteristics and conditions-such that they can be limited in their wide application.Models may also ignore the influence of many other relevant sources of variation in agricultural fields, such as the tillage direction, variation in the spatial correlation length of soil moisture variability across different fields, and the influence of landscape topography on the degree and range of spatial dependence in soil moisture variability on a seasonal basis.Model propagation of uncertainty is often not considered.Surface roughness and incident angle are often tuned or adjusted for, but semi-empirical equations, such as the Dubois model (see [17]), may limit the inclusion of additional variables that may lead to more accurate and robust prediction.Bryant et al., (2007) have previously demonstrated how roughness effects on radar backscatter are very complex depending on the configuration of the sensor, and the relationship between root-mean-square-height (h RM S ) and surface correlation length (CL) (i.e., the maximum extent of spatial correlation in surface roughness function in SAR horizontal look-direction), and that the degree of error in soil-moisture measurements can vary tremendously (e.g., < 1% to 82%), depending on whether CL is derived from h RM S or whether it is measured in the field [19].Generally, in experimental studies, there is no relationship between these two independent parameters, however, recent studies have offered empirical, semi-empirical and theoretical approaches for deriving CL directly from a measurement of h RM S and to parameterize radar scattering models like the Integral Equation Model (IEM) for surface roughness requiring only the measurement of h RM S [19][20][21].Rahman et al., (2008) demonstrate regional-scale mapping of surface roughness and soil moisture (using a multi-angle approach and the Integral Equation Model (IEM) retrieval algorithm for sparsely vegetated landscapes), eliminating the need for field measurements [22].A recent review of state-of-the-art with respect to measuring, analysis and modeling spatio-temporal dynamics of soil moisture at the field scale, Vereeeken et al., (2014) finds that ground-based and high-resolution satellite RS data of soil moisture is well suited for near real-time management of agricultural fields and operational, agricultural decision-making, but that more modeling research needs to be placed to understand more complex model-based data collection and adaptive sampling strategies.This is needed, alongside a better understanding scaling (upscaling/downscaling) of soil moisture, to better quantify soil moisture patterns, fluxes and extreme values using statistical models and approaches, while also integrating and optimizing predictors and model performance metrics [23].

An Integrative, Flexible Predictive Modeling Approach
Our statistical modeling approach integrates the RS, ground-based variables and a consideration of the varying influence of hidden or unmeasured variables that mediate spatial dependence in soil moisture prediction.We refer to soil moisture as the response variable of interest at a location s, and denote it as m(s).We combine the RS variables in a row vector, denoted, X r (s), and defined as, The variables, σ vv (s) and σ hh (s), denote vertical and horizontal co-polarized backscatter coefficients, respectively, and θ is incidence angle.Based on physical SAR detection and configuration, the SAR backscatter coefficient can be related to the sine of incidence angle, θ with a proportionality constant that accounts for various physical properties such as brightness, surface roughness and the correlation profile shape.Instead, we specify θ, not sin(θ) in our regression modeling.This does not introduce any physical inconsistencies, arising from the equations not being periodic with respect to θ, because θ only ranges between 0 and π/2.Within this range sin(θ) is a strictly increasing function of θ and maps the interval [0, π/2] to the interval [0,1].Replacing θ by sin(θ) was initially tested as part of our exploratory analysis, but results were very similar and thus θ was selected as the predictor for incidence angle.In the case of a large number of sampling points in time each having different SAR acquisition θ's, one can involve the sinuosoidal (i.e., periodic) function of θ, whereby at each acquisition time (e.g., ±3 h), θ is assumed fixed.Additionally, given the values we utilize here, the small angle approximation applies, whereby θ ∼ sin(θ) ∼ tan(θ) within an error range of 5%-9% (i.e., approximation error for 31-39 • or 0.541-0.681radians).
We define a row vector, X g (s) for the ground-based measurement variables, given by, where h RM S (root-mean-square height) and CL (horizontal correlation length) are measures of surface roughness and ST is the soil type (sand or clay) at the point s.The value of h RM S is the root-mean-square difference of the surface heights compared to its mean in a small area around the point s and CL is the horizontal length of ridges present on the ground [24].Correlation length therefore provides information on how the surface height, at one point on a surface, is related to the surface height at a different point defining a surface-height correlation function.
The statistical modeling equation, integrating both RS data (i.e., X r (s) from Equation (1) above), and X g from Equation ( 2) is then given by, where β 0 is a constant, and β r and β g are column vectors of regression coefficients for X r (s) and X g (s), respectively, and W (s) is the error term reflecting a spatial process over the area of the study.We assume W (s) is normally distributed with mean zero.We further define, W (s) ∼ N (0, σ 2 ) as a spatial correlation function denoted by C(s, s ), which can be assumed isotropic and exponential: C(s, s ) = exp(−||s − s ||) where ||s − s || is the distance between s, s on the ground (in meters).
Available data can be used to estimate the regression coefficients to generate spatial predictions for sites at which we have no observations.However, even such a model may not be sufficient in terms of accurately capturing the key relationships, because the relation between soil moisture and the backscatter coefficients may also require the inclusion of additional interaction terms such as, The amount of available data (i.e., sampling size) typically constrains whether specific or all possible interactions can be added as additional regression terms.Here, the reliance on semi-empirical formulae for prediction is simpler and involves inputting RS variables and the ground-based variables to generate estimates of the dielectric constant (denoted as ε(s)) to track the relative influence of X r (s) on m(s) and to tune and adjust it for any interactions with X g (s).This assumes that ε(s) is positively correlated with m(s) [17].Employing the simplier empirical approach, framing as a statistical regression-based model, gives, There are many other candidate models that could be considered.The Dubois model was used because this is a simpler, semi-empirical model that has been widely applied (bare soil), well-researched and has well-defined validity bounds.Here, uncertainty due to sensitivity and the contribution of variance from interactions between surface roughness (h RM S ), correlation length change (CL) and soil type (ST ) are all included in this equation and can be tuned and adjusted under a prescribed set of assumptions for added flexibility.For example (see [25]) if we consider the log ratio dvh = log(σ vv /σ hh ), the influence of the soil roughness on the dielectric constant may be minimized, given by, No interactions are required between X 1 (s) and X g (s), yielding the modified model, Functional dependence between the variables, σ vv , σ hh and h RM S , in semi-empirical models, is established via the following term, dvh 2 is a derived variable representing a construct (i.e., mathematically defined ratio) of physically-based and physically-interpretable horizontal and vertical co-polarized backscattering and their relative signal contribution.One can integrate such a term into this generalizable statistical modeling approach, and consider models incorporating the following set of covariates, This results in the following multi-scale statistical model,

Predictors and Covariates
The effect of h RM S on the relationship between the backscatter coefficients and the dielectric constant of soil moisture is well known [14,17,25,26].The Dubois model is an example of an empirical model commonly applied when processing and interpreting SAR imagery [17].This empirical backscattering model was derived from L, C and X band scatterometer data, applicable for incidence angles varying from 30 • to 60 • .In the Duboi model, the HH and VV backscatter coefficients are given by, σ hh = 10 −2.75 cos 1.5 (θ) sin 5 (θ) 10 0.028εr tan(θ) (kh RM S sin(θ)) 1.4 λ 0.7 (10) σ vv = 10 −2.35 cos 3 (θ) sin 3 (θ) 10 0.046εr tan(θ) (kh RM S sin(θ)) 1.1 λ 0.7 (11) where σ vv and σ hh denote VV and HH backscatter coefficients respectively; k is the free space wave number given by k = 2π/λ where λ is the free space wavelength (cm).We can omit h RM S from the relationships with ε r by referencing Equations ( 10) and (11).Raising the second equation to the power of 1.27 = 1.4/1.1 and dividing the two equations, kh RM S sin(θ) is canceled to obtain: which is a correction to the equation given in [14] (page 4, Equation ( 13)).
Given that the dielectric constant and soil moisture are positively correlated, Merzouki et al., (2011) provide the following relationship, where they evaluated and inter-compared the Dubois and Oh empirical scattering models [14], Referring to Equation ( 12), dvh 2 = 10 log 10 ((σ vv ) 1.27 /σ hh ) should also be positively correlated with soil moisture.Sanoa et al., (1998) advise instead to use the co-polarized ratio, σ vv /σ hh , so as to minimize the interaction with surface roughness [25].Values of the dielectric constant (ε) that are obtained by solving for ε r in Equations ( 10) to ( 12) yield estimates termed ε r(hh) ,ε r(vv) , and ε r(hh,vv) (i.e., corresponding values of the dielectric constant for co-polarized and cross-polarized alignments).
Figure 2 depicts the steps of our predictor selection procedure, the last row comprising of a total of five possible predictor groups.Dubois et al., (1995) highlight the importance of validity regions for various semi-empirical formulas and that observational parameters must lie within these regions to ensure feasible/optimal values [17] .For example, for the standard Dubois formula, the conditions are that k • h RM S ≤ 2.5, θ ≥ 30 o , m ≤ 35% (recall k = 2π/λ).For the Casselman data-set, λ = 5.6 cm, and θ varied between 35 • and 37 • .Negative values of ε have no meaning.Yet, there is still no general mathematical or theoretical guarantee that ε r is positive when inverting using these formulas, even when the validity constraints or the so-called "Dubois conditions" are satisfied.

Model Structure
A suite of statistical models were constructed by combining different covariates and sources of information (RS data, ground data, spatial correlation) to obtain best-fitting soil moisture predictions at observed and unobserved locations.There are various ways the data can be built in a statistical model.Due to limitations on the data, one may not be able to predict using all possible predictors and interaction terms; nonetheless, such a situation might lead to over-fitting, whereby, a statistical model performs very well for a training data set, but poorly for an independent set of validation data.Under-fitting can also occur when a significant influence on soil moisture is ignored.We consider two classes of models, namely: (1) models with only remotely-sensed covariates; and (2) models with both remotely-sensed and ground-based covariates.We compare results from applying these two class of models to investigate the predictive power and reliability of the remotely-sensed variables alone in predicting the soil moisture, and to investigate the relative improvement, benefit or gain in measuring ground-based variables.
We consider RS covariates, σ vv , σ hh , θ and the interaction terms θ × σ vv , θ × σ hh , which we denote as σ vv θ, σ hh θ.We also consider two other possible covariate forms, defined by, which is based on the recommendation of Sanoa et al., (1998) [25] and, which we have derived in reference to Equation (12).Note that in this Equation, the dielectric constant is only a function of dvh 2 and the incidence angle, Now, referring to Equation ( 13), soil moisture is a function (i.e., h(ε) = √ ε − 1.6) of the dielectric constant only, whereby soil moisture can be expressed as a function of σ vv , σ hh , θ and h RM S , or, alternatively as a function g of dvh 2 and θ, m(σ vv , σ hh , θ, h RM S ) = g(dvh 2 , θ). ( Hereafter, we refer to the variables dvh and dvh 2 as "intermediate" variables.We consider models using the dielectric constants, ε r(hh) , ε r(vv) , ε r(hh,vv) , obtained in Equations ( 10) to (12).For covariate selection, we first use the data at Times 1-3 separately and then consider all the time points combined.We modelled at each of the three acquisition times individually to determine the best models under variation in the ground-based sampling data and SAR configuration (e.g., incident angle), and to obtain independent estimates of model performance or prediction power across this observation time window.In this way, we compute cross-validation model error to isolate the best-fitting or "optimal" models.For the response, we can consider either the raw values of the soil moisture m, as a proportion, or its logit (Z(m) = log(m/(1 − m)).We note that there is very little difference in results obtained from analyses of m versus Z(m) and results presented here are based on m.This procedure that was applied (refer to flow diagram shown in Figure 3) to inter-compare the predictive power of competing statistical models and to select the best-fitting model consisted of several decision steps.At the highest layer, we selected the best model (in terms of prediction error as explained below) for each of the five model families; in the second layer we choose the best model for each of the families in conjunction with ground data; in the third layer, we choose the best of all the models over the families of models; and finally in the last layer we add spatial correlation.
Spatial models without any predictors were also considered (last decision layer).Note that the overall best model may not incorporate some elements, across all layers considered, for example the spatial models may not improve over a non-spatial model, despite involving the same set of predictors of soil moisture.This can be considered as a particular example of over-fitting as spatial models involve more parameters as compared to the corresponding non-spatial models.For each model, we have listed the associated unique family or set of covariates (refer to Table 3).The "Raw" family includes raw remotely-sensed covariates.The "Intermediate" dvh (suggested by [25]) and dvh 2 families utilize transformations on the raw covariates, recall dvh 2 is created by manipulating the Dubois Formulas as described above.Dubois Single-polarized (Dub.Single) and Multi-polarized (Dub.Multi) families utilize the Dubois Formulas and incorporate the Dubois-derived dielectric constants as covariates.Figure 2 summarizes the procedure we performed to obtain the five different families or sets of predictors.

Model for Spatial Dependence
For fitting the spatial models we used maximum likelihood and Bayesian hierarchical methods [27,28].For the maximum likelihood method (fitted using the geoR package) the estimates of the spatial decay parameter (range parameter) were very unstable.This confirms the spatial decay parameters are weakly identifiable, as previously reported by Finely et al., (2008) [29].The Bayesian approach (implemented in R) for implementing the spatial version of our statistical model that we employed circumvented this problem by prescribing informative priors or distributions on the range parameter.

Model Performance Statistics
The cross-validation root-mean-square error (CVE) and cross-validated correlation (CVR) were selected to compare the performance of the different statistical model structures, comprising different predictors, covariates, and spatial correlation assumptions, and were computed as follows.CVR 2 is termed the predictive squared correlation coefficient or leave-one-out cross-validated R 2 and also denoted as Q 2 .A high CVR is a necessary but not a sufficient condition for a model to have a high predictive power (i.e., goodness of fit), because different CVR values may arise from training data sets with different sample size and spatial distributions.Thus, the CVR value should always be accompanied by descriptive statistics of the training data set used to compute it, such as CVE (also denoted RMSE) [30,32].We computed both of these measures.While a high value of this validation statistic (CVR 2 > 0.5) is typically considered sufficient for proof of the high predictive ability of the model from internal cross-validation (i.e., a LOOCV procedure), low values do not necessarily indicate a sufficient reason to question the validity of a model, but relate more to the size and distribution of training data used for prediction.Cross-validation with an external (i.e., independent) set of training data can further improve the reliability of a model.However, while the calculation of CVR 2 by LOOCV validation is based on a well-known and accepted formula, its derivation from an external training or evaluation data set is not trivial and varies with available sample size [30].For each site , a leave-one-out cross-validation (LOOCV) was performed that involved excluding the data/mean values for a given site, s i , and predicting the value at s i , on an iterative basis so that each of the sites have been excluded once.We denote the predicted values as m(s i ), and compute cross-validation statistics (CVE and CVR) according to, . . .
where corr denotes correlation.

Predictor Selection and Validity of Model Predictions
We have presented a set of competing statistical models having different covariates (i.e., the predictor variables).The simplest choice for a group of predictors is to use all the available raw RS variables and their interactions with each other.However, as the size of our data set is small (i.e., containing 44 total sampling points for three days) this choice may not necessarily be optimal due to potential over-fitting.In general, variable interactions may be non-linear and variable distributions, in different SAR soil moisture modeling applications could be directionally biased and/or highly skewed, possibly requiring different parameter and error distribution assumptions if tranformations applied do not approximate a normal or Gaussian statistical distribution (see Vereeken et al., (2014) for a detailed review of statistical features and dynamics of soil moisture patterns [23]).Figure 4 (top panels) show the frequency of the data points when the Dubois Conditions are satisfied (denoted by 1) versus when they are not satisfied (0), showing the Dubois validity conditions are not satisfied for a large proportion of measured values in the case of Time 1, but for the other Times 2 and 3, the conditions are satisified far more frequently.The bottom panels in Figure 4 depict the boxplot summaries of values of ε r obtained from Dubois formulas: ε r(hh) , ε r(vv) , ε r(hh,vv) .We find that even for the data points for which the Dubois Conditions hold, ε hh and ε vv are negative for many of the data points.Contrary to this, for ε r(hh,vv) all the values are positive, regardless of whether the Dubois Conditions are satisfied or not.The relationship between the estimated dielectric constants, ε r(hh) , ε r(vv) , and ε r(hh,vv) , in comparison to estimated soil moisture is shown in Figure 5 (refer to top panels) for Time 1 (light grey), Time 2 (grey) and Time 3 (black).For each time, the corresponding simple regression line (relating m and ε) is provided in the corresponding color.The dotted line shows the vertical line ε = 0.It is clear that at Time 1 and Time 3 there is clear association between any of the estimated dielectric constants and the soil moisture.However at Time 2, when soil moisture estimates are consistently smaller, the relationship is weak and in the wrong direction (i.e., decreasing not increasing).Figure 5 further reveals that various associations of soil moisture with ε r(hh) , ε r(vv) are stronger than ε r(hh,vv) .In the bottom panels, we have repeated the analysis with only the data points for which the Dubois Conditions are satisfied, and still the relationship is decreasing (negative), with no significant change and improvement to an increasing (positive) relationship.
Despite the limitations of using data from only one monitoring site and sampling data available only for three sampling days, a large change in the proportion of sampling data that satisfies validity conditions is evident.This highlights that caution must be taken when applying the Dubois or even other empirical-based formulae with fixed interval validity conditions.Instead of using or extending fixed validity assumptions and constraints imposed by empirical models, our statistical modeling approach offers the key important advantage that it is not constrained to any specific validity region, and avoids  Figure 6 shows scatterplots of the backscatter coefficients (σ hh , σ vv ) and the derived variables (dvh, dvh 2 ) with in-situ soil moisture (%), along with the regression line using the data from all three times.These results show a positive association between soil moisture and the predictors.Computed correlation (%) between each of these variables with both soil moisture (m) and surface roughness (i.e., root-mean-square height, h RM S ), for Times 1-3 and all the Times pooled are summarized in Table 4. Uncertainty in the correlation values are based on standard statistical bootstrapping method based on the 10th, 50th (median) and 90th quantiles and 1000 bootstrap samples.These results indicate that dvh and dvh 2 both have a positive and significant association (i.e., with respect to percent bootstrap confidence interval) with soil moisture at Times 1 and 3 and all three Times pooled, while σ vv and σ hh have higher uncertainty and their confidence intervals include zero.At Time 2 we do not observe significantly positive correlation between soil moisture and any of the predictors.When pooling our data across all three Times, the largest correlation is obtained between soil moisture and dvh 2 .Scatterplots of the backscatter coefficients (σ hh ,σ vv ) and the derived variables (dvh, dvh 2 ) with h RM S are shown in Figure 7, with correlation values summarized in Table 4.The variable h RM S is positively correlated with the two predictors, σ hh and σ vv at Times 1 and 3, with the 80% confidence interval indicating that such association is significant.For the derived variable, dvh 2 , a non-significant correlation is evident at Times 1 and 3, while for both dvh, and dvh2 at Time 2 there is significant negative correlation.A non-significant correlation of a variable with h RM S is desirable for the form of models which use dvh 2 as a predictor, but do not explicitly include h RM S .According to this criterion, dvh 2 is the most desirable predictor for models that do not include h RM S as a variable at Times 1 and 3.

Performance of Different Statistical Model Structures
Model validation/performance measures (i.e., cross-validation root-mean-square-error, CVE, and cross-validated correlation, CVR) for different statistical model structures (i.e., families) are summarized in Table 5. Different model families are identified according to the two groups we considered: (1) models that only include remotely-sensed covariates (remote only) and ( 2) models that include both remotely-sensed and ground-based variables (+ ground).For each model family, models with all the possible combinations of corresponding covariates are fitted and the best model is identified as having the smallest mean squared cross-validation error.At both Times 1 and 3, models involving dvh 2 are among the best models and adding ground covariates has turned out to be useful with CL appearing in the best models at both times.At Time 2 there are no satisfactory models and the best models only include the incidence angle θ (which clearly cannot have any prediction power on its own).Table 6 summarizes our results for the same covariate selection procedure, but now applied to the data pooled together across all the three times.In this case, we forced the categorical time covariate (Time 1, 2 or 3) to be a covariate in the model.This is because soil moisture varies across time and this prevents artificially selecting covariates that are confounded with time such as θ.In this case, models including dvh 2 are again among best models.However, adding the ground covariates did not improve the prediction of the soil moisture.The leave-one-out cross-validation (LOOCV) results are shown in comparison to observed data in Figure 8.For each model and data point, we take the data point out, fit the model and then predict the point which was taken out.In each panel, the LOOCV prediction is plotted against the observed value.The cross-validation correlation includes more than just the correlation between model predictions/fitted values and the full set of observations in evaluating prediction power or model performance, and a relatively high correlation indicates reasonable model performance in relation to observed inter-site variability.The results indicate that the fit at Time 2 is far from satisfactory for multi-site prediction.The top right panel shows that the predicted values at Time 2 fail to capture the increase of soil moisture on the x-axis.Also, in the bottom right plot, we note that the clustering of data along a line segment which sits below the rest of the data can be attributed to the inclusion of data from Time 2. The standard deviations (SD's) in observed soil moisture across all sites for Time 1, 2 and 3 are 5.1, 3.3, 4.1, respectively.Comparing these estimates with the best model CVE's (i.e., 4.2, 3.2, 2.8, respectively), indicates that a significant portion of the observed variation in soil moisture is explained by the models at Times 1 and 3, but less so at Time 2.

Influence of Spatial Correlation
The spatial correlation of soil moisture can potentially help us improve the predictions of soil moisture across space and is considered in this context in [10,11,14].We summarize statistical model predictions obtained from including spatial correlation in the various statistical models developed here.The top panels of Figure 9 depict the semivariogram (created by geoR package, [33]) for the raw soil moisture data, while the bottom panels depict the semivariogram for the residuals after fitting the best models at each time point.In the presence of strong spatial correlation, semivariance increases with the separation distance between the location of pairwise observation measurements.Both the raw data and the remaining noise confirm this increasing trend, but the signal for the spatial correlation is weak.A summary of results obtained from fitting the spatial models to data, both with and without predictors is provided in Table 7.The corresponding non-spatial fits are also included for comparison purposes.We considered isotropic spatial covariance functions of the exponential and Matérn form as discussed in [34].
Given that we have a small sample size and the weakness of the spatial influence detected in the current data-set, non-isotopic spatial functions were not considered.This does not, however, rule out the possibility that the spatial influence might be stronger given more data for the Casselman site, or for data on other agricultural monitoring sites.So, spatial influences can be prominent, even if weak in our current data set, and modeling needs to be flexible in detecting this changing influence across different sampling sites.Only small deviations were detected between specifying an exponential versus a Matérn covariance function (results not shown here), and therefore we only include results for the exponential covariance function that did reveal significant influences in model prediction.Our numerical results also reveal that the spatial models only improved the CVE in the case of Time 2, the same time when predictors also did not show any prediction power in non-spatial models.

Discussion
Our findings demonstrate how semi-empirical models and their assumptions may not be satisfied in a large proportion of data, and furthermore, even when the conditions are satisfied, the dielectric constant using single-polarization method, often can lead to negative i.e., nonsensical soil moisture predictions.Such negative values did not result when employing the multi-polarization method however.Single-polarization values, even when negative, generated predicted patterns of soil moisture having strong correlation with observations (Figure 5).Statistical models do not suffer from these validity constraints and performance statistics that they generate provide a more sound assessment of their reliability to be applied to other regions and application contexts, than deterministic models.Prediction error (root mean square error, RMSE) from previous work that has applied the Dubois multi-polarization method is estimated at 6.2% [14].With our statistical modeling approach, the best-performing model offers a significant improvement (i.e., a significant reduction of prediction error) within the range of 3%-4%.
Data in this modeling study was available at three time points (Time 1-Time 3) during the early weeks of the crop growing season in 2008; 5 May (Time 1), 16 May (Time 2) and 23 May (Time 3).We evaluated and compared a selected set of statistical models that do not include any ground-based covariates that are typically measured (soil type, h RM S , CL).The first three rows of Table 3, correspond to three model families which do not depend on ground variables.In particular models including dvh and dvh 2 are constructed so that the effect of h RM S (a ground variable) is included through other variables and direct values for this observation are not needed.We investigated whether the ground covariates can improve the predictions of these models by adding ground variables to each family.The predictions were improved for models in Time 1 and Time 3, but did not improve for Time 2. Also the best models in terms of prediction for the data combined across the three times included models with no ground predictors.For Time 1 and Time 3, models involving dvh 2 = 10 log 10 ((σ vv ) 1.27 /σ hh ) (Dubois), were among the best models; including ground covariates such as CL improved the prediction accuracy.However for Time 2, the prediction was not satisfactory in any of the non-spatial models.Two differences between Time 2 and Times 1, 3 are the smaller incidence angle and smaller soil moisture values and spatial variability.Rainfall, evapotranspiration would be expected to induce larger differences, so that we infer that the reason why spatial dependence was detected at Time 2 was likely due to sufficiently dry conditions that made it more difficult to discriminate soil moisture variability using SAR.As indicated by Merzouki et al., (2011) in conjunction with processing and analysis of the same SAR acquisition and Casselman ground-based data, a significant accumulation of precipitation preceded the first acquisition, followed by relatively little precipitation between this acquisition and the second acquisition of 16 May.In addition, warm day time temperature aided in the drying of the top soil prior to 16 May [14].A relatively high error in the field measurement of correlation length (CL) was likely the result of its sensitivity to profile length [35].As outlined by Merzouki et al., (2011), relatively short lengths (1 m) were used.A much longer profile length (i.e., >10 m) might have reduced the high nugget variance, but contrasting results are reported in the literature.Also, in obtaining the current data set, shorter length was used, in part, due to practical considerations and constraints of time, labour and cost [14].
Overfitting of statistical predictions can occur when a statistical model is fit to training data but provides poor prediction using an independent data set [36].The solution to this problem is not to include all possible covariates into the model and to detect as much variability and signal information in a given data set.This requires variable and model selection statistical techniques.Existing methods to handle and control overfitting can be organized into three categories [36]: (1) iterative selection methods (such as step-wise regression); (2) regularization methods such as Least Absolute Shrinkage and Selection Operator (Lasso), or, (3) statistical averaging methods (such as Bayesian model averaging) [37].In this paper, we utilized the first of these approaches, devising a grouped, stepwise method that conducts an iterative search of the predictor space corresponding to a group of selected leading predictors.This extends regular stepwise methods to the multivariate case [38][39][40].A widely used measure in validating soil moisture estimation algorithms in the literature is the Root Mean Squared Error (RMSE) [7,14].Despite its popularity, this measure does not deal with over-fitting problems and can lead to eronous conclusions.Instead, alternative validation measures have been developed, namely: Akaike Information Criterion (AIC) [41], the Bayesian Information Criterion (BIC) [42] and Deviance Information Criterion (DIC) [43].These are termed likelihood-based measures and assess overfitting, but cross-validated RMSE (CVE) and correlation (CVR) provide a measure of accuracy of predictions of a model.CVR showed more deviation and was more responsive than CVE.Possible multi-collinearity effects may need to be considered in our modeling arising from sampling points that are sufficiently close together in areas that show reduced soil moisture variability.Depending on the sites selection and their spacing arrangement, spatial correlation may be informative, because very close sites may have a stronger tendency to exhibit similar soil moisture variability.In contrast, sampling a very large area more sparsely may only capture some variations, but not all, within a full sampling extent.Higher deviations in the performance of different models would be expected for other sampling regions under different soil, climate, crop, and landscape variation.A small deviation in CVE and CVR can lead to large spatial uncertainty and error when propagated spatially and temporally (i.e., interpolation and extrapolation).Nonetheless, to capture observed daily, weekly, monthly variability in soil moisture more comprehensively, requires data across a larger time interval and number of acquisition dates.This would enable temporal components of soil moisture variability to be added to the statistical models and involved in the multivariate regressions.
Soil moisture variability at the our study site may, at certain times, be very spatially homogeneous, such that a more heterogeneous region (e.g., in terms of surface roughness, soil variation etc.) would be best for training and validating a statistical modeling approach.Li and Rodell (2013) have recently highlighted how soil moisture is often sampled over a short time period and this results in the observed soil moisture often exhibiting smaller dynamic ranges that prevents unravelling soil moisture spatial variability as a function of mean soil moisture [44].They also provide evidence of power-law scaling in soil moisture variability driven by climate variables such as rainfall.They log-transform soil moisture values, and this might further help to improve the detection of soil moisture variability within our statistical modeling, especially at times when soil moisture variability is reduced.Our analysis identifies that one of these differences may be the reason for the poor prediction power.At Time 2 spatial correlation improved prediction accuracy (i.e., reduced model prediction error), while at Time 1 and Time 3 with weak spatial correlation, including spatial correlation did not improve prediction accuracy.At Time 2, the covariates did not show any prediction power, while the spatial model offered minor improvement and captured a greater portion of the observed variability in soil moisture.As the CVR statistic is sensitive to the sample size of our training set and its spatial distribution, higher predictive power (i.e., higher CVR) could be achieved with training data that has a higher variability in soil moisture than our training data set (e.g., at Time 3 where the value of CVR 2 was 0.35 < 0.50).The standard deviations (SD's) in observed soil moisture across all sites for Times 1, 2 and 3 are 5.1%, 3.3%, 4.1%, respectively.Such low spatial variability of soil moisture makes training statistical models, assessing and interpreting their predictive performance more challenging.Comparing such observed variability with the best-model CVE estimates (i.e., 4.2%, 3.2%, 2.8% for Times 1, 2 and 3, respectively), indicates that the best models at Time 1 and 3 explain a portion of the observed spatial variability of the soil moisture, despite low observed variation in the training data set.The lower predictive performance at Time 2 may be due, in part, to the very low observed SD (i.e., 3.2%), as well as, the small incidence angle.As our results show, the pooling of data/acquisitions across times of high variability may also be necessary to sufficiently increase model predictive power.
Our results show that when integrating ground-based soil moisture data as auxiliary data with SAR remote-sensing data for model prediction (i.e., not just estimation) to achieve high predictive power from statistical models, requires a sufficiently large set of training data and spatially heterogeneous regional variability.Our findings support those of Van der Heijden et al., (2007) who have also previously determined that in remote-sensing across agricultural land, the predictive performance of statistical models is under-estimated with the CVR statistic, given its high sensitivity to the degree of spatial heterogeneity and size of the training data set used in LOOCV cross-validation [45].SAR analyses and modeling studies vary substantially in terms of the quantity and quality of data they rely on -some reported studies utilize data collected during 2-3 months during a growing season while others have monitoring a region for up to 6 years.The number and interval of SAR acquisitions also substantially varies (e.g., 2-11 images), including the number and distribution of sampling sites (e.g., 5-50), often with very limited within-site sampling to enable a reliable determination of intra-site variance.Many SAR analysis and modeling studies have relied on coefficient of determination (i.e., R 2 ) statistics, and in some cases, considering R 2 = 0.30 (i.e., instead of 0.50 or larger values) as the threshold criteria for accepting a given model for reliable estimation and/or prediction.While there is currently no broad consensus on the acceptable threshold for soil moisture prediction, as our findings show, by relying on additional cross-validated statistics, the reliability of a model can be better gauged in terms of its ability to attain prediction-based targets, thresholds and criteria.The inter-comparison of a broader set of such statistics could also help to limit additional bias introduced in the under and over-estimation of soil moisture extremes in SAR analyses, especially when predictions rely on sampling distributions, rather than more complete statistical distribution/moments information.Data availability, costs and coverage are often an area of trade-off that challenges many SAR analyses and modeling studies, so "stress-testing" models and their predictive power, as in this study, under situations of high data sparsity and high variability provides a realistic, operational situation that many practitioners and scientists confront.

Conclusions
In this study, we demonstrated a statistical modeling approach for improving the robustness of soil moisture predictions.We quantifed and inter-compared the predictive power of different models and variables for predicting soil moisture.This approach offers a way to consider a broad set of spatio-temporal assumptions required to identify, select, and validate alternative, competing models, predictors, covariates and spatial correlation assumptions.The approach also does not impose any rigid a priori validity bounds on its inputs, nor overriding fixed constraints in its output predictions, as is the case with many existing soil moisture retrieval methods.Leave-one-out cross-validation (LOOCV) is also integrated.We applied our approach to an agricultural region in Canada with available C-band, multi-polarization SAR data with multi-site ground-based data.Under non-ideal SAR monitoring conditions, employing both model-and predictor-based selection steps, we obtained a best-performing model with a significant reduction of prediction error to within 3%-4%.We found that ground-based data are useful for improving soil moisture prediction, but not in all situations, such as when climate conditions are highly variable, landscape is too homogeneous and/or spatial correlation of soil moisture is low.We further determined that the cross-validated statistic, CVR 2 , was more sensitive than CVE 2 .Our study was limited, however, by available data, namely; one study site, only three SAR acquisitions (i.e., images), and a limited range of surface roughness and soil moisture variability.In addition, high error in correlation length from the use of shorter profile length measurements was also a limitation in the data used to train the models.
The Dubois model was selected in our study as it has a mathematical closed-form solution that enables eliminating the surface roughness parameter (h RM S ) so that a closed-form equation could be derived for the reflectivity, and distinguishing two "model families"-one that includes h RM S as a predictor and another that does not.The use of the Dubois model also enabled highlighting numerical issues with using empirical-based equations having validity constraints when coupling them within a generalized (i.e., broader and integrated) statistical-based approach.Currently, a lower sensitivity and early saturation reported for the IEM model to soil moisture under wet conditions (i.e., extreme soil moisture) indicates that there are significant challenges faced by both simpler and more complex retrieval models in estimating and predicting soil moisture under wet conditions and at the regional-scale of variability [46].Our study utilized predictors that depend on/are linked with the Dubois equations, but also included predictors linked with a "Raw model" and "Sanoa model" branch that do not dependent on the Dubois equations.Each of these model families included many models that were compared with or without ground data and spatial correlation.By including the Oh, or the more complex IEM models into our approach, it may be possible to further reduce prediction error, and to expand its potential application and usefulness.
There are increasing demands for greater predictive power and reliability in model-based predictions.Such information can be used in commodity market forecasting and price adjustments, setting risk insurance coverage and premiums associated with extreme events (e.g., droughts, floods) affecting crops across large agricultural regions, or for geospatial intelligence and planning for early-warning disaster response.For this reason, there is a great need for a consistent methodology, which can be further adapted and tuned to integrate across data sets, models and assumptions, for generating cross-validated soil moisture predictions in an reliable and rapid (automated) way.In the future, statistical-based modeling of very large amounts of RS data on soil moisture will also be increasingly important for integrating data that is multi-scale (i.e., coarse and fine-scale) data and to increase predictive power across a wide range of monitoring conditions and constraints.To help advance soil moisture studies for model-based prediction, NASA's Soil Moisture Active Passive (SMAP) mission was just successfully launched on 31 January 2015 SMAP Mission.SMAP has on-board a synthetic aperture radar (active) instrument operating with multiple polarizations, not in C-band like RADARSAT-2, but in the L-band range (1.20-1.41GHz).It integrates active and passive sensors for coincident fine-scale SAR and coarser-scale measurements (9 km footprints) for producing global soil moisture maps every three days.As a way forward-the approach we have presented in this study, with further enhancement and improvement, could provide the consistent and reliable approach needed to integrate different models, predictors, covariates and spatio-temporal correlation assumptions using SMAP SAR data obtained under a wide range of climate, landscape, soil, crop conditions.In addition, linking our approach across additional agricultural regions with ground-based data remotely-streamed from wireless sensor network-based monitoring technology may provide an efficient and strategic way to obtain internal (i.e., training) and external validation data.Such technology provides semi-continuous soil moisture sampling with automated data processing that can help to further increase the usefulness and reliability of our statistical modeling approach in predicting soil moisture to aid in regional-scale decision-making [47].

1 .
Challenges in Modeling Soil Moisture Using Satellite, Remote-Sensing Data

Figure 2 .
Figure 2. Flow diagram of the variable (i.e., predictor) selection procedure.The last row of the diagram comprising of a total of five possible radar predictor groups to be used in the grouped step-wise algorithm.

Figure 3 .
Figure 3. Flow diagram of the statistical model-selection procedure.The best model is first chosen based on its minimal prediction error, then the best model that includes ground-based data is chosen.In the third selection step, the best model across all five families of possible combinations of predictors is identified.In the final selection step, the influence of spatial correlation is considered and the best performing model is identified.

Figure 4 .
Figure 4. Box-plot summaries of the distribution of ε r when the data satisfy the Dubois Conditions (denoted by 1) or not (denoted by 0).In the bottom panels, we observe that even when the Dubois conditions hold, a changing proportion of data yield negative values of the dielectric constant.
the need to independently discriminate and verify at what locations and at what times such conditions are met.

Figure 5 .
Figure 5.The relationship between ε r obtained from Dubois formulas and the near-surface soil moisture (m(%)).The left and middle panels correspond to the single polarization methods and the right panels correspond to the multi-polarization method.Associated regression lines are indicated: Time 1 (light grey); Time 2 (grey); Time 3 (black).The bottom panels correspond to the points for which the Dubois Conditions are satisfied.

Figure 8 .
Figure 8. Cross-validation predictions of the best-performing model predicted versus observed of soil moisture (m)(%) for Times 1,2, 3, and all times pooled.The y = x line is also shown, whereby better fits have values that lie closer to this line.

Table 3 .
Table shows the covariates corresponding to every model family.

Table 4 .
Correlation (%) between in-situ soil moisture (m)(%) and surface roughness (h RM S ) (cm) with the four leading model predictors (σ vv , σ hh , dvh, dvh 2 ), respectively.Uncertainty in these correlation estimates was estimated from standard statistical bootstrapping based on 1000 bootstrap samples.The 10th, 50th (median) and 90th quantiles are indicated, respectively, with the median values highlighted in bold.

Table 5 .
Model selection for soil moisture estimation on the ground at Time 1 (5 May), Time 2 (16 May) and Time 3(23 May).CVE in the table stands for the mean square cross-validation error and CVR stands for cross-validated correlation.The best model(s) for each time point is denoted by a star.