Statistical Applications to Downscale GRACE ‐ Derived Terrestrial Water Storage Data and to Fill Temporal Gaps

: The Gravity Recovery and Climate Experiment (GRACE) has been successfully used to monitor variations in terrestrial water storage (GRACE TWS ) and groundwater storage (GRACE GWS ) across the globe, yet such applications are hindered on local scales by the limited spatial resolution of GRACE data. Using the Lower Peninsula of Michigan as a test site, we developed optimum procedures to downscale GRACE Release ‐ 06 monthly mascon solutions. A four ‐ fold exercise was conducted. Cluster analysis was performed to identify the optimum number and distribution of clusters (areas) of contiguous pixels of similar geophysical signals (GRACE TWS time series); three clusters were identified (cluster 1: 13,700 km 2 ; cluster 2: 59,200 km 2 ; cluster 3: 33,100 km 2 ; Step I). Variables precipitation, normalized difference vegetation index (NDVI), Huron land


Introduction
The Gravity Recovery and Climate Experiment (GRACE) is a satellite mission that was jointly implemented by the National Aeronautics and Space Administration (NASA) in the United States and the Deutschen Zentrum für Luftund Raumfahrt (DLR) in Germany to map the temporal variations in the global gravity field [1,2]. The GRACE satellites were launched in March 2002, and the GRACE Follow-On (GRACE-FO) mission was launched in May 2018; since then, their applications have resulted in advances in hydrologic sciences (e.g., [3][4][5][6][7]) in the assessment and monitoring of spatial and temporal variations in groundwater storage (GWS) in many parts of the world, including Africa [3][4][5], the Middle East [6,7], China [8], India [9], California [10], and Mexico [11]. However, such applications are hampered by the relatively low horizontal resolution of GRACE data and the fact that GRACE does not have vertical resolution [12,13]. In other words, GRACE cannot determine in which compartment (e.g., surface water, groundwater, or soil moisture) the observed mass variations are occurring.
Many studies utilizing GRACE data for hydrological research and applications (e.g., [14,15]) target large aquifers and watersheds (areas of 450 × 10 3 to 6 × 10 6 km 2 ). However, the majority of the world's aquifers and watersheds are much smaller; even for the larger ones, one often needs to understand the partitioning of water on the sub-basin level. A finer resolution of GRACE solutions would be a useful tool for tracking the changes in GWS (GRACEGWS) on local scales, especially for regions that do not have sufficient in-situ monitoring sites.
Downscaling techniques allow predictions to be made at a finer spatial resolution than that of the original dataset [16]. Downscaling approaches, especially those developed for climate models and later applied to remotely acquired data, can be classified into two main groups: Dynamic downscaling and statistical downscaling [16]. The former approach has been successfully applied to downscale global climate models (GCMs) over regions of interest by integrating GCM outputs with the physical characteristics of Earth's surface in the area of interest. Monthly GRACE terrestrial water storage (GRACETWS) solutions from Center for Space Research (CSR), Jet Propulsion Laboratory (JPL), and Deutsches GeoForschungsZentrum (GFZ) (spatial resolution: 150,000 km 2 ) were assimilated into a land surface model to generate high-resolution water storage changes within the major watersheds of the Mississippi River [17]. GRACETWS derived from spherical harmonic (SH) solutions were assimilated into the Catchment land surface model to extract GRACE-based drought indicators (spatial resolution: 1° × 1.25°) for North America [18]. Gridded (25 km 2 ) Advanced Microwave Scanning Radiometer-Earth Observing System (AMSR-E) data were assimilated into a fine-scale (1 km 2 ) NOAH land surface model using three-dimensional and one-dimensional Kalman filters [19]. The JPL GRACETWS mascon solutions (3.0° × 3.0°) were assimilated into the fine-resolution (0.05° × 0.05°) hydrologic models [20,21]. The scale factor was used to minimize the leakage errors and to improve the spatial resolution of the JPL spherical harmonics solutions [22]. Such applications often require extensive computing time and resources that are not available for many researchers [23]. Also, many of the above-mentioned procedures depend on the selected hydrological model, some of which are lacking surface or groundwater components.
Statistical downscaling, on the other hand, does not require these resources. Statistical downscaling evaluates observed spatial and temporal relationships between inputs (independent variables) and outputs (dependent variables) using coarse-scale datasets (inputs and outputs) and applies the extracted relationships to produce the dependent variables at a spatial resolution matching that of the fine-scale inputs [24]. A variety of statistical methods have been applied to downscale remote sensing or ground-based data, including Markov chains and support vector machines [25], regression kriging [26], neural networks [27], and stochastic models [28]. Stepwise regression was successfully applied to downscale satellite-based precipitation data (TRMM3B43 products) and average daily precipitation and air temperature data from weather stations [29,30].
Artificial neural networks (ANNs) were used to downscale GCM outputs [31] and rainfall [32]. The major limitation of the statistical approaches comes from the assumption of stationarity between the coarse-and fine-scale dynamics and from the uncertainty and probability associated with this assumption [33].
Statistical approaches were successfully used to downscale GRACE data to a high-resolution (~16 km 2 ) dataset of groundwater storage changes over a portion of California's Central Valley using ANN techniques [27]. In this study, temporal GRACETWS and a series of widely available hydrologic variables were used as model inputs and target data were extracted from groundwater storage changes that were estimated from an extensive well network dataset (2189 wells). Similarly, variations in GWS were extracted and tested against temporal variations in groundwater levels in Texas, Nebraska, and Illinois [34] by using coarse-resolution (3° × 3°) GRACETWS JPL Release-05 monthly mass concentration (mascons; JPL RL-05M) and high-resolution hydrologic variables, and applying ANN techniques. Statistical downscaling was also used to successfully downscale GWS anomalies from 110 to 2 km in the North China Plain on both interannual and monthly scales in areas where a strong relationship between GWS and ET was detected and where the relationship was established under different spatial resolutions [35]. The successful applications in the first two examples (in California's Central Valley, Texas, Nebraska, and Illinois) rely heavily on the availability of temporal head data from dense networks of wells, and for the latter (North China Plain) on the presence of strong relationship between GWS and ET-conditions that are not necessarily available in many of the basins worldwide. In a fourth study, temporal GRACETWS solutions and land surface and hydro-climatic variables were used to predict groundwater level anomaly (GWLA). A network of 32 wells (21 wells for training and 11 wells for testing) was used to establish and test the relationship between GRACETWS and hydro-climatic variables as input and GWLA as the response variable using a downscaling algorithm based on machine learning (ML) [36]. In many of the applied statistical downscaling approaches, including the latter study, a dense network of wells is required to establish a relationship between hydrological variables and groundwater anomalies extracted from the well data. Those methods cannot be applied in many parts of the world with limited monitoring well sites.
In this study, we applied statistical techniques to extract relationships between coarse-resolution GRACE solutions (target data) and hydrologic variables (total precipitation, normalized difference vegetation index (NDVI), snow cover, streamflow, Lake Michigan level, Lake Huron level, land surface temperature, soil moisture, air temperature, and evapotranspiration (ET). These variables could potentially correlate with, or contribute to, the temporal variations of GRACETWS. We used those relationships and high-resolution hydrologic variables to generate high-resolution modeled monthly GRACE solutions. The Lower Peninsula (LP) of Michigan was used as a test site. We applied and compared the findings from three statistical methods-stepwise multivariate regression (MR) models, ANN, and extreme gradient boosting (XGBoost)-to downscale GRACE data and to fill the temporal gaps in the time series data over the LP throughout the investigated period (2002-2016).

Overview of the Study Area
The LP (Figure 1) depends heavily on its groundwater resources to support a population of 10 million citizens, its agricultural sector (35% of its area is agricultural land [37]), and its industry [38]. The area in general, and the local communities in particular, can benefit from reliable datasets that show spatial and temporal variations in GWS at the local scale. This is especially true for the southern counties of the LP, where intensive agricultural activities are established and groundwater withdrawal rates (30 to 90 million gallons/day [39]) are the highest in the state.
Groundwater availability differs from one place to another in the LP; it is plentiful in some regions (e.g., the southwest) and less so in other areas (e.g., the southeast) [40]. The major aquifers in Michigan are largely found in (1) glacial deposits, where yields are mostly from outwash and glaciofluvial deposits; and (2) sedimentary bedrock units, where yields come largely from the Mississippian and Pennsylvanian rocks [40]. The surface and groundwater in the area drain into the Great Lakes (Lake Michigan, Lake Huron, and Lake Erie) [41].  (1, 2, and 3), the Holland and Harbor Beach lake-level measuring stations, stream gauges, lakes Austin and Ostego, three arbitrary pixels (Point 1, Point 2, and Point 3) in clusters 1, 2, and 3, respectively where downscaled GRACETWS and GRACEGWS trends, time series, and uncertainties were estimated, and monitoring wells in Kalamazoo (well A: site name 02S 11W 22CDBB 01) and in Lansing (well B: site name 04N 02W 26BBDB 01 and well C: site name: 04N 02W 16DAAA 01). Inset shows the location of the study area in the USA.
Nine hydrologic provinces with varying sedimentary deposits and aquifer thicknesses ( Figure  2) were identified in the LP [40]. Province 1 has a relatively thin (1 to 60 m) glacial-lacustrine sand unit that overlies Silurian and Devonian limestone and dolomite. Aquifers in Silurian and Devonian rocks are the main source of groundwater in the area. Province 2 is largely covered by thick (> 300 m in some areas), coarse-grained, and sandy glacial deposits, whereas province 3 is characterized by variable thicknesses (15 to 75 m) low-yield lacustrine deposits that overlie the Mississippian bedrocks. Province 4 is located in the most southern section of the LP and is characterized by thick (30 to 180 m) coarse-grained glacial deposits that overlie the low-yield Mississippian rocks; it is the main source for groundwater in the area. Province 5 glacial deposits vary in thickness (7 to 150 m) and overlie the high-yield Pennsylvanian aquifers. In this province, the thickness of the glacial aquifer thins to the south and groundwater is largely extracted from the Pennsylvanian sandstone. Province 6 is characterized by thin to moderate glacial drift that overlies high-yield Mississippian bedrock aquifers. The latter is largely composed of the Marshall Sandstone, whereas the glacial deposits are absent in some areas but thicken (up to 120 m) in others. Province 7 is characterized by thin glacial deposits (< 10 m in most areas) that overlie moderate-yield Silurian and Devonian limestone and dolomite intercalated with sand and shale layers. Province 8 is characterized by low-yield, moderate to thick (15 to 120 m), and lacustrine clay deposits that overlie low-yield Devonian and Mississippian sandstone. Lastly, Province 9 consists of featureless lacustrine and low-yield glacial sand and clay deposits of variable thickness (7 to 90 m) that overlie Pennsylvanian and Mississippian sandstone aquifers [40]. In general, the vertical hydraulic conductivity of the glacial aquifer is high (9.64 × 10 −7 to 3.8 × 10 −5 m/day) in the southern sections compared to the northern sections (3.8 × 10 −5 to 0.45 m/day; [42]). Estimated yield in glacial deposits in gallons/minute (gpm) and hydrological provinces (1 through to 9) of the Lower Peninsula modified from [40] and [42].

Methodology
A four-fold exercise was conducted to accomplish the statistical downscaling of the GRACEderived terrestrial water storage (GRACETWS) (Figure 3). First, clustering analysis was conducted on GRACETWS data over the LP to identify clusters of pixels, where pixels within each cluster have similar GRACETWS values yet are statistically different from those of neighboring clusters (Step I). Variables that correlate with and/or control GRACETWS were then identified (Step II). In Step III, for each cluster, relationships between coarse-scale inputs (independent variables) and outputs (GRACETWS; dependent variables or target) were extracted using three statistical approaches-MR, ANN, and XGBoost-to produce the dependent variables at a spatial resolution matching that of the fine-scale inputs (0.125° × 0.125° or 120 km 2 ). In doing so, we assumed that the extracted statistical relationship between the input variables and the target applies at the finer scales as well. In this step, the adopted statistical models were compared and evaluated to select the optimum methodology. Finally, GWS variations were extracted using the outputs of the applied land surface model and outputs of the optimum downscaling method (downscaled GRACETWS) (Step IV). Initially, we selected 11 variables that could correlate with, or contribute to, the temporal variations of GRACETWS. All input variables listed below (Section 3.2) are available at a spatial resolution ranging from 0.05° × 0.05° to 0.125° × 0.125°, and the output target values for each of the identified clusters were calculated from the gridded values within each of the clusters. Input variables were resampled to the size of their corresponding clusters for Step I and resampled to 0.125° × 0.125° (120 km 2 ) for Step III. We adopted three statistical approaches to model the relationships between the 11 variables and GRACETWS, namely, the MR, ANN, and XGboost approaches for each cluster. For each of the three approaches, the data were randomly partitioned into two subsets, training and testing. The training subset comprised 80% of the data points (percentage of the months in the time series data) and was used to construct the model, whereas the remaining 20% were used to evaluate the performance of the model. This approach was applied to each of the identified clusters.

Cluster Analysis
Working with small areas (individual pixels) introduces significant leakages from neighboring pixels. We adopted cluster analysis to identify larger areas (contiguous pixels) with similar geophysical signatures (GRACETWS time series) and hence reduce leakage errors. Clustering is the partitioning of the set of objects into groups in such a way that the objects within a group are more similar to each other than those in other groups. We applied K-means, one of the most popular methods for clustering analysis, to the monthly GRACE Release-06 (CSR RL06) solutions over the study area. In this method (K-means), the dataset is partitioned into K clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster [43]. The optimum number of clusters was estimated, and the area was partitioned into the identified number of clusters. The monthly values of a variable for each cluster represent the average values of all pixels within that cluster in the investigated month.
The optimal number of clusters was determined using the gap statistic implementation [44], where three steps are undertaken: Step 1: Estimate the gap statistic using Equation (1): in which k is the number of clusters, B is the number of reference datasets generated using a uniform prescription, Wkb is the within-dispersion measures, and Wk is the within-cluster dispersion.
Step 2: Compute the standard deviation: where: Step 3: Define 1 and choose the number of clusters via: = smallest k such that Gap k Gap k 1 s .

Identification of Variables that Correlate with and/or Control GRACETWS
In this section, we briefly describe the variables and target data for the MR, ANN, and XGboost models. We selected the input variables (total precipitation, normalized difference vegetation index (NDVI), snow cover, streamflow, Lake Michigan level, Lake Huron level, land surface temperature, soil moisture, air temperature, and evapotranspiration (ET)) based on their probable correlation and/or contribution to the target (GRACETWS). The spatial resolution, format, and sources of these datasets are given in Table 1. A number of these variables (e.g., soil moisture, NDVI, and evapotranspiration) may be related to, or affected by, the soil characteristics and the underlying glacial aquifer parameters. were applied in this study and reported relative to a 2004-2009 mean baseline. The first is the GRACE CSR-RL06M solutions provided by the University of Texas Center for Space Research (UT-CSR); they were derived using Tikhonov regularization [45], and resolved on a geodesic grid (grid size: 12,000 km 2 ) [45,46]. The second is the mascon solutions from the Jet Propulsion Laboratory (JPL-RL06M) [47] and the third is the mascon solution from NASA Goddard Space Flight Center [48].
The CSR-RL06M solutions were selected as the initial mascon solutions for extracting trends and time series over the investigated areas. The uncertainty associated with the calculated trend values were calculated from the differences in trend values extracted from the three solutions (CSR-RL06M, JPL-RL06M, and GSFC-M) [49,50] (Table 2). No post-processing and/or filtering or application of empirical scaling factors were applied [45,47,51]. SH solutions of GRACE data have been successfully applied in many studies to monitor variations in TWS on large scales [3,52]; however, its application on local scales was hindered by its coarse spatial resolution (> 125,000 km 2 ), leakage problems from adjacent pixels, and the required complex post-processing steps [53]. Compared to SH, the GRACE RL06 mascon solutions have a higher signal-to-noise ratio, higher spatial resolution, and reduced leakage from neighboring mascons that are in separate constraint regions [53,54]. The extracted trends and associated uncertainties for each of the investigated clusters are given in Table 2.

NDVI
We used NDVI products derived from the Moderate-resolution Imaging Spectroradiometer (MODIS) as one of the variables. NDVI uses the red and near-infrared bands, which are sensitive to healthy vegetation. The data consists of global monthly NDVI values reported at a 0.05° × 0.05° spatial resolution downloaded from Land Processes Distributed Active Archive Center's website [55]. The uncertainties associated with MODIS NDVI products were estimated by comparing the NDVI products with that extracted from the Advanced Very-High-Resolution Radiometer (AVHRR) and from the Visible Infrared Imaging Radiometer Suite (VIRS) [56]. The reported statistical coefficients between the NDVI products of MODIS and each of the AVHRR (R-square 0.99) and VIRS (R-square 0.99) indicate high consistency of the NDVI values extracted from different sensors [56].

Snow Cover (SC)
The monthly average snow cover (spatial resolution: 0.05° × 0.05°) values were computed from daily snow cover observations extracted from the MODIS/Terra Snow Cover Daily L3 Global 0.05Deg Climate Modeling Grid dataset. The normalized difference snow index-an index that is sensitive to the high reflectance over snow-covered lands in the visible wavelength region and low reflectance in the shortwave infrared regions-was used to identify snow-covered land. The monthly averages are calculated from the corresponding 28 to 31 days of observations in the daily maximum snow cover extent data. The MODIS monthly snow cover data were downloaded from NSIDC's website [57].

Stream Flow (SF)
The stream flow data were obtained from the United States Geological Survey (USGS). A good correspondence between GRACETWS and streamflow was noted in previous studies [58][59][60]. We used monthly mean values, which are average monthly values of average daily streamflow for each of the nine gauge stations. In the selection of the gauge stations, preference was made to the gauges on the main river, as observations from such locations are more likely to represent the overall fluctuations of runoff within the investigated cluster. The selected gauge stations are located in the following rivers: Clinton River in Sterling, Sable River near Red Oak, Pine River near Midland, Grand River in Lansing, St. Joseph River in Niles, Boardman River near Mayfield, Muskegon River near Croton, Kalamazoo River near Battle Creek, and St. Joseph River in Elkhart (Figure 1). The monthly streamflow data were downloaded from the USGS National Water Information System's web interface [61].

Lake Levels (LL)
Average monthly water level data for Lake Michigan and Lake Huron were obtained from the NOAAʹs Center for Operational Oceanographic Products and Services. The Lake Michigan water levels were obtained from the Holland station and the Lake Huron levels from the Harbor Beach station. The former station is located along Lake Michigan's eastern shoreline and the latter along Lake Huron's southwestern shoreline (Figure 1). At both stations, water levels are measured every six minutes. The average monthly water level data were downloaded from NOAA's website [62].

Land Surface Temperature (LST)
The monthly MODIS/Terra land surface temperature (LST) is derived by averaging the daily values of MOD11C3 products. These products have been validated through a series of field studies by the MODIS land team [63]. The spatial resolution of the monthly LST is 0.05° × 0.05° and the data were obtained from the NASAʹs Earth Science Data Systems available at [63]. The performance of the MODIS LST product was evaluated in several locations in the USA; results showed a good correspondence (absolute biases < 0.8 °C and RMSEs < than 1.7 °C) between the in-situ measurements and MODIS LST products [64].
3.2.7. Rainfall, Snow Water Equivalent, Soil Moisture, Air Temperature, and Evapotranspiration The NASA's North American Land Data Assimilation System (NLDAS) NOAH model outputs (total monthly rainfall, snow water equivalent (SWE), soil moisture (SMS), air temperature (AT), and evapotranspiration (ET); spatial resolution: 0.125° × 0.125°) were used as inputs to our models. These data are produced from daily ground-based precipitation analysis, bias-corrected shortwave radiation, and surface meteorology re-analyses to drive land surface models [65]. The average monthly data used in this research were downloaded from NASA Goddard Earth Sciences Data and Information Services Center's website at [66].
The monthly total precipitation was computed from the sum of the monthly SWE and rainfall. The outputs of NLDAS products have been validated and enhanced through several research works (eg., [67][68][69]). The accuracy of the NLDAS products varies from one product to another and from one location to another. For instance, the uncertainty for ET products was reported to be 48-mm per month over the contiguous United States [67]. Comparing the NLDAS products with in-situ measurements showed that the performance of the soil moisture product over the contiguous United States was very high (RMSE = 0.02 to 0.11) [70]. The uncertainty of the NLDAS snow water equivalent is less than 20% based on comparisons with IMS (MultiSensor Snow and Ice Mapping System) observations [71]. The uncertainties in air temperature over Michigan were estimated by comparing the monthly mean values between NLDAS and the National Centers for Environmental Prediction (NCEP) products. The reported difference was < 0.4 °C [72]. The NLDAS precipitation data were compared with five other available datasets over the western United States and the mean relative difference between them ranged from 11% to 18% [73].

Construction, Evaluation, and Selection of an Optimum Model for Downscaling
For each cluster, all datasets (variables and targets) were randomly partitioned into two groups: Training (80% of the time series for each cluster) and testing (20% of the time series for each cluster). We constructed and applied three statistical models (MR, ANN, and XGBoost) to establish the relationships between the variables (predictors) and GRACE (the target) for each of the investigated clusters. The performance of each model was compared and evaluated. The evaluation of the models was carried out by comparison between the observed values (testing subset) and predicted values using the coefficient of determination (R-squared), the normalized root-mean-square error (NRMSE), and the Nash-Sutcliffe model efficiency coefficient (NSE) ( Table 3) [74]. The R-squared values range from 0 to 1; those for the NRMSE and NSE indices range from 0 to 1. The predictive power of the models increases with increasing R-squared and NSE values and with decreasing NRMSE values. The rate of the performance for each approach was based on classifications adopted by [75]. Following the identification of the statistical model that yielded the highest performance, we used the selected model to downscale the GRACE solutions for each of the clusters to 0.125° × 0.125° throughout the investigated period.

MR Models
The MR, or multiple linear regression method, derives patterns in the data and establishes the best fitting multivariate linear relationships between two or more dependent variables and the target (GRACETWS). As described earlier, we applied a stepwise MR method in which the selection of variables is carried out by addition to, or subtraction from, a set of dependent variables using some pre-specified coefficients, such as the F-test, the t-test, and the coefficient of partial determination. In an MR model, every value of the independent variable, X, is associated with a value of the target variable, Y. The regression line for n independent variables X1, X2,…, Xn can be explained as follows: where Y is the predicted value of the target variable, B0 is the value of Y when all of the independent variables are equal to zero, X1 through to Xn are independent variables, and B1 through to Bn are the estimated regression coefficients. In multivariate linear regression, the response variable, Y (GRACETWS in our work), is assumed to be linearly related to a set of n explanatory independent variables, X1, X2,…, Xn, and the independent variables are not highly correlated with each other.
Observations are selected independently and randomly from the population. Also, residuals are assumed to be normally distributed with a mean value of 0.
The parameters are trained in such a way to achieve the highest similarity between the modeled and observed values in the training data set. One optimization model is employed to minimize the sum of the squares of the vertical deviations from each data point to the regression equation. The ideal case is a model in which a data point lies completely on the fitted line (i.e., vertical deviation = zero).
We applied a stepwise multivariate regression approach. Stepwise regression fits the multivariate regression several times, each time removing the least correlated variable until the statistically significant variables are left. For a full description of variable selection in the stepwise method, see [76]. MR models were first developed for each of the identified clusters to establish linear relationships between coarse-resolution inputs (variables) and target (GRACETWS) values for each of the identified clusters.
All input variables were available in both coarse and fine resolutions, whereas GRACETWS values were available in coarse resolution only. The variables were resampled to the size of each cluster using bilinear resampling techniques.

ANN
The ANN method establishes empirical, possibly non-linear, relationships between a set of "input" variables and corresponding "target" variables. An ANN is based on a series of connected units or neurons, which are intended to replicate the functions of neurons in animal or human brains; they pass information between one another, a structure that enables ANNs to be trained and learn. The ANN method used in this study is known as multilayer perceptron (MLP). An MLP consists of units called perceptrons. Perceptrons have one or more inputs, an activation function, and an output. An MLP model is built up by combining perceptrons in structured layers. The perceptrons in a given layer are independent of each other, but each connects to all of the perceptrons in the following layer. Each layer is composed of a set of neurons and is trained with a back-propagation algorithm.
Backpropagation is one of the most extensively used algorithms for supervised training of multilayered neural networks [77][78][79]. Backpropagation works by approximating the non-linear relationship between the input and the target by altering the weight values internally. The processes of the backpropagation can be divided into two stages: Feedforward and backpropagation. In the feedforward step, a pattern is applied to the input layer, and its effect propagates, layer by layer, through the network until the output is generated. The networkʹs sample output value is then compared to the anticipated value for a given input, and an error signal is estimated for each of the output neurons. Since all neurons within the hidden layer contributed to the signal errors in the output layer, the output errors are transmitted backward from the output layer to each neuron within the hidden layer that contributed to the output layer. This process is then reiterated, layer by layer, until each neuron in the network has received an error signal that represents its relative contribution to the total error. Once the error signal for each neuron has been computed, the errors are then applied by the neuron to adjust the values for each connection weight. The goal is to minimize the value of the error function in weight space. The weights with minimum error functions are then considered to be a solution to the learning problem. In an ANN, the hyperparameter is a parameter whose value is set before the learning process begins, and it controls the model structure (eg., number of layers, number of hidden neurons, number of epochs). Additional information about the theory behind ANN applications can be found in [80].
In our study, we applied a trial and error technique to determine the optimal number of hyperparameters, where the numbers were added gradually until the predicted and observed values start to match by evaluating the model performance using the mean squared normalized error (MSE) performance function. In our study, individual ANNs were constructed for each cluster. In all three clusters, the ANNs consist of one input layer, one hidden layer, and one output layer. The number of hidden neurons in our study ranged from 12 to 18. The number of epochs is the number of times the entire training data are used to update the weights. In other words, it is the number of times that the backpropagation algorithm works through the entire training dataset. The number of epochs ranged from 350 to 500.
The final model evaluation was carried out by the comparison between observed and predicted values on testing data (out of sample data set) using the above-mentioned statistical coefficients (NRMSE and NSE).

XGBoost
Gradient boosting was used with decision trees. Decision tree learning is a predictive modeling approach in machine learning that uses a tree-like model to go from observations of predictors (branches of the tree) to the prediction of the target value (leaves of the tree). The goal of our study was to create a model to predict the GRACETWS values from sets of input variables for each month. Using trees has several advantages including, but not limited to, the ability to handle various types of target variables (e.g., numerical, categorical, and multivariate), modeling complex interactions, and managing missing values with minimal loss of information [81]. However, there are two main limitations with trees: Weakness of the prediction and difficulty in the interpretation of large trees [81]. To overcome these limitations, the gradient boosting algorithm was introduced by [82] and developed by many others (e.g., [83,84]).
In gradient boosting, the goal is to use a set of predictors (X1,…, Xn) to predict a set of target data (Y1,…,Yn) by fitting a model → and minimizing the sum of the loss function ∑ , by improving the model F(X) (in our work, the loss function, , ). Then, the following iteration is performed: 1. Calculate the negative gradients of J with respect to F(Xi), which is .

Let our new F(Xi) be F(Xi)+ ℎ,
where is the step size in our algorithm to reach the estimated minimum of .
As a significant improvement over gradient boosting, in XGBoost, we start with a loss function , ℎ and minimize ∑ , ℎ Ω ℎ , where Ω ℎ || || . Here, is the number of leaves in the tree and is the leaf weights. Figure 4 shows a schematic diagram of the gradient boosting method.

Selection of Optimum Statistical Model and Gap Filling
The performance of each of the three models was evaluated by comparing the predicted values with the observed values on the testing subset using R-squared, normalized root mean squared error (NRMSE) (Equations (3) and (4)), and Nash-Sutcliffe efficiency (NSE) (Equation (5)) as follows: , where is the observed value, Ŷ is the predicted value, n is the number of observations, and is the mean of the observed data.
The model that produced the highest R-squared and NSE and the lowest NRMSE value was selected. Using the optimum model, the relationships were established between input variables (total precipitation, NDVI, snow cover, streamflow, Lake Michigan level, Lake Huron level, soil moisture, air temperature, LST, and ET) and GRACETWS as the target variable. These relationships were used to estimate the missing GRACETWS months.

Extraction of Temporal Groundwater Storage Using Outputs of Land Surface Models
We used the downscaled GRACE data to extract for each of the 0.125° × 0.125° pixels the changes in GRACETWS (∆GRACETWS) and the secular trend for each of these pixels ( Figure 5). Then, changes in groundwater storage (∆GWS) were calculated using the downscaled ∆GRACETWS and outputs of land surface models (NLDAS NOAH) and applying the following equation (Equation (6)): where ∆SMSNLDAS, ∆CWSNLDAS, and ∆SWENLDAS are the changes in soil moisture, canopy water storage, and snow water equivalent, respectively, as extracted from the NLDAS model. All these data are provided in a spatial resolution of 0.125° × 0.125° (~120 km 2 ).

Sources and Propagation of Errors
The uncertainties in the GRACETWS trends reflect the variations between trend values that were extracted from three GRACE solutions (CSR-RL06M, JPL-RL06M, and GSFC-M) ( Table 2; [49,50]). The uncertainties in the downscaled GRACETWS are related to: (1) Uncertainties in the variables (remote sensing-based and land surface model-based) that were used as inputs to the statistical models, and (2) errors introduced by the applied statistical models. The statistical coefficients (Rsquare, NSE, and NRMSE) describe the accuracy of the extracted model, namely the degree to which the extracted statistical model can or cannot predict the target (GRACETWS in our case). Statistical models that have R-square, NSE, and NRMSE values of 0.9, 0.9, and 0.1, respectively, can predict the target with an accuracy of 90%. Since the reported accuracy of the models was estimated by comparing the modeled and observed GRACETWS values in the test dataset, the reported model errors incorporate the errors associated with the individual variables as well. In this respect, the coefficients could be used to estimate the errors introduced by both the variables and the statistical models [36]. The combined errors were estimated by averaging the three coefficients and are presented in Table 4 as the percent uncertainty of the output (GRACETWS). Table 4. The coefficient of determination (R-squared), the normalized root-mean-square error (NRMSE), and the Nash-Sutcliffe model efficiency coefficient (NSE) for each of the examined models (extreme gradient boosting, multivariate regression, and artificial neural network) over clusters 1, 2, and 3 and calculated uncertainties. We assumed that the model-based accuracy of GRACETWS in area A applies to all downscaled pixels within this area. Similarly, the GRACETWS of the downscaled pixels within areas B and C will inherit the estimated accuracy for areas B and C, respectively. These are reasonable assumptions given that all of the CSR06M pixels within each of areas A, B, and C have similar geophysical signals.

Cluster
The errors (uncertainties) associated with the estimated downscaled GRACEGWS values were propagated from the estimated errors in the GRACETWS, and in the land surface model outputs (SMSNLDAS, CWSNLDAS, and SWENLDAS) which were used in calculating changes in groundwater storage (Equation (7)). The errors in each of these land surface model outputs were calculated as the standard deviation of the values extracted from the three NLDAS simulations (NOAH, VIC, and mosaic [85,86]. The errors in the estimated GWS (σGWS) were calculated by adding, in quadrature, the uncertainties related to GRACETWS, SMSNLDAS, CWSNLDAS, and SWENLDAS values (Equation (7)). The estimated total error rate for fine-scale GWS in three arbitrary pixels (location given in Figure 1) is presented in Figure 6: (7) Figure 6. The estimated total error rate for three arbitrary pixels (Point 1, Point 2, and Point 3). The location of the pixels is shown in Figure 1.

Cluster Analysis
The optimum number of clusters was estimated at 3. Three clusters were identified (cluster 1 area: 13,700 km 2 , cluster 2 area: 59,200 km 2 , cluster 3 area: 33,100 km 2 ; Figure1). The correlation coefficients of the GRACE time series between clusters were evaluated through the construction of a correlation matrix ( Table 5). The correlation coefficients of the GRACE time series between clusters varies from 0.41 to 0.66, and those between clusters and the Michigan lake level varies from 0.43 to 0.74. In the generation of the correlation matrix, the secular trends and seasonal cycles were removed from the time series. Although we cannot rule out leakage from the adjacent water bodies and areas, we suggest that there are significant geophysical signals in each of the investigated areas, as evidenced by the following observations. First, higher correlation coefficients than those observed (0.41-0.66) are to be expected between GRACETWS over areas 1, 2, and 3 if the leakage was significant. Second, lake levels lag behind GRACETWS by 1 to 2 months for areas 1, 2, and 3 ( Figure 7).

Evaluation and Comparison of the Models
Comparison of the performance of the three models revealed that, in general, the XGBoost models perform better than the other two models (Table 4)  The performance of the XGBoost is high (clusters 1, 2, and 3: Very good), compared to that for MR models (clusters 2 and 3: Very good; cluster 1: Good) and for the unified ANN model (clusters 2 and 3: Very good; cluster 1: Unsatisfactory). One plausible explanation for the enhanced performance of the XGBoost models over the ANN and MR models is that it can better account for the specific characteristics or significant variables that control, or relate to, the observed temporal GRACETWS solutions in each cluster. It is flexible and performs well with categorical and numerical values [87], as is the case with our datasets (Figure 8).

Factors Controlling the TWS and GWS Variations over the Study Area
Seven out of 11 variables showed statistical significance with the GRACETW values in the XGboost models (Table 6). They were used for the downscaling process based on cluster-based XGBoost models. Those variables are ET, air temperature, NDVI, total precipitation, soil moisture, Lake Michigan water level, and streamflow. The significance of the variables is determined by their p-values. The insignificant variables were omitted due to their high p-value (probability value). The p-value represents the probability of the occurrence of a given event and helps determine the significance of the results. The higher p-values (typically > 0.05) indicate weak evidence against the null hypothesis (i.e., there is no significant relationship between the independent variable and the target, and therefore the variable is insignificant [88]). The smaller p-values (typically ≤0.05) indicate the opposite: There is strong evidence in favor of the alternative hypothesis, and there is a significant relationship between the independent variable and the target.  Multicollinearity, a condition in which two or more predictors are highly correlated with one another in linear regression models, was addressed using the variance inflation factor (VIF). Multicollinearity makes it difficult to determine the effect of the individual predictors on the response and to identify the variables to be included in the model. VIF is one of the most widely used diagnostic indices for multicollinearity [89]. It estimates how much the variance of a coefficient is "inflated" because of linear dependences with other predictors [89]. Using a VIF value of 11 in this study, one of two variables (Lake Michigan and Lake Huron water levels) that show multicollinearity was omitted. Lake Huron lake level was automatically removed from the set of individual variables in the stepwise procedure and was not used in our models. Five lag times (months 1 through 5) were assigned to each of the investigated variables to identify the optimum lag time for the individual variables. Four of the examined variables (total precipitation, temperature, ET, and NDVI) were found to have optimum lag times ranging from 1 to 3 months; none exceeded 3 months, and the remaining variables had no lag times ( Table 6). The optimal lag time was found to vary from one cluster to another; for example, the lag in total precipitation varied from 1 month in cluster 1 to 3 months in cluster 2. Again, less significant lag times for an individual variable were automatically removed throughout the application of the stepwise regression.
The significant variables are the ones that correlate well with, respond fairly quickly to, and either drive or are driven by the variations in GRACETWS. An increase in soil moisture and stream flow over a cluster will increase its GRACETWS values, whereas an increase in land surface temperature or ET will probably decrease its TWS values. Interestingly, lake levels correlated well with GRACETWS, which is to be expected given that both the land (clusters 1-3) and surrounding water bodies (Lakes Michigan and Huron) receive added water contributions (precipitation and SWE), which will increase the water levels in the lakes and increase the GRACETWS over the land. However, the lake water levels lagged behind GRACETWS by 1 to 2 months. We suspect that this lag time is related to the time period it takes for runoff to reach the lake. Starting in 2013, there has been an increase in the water level in both Lake Huron and Lake Michigan. A thorough investigation revealed that the recent rise in the water level in Lake Michigan-Huron is due to above-average spring runoff, which drains into the lakes, and excess precipitation over the lake as well [90].

Discussion
The original size of the pixels over the LP (irregular grid, pixel size ~12,000 km 2 ) is coarse for monitoring TWS and GWS on the county scales (size range: 500 (e.g., St. Joseph County) to 850 km 2 (e.g., Kent County)). The adopted downscaling technique addresses this issue through the generation of downscaled GRACETWS and GRACEGWS (spatial resolution: 0.125° × 0.125°; 10 × 14 km = 140 km 2 ).
Inspection of the secular trends in GRACETWS and GRACEGWS revealed two general patterns, a near-steady state in GRACETWS and GRACEGWS (−1 to +1 mm/year) for an earlier period (2002 to 2012), hereafter referred to as period I, followed by an increase in GRACETWS (28 to 120 mm/year) and GRACEGWS (10 to 130 mm/year) for a later period (2013 to 2016), hereafter referred to as period II (Figure 9).
The breakpoints were identified using the regime shift detection method with a 95% confidence [91]. The two above-mentioned general patterns were observed throughout the entire investigated area. For all of the downscaled pixels, no major differences in GRACETWS and GRACEGWS trends were observed during period I, all of which show near-steady trends. However, distinct variations in the GRACETWS and GRACEGWS trend values are observed in period II between the three clusters ( Figure  9). Cluster 1 (represented by point 1; Figure 9) has the highest GRACETWS (103 to 122 mm/year) and GRACEGWS (100 to 130 mm/year) trends, followed by cluster 2 (represented by point 2), with a TWS trend of 50 to 57 mm/year and GWS trend of 45 to 70 mm/year. Clusters 1 and 2 are located within areas characterized by the highest average SWE (60 to 190 mm/year) and the highest average annual rainfall (800 to 1043 mm/year), respectively. Cluster 3 is located in the southern and southwestern parts of the LP, areas that are characterized by high groundwater extraction. This cluster (represented by point 3) shows a TWS trend of 28 to 55 mm/year and a GWS trend of 10 to 55 mm/year.  (Figures 1 and 2) where the glacial deposit is relatively thick, whereas cluster 3 is located in the southwestern section of the LP, an area characterized by high groundwater withdrawal for agricultural activities (Figure 1). The eastern part of cluster 3 is located in an area characterized by low yield (Figure 2) and thin to moderate glacial deposits (refer to Section 2). Also, in general, the northern, but not the southern sections, of the LP have high vertical conductivity and low groundwater extraction rates [42]. The average annual rainfall over the entire LP increased from 774 mm/year (period I) to 783 mm/year (period II) and the average annual SWE increased from 50 (period I) to 55 mm/year (period II) ( Figure 10). For cluster 1, the average annual rainfall increased from 689 (period I) to 723 mm/year (period II) and the average annual SWE increased from 44 to 75 mm/year. Similarly, for cluster 2, the average annual rainfall increased from 761 (period I) to 785 mm/year (period II) and the average annual SWE increased from 52 to 56 mm/year in periods I and II, respectively. For cluster 3, the average annual SWE increased from 36 (period I) to 44 mm/year (period II), but the average annual rainfall decreased from 834 (period I) to 823 mm/year (period II). These collective observations suggest that the observed steep GRACETWS and GRACEGWS trends over the northern sections of the LP during period II could be related to one or more of these factors: (1) Thickened glacial deposit, (2) high precipitation and/or snow fall rates, (3) high vertical conductivity [42], and (4) low extraction rates.
One would expect the above-mentioned temporal variations in precipitation and SWE to be reflected in the downscaled GRACEGWS and groundwater levels. Figure 11 shows an overall correspondence between the downscaled GRACEGWS data and groundwater levels from three monitoring wells (well A: site name 02S 11W 22CDBB 01, location Kalamazoo; well B: site name 04N 02W 26BBDB 01, location Lansing; and well C: site name 04N 02W 16DAAA 01, location Lansing; Figure 1) within each of the downscaled GRACEGWS pixels. One should not expect a one to one correspondence between the two datasets. The groundwater levels, but not the GWS, are affected by groundwater withdrawal and by the lag time (between precipitation and recharge). Unfortunately, only a few of the monitoring wells, all located in Kalamazoo and Lansing, have continuous measurements throughout the investigated period and across the study area, none of which are located in the central or northern LP (Figure 1). The correlation coefficient between the downscaled GWS (0.125° × 0.125°) and the observed groundwater level in wells A, B, and C were calculated at 0.4, 0.55, and 0.32, respectively, higher than those between the original GRACEGWS and the wells (A: 0.14; B: 0.36; and C: 0.05).  . Comparison between the downscaled GRACEGWS data for three pixels and groundwater levels from monitoring wells within each of the three GRACE pixels in Kalamazoo (well A) and Lansing (wells B and C) (see locations in Figure 1). Groundwater-level elevations are given in elevation above mean sea level (cm).
We also compared the time series for surface water levels from two inland lakes (Otsego lake in northern LP and Austin lake in southern LP; Figure 1) to downscaled GRACETWS time series in areas (pixels) proximal to these lakes ( Figure 12). The surface water lake levels approximate the groundwater table in the surrounding areas and thus, the changes in lake level should be indicative of the temporal variations in GRACEGWS [92,93]. Figure 12 shows a good correspondence between the downscaled GRACEGWS and Otsego lake level (correlation coefficient: 0.73) and Austin lake level (correlation coefficient: 0.75), an observation that further validates the adopted downscaling procedures. Figure 12. Comparison between the downscaled GRACEGWS data for two pixels and surface water levels from two inland lakes, namely Otsego lake and Austin lake (see locations in Figure 1). The discontinuities in the lake levels is due to temporal gaps in the collected data.

Conclusion
The GRACE data has been widely used to monitor the temporal and spatial variations in TWS and GWS on large scales. However, such applications remained limited on the local scales due to the poor spatial resolution (irregular grid of 12,000 km 2 ) of the GRACE data. The objective of this study was to address this shortcoming by downscaling the CSR mascon solutions to a finer resolution (0.125° × 0.125°) to enable monitoring of GRACETWS and GRACEGWS on county scales and fill the gaps for missing months in the GRACE time series over the study area. Using cluster analysis, areas of similar GRACETWS patterns within the study area were first identified. For each of the identified clusters, variables (total precipitation, NDVI, snow cover, streamflow, Lake Michigan water level, Lake Huron water level, soil moisture, land surface temperature, and ET) that presumably contributed to, or were correlated with, the GRACE data were identified and collected on a monthly basis over the investigated period (2002 to 2016). The data sets were randomly partitioned into two groups: Training data (80%) and testing data (20%). XGBoost, MR, and ANN methods were applied to extract statistical relationships between the independent variables and the GRACETWS (dependent variable).
The comparisons of the observed GRACETWS (training dataset) versus the modeled GRACETWS values showed that the XGBoost method outperformed the other two methods as indicated by their lower NRMSE and higher NSE values compared to those obtained from the MR and ANN models. The unified approaches have the advantage of providing adequate overall downscaling results over large areas, yet one would expect the performance of the model to vary from one area to another given that the selected variables and/or their significance is likely to vary across the investigated area.
We suggest that if statistical downscaling methods were selected for downscaling GRACETWS values on local scales, preference should be given to cluster-based approaches over the commonly used unified approaches.
The XGBoost model was used to downscale (12,000 to 120 km 2 ) GRACETWS given the high performance of this model over all other models (ANN and MR) and its ability to estimate the contributions of the independent variables towards the response variable and to forecast missing months within the GRACE's time series data. Although the XGBoost model outperformed the ANN method in all three clusters in our study area, that might not necessarily be the case over other locations. We suggest that one should explore the use of multiple statistical approaches and select the one that performs the best over each of the investigated areas (clusters).
Since the individual variables and the degree to which they correlate with GRACETWS vary from one cluster to another, it is recommended to identify the local hydrologic components that are specific to the investigated area and to select the optimum cluster-based model to improve the accuracy of the downscaled GRACETWS values. The accuracy of the derived downscaled GRACEGWS will largely depend on the accuracy of the land surface model outputs that were used in calculating GRACEGWS, namely the SMSNLDAS and SWENLDAS in our case. Unfortunately, the State of Michigan lacks a comprehensive groundwater monitoring program to validate the downscaled GRACEGWS data adequately.
As discussed earlier, we cannot rule out leakage from the adjacent water bodies and/or areas, but we suggest that there are significant geophysical signals from each of the investigated areas (clusters) as evidenced by the modest correlation coefficients between the time series of areas 1, 2, and 3 and by the lag of lake levels by 2-months behind the GRACETWS over the land areas. Currently, efforts are underway to generate GRACETWS of higher spatial resolution (1° x 1°) by NASA JPL, through combining satellite gravimetry and in-situ GNSS measurements [94]. If and when such data become available, we can apply the proposed methodologies on the individual pixels without worrying about the leakage from their surroundings.
We developed a straightforward methodology that could be used to monitor temporal variations in GRACETWS and GRACEGWS on local scales (county levels). The methodology takes advantage of readily available remote sensing datasets and outputs of land surface models that are of global nature, both of which come at no cost to users. These methodologies could be used by local communities and decision makers for water management purposes in the State of Michigan. They can also provide a replicable model for local applications across the continental USA and possibly in similar settings worldwide as well. The performance of the statistical models can be enhanced by identifying and including local variables that control, or correlate with, the GRACETWS solutions over the investigated areas.

Conflicts of Interest:
The authors declare no conflict of interest.