Spatio-Temporal Evaluation of Water Storage Trends from Hydrological Models over Australia Using GRACE Mascon Solutions

: The Gravity Recovery and Climate Experiment (GRACE) data have been extensively used to evaluate the total terrestrial water storage anomalies (TWSA) from hydrological models. However, which individual water storage components (i.e., soil moisture storage anomalies (SMSA) or groundwater water storage anomalies (GWSA)) cause the discrepancies in TWSA between GRACE and hydrological models have not been thoroughly investigated or quantified. In this study, we applied GRACE mass concentration block (mascon) solutions to evaluate the spatiotemporal TWSA trends (2003 – 2014) from seven prevailing hydrological models (i.e., Noah-3.6, Catchment Land Surface Model (CLSM-F2.5), Variable Infiltration Capacity macroscale model (VIC-4.1.2), Water — Global Assessment and Prognosis (WaterGAP-2.2d), PCRaster Global Water Balance (PCR-GLOBWB-2), Community Land Model (CLM-4.5), and Australian Water Resources Assessment Landscape model (AWRA-L v6)) in Australia and, more importantly, identified which individual water storage components lead to the differences in TWSA trends between GRACE and hydrological models. The results showed that all of the hydrological models employed in this study, except for CLM-4.5 model, underestimated the GRACE-derived TWSA trends. These underestimations can be divided into three categories: (1) ignoring GWSA, e.g., Noah-3.6 and VIC-4.1.2 models; (2) underrating both SMSA and GWSA, e.g., CLSM-F2.5, WaterGAP-2.2d, and PCR-GLOBWB-2 models; (3) deficiently modeling GWSA, e.g., AWRA-L v6 model. In comparison, CLM-4.5 model yielded the best agreement with GRACE but overstated the GRACE-derived TWSA trends due to the overestimation of GWSA. Our results underscore that GRACE mascon solutions can be used as a valuable and efficient validation dataset to evaluate the spatio-temporal performance of hydrological models. Confirming which individual water storage components result in the discrepancies in TWSA between GRACE and hydrological models can better assist in further hydrological model development. CSR-M, JPL-M, and GSFC-M) to evaluate the spatio-temporal performance of seven widely used hydrological models (i.e., Noah-3.6, CLSM-F2.5, VIC-4.1.2, WaterGAP-2.2d, PCR-GLOBWB-2, CLM-4.5, and AWRA-L v6 models) over Australia from 2003 to 2014. The spatiotemporal differences in TWSA trends between GRACE and hydrological models were investigated and, more importantly, the corresponding individual water storage components (i.e., SMSA and GWSA) that lead to the differences in TWSA trends were explored.


Introduction
Accurate quantification of total terrestrial water storage anomalies (TWSA) is indispensable for the prediction of regional food supply, energy production, human and ecosystem health, and economic and societal development [1], particularly in arid and semi-arid environments (e.g., Australia). Since the first primitive land surface model introduced by Manabe in 1969 [2], many tens of hydrological models have been employed to analyze TWSA. Due to different model structures and parameterizations utilized in hydrological models, significant discrepancies can be observed among these model-simulated water storage variations in Australia. For instance, although Water-Global Assessment and Prognosis (WaterGAP-2.2) model [3] and Community Land Model version 4.0 (CLM-4.0) [4] include all water storage components, they have different climate forcing and soil thicknesses (more details about these models can be found in Section 2.2.1). Such differences cause a linear trend of TWSA (2002-2014) simulated by these two models in the Murray-Darling river basin to show opposite variation, i.e., 3.78 mm/year for WaterGAP model but −4.58 mm/year for CLM-4.0 model [5]. Hence, a question emerges about the reliability of hydrological models in Australia.
Encouragingly, satellite measurements of the Earth's time-variable gravity from the Gravity Recovery and Climate Experiment (GRACE) mission provide new data for monitoring TWSA at the regional and global scales with a surface spatial resolution of 300 km [6]. In Australia, GRACE data have been used to monitor the Millennium Drought [7][8][9][10][11], groundwater variations [12,13], and the impacts of extreme hydroclimatic events (e.g., El Niño-Southern Oscillation (ENSO), Indian Ocean Dipole (IOD), and Southern Annular Mode (SAM)) on water change [9,[14][15][16][17]. Additionally, due to the uncertainty of monthly GRACE-derived TWSA being at an error level of 2 cm in terms of equivalent water thickness (EWT) for a catchment of the size ≈63,000 km 2 [18], and considering that this uncertainty could be decreased with the increase in catchment size (due to GRACE measurement and leakage uncertainties decreasing with increasing basin size [5]), GRACE data can also be used to evaluate the performance of hydrological models in Australia [19]. van Dijk et al. [20] used the TWSA derived from GRACE spherical harmonic (SH) solutions to test the performance of the TWSA modeled by the Australian Water Resources Assessment (AWRA) system in Australia and indicated that AWRA model tended to have smaller seasonal amplitude and to show less negative trends in northwest Australia, the southern Murray Basin, and southwest Western Australia than GRACE. Swenson et al. [21] used the TWSA derived from GRACE SH solutions to examine the response of the Community Land Model (CLM-4.5) aquifer model to transitions between low and high recharge inputs in a northeastern Australia region and showed that CLM-4.5 model simulates unrealistic longperiod behavior relative to the TWSA derived from GRACE. Tangdamrongsub et al. [22] used the TWSA and groundwater water storage anomalies (GWSA) derived from GRACE data assimilation (GRACE DA; assimilates GRACE SH solutions into the PCRaster Global Water Balance (PCR-GLOBWB) model) to evaluate the performance of four different hydrological models (i.e., WaterGAP, PCR-GLOBWB, Community Atmosphere Biosphere Land Exchange (CABLE), and World-Wide Water (W3)) in Australia and the North China Plain. The results indicated that GRACE DA shows a significant improvement in all measures than those obtained from the model simulations alone. Scanlon et al. [5,23] used the TWSA derived from GRACE mass concentration block (mascon) and SH solutions to evaluate the reliability of two hydrological and water resource models (i.e., WaterGAP and PCR-GLOBWB) and five land surface models (i.e., Noah-3.3, Catchment Land Surface Model (CLSM-F2.5), Variable Infiltration Capacity macroscale model (VIC), Mosaic, and CLM-4.0) in 183 river basins globally (four river basins for Australia). The results showed that compared to GRACE, all models underestimated the large water storage trends and most models underestimated the seasonal water storage amplitudes in tropical and (semi) arid basins, while land surface models generally overestimated the amplitudes in northern basins. These studies focused on using GRACE data to evaluate the reliability of hydrological model-simulated TWSA in Australia, but less attention has been paid to which individual water storage components (i.e., soil moisture storage anomalies (SMSA) or GWSA) lead to the discrepancies in TWSA between GRACE and hydrological models. However, identifying and quantifying the contribution of individual water storage components to the discrepancies in TWSA between GRACE and hydrological models are critical for assessing continental water storage change and further hydrological model development [24]. Therefore, the manner in which individual water storage components affect the performance of hydrological models in Australia needs to be analyzed.
We assume that the total terrestrial water storage (TWS) in Australia is mainly composed of soil moisture storage (SMS) and groundwater storage (GWS), while the contribution of snow water storage (SnWS), canopy water storage (CWS), and surface water storage (SWS) are negligible (more details can be found in Section 2.3.1). The reference value of TWSA is based on the mean of three different GRACE mascon solutions (i.e., Center for Space Research release 6 (RL 06) mascon solutions v01 (CSR-M), Jet Propulsion Laboratory RL06 mascon solutions v01(JPL-M), and Goddard Space Flight Center mascon solutions v02.4 (GSFC-M)), the reference value of SMSA is the mean of four different hydrological models (i.e., Noah-3.6, VIC-4.1.2, CLM-4.5, and Australian Water Resources Assessment Landscape model (AWRA-L v6) models), and the reference value of GWSA is the residual between the reference value of TWSA and the reference value of SMSA. These three different reference values are used to evaluate the performance of hydrological models. This study aimed to use GRACE mascon solutions to evaluate the spatio-temporal performance of TWSA from hydrological models and identify which individual water storage components (i.e., SMSA or GWSA) result in the discrepancies in TWSA between GRACE and hydrological models. The results can provide guidance for choosing the appropriate dataset when investigating different water storage variations and assist in the development of hydrological models in Australia.
The rest of the paper is organized as the follows. Section 2 presents the overview of the study area, hydrological models, and GRACE mason solutions, as well as describes the determination of the reference values (i.e., of the TWSA, SMSA, and GWSA) and the evaluation strategy. Section 3 presents a comparison of different GRACE mascon solutions, an evaluation of hydrological models, and a quality assessment of the contribution of SMSA trends and GWSA trends to the differences in TWSA trends between GRACE mascon solutions and hydrological models, as well as discusses the influence of the driving force from the model structure, climate forcing, human water use, and model calibration on the results. Section 4 concludes the key findings of this study.

Study Area
Australia is the only country in the world that covers an entire continent. This continent can be divided into 13 regions based on topographic drainage divisions [25] (Figure 1). Australia was selected as a study area due to the following reasons: first, only a reduction in leakage errors was required (the signal in a target region may leak into the surrounding areas and reduce signal amplitude in this region (leakage-out), and the signal from the surrounding areas may also leak into this region (leakage-in) and improve the signal amplitude in this region) across coastlines when GRACE data were used to analyze the TWSA over Australia [26], without considering contamination from other continents [27]. GRACE mascon solutions can easily satisfy this requirement [28] and are better for estimating small hydrological signals over Australia. Second, Australia has experienced significant spatio-temporal TWS variabilities under extreme hydroclimatic impacts [16], so measuring the water storage variations over Australia has long been proven to be a challenge for hydrological models [26]. Lastly, SMSA and GWSA are the major components of the TWSA in Australia, while snow water storage anomalies (SnWSA), canopy water storage anomalies (CWSA), and surface water storage anomalies (SWSA) are assumed to be negligible [22] (more details about these components are given in Section 2.3.1). Therefore, this feature can help us to better understand the primary water storage component that leads to the discrepancies in TWSA between GRACE data and a hydrological model. In this study, we focused on analyzing the water storage variations across the whole continent of Australia. Additionally, to help interpretation, we also quantized the water storage variations in four different regions (i.e., the Carpentaria Coast (CC), the North East Coast (NEC), the Murray-Darling Basin (MDB), and the North Western Plateau (NWP)) ( Figure 1), which are typical regions of water storage changes in Australia (more details can be found in Section 3.1).

Hydrological Models
A hydrological model, defined as a simplification of a real-world system, helps us to understand, predict, and manage water resources [29]. Hence, understanding the performance of hydrological models is important. In this study, we assessed the performance of seven state-of-the-art hydrological models that are commonly used in Australia [12,13,20,22]. These hydrological models include three different land surface models (i.e., Noah-3.6, CLSM-F2.5, and VIC-4.1.2) in Global Land Data Assimilation System (GLDAS-2.1) [30], as well as Water-Global Assessment and Prognosis (WaterGAP-2.2d) [3], PCRaster Global Water Balance (PCR-GLOBWB-2) [31], CLM-4.5 [32], and Australian Water Resources Assessment Landscape model (AWRA-L v6) models [33]. Their primary attributes are listed in Table 1. The key differences between hydrological models are model structure, climate forcing input, consideration of human water use or not, and model calibration. For instance, most of the hydrological models used in this study include all individual water storage components, whereas Noah-3.6 and VIC-4.1.2 models do not contain SWS, or GWS. Additionally, the soil thicknesses among hydrological models are different, e.g., the soil depth of VIC-4.1.2 in Australia is from 1.5 to 3.5 m [34], and from 0.5 to 1.5 m for WaterGAP-2.2d model in most parts of Australia [3]. For climate forcing, only Noah-3.6, CLSM-F2.5, and VIC-4.1.2 models use the same parameters, whereas the other hydrological models utilize completely different parameters. Furthermore, only WaterGAP-2.2d model considers human water use and model calibration. In summary, these differences can affect the performance of these models, and are the focus of this study.

GRACE Mascon Solutions
The GRACE twin satellites [6], their science mission launched on 17 March 2002 and ended on 12 October 2017, were a joint space mission of National Aeronautics and Space Administration (NASA) and German Aerospace Center (DLR). Its primary objective was to monitor the temporal changes of the Earth's gravity field at the global scale [19]. The temporal variations of the Earth's gravity field are mainly caused by earthquakes [35], crustal deformations [36], glacial isostatic adjustment [37], and oceanic [38] and hydrologic processes [19]. Across Australia, GRACE-derived gravity changes were generally attributed to large-scale hydrological variations [39,40].
Three newly available GRACE mascon solutions were employed in this study to analyze the TWS changes in Australia, i.e., Center for Space Research release 06 (RL 06) mascon solutions v01 (CSR-M) [41,42], Jet Propulsion Laboratory RL06 mascon solutions v01 (JPL-M) [43,44], and Goddard Space Flight Center mascon solutions v02.4 (GSFC-M) [45]. Their primary processing details are listed in Table 2. Compared to GRACE SH solutions, there are two main advantages of mascon solutions when using them to study the water storage changes in Australia. First, mascon solutions can be applied from the regional to global scales, whereas SH solutions are just for the global scale [28]. Hence, mascon solutions can better resolve ocean leakage error in terrestrial water signals across Australia [26,46]. Second, mascon solutions apply geophysical data constraints during processing with little empirical post-processing requirements, i.e., they do not require de-striping and signal restoration, which were necessary for the SH solutions [28]. Therefore, mascon solutions provide a higher signal-to-noise ratio [1], which is more suitable for the arid Australian environment with relatively small seasonal hydrological variations [26]. Although the spatial resolution of GRACE is only ~100,000 km 2 (~3° × 3° at the equator) [28], it is suitable for measuring the TWS in a large-scale region [5] (e.g., for the whole continent of Australia). It should be noted that the timelines of all datasets were chosen during the intersection of GRACE and hydrological models, that is, from January 2003 to December 2014 (Tables 1 and 2). The missing months of data in GRACE mascon solutions were filled by linear interpolation. Since the monthly water anomalies from GRACE mascon solutions were typically computed relative to a timemean baseline ( Table 2), all water storages anomalies were computed relative to the same time-mean baseline (i.e., 2004.0-2009.999). Additionally, to ensure the consistency of the spatial resolution between GRACE and hydrological models, the spatial resolution of all data was interpolated to 0.5°. We found that the standard deviations between the interpolated data-derived TWSA and that of the corresponding original resolution data in Australia were lower than those between the different original resolution data. Therefore, although interpolation may affect the value of an individual grid cell, it can be ignored for the study of TWSA at the regional scale.

TWS SnWS CWS SWS SMS GWS
Total terrestrial water storage anomaly (TWSA) is relative to a long-term mean TWS (e.g., the average of 2004.0-2009.999 in this study), generally expressed as an equivalent water thickness (EWT; unit: mm) [48]. The same interpretations are for SnWSA, CWSA, SWSA, SMSA, and GWSA. Under the Australian environment, SWS changes are expected to be minor and negligible [12]. Rodell et al. [49] indicated that the seasonal amplitude of the gravity changes caused by biomass variations is as large as 5 mm, but it was still at an order of magnitude smaller than the current GRACE uncertainties (2 cm); hence CWS changes are negligible. Additionally, SWS variations have little impact over most of Australia [50], and such variations are not considered as a major contributor to TWSA [40,50,51]. In summary, based on these assumptions, the water storage equation over Australia can be simplified as follows:

Time Series Decomposition and Trend Estimation
The Seasonal Trend Decomposition using Loess procedure (STL), introduced by Cleveland et al. [52], is a versatile and robust decomposition method [53]. This method consists of inner and outer loops. In the inner loop, the seasonal and long-term components are estimated from a given time series under several passes of smoothing filters. In the outer loop, robustness weights, which are utilized in the next iteration of the inner loop, are adjusted to reduce the impact of outliers in the time series (more details can be found in the work of [53]). Compared to the outputs from harmonic analysis, which is a common approach used to remove seasonal variations from given time series data, the STL method shows similar results but isolates long-term components without additional smoothing [5]. Therefore, in the current study, the STL method was utilized to decompose the monthly time series of water storage anomalies into seasonal ( seasonal X ), long-term ( − long term X ), and residual ( residual X ) components as follows: The long-term component was further decomposed into the linear X and − inter annual X components by fitting a trend applying least squares linear regression and assigning the remaining long-term signal to inter-annual variability. The residual component included the unmodeled seasonal and long-term components and can reflect the sub-seasonal signal and noise [5].
The seasonal variability and linear trend (hereinafter called trend) in TWSA are important for understanding the related hydrological processes over specified regions [54]. They are usually selected as evaluation metrics to assess the performance of hydrological models [5,23]. However, for the TWSA over Australia, only some parts of the northern continent are dominated by seasonal variability [53]. Therefore, in this study, we focused on evaluating the performance of hydrological models based on the trends in water storage changes. Additionally, in order to assess the significance of the trends, the Mann-Kendall test (M-K test) [55,56] with a significance level of 0.05 was performed on long-term components data based on STL decomposition (Equation (3)). The M-K test, which is a non-parametric test for identifying whether or not there is a linear monotonic trend in a time series [57], has been widely utilized to detect trends in hydrologic data [58]. Additionally, the uncertainties in trends are mathematically derived from least squares linear regression when decomposing longterm components [28].

Determination of Reference Values
In this study, we not only assessed the TWSA trends derived from hydrological models, but also evaluated the trends of the individual water storage components of said hydrological models. To understand how different water storage components (i.e., soil moisture and groundwater) contribute to the differences in TWSA trends between GRACE and hydrological models, it is essential to determine the corresponding reference values of the different components.
Since GRACE-derived TWSA is an integrated change and cannot be decomposed into individual components without ancillary data, it has been widely used to evaluate modeled TWSA simulations. Additionally, we used the mean of three GRACE mascon solutions-based TWSA as a reference value for TWSA, which is more effective in reducing the noise in the gravity field solutions than an independent mascon solution [59]. The TWSA uncertainty was calculated based on the standard deviation among these GRACE mascon solutions [59].
Soil moisture storage changes are generally from one of the following three broad sources, namely, in situ observations, satellite remotely sensed estimates, and models simulations [60]. However, in situ measurements are at point-scale and are sparsely distributed across the continent of Australia [61]. Although remotely sensed soil moisture estimates can satisfy the spatial requirements, they usually represent the soil moisture in the top ~5 cm [62], which is much lower than the soil depth of hydrological models (Table 1). Therefore, in this study, we utilized the modeled SMSA as a reference value, which has been widely implemented in Australia [7,12,40,63]. Additionally, there is no definite way to identify which model demonstrates the highest accuracy; therefore, in order to reduce the influence of model uncertainty on the reliability of the reference valued, we adopted a multi-model (i.e., Noah-3.6, VIC-4.1.2, CLM-4.5, and AWRA-L v6 models; more details can be found in Section 3.2.2) ensemble average as a reference value of SMSA. The SMSA uncertainty was equal to the standard deviation among the applied models [51].
The variations in groundwater storage can be measured by in situ borehole observations; however, the spatial coverage of monitoring boreholes in Australia is inadequate and uneven [12,22]. These boreholes are mainly biased toward locations in the southern and southwestern parts and the eastern coast [64], which cannot satisfy the spatial requirement of this study on a continental scale. Moreover, the groundwater level from the monitoring of boreholes needs to be multiplied by a specific yield to convert it into groundwater storage, but the uncertainty in the special yield is very difficult to quantify [40]. Although most of the hydrological models used in this study include groundwater storage, large differences in GWSA trends between them were found (more details can be found in Section 3.2.3). Thus, selecting one model or a multi-model ensemble mean as a reference value was not reasonable. In comparison, GRACE-inferred groundwater storage, which was separated as the residual between the GRACE-based TWSA and the modeled SMSA [51,65,66], can provide a reasonable agreement with borehole-derived water storage for the regions with sufficient boreholes over Australia [7,12,22,40,67]. This suggests that GRACE-inferred GWSA can reliably be used as a reference value of GWSA, and the GWSA uncertainty was calculated by the root-sum-ofsquares of two uncertainties from the TWSA and SMSA reference values [40].

Evaluation Strategy
In this study, we concentrated on using the differences (Equation (4)) in different water storage trends between hydrological models and the reference values as evaluation metrics to assess the performance of said hydrological models.
where trend is the trend difference, M trend is the water storage trend simulated by a hydrological model, and O trend is the water storage trend derived from the reference value. If the value of trend is negative, it implies that the hydrological model underestimates the reference value of trend. It should be noted that the released spatial resolution of GRACE mascon solutions has the same or a higher resolution than that of hydrological models (Tables 1 and 2) (e.g., the released spatial resolution of GSFC-M is 1° × 1°, the same as that of GLDAS-2.1 model). However, these grid cells in GRACE mascon solutions are highly correlated with their neighbors due to the native spatial resolution of GRACE being approximately 3° × 3° [1]. Therefore, any analysis of the result of an individual grid cell should be avoided [20,22]. In other words, the spatial pattern of different water storage trends was only used for visualization purposes, and the quantified metrics for evaluating the performance of hydrological models were based on the differences between the temporal trends. Additionally, to aid interpretation, we estimated the temporal agreements in different water storage anomalies between hydrological models and the reference values using the following statistics: Pearson's linear correlation coefficient (r: at the significance level p < 0.05), root-meansquare error (RMSE), and the Nash-Sutcliffe efficiency (NSE) coefficient, respectively, as follows: where i M and i O are the modeled and reference values of the same variable, respectively; M and O are the mean of the modeled and reference values, respectively, i is the number of months, and n is the total number of months (i.e., 144 months).

Comparison of Total Terrestrial Water Storage Trends Derived from GRACE Mascon Solutions
The spatial distributions of the TWSA trends derived from three different GRACE mascon solutions are shown in Figure 2, while the temporal variations in the TWSA derived from three different GRACE mascon solutions based at the regional and continental scales are shown in Figure  3. As displayed in Figure 2a Figure 3d). This pattern was also reported by Humphrey et al. [53] and Rodell et al. [1]. Meanwhile, the TWSA trends from JPL-M and GSFC-M experienced a similar distribution to that of CSR-M (Figures 2b,c and 3a-d, and Table S1), but CSR-M displayed higher apparent resolution and gradual changes. This can be attributed to the fact that the regularization of the CSR mascon solutions is dependent on a 200-km Gaussian filter and its grid files are released as 0.25° apparent resolution [42]. Furthermore, the TWSA trends simulated by JPL-M indicated larger increasing and decreasing trends than those of CSR-M and GSFC-M (Table S1). A similar result was also found by Rodell et al. [1], who indicated that the 3° mascons (native resolution) of JPL-M "focus" more signals than the 1° mascons (native resolution) of CSR-M and GSFC-M over India. Additionally, the TWSA trends derived from CSR-M and JPL-M showed more spatial consistency than those from GSFC-M, especially in CC, Lake Eyre Basin (LKE), and South West Coast (SWC) (Figure 2a-c). These differences could be due to the new RL06 mascon solutions, which were applied in CSR-M and JPL-M, with a higher signal-to-noise ratio than the v02.4 mason solutions used in GSFC-M [68].  Across the entire continent of Australia, the TWSA time series derived from three different GRACE mascon solutions were highly consistent ( Figure 3e); all correlation and NSE coefficients were larger than 0.90 and 0.85, respectively, and all RMSEs were lower than 1.4 cm (Table 3).
Additionally, the agreement between CSR-M and JPL-M (i.e., r = 0.98, RMSE = 0.66 (cm), and NSE = 0.97) was higher than between the others (Table 3). Under extreme hydroclimatic impact, the TWSA over the continent indicated three different distinct temporal segments [16], i.e., two dry periods (i.e., 2003-2009 and 2011-2014) and one wet period (i.e., 2010-2011). Overall, the TWSA trends derived from three different GRACE mascon solutions depicted a long-period significant increasing trend in Australia throughout the full investigated period, but the TWSA trend from GSFC-M (2.38 ± 0.50 mm/year) was lower than that from CSR-M (3.69 ± 0.57 mm/year) and JPL-M (3.38 ± 0.62 mm/year). The spatial and temporal differences in the TWSA trends among GRACE mascon solutions over Australia could be related to the different background models and data processing strategies used in said GRACE mascon solutions ( Table 2). In order to reduce the influence of noise in the gravity field solutions on the evaluation of hydrological models, we used the ensemble mean of all GRACE mascon solutions-based TWSA as a reference value of TWSA [59].  Figure 4 maps the spatial distribution of seven different hydrological models-simulated TWSA trends (Figure 4a-g) and the reference values of TWSA trends (Figure 4h). Figure 5 shows the time series of the regional and the continental average TWSA simulations from seven different hydrological models and the corresponding reference values (i.e., the ensemble means of CSR-M, JPL-M, and GSFC-M). As shown in Figures 4 and 5, all models underestimated the increasing or decreasing trends in Australia compared to GRACE, except for CLM-4.5 model. For instance, Noah-3.6 and VIC-4.1.2 models indicated significant wetting TWSA trends in MDB (Figures 4a,c and 5c) with a value of 5.37 ± 0.79 mm/year and 4.04 ± 0.68 mm/year, respectively. However, these values are lower than the reference value of 7.67 ± 0.83 mm/year. Additionally, these models showed no obvious wetting TWSA trends in CC and NEC (Figure 5a,b and Tables S2-S3) and drying trends in the NWP (Figure 5d and Table S5). Furthermore, despite CLSM-F2.5, WaterGAP-2.2d, PCR-GLOBWB-2, and AWRA-L v6 models containing all water storage compartments, these models also showed an underestimation of the TWSA trends compared to GRACE, in particular, the increasing trends in CC (Figure 5a and Table S2) and the decreasing trends in NWP (Figure 5d and Table S5). On the contrary, CLM-4.5 model overestimated the increasing trends in NEC (28.62 ± 1.28 mm/year) compared to the reference value of 7.66 ± 0.81 mm/year (Figure 5b). Additionally, we found that although WaterGAP-2.2d was the only model used in this study that considers human water use and calibration (Table 1), it showed lower TWSA trends and more non-significant trends than the other models (Figure 4d and Tables S2-S5). Across the entire continent of Australia, all of the hydrological models indicated similar TWSA temporal changes to those of reference value (Table 4). CLM-4.5 model presented the highest agreement (r = 0.95, RMSE = 1.05 cm, and NSE = 0.91) in terms of TWSA to the reference value, whereas WaterGAP-2.2d model expressed the lowest agreement (r = 0.77, RMSE = 3.02 cm, and NSE = 0.20). Additionally, except for CLM-4.5 model, the TWSA trends from all models (ranging from 0.56 ± 0.13 to 2.02 ± 0.45 mm/year) ( Table 5) were lower than that of the reference value, with a rate of 3.15 ± 0.56 mm/year. Furthermore, although CLM-4.5 model demonstrated the closest agreement with GRACE, this model overestimates the GRACE-derived rising TWSA trend, with a rate of 3.79 ± 0.52 mm/year. In addition, we observed that there were large discrepancies in TWSA between GRACE and hydrological models over the period of 2011-2013; these discrepancies may be due to the hydrological models underestimating the process of severe flooding caused by the strongest La Niña event on record in Australia [15].    1 The reference value of TWSA trend is based on the ensemble average of CSR-M, JPL-M, and GSFC-M-born TWSA trends; the reference value of SMSA trend is based on the average of all hydrological models-derived SWSA trends, except for CLSM-F2.5, WaterGAP-2.2d, and PCR-GLOBWB-2 models; the reference value of GWSA trend is separated as the residual between the reference value of TWSA trend and the reference value of SMSA trend. The uncertainties in trends were mathematically derived from least squares linear regression when decomposing the long-term component [28].

Total Terrestrial Water Storage Trend
The differences in TWSA trends may be due to the uncertainties of GRACE and hydrological models [5]. However, we found that the standard deviations in TWSA among GRACE mascon solutions are much lower than those from hydrological models ( Figure 5, Tables 3-4, and S6-S13). Therefore, this suggests that the uncertainties in hydrological models are the main reason for the differences in TWSA trends between GRACE and hydrological models over Australia. The main factors associated with hydrological model uncertainties include model structure, climate forcing input, consideration of human water use or not, and model calibration [3]. By comparison, the water storage components and the capacity related to model structure have the most substantial impact on a hydrological model's uncertainties [3,5], so we focused on studying this factor.
It can be clearly seen that one of the main reasons for Noah-3.6 and VIC-4.1.2 models underestimating the GRACE-based trend is because these models do not include groundwater in their simulations (Table 1). For other models, however, we could not directly demonstrate the differences in TWSA trends between GRACE and hydrological models being caused by the uncertainties in soil moisture and/or groundwater. Therefore, it was necessary to assess the individual water storage trends derived from hydrological models. Figure 6 shows the spatial distributions of SMSA trends simulated by seven different hydrological models (Figure 6a-g) and the reference values (Figure 6h), while Figure 7 shows the regional and continental average temporal changes in SMSA derived from these seven different hydrological models and the corresponding reference values. As shown in Figure 6a Figure 7d). Similar patterns in these trends were observed in VIC-4.1.2, CLM-4.5, and AWRA-L v6 models (Figures 6c,f,g and 7a-d, and Tables S2-S5). However, CLM-4.5 model showed higher increasing SMSA trends in NEC (9.34 ± 0.68 mm/year; Figure 7b and Table S3) and larger decreasing SMSA trends in NWP (−5.12 ± 0.43 mm/year; Figure 7d and Table S5). This could be related to the fact that the CLM-4.5 model includes a dry surface layer, which can significantly reduce the evaporation in NEC [69]; however, this dry surface layer may overestimate evaporation in NWP. In comparison, CLSM-F2.5, WaterGAP-2.2d, and PCR-GLOBWB-2 models underestimated the SMSA trends in Australia, especially the wetting SMSA trends in NEC and MDB (Figures 6b,d, e and 7b-c and Tables S3 and S4). For the whole continent of Australia, the SMS variations estimated by Noah-3.6, VIC-4.1.2, CLM-4.5, and AWRA-L v6 models were highly consistent with all of the correlation coefficients, and NSE coefficients were larger than 0.90, whereas all of the RMSEs were lower than 1.0 cm (Table 4). On the contrary, CLSM-F2.5, WaterGAP-2.2d, and PCR-GLOBWB-2 models showed similar SMS changes to the correlation coefficients, ranging from 0.66 to 0.95 (Table 4), whereas their changing magnitudes were lower than those from the other models with the NSE coefficients ranged from 0.17 to 0.60 and the RMSEs ranging from 1.76 to 2.54 cm (Table 4)  The above results reveal that CLSM-F2.5, WaterGAP-2.2d, and PCR-GLOBWB-2 models underestimated the spatio-temporal pattern of SMSA trends. This underestimation can be explained by the fact that the soil profiles of these three models are too thin to sufficiently accommodate the soil moisture storage, e.g., most of the soil thicknesses of WaterGAP-2.2d model in Australia are approximately 0.5-1.5 m [3]. Additionally, these three models had worse performance in simulating extreme hydrological changes. To avoid possible biases in trend calculations, we applied the ensemble average of all of the hydrological models-derived SMSA as the SMSA reference values in Australia, with the exception of CLSM-F2.5, WaterGAP-2.2d, and PCR-GLOBWB-2 models.

Groundwater Storage Trend
The spatial patterns of GWSA trends derived from five different models and the corresponding reference values are displayed in Figure 8, while the temporal variations of the regional and the continental average GWSA simulations from five different hydrological models and the corresponding references value are shown in Figure 9. Distinct spatial differences in the GWSA trends among five different models can be observed (Figure 8a-e). These differences imply that it may be unreasonable to selected one model or a multi-model ensemble mean as a reference value. Therefore, the residuals between the reference values of TWSA trends (the ensemble means of CSR-M, JPL-M, and GSFC-M) and the reference values of SMSA trends (the ensemble mean of Noah-3.6, VIC-4.1.2, CLM-4.5, and AWRA-L v6 models) were applied as the reference values of GWSA trends. This hypothesis is consistent with the previous studies [40,51]. As shown in Figure 8f Figure 9d). Swenson and Lawrence [21] attributed this overestimation to CLM-4.5 model lacking a finite lower boundary in the bulk aquifer model. It should be pointed out that poor modeling of the influence of ocean tides and non-tide ocean mass movement can affect the GRACE water storage estimates in the nearby Gulf of Carpentaria [20,70]. However, these errors introduce a cyclical error without affecting long-term trends [40]. Additionally, a lower aquifer indicates an obvious rising groundwater level trend during 2004-2014 in the northern part of Australia from the Australian Groundwater Insight [71]. Hence, the increasing GWSA trends from the reference values in the northern part of Australia are reliable (Figure 8f). However, all of the hydrological models underestimated these increasing trends (Figure 8a-e), which could be attributed to the fact that these increasing trends are mostly expressed as GWSA trends of the lower aquifer, and all of the hydrological models may lack sufficient storage capacity to accommodate variations in deep groundwater [21]. In general, under the impact of "big wet" driven by one of the strongest La Niña events in 2011 [16], all of the hydrological models indicated a significantly increasing GWSA trend for the whole continent of Australia during the study period (Table 5). However, the GWSA trends derived from CLSM-F2.5, WaterGAP-2.2d, PCR-GLOBWB-2, and AWRA-L v6 models (ranging from 0.28 ± 0.08 to 1.04 ± 0.44 mm/year) were lower than that of the reference value (with a trend of 1.57 ± 0.32 mm/year).  Table 4). This could be related to the soil thickness of CLSM-F2.5 model being only 1 m, thus resulting in larger changes in GWSA [24]. Therefore, the depth of the soil layer has a significant influence on groundwater storage variations [21]; however, the manner in which soil thickness affects groundwater storage was beyond the scope of this paper.

Influence of Model Structure on Model Uncertainty
Based on the aforementioned results presented in Sections 3.1 and 3.2, different water storage changes in Australia can be calculated through a comprehensive synergy of GRACE mascon solutions and hydrological model simulations. The pronounced decreasing TWSA trends were located in NWP and mainly expressed as a decline in GWSA trends (Figures 4h and 8f), which may be associated with the influence of Pilbara's mining industry [1]. On the contrary, significant increasing TWSA trends were found in CC, NEC, and MDB (Figure 4h), which are attributed to the impact of a heavy rain event in 2011 [16]. The trends in NEC and MDB were expressed as wetting SMSA and GWSA trends (Figures 4h, 6h, and 8f), whereas the trends in CC were mostly indicated as increasing GWSA trends (Figures 4h, 6h, and 8f). In summary, SMSA trends demonstrated similar contributions (50.16 ± 15.24%) to TWSA trends than those from GWSA trends (49.84% ± 10.16%) in Australia over the period of 2003-2014 ( Table 5).
As shown in Figure 10 and Table 5, all of the hydrological models underestimated the GRACEderived increasing TWSA trends in Australia, with the exception of CLM-4.5 model. The causes for the underestimation can be divided into three categories: (1) the model ignores the groundwater water component, i.e., Noah-3.6 and VIC-4.12 models [5]; (2) the model's soil depth is too small to accommodate the range in soil moisture change and lacks sufficient groundwater storage capacity, i.e., CLSM-F2.5, WaterGAP-2.2d, and PCR-GLOBWB-2 models [5]; (3) the model does not have enough ability to catch groundwater storage changes, i.e., AWRA-L v6 model [20]. On the contrary, CLM-4.5 model overestimated the GRACE-based TWSA trends because of the overestimation of GWSA trends, which was induced by the lack of a finite lower boundary in the bulk aquifer, with poor responses to large wetting events [21]. Besides the trend in TWSA, the seasonal and inter-annual variability in TWSA can also be selected as evaluation metrics to assess the performance of hydrological models [23,54]. These two metrics will be studied in the future.

Impact of Other Factors on Model Uncertainty
In addition to the model structure, the differences in TWSA trends between GRACE and hydrological models were also affected by climate forcing, consideration of human water use or not, and model calibration [3,5].

Climate Forcing
Using different climate forcing in seven different models may be a disadvantage in this study. However, it is very difficult to apply the same climate forcing for all hydrological models in order to isolate the influence of climate forcing since some models are not in operational mode (e.g., WaterGAP-2.2d) [3]. To easily evaluate the impact of different forcing data on the conclusion of this study, we made a comparison among three common GLDAS models, i.e., As can be seen in Figure 11a,b, two different Noah models showed similar spatial distribution, i.e., decreasing SMSA trends in NWP and increasing SMSA trends in MDB. However, Noah model in GLDAS-2.0 states larger drying SMSA trends in NWP (−2.36 ± 0.44 mm/year) and lower wetting SMSA trends in MDB (3.27 ± 0.73 mm/year) than those of Noah model in GLDAS-2.1 (−0.13 ± 0.51 mm/year and 5.37 ± 0.79 mm/year in NWP and MDB, respectively), which could be related to the different climate forcing used in these two models. In comparison, CLSM model indicated no obvious decreasing SMSA trends in NWP (−0.04 ± 0.13 mm/year) and lower increasing SMSA trends in MDB (0.85±0.15 mm/year, Figure 11c). Across the whole continent of Australia, two different Noah models presented generally good agreement (r = 0.98, RMSE = 0.73 cm, and NSE = 0.94) in SMS variations during the entire period ( Figure 12), except that large discrepancies in SMSA were found from 2011 to 2012, which were caused by a strong precipitation event in 2011. On the contrary, the magnitude of the SMS changes from CLSM model were always lower than those of the two Noah models, and the agreement metrics between CLSM model and Noah-3.6 model (r = 0.94, RMSE = 2.44 cm, and NSE = 0.37) were lower than those of between two different Noah models. Furthermore, these three GLDAS models indicate increasing SMSA trends over the whole study period, but the increasing magnitude revealed by CLSM model (with a slope of 0.25 ± 0.12 mm/year) was much lower than that of the two different Noah models (with rates of 1.35 ± 0.36 and 1.74 ± 0.46 mm/year in GLDAS-2.0 and GLDAS-2.1, respectively). In summary, model structure and climate forcing can result in differences in TWSA trends between GRACE and hydrological models, but the impact of the model structure on these differences is much greater than that of climate forcing. These results are supported by the findings of Scanlon et al. [5], who also identified that the influence of model structure on TWSA trends is stronger than climate forcing by using different climate forcing into WaterGAP model and applying the same climate forcing to WaterGAP and PCR-GLOBWB models. Therefore, although different climate forcing in models leads to the differences in TWSA trends between GRACE and hydrological models, the conclusion in this study based on model structure is reasonable.

Human Water Use and Model Calibration
It is interesting to note that although WaterGAP-2.2d model was the only model applied in the current study that considers human water use and model calibration (Table 1), it showed the worst agreement in terms of TWSA with GRACE relative to the other hydrological models (Table 4). On the contrary, CLM-4.5 model does not consider human water use and model calibration, yet showed the best agreement in terms of TWSA with GRACE (Table 4). Therefore, this implies that the factors of human water use and model calibration may have uncertainties or are unlikely to be the main cause of the differences in TWSA trends between GRACE and hydrological models in Australia for the period of 2003-2014 [5]. It should be pointed out that human water use can be a primary reason for groundwater depletion in some Australian regions (e.g., in Victoria [12] and south-west western Australia [72]); however, in the current study, we focused on evaluating the performance of hydrological models over the whole continent of Australia; hence, human water use and model calibration did not influence our conclusions.
Overall, we emphasize that model structure is the most important cause for discrepancies in TWSA between GRACE and hydrological models, and do not deny that other factors also have an impact; however, these influences on the conclusions of this study can be ignored.

Conclusions
Assessment of a hydrological model's performance is important for managing water resources and for guiding its future improvement. In this study, we applied three newly available GRACE mascon solutions (i.e., CSR-M, JPL-M, and GSFC-M) to evaluate the spatio-temporal performance of seven widely used hydrological models (i.e., Noah-3.6, CLSM-F2.5, VIC-4.1.2, WaterGAP-2.2d, PCR-GLOBWB-2, CLM-4.5, and AWRA-L v6 models) over Australia from 2003 to 2014. The spatiotemporal differences in TWSA trends between GRACE and hydrological models were investigated and, more importantly, the corresponding individual water storage components (i.e., SMSA and GWSA) that lead to the differences in TWSA trends were explored.
Although different background models and data processing strategies were used in CSR-M, JPL-M and GSFC-M, high spatio-temporal agreement in TWSA was found between them. Additionally, the standard deviations among three GRACE mascon solutions-derived TWSA were lower than that among hydrological models. Therefore, GRACE mascon solutions can be used as valuable data to assist in the improvement of hydrological models.
The Australian TWSA was mainly composed of SMSA and GWSA. The TWSA trends showed a bipolar pattern, i.e., increasing trends in CC, NEC, and MDB, and decreasing TWSA trends in NWP. However, the increasing TWSA trends in CC and the decreasing TWSA trends in NWP were mostly due to groundwater storage variations, while the increasing TWSA trends in NEC and MDB were attributed to changes in both soil moisture and groundwater storage. Across the whole continent of Australia, the TWSA trend indicated a pronounced increasing trend with a value of 3.15 ± 0.56 mm/year throughout the full investigated period. The contributions of SMSA trend (1.58 ± 0.48 mm/year) and GWSA trend (1.57 ± 0.32 mm/year) to TWSA trend were similar, 50.16 ± 15.24% and 49.84% ± 10.16%, respectively.
The differences in TWSA trends between GRACE and hydrological models were mainly caused by uncertainties in these hydrological models. The influences of four different driving forces (i.e., model structure, climate forcing input, consideration of human water use or not, and model calibration) on hydrological models' uncertainties were analyzed. Model structure was the most important reason; climate forcing can result in differences in TWSA trends between GRACE and hydrological models, but the impact of climate forcing on these differences was much lower than that of model structure. Additionally, the factors of human water use and model calibration may have uncertainties or were unlikely to be the main cause of the differences in TWSA trends between GRACE and hydrological models in Australia for the period of 2003-2014.
All of the hydrological models applied in this study, except for CLM-4.5 model, underestimated the GRACE-derived TWSA trends over the whole continent of Australia. Different hydrological models showed different causes for the differences in TWSA trends between them and GRACE. We can divide the causes of underestimation into three categories: missing groundwater storage component (for Noah-3.6 and VIC-4.1.2 models), lack of sufficient capacity to accommodate the range in soil moisture and groundwater storage trends (for CLSM-F2.5, WaterGAP-2.2d, and PCR-GLOBWB-2 models), and inadequate ability to catch groundwater storage changes (for AWRA-L v6 model). In contrast, CLM-4.5 model indicated the best agreement with GRACE TWSA, but overestimated the GRACE-derived TWSA trends, mostly due to the overestimation of GWSA trends.
Supplementary Materials: The following are available online at www.mdpi.com/2072-4292/12/21/3578/s1, Table  S1: Statistical summary of TWSA trends (mm/year) derived from three different GRACE mascon solutions in regions (Carpentaria Coast (CC), North East Coast (NEC), Murray-Darling Basin (MDB), and North Western Plateau (NWP)), and the whole continent of Australia; Table S2: Statistical summary of different water storage trends (mm/year) derived from hydrological models and reference values in CC from 2003 to 2014; Table S3: Statistical summary of different water storage trends (mm/year) derived from hydrological models and reference values in NEC from 2003 to 2014; Table S4: Statistical summary of different water storage trends (mm/year) derived from hydrological models and reference values in MDB from 2003 to 2014; Table S5: Statistical summary of different water storage trends (mm/year) derived from hydrological models and reference values in NWP from 2003 to 2014; Table S6: Statistical summary of temporal agreement metrics between GRACE mason solutions in CC; Table S7: Statistical summary of temporal agreement metrics between GRACE mason solutions in NEC; Table S8: Statistical summary of temporal agreement metrics between GRACE mason solutions in MDB; Table S9: Statistical summary of temporal agreement metrics between GRACE mason solutions in NWP; Table  S10: Statistical summary of agreement metrics between reference values and hydrological models in CC; Table  S11: Statistical summary of agreement metrics between reference values and hydrological models in NEC; Table  S12: Statistical summary of agreement metrics between reference values and hydrological models in MDB; Table  S13: Statistical summary of agreement metrics between reference values and hydrological models in NWP.