Improving the Accuracy of Water Storage Anomaly Trends Based on a New Statistical Correction Hydrological Model Weighting Method

: The Gravity Recovery and Climate Experiment (GRACE) satellite solutions have been considerably applied to assess the reliability of hydrological models on a global scale. However, no single hydrological model can be suitable for all regions. Here, a New Statistical Correction Hydrological Model Weighting (NSCHMW) method is developed based on the root mean square error and correlation coefﬁcient between hydrological models and GRACE mass concentration (mascon) data. The NSCHMW method can highlight the advantages of good models compared with the previous average method. Additionally, to verify the effect of the NSCHMW method, taking the Haihe River Basin (HRB) as an example, the spatiotemporal patterns of Terrestrial Water Storage Anomalies (TWSA) in HRB are analyzed through a comprehensive comparison of decadal trends (2003–2014) from GRACE and different hydrological models (Noah from GLDAS-2.1, VIC from GLDAS-2.1, CLSM from GLDAS-2.1, CLSM from GLDAS-2.0, WGHM, PCR-GLOBWB, and CLM-4.5). Besides, the NSCHMW method is applied to estimate TWSA trends in the HRB. Results demonstrate that (1) the NSCHMW method can improve the accuracy of TWSA estimation by hydrological models; (2) the TWSA trends continue to decrease through the study period at a rate of 15.7 mm/year; (3) the WGHM and PCR-GLOBWB have positive reliability with respect to GRACE with r > 0.9, while all the other models underestimate TWSA trends; (4) the NSCHMW method can effectively improve RMSE, NES , and r with 3–96%, 35–282%, 1–255%, respectively, by weighting the WGHM and PCR-GLOBWB. Indeed, groundwater depletion in HRB also proves the necessity of the South-North Water Diversion Project, which has already contributed to groundwater recovery.


Introduction
The Haihe River Basin (HRB), including mountains and plains, is one of the largest agricultural production bases of China [1]. The main crops grown on HRB are corn and wheat, but there is no rice [2]. In the past two decades, the HRB experienced a rapid land type change. During the process, a large amount of land unsuitable for farming has been reclaimed for agriculture. Land-use changes have impacts on the soil-vegetationatmosphere circulation system. The Terrestrial Water Storage (TWS) anomalies, caused by unsustainable groundwater over-extraction and climate changes, are particularly severe in Note: If the grid area contained in the sub-regions is greater than a quarter of the area of a single grid, then this grid will be included in the number of grids. The grid at the boundary is repeated. The Basin lies in a temperate East Asian monsoon climate zone and belongs to a semi-humid zone, with an average annual temperature of 1~15 • C, average annual precipitation of about 400~600 mm, and annual evaporation of 900~1500 mm [26]. HRB is the most important food production base in northern China [24]. In recent years, the overexploitation of water resources results in negative changes in the ecological environment, such as groundwater depletion, drought, land subsidence, and deterioration of freshwater quality. Five of China's eight super-large groundwater overexploitation areas are in the HRB [30]. The geographic location of the study area in this case is shown in Figure 1. In this work, two newly available GRACE mascon solutions, the monthly Jet Propulsion Laboratory RL06 v02 (JPL) [31,32] and the Center for Space Research (CSR) [33], are used to research the TWSA in HRB. There are currently three institutions processing GRACE mascon data, CSR, JPL, and CSFC, with resolutions of 0.25 • × 0.25 • , 0.5 • × 0.5 • , and 1 • × 1 • , respectively. In this paper, data from two of the most widely used institutions with higher resolution are selected (CSR and JPL). As for the main discrepancy of JPL and CSR, firstly, the native resolution of JPL is 3 • × 3 • , while it is 1 • × 1 • in CSR. Secondly, the tide model used in CSR is GOT4. 8  In GRACE data exists leakage error and the limited intrinsic resolution, which affects the reliability of GRACE to a certain extent. Fortunately, mascon solutions can better resolve ocean leakage error and improve the signal-to-noise ratio in terrestrial water signals over the HRB region. GRACE mascon solutions can be applied from the regional to global scales, whereas the spherical harmonics (SH) solutions are just for the global scale. A basic difference between SH and mascon is that SH solutions need to be processed to reduce noise. This study concerns the reliability of GRACE summing up four aspects: (1) GRACE mascon solutions provide higher precision for monitoring TWSA on the regional scale. (2) We assume that only reduction in leakage errors across coastlines is required when GRACE data are used to analyze the TWSA over HRB. (3) The estimates for all parameters improve in the mascon approach, including the signal-to-noise ratio. (4) Although different background models and data processing strategies are used in CSR and JPL, high spatiotemporal agreement in TWSA is found between them. Thus, GRACE mascon solutions can be used as valuable data to assist in the improvement of hydrological models over HRB.
We believe that the discrepancies concerning reference are the consequence of a random fluctuation and not of a systematic error. To improve the signal-to-noise [34] the two GRACE mascon products are averaged to reduce the uncertainty. The period of all datasets is from 2003 to 2014, as part of the hydrological models are only updated to 2014 (such as PCR-GLOBWB). A total of 11 months of GRACE data are missing between 2003 and 2014, as shown in Table S18. The cubic-spline interpolation method is used to fill the missing month of GRACE. All the monthly water storage anomalies are typically computed relative to a mean-time baseline, known as an average from 2004 to 2009. To ensure the consistency of the spatial resolution among GRACE and hydrological models, all datasets are interpolated to 0.5 • × 0.5 • in this work.

Hydrological Models
Hydrological models are physical and data structures established by simulating hydrological phenomena, which are generalizations of actual hydrological conditions, and they can help relevant departments understand, manage and predict water resources. For this reason, investigating the performance and reliability of hydrological models is of great significance. Current hydrological models include LSMs and GHWRMs. First of all, GLDAS models have been updated with multiple versions so far, including two versions of datasets (GLDAS-1 and GLDAS-2), which are the most common LSMs. Among them, GLDAS-1 includes four models (Noah, VIC, CLM, and Mosaic), while GLDAS-2 includes three models (Noah, VIC, and CLSM [35]). GLDAS-2 provides global land surface data from 2000 to the present. The spatial resolutions are 0.25 • × 0.25 • and 1 • × 1 • , and the time resolutions are 1 day and 1 month. GLDAS-2 is improved and solved the discontinuity from GLDAS-1 data. It is worth noting that the CLM version is constantly being updated such as CLM-4.0 and CLM-4.5 [36].
Then, WGHM and PCR-GLOBWB are common GHWRMs used in the HRB region, which are described in detail in the introduction part. The spatial resolutions of WGHM and PCR-GLOBWB are 0.5 • × 0.5 • and 0.08 • × 0.08 • , respectively. The WGHM not only simulates the daily groundwater component but also considers human-induced water Remote Sens. 2021, 13, 3583 5 of 26 consumption. GHWRMs have a higher resolution than LSMs used in this paper, which allows the latter to perform better at the grid-scale. This paper assesses the performance of hydrological models widely used in HRB, including Noah (from GLDAS-2.1), VIC (from GLDAS-2.1), CLSM2.1 (from GLDAS-2.1), CLSM2.0 (from GLDAS-2.0), Water-Global Assessment and Prognosis Hydrological Model (WGHM), PCRaster Global Water Balance (PCR-GLOBWB), and CLM-4.5. In the following text, Noah, VIC, CLSM2.1, CLSM2.0, WGHM, PCR-GLOBWB, and CLM are used to simplify the description of the above seven hydrological models respectively. Their primary attributes are listed in Table 2. The reason why these models are used as research data in this paper is that there are differences among them that cause uncertainty. The uncertainty among hydrological models is model structure, climate forcing, human water use, and climate changes. For instance, the number of soil layers and total soil thickness of hydrological models are different ( Table 2). For climate forcing, only Noah, VIC, CLSM2.1, and CLSM2.0 have the same input parameters, whereas the other hydrological models apply completely different parameters. For model structure, only CLSM2.1 has the same model structure as CLSM2.0. Furthermore, only WGHM and PCR-GLOBWB consider human water use in simulating TWSA, and just WGHM is calibrated. In summary, these differences are related to the model uncertainty, which can affect the performance of hydrological models.

Establishment of the NSCHMW Method
Traditional methods for estimating TWSA usually use the outputs of a single model. However, hydrological models are usually developed on a global scale, which poses challenges for regional applications. Due to the differences in uncertainties such as model drive, the data results are also different. For the sake of optimizing the TWSA of HRB, this paper sets weights according to the consistency and accuracy between the hydrological models and GRACE and integrates the selected multi-source hydrological models to improve the accuracy of TWSA trends. The concrete steps of the NSCHMW method include three points. First, the major components of TWSA in HRB need to be judged, and the TWSA is estimated by hydrological models according to the major components. Secondly, the statistical indicators and the weights of hydrological models with correlation coefficients greater than 0.9 are calculated successively. The selected hydrological models are weighted to correct the TWSA trends simulated by a single model, and the merged TWSA is estimated at the grid-scale, which is represented by "Merge" hereafter. The complete steps are as follows.

1.
Estimation of TWSA based on hydrological models; Soil moisture storage (SMS), groundwater storage (GWS), snow water storage (SnWS), canopy water storage (CWS), and surface water storage (SWS) are the five components of TWS, as the following formula [20,37]: where TWS i denotes TWS in the month i, TWS is the average of TWS from 2004 to 2009, and TWSA stands for the change value during the study period. In like manner, SMSA, GWSA, SnWSA, CWSA, and SWSA also subtract the corresponding average value. For HRB, the SnWSA and SWSA are relatively small, so they can be ignored [26]. Research [38] implied that changes in water resources caused by plants were about 5 mm, far less than GRACE uncertainties (2 cm). Hence, CWSA is negligible too [26]. Under these hypotheses, Equation (1) can be transferred as follows [1].
After the original TWSA data are calculated by Formula (3), the seasonal and trend components of GRACE are wiped out by the Seasonal Trend Decomposition using Loess procedure (STL) mothed [39]. The trend component is concerned to assess the performance of hydrological models in this work. Then, the Mann Kendall (MK) test is used to evaluate the significance level of TWSA trends [40]. The variance and statistic (Z) are calculated in the MK test. When Z is greater than 0, the time-series data shows an increasing trend, while when Z is less than 0, it shows a decreasing trend. The absolute value of Z given in this paper is greater than 1.96, indicating a significance level with 95% confidence. The uncertainty in trends is derived from the least-squares linear regression.

2.
Assignment of weights based on statistical indicators.
As the most widely used evaluation metrics of the hydrological changes, root mean square error (RMSE), the Nash Sutcliffe efficiency (NSE) coefficient, and Pearson's linear correlation coefficient (r, at 95% significance level), are used to analyze correlation. If r or NSE are near 1 and RMSE is relatively small, it implies that the model is reliable. Three indicators are calculated by the following Equations (Eqs.) [41][42][43]: where w is the weight,

3.
Weighting TWSA based on weights. In this paper, the operation from Equation (1) to Equation (7) is performed for each grid. Based on the weight of hydrological models, the formula of the NSCHMW method can be obtained as follows: where Merge represents the TWSA estimated by NSCHMW method; TWSA0 represents the TWSA calculated by hydrological models. The m represents the number of hydrological models. Furthermore, k represents the months of different periods. Based on climate factors, this paper analyzes the correlation between the model and GRACE in different periods. The first period is from 2003 to 2009, covering a total of 84 months, and the second period is from 2010 to 2014, covering a total of 60 months. Figure 2 spatially illustrates monthly TWSA trends derived from JPL CSR, and the average value during the studied decade. It is worth noting that JPL, CSR, and Average have all carried out MK test, and the TWSA trend of all grids is within the 95% confidence interval, indicating the significant trends. Therefore, there is no grey dot representing no significant trend in the spatial distribution. The results of JPL ( Figure 2a) and CSR ( Figure 2b) are slightly different in that JPL displays two severe depletion regions, one in Henan (such as Jiaozuo, Xinxiang, Hebi, Puyang, and Handan) and one in the east of Hebei (such as Hengshui and Cangzhou), while CSR has only one depletion region in the border area between Hebei and Henan. As shown in Figure 2c, TWSA shows obvious annual changes, at a rate of −15.7 ± 0.7 mm/year in HRB. In addition, the spatial resolution of CSR displays higher and clearer gradual changes. Such difference is because GRACE data appear leakage errors in small regional examine [20]. However, in the long-term time series, the trends from JPL and CSR are the same. Up to now, GRACE is accepted to detect TWSA to a large extent. To further reduce errors of GRACE-derived TWSA, the average value of JPL and CSR is utilized as the reference value of TWSA.

TWSA Derived from GRACE Mascon
Remote Sens. 2021, 13, x FOR PEER REVIEW 8 3a indicates that precipitation is basically stable during 2003 and 2009; however, i creases in the following years [48]. Owing to the alternation of El Nino and La Nina, t effects on climate are usually opposite, there is more precipitation in the second year o Nino, which results in a significant rise in 2010 and 2012, respectively.  In this paper, the GRACE data are decomposed by STL in 12 months. Figure Figure 3 is an average of all grids over HRB. Monthly precipitation is accumulated over the entire HRB. Through seasonal and residual components are removed, the trend component is linear. As seen in trend, TWSA based on the two institutions has similar temporal variations, with long-term decreasing trends, especially fiercer since 2005. TWSA continues to decline and gradually intensifies from north to south, with changes in the southern HRB ranging from −30 mm/year to −35 mm/year. However, TWSA trends derived by JPL (−17.4 ± 0.7 mm/year) indicates a larger increasing and decreasing amplitude than CSR (−14 ± 1 mm/year). The TWSA of HRB's sub-regions is shown in Table S1.   Figure 4 maps the diagram of TWSA trends simulated by hydrological models and GRACE. Black dots indicate non-significant trend regions. Table 3 shows the different water storage trends (mm/year) derived from hydrological models and the reference (p < The GRACE-derived TWSA also reflects the influence of two extreme climate events, which occurred in 2009 [44,45] and 2011 [46,47], respectively, mainly triggered by the Arctic Oscillation and the El Niño events. The monthly distribution of precipitation in Figure 3a indicates that precipitation is basically stable during 2003 and 2009; however, it increases in the following years [48]. Owing to the alternation of El Nino and La Nina, their effects on climate are usually opposite, there is more precipitation in the second year of El Nino, which results in a significant rise in 2010 and 2012, respectively.  Table 3 shows the different water storage trends (mm/year) derived from hydrological models and the reference (p < 0.05). What needs to be explained is that the hydrological model's outputs do not include the Remote Sens. 2021, 13, 3583 9 of 26 TWSA component, and generally, the sum of the other components is used as the TWSA component (e.g., GWSA and SMSA). As stated in Equation (3), TWSA in this paper only contains two variables, GWSA and SMSA. Since NOAH and VIC do not contain the GWSA component, the value of TWSA is equal to that of SMSA. WGHM (−17.1 ± 0.4 mm/year) and PCR-GLOBWB (−23.0 ± 0.7 mm/year) are the only two models that overestimate TWSA trends (Figure 4e,f). Since two GHWRMs (WGHM and PCR-GLOBWB) have all the water storage components, and human water use is considered. The reason why PCR-GLOBWB has a higher slope than WGHM is that WGHM is calibrated after WGHM 2.2a. The updated WGHM 2.2d in this study enhances the simulation performance of TWSA in HRB significantly, although the overestimation still exists. Four GLDAS models, including Noah, VIC, CLSM2.1, and CLSM2.0 (Figure 4a-d), show no obvious TWSA trends with values of −2.6 ± 0.7 mm/year, −1.2 ± 0.5 mm/year, −1.3 ± 0.7 mm/year, and −1.1 ± 0.9 mm/year, respectively, and GLDAS models have many non-significant trends points than GHWRMs. On the contrary, there are no black dots in GRACE results. Additionally, CLM ( Figure 4g) has an opposite trend compared to GRACE in the northern HRB, although the spatial distribution of them is consistent in the southern HRB. Hence, spatiotemporal patterns of TWSA simulated by different hydrological models are quite different. The differences among GRACE and hydrological models in TWSA trends come from uncertainties of components.

TWSA Trends
Remote Sens. 2021, 13, x FOR PEER REVIEW 9 of 25 0.05). What needs to be explained is that the hydrological model's outputs do not include the TWSA component, and generally, the sum of the other components is used as the TWSA component (e.g., GWSA and SMSA). As stated in Equation (3) Figure 4g) has an opposite trend compared to GRACE in the northern HRB, although the spatial distribution of them is consistent in the southern HRB. Hence, spatiotemporal patterns of TWSA simulated by different hydrological models are quite different. The differences among GRACE and hydrological models in TWSA trends come from uncertainties of components.  Figure 5 shows the time-series TWSA of HRB and six sub-regions. TWSA errors are estimated by the standard deviation between CSR and JPL in this work [34], which is the grey-shaded area. As seen, the TWSA trends have discrepancies from different hydrological models, with higher trends from GHWRMs, and lower trends from LSMs. For the whole HRB (Figure 5a), the decreased reaches −15.7 mm/year, and the values vary from (h) Reference. The color bar represents the TWSA trends amplitude, and the black dots indicate regions with non-significant trends. The reference of TWSA is the average of JPL and CSR. Table 3. Trends of TWSA, SMSA, and GWSA (mm/year) derived from hydrological models and the reference over HRB. The non-significant values with the MK test at the 95% confidence level are shown in bold.

Models
TWSA Trends SMSA Trends GWSA Trends Figure 5 shows the time-series TWSA of HRB and six sub-regions. TWSA errors are estimated by the standard deviation between CSR and JPL in this work [34], which is the grey-shaded area. As seen, the TWSA trends have discrepancies from different hydrological models, with higher trends from GHWRMs, and lower trends from LSMs.  Table S7). However, CLSM2.1 and CLSM2.0 indicate slight TWSA trends in Henan with a value of −7.3 ± 1.2 mm/year, −11.5 ± 1.5 mm/year, respectively, which are lower than the corresponding reference value of −34.6 ± 2.1 mm/year. Furthermore, the two CLSM models show no obvious TWSA trends in Beijing and Tianjin. Table 4 presents the agreement metrics between hydrological models and corresponding reference values. WGHM presents the highest consistency (r = 0.91, RMSE = 26 mm, and NSE = 0.86) in terms of TWSA by comparison, whereas CLSM2.0 expresses the lowest consistency (r = 0.49, RMSE = 67 mm, and NSE = −0.11).
For TWSA trends in HRB's sub-regions, GHWRMs overestimate the TWSA trends in HRB's sub-regions apart from the Shanxi. The following three points can be summarized, (1) LSMs (Noah, VIC, CLSM2.1, CLSM2.0, CLM) underestimate TWSA trends in Beijing and Hebei compared with GRACE. (2) LSMs and WGHM underestimate that in Tianjin, Shandong, and Henan regions. (3) Both LSMs and GHWRMs underestimate that in Shanxi. The model structure may be the main factor that leads to the discrepancies between LSMs and GHWRMs. The two GHWRMs both show better consistency with GRACE in Beijing and Hebei (r > 0.84, Tables S8 and S10). Among them, PCR-GLOBWB shows a good agreement with GRACE in Tianjin, Shanxi, and Shandong, with the correlation coefficient of 0.79, 0.75, and 0.85, respectively (Tables S9, S11 and S12). By combining RMSE, NSE, and r for joint analysis, WGHM best matches with GRACE in Henan, with the correlation coefficient of 0.91 (Table S13).
PCR-GLOBWB and WGHM are the best consistent with GRACE in the HRB region (r > 0.9), thus, they are used as input models for the NSCHMW mothed. Figure 6 shows the spatial distribution of TWSA from Merge in different periods. In this paper, the GRACE inversion results are taken as the true values of TWSA (Figure 4h). It can be seen that the spatial distribution of the Merge result estimated by the NSCHMW method is more consistent with GRACE. However, PCR-GLOBWB significantly overestimates it in the middle part of the HRB (Figure 4f), as well as the TWSA signal simulated by WGHM is insufficient in the eastern part of HRB (Figure 4e). The average trend of the Merge is better than the results of the individual models. drological models and corresponding reference values. WGHM presents the highest consistency (r = 0.91, RMSE = 26 mm, and NSE = 0.86) in terms of TWSA by comparison, whereas CLSM2.0 expresses the lowest consistency (r = 0.49, RMSE = 67 mm, and NSE = −0.11).

SMSA Trends
SMSA can be measured by in-situ wells, remote sensing estimates, and hydrologica models. However, in-situ observations are not only time-consuming and laborious, but also the points are rare and unevenly distributed across the HRB [49]. As for the remote sensing monitoring methods, the thickness of soil can observe generally in the top layer ~5 cm [50]; however, now the soil depth of hydrological models reaches up to 200 cm Consequently, the SMSA simulated by hydrological models is considered to be reliable in this study. For the sake of enhancing signal-to-noise on the reference value, the average of three model outputs (Noah, VIC, and CLM) is applied as the reference of SMSA [51] because the soil layers of these three models are more than others. The SMSA error is equa to the standard deviation among the three hydrological models. Figure 7 represents distributions of SMSA trends simulated by hydrological models and corresponding SMSA reference. For the whole HRB, SMSA trends estimated by Noah VIC, and CLM are highly consistent with the reference, of which the correlation coefficient is over 0.9, NSE is over 0.5, and RMSE is below 18 mm (Table 4). However, WGHM hardly shows any trend in HRB, and CLM obviously overestimates decreasing trends in northern HRB compared with the SMSA reference. In addition, four models (CLSM2.0, CLSM2.1 WGHM, and PCR-GLOBWB) match poorly with reference values of SMSA trends. Such results can be more accurately analyzed from statistics metrics. The four models show lower statistics values than the other three models so that the average of Noah, VIC, and CLM to be the reference value is appropriate.

SMSA Trends
SMSA can be measured by in-situ wells, remote sensing estimates, and hydrological models. However, in-situ observations are not only time-consuming and laborious, but also the points are rare and unevenly distributed across the HRB [49]. As for the remote sensing monitoring methods, the thickness of soil can observe generally in the top layer 5 cm [50]; however, now the soil depth of hydrological models reaches up to 200 cm. Consequently, the SMSA simulated by hydrological models is considered to be reliable in this study. For the sake of enhancing signal-to-noise on the reference value, the average of three model outputs (Noah, VIC, and CLM) is applied as the reference of SMSA [51], because the soil layers of these three models are more than others. The SMSA error is equal to the standard deviation among the three hydrological models. Figure 7 represents distributions of SMSA trends simulated by hydrological models and corresponding SMSA reference. For the whole HRB, SMSA trends estimated by Noah, VIC, and CLM are highly consistent with the reference, of which the correlation coefficient is over 0.9, NSE is over 0.5, and RMSE is below 18 mm (Table 4). However, WGHM hardly shows any trend in HRB, and CLM obviously overestimates decreasing trends in northern HRB compared with the SMSA reference. In addition, four models (CLSM2.0, CLSM2.1, WGHM, and PCR-GLOBWB) match poorly with reference values of SMSA trends. Such results can be more accurately analyzed from statistics metrics. The four models show lower statistics values than the other three models so that the average of Noah, VIC, and CLM to be the reference value is appropriate.
HRB compared with the SMSA reference. In addition, four models (CLSM2.0, CLSM2.1, WGHM, and PCR-GLOBWB) match poorly with reference values of SMSA trends. Such results can be more accurately analyzed from statistics metrics. The four models show lower statistics values than the other three models so that the average of Noah, VIC, and CLM to be the reference value is appropriate.  For SMSA trends in HRB's sub-regions, the Noah is the most reliable in most sub-regions (r > 0.9), largely due to the well simulating in soil moisture. Noah shows a significant decreasing trend of SMSA in Shandong (−5.2 ± 1.4 mm/year), Henan (−7.7 ± 1.2 mm/year), southern Hebei, and southern Shanxi, while no significant trend is observed in Beijing, Tianjin, northern Hebei, and northern Shanxi. Similar patterns are observed in corresponding reference values. In addition, we can get the following points, (1) CLM and VIC overestimate the SMSA in Beijing by comparison, while Noah has the best correlation coefficient of 0.9. (2) Noah, VIC, and PCR-GLOBWB overestimate the SMSA in Hebei, Shanxi, and Shandong. (3) Only CLM overestimates SMSA in Tianjin, and only Noah overestimates SMSA in Henan. Conversely, the undervaluation is likely interpreted by the soil layers of these hydrological models are too small to insufficiently capture SMSA.

GWSA Trends
Similar to the SMSA, the GWSA can also be measured by in situ observations; nonetheless, there is a limitation of inadequate and uneven distribution of points in HRB [28]. According to the principle of land water balance (Equation (3)), GWSA reference values could be calculated by the residual between the GRACE-derived TWSA and the model-simulated SMSA [51,52]. The GWSA uncertainties are estimated by the mean square error of the uncertainties from two other references in this study. Among the seven models used in this paper, only five include the groundwater component. Noah and VIC do not include the groundwater components, so only five models participate in this section. the SMSA in Beijing by comparison, while Noah has the best correlation coefficient of 0.9.
(2) Noah, VIC, and PCR-GLOBWB overestimate the SMSA in Hebei, Shanxi, and Shandong. (3) Only CLM overestimates SMSA in Tianjin, and only Noah overestimates SMSA in Henan. Conversely, the undervaluation is likely interpreted by the soil layers of these hydrological models are too small to insufficiently capture SMSA.   (Table 4). Because WGHM accurately considers the abstractions and recharge of groundwater from soil and surface water, and it considers the domestic water consumption [53]. Figure 9 illustrates the spatial distribution of GWSA trends in HRB. The reference values display strong decreasing trends in the south and weak in the north. However, CLSM2.0, CLSM2.1, and CLM underestimate the changes of groundwater storage, even with no trends in annual scale at all. In contrast, the two GHWRMs overestimate the GWSA by comparison. In addition, WGHM displays the largest amplitude variation ranging from −152 mm to 54 mm and good statistics (RMSE = 21 mm, NSE = 0.87, r = 0.94) ( Table 4). Because WGHM accurately considers the abstractions and recharge of groundwater from soil and surface water, and it considers the domestic water consumption [53].     (Tables S6 and S7). A relatively slighter trend can be detected by both two GHWRMs in Shanxi (Table S5). The grey shaded areas of reference values imply highly uneven changes of water storage within sub-regions. By comparison, in the shaded areas in three water storage changes, the error of SMSA is relatively stable and seasonal, while errors of GWSA are about two times larger than those of TWSA, and GWSA in Tianjin has the highest error. and S7). A relatively slighter trend can be detected by both two GHWRMs in Shanxi (Table  S5). The grey shaded areas of reference values imply highly uneven changes of water storage within sub-regions. By comparison, in the shaded areas in three water storage changes, the error of SMSA is relatively stable and seasonal, while errors of GWSA are about two times larger than those of TWSA, and GWSA in Tianjin has the highest error.

Analyzing Water Storage Changes in Plains and Mountains
According to the topography features of the HRB, it can be divided into two parts: mountains and plains (Figure 1). The trends of three hydrological variables (TWSA, SMSA, and GWSA) have discrepancies in different topography. Figure 11 displays the time-series changes of three hydrological variables in mountains and plains, respectively.

Analyzing Water Storage Changes in Plains and Mountains
According to the topography features of the HRB, it can be divided into two parts: mountains and plains (Figure 1). The trends of three hydrological variables (TWSA, SMSA, and GWSA) have discrepancies in different topography. Figure 11 displays the time-series changes of three hydrological variables in mountains and plains, respectively. In general, the amplitudes of hydrological variables in plains are larger than that in mountainous. For example, the mean TWSA trend in plains is −17.2 ± 1.1 mm/year, and that in mountainous is −14.2 ± 0.7 mm/year, which indicates the overexploitation of the deep groundwater storage in plains (Tables S14 and S15). As seen in Figure 11, there is clear variations in the seasonal change in TWSA and SMSA, while it is disorganized in GWSA. Moreover, the model reliabilities of three hydrological varies are also analyzed. In the first place, GHWRMs overestimate the TWSA trends, while LSMs underestimate it (Figure 11a,b). In the second place, Noah has the best correlation coefficient (r of plains: 0.89, r of mountains: 0.96) and WGHM has the worst (r of plains: 0.6, r of mountains: 0.5) in SMSA simulation results (Tables S16 and S17, Figure 11c,d). Finally, the results of WGHM are higher than the inversion results of reference value in the plains area, while the other six hydrological models underestimate GWSA trends (Figure 11e,f). In general, the amplitudes of hydrological variables in plains are larger than that in moun tainous. For example, the mean TWSA trend in plains is −17.2 ± 1.1 mm/year, and that in mountainous is −14.2 ± 0.7 mm/year, which indicates the overexploitation of the deep groundwater storage in plains (Tables S14 and S15). As seen in Figure 11, there is clea variations in the seasonal change in TWSA and SMSA, while it is disorganized in GWSA Moreover, the model reliabilities of three hydrological varies are also analyzed. In the firs place, GHWRMs overestimate the TWSA trends, while LSMs underestimate it ( Figure  11a,b). In the second place, Noah has the best correlation coefficient (r of plains: 0.89, r o mountains: 0.96) and WGHM has the worst (r of plains: 0.6, r of mountains: 0.5) in SMSA simulation results (Tables S16 and S17, Figure 11c,d). Finally, the results of WGHM are higher than the inversion results of reference value in the plains area, while the other six hydrological models underestimate GWSA trends (Figure 11e,f).

Model Uncertainty
The differences in TWSA trends among GRACE and hydrological models are mainly caused by model uncertainty. Uncertainties include climate forcing, model structure, hu

Model Uncertainty
The differences in TWSA trends among GRACE and hydrological models are mainly caused by model uncertainty. Uncertainties include climate forcing, model structure, human activities, and climate changes [16]. Figure 12 shows the difference between hydrological models and corresponding references. As seen, LSMs underestimate the TWSA compared with GRACE in HRB, whereas GHWRMs overestimate it. A positive number means underestimating the TWSA trend, and a negative number means overestimating it. The trend differences of WGHM and PCR-GLOBWB are −1 mm/year and −7 mm/year, respectively. The trend differences of Noah, VIC, CLSM2.1, CLSM2.0, and CLM are 13 mm/year, 14 mm/year, 14 mm/year, 15 mm/year, and 13 mm/year, respectively. man activities, and climate changes [16]. Figure 12 shows the difference between hydrological models and corresponding references. As seen, LSMs underestimate the TWSA compared with GRACE in HRB, whereas GHWRMs overestimate it. A positive number means underestimating the TWSA trend, and a negative number means overestimating it. The trend differences of WGHM and PCR-GLOBWB are −1 mm/year and −7 mm/year, respectively. The trend differences of Noah, VIC, CLSM2.1, CLSM2.0, and CLM are 13 mm/year, 14 mm/year, 14 mm/year, 15 mm/year, and 13 mm/year, respectively.
As stated in the introduction, different hydrological models have completely different model structures. The two GHWRMs overestimate the TWSA trends due to the overvaluation of groundwater changes, and groundwater over-extraction resulted in a decline in GWSA. Diversely, the reasons for the undervaluation of LSMs could be made into two cases. On one hand, Noah and VIC ignore the groundwater storage component; on the other hand, CLSM2.1, CLSM2.0, and CLM have insufficient capture capabilities of groundwater storage. Consequently, the model structure has a significant impact on the differences of TWSA trends among GRACE and hydrological models. Then, we should touch on human water use and climate changes. Table 1 presents that WGHM and PCR-GLOBWB are the only two models that simulate the human activities. In terms of TWSA, the two GHWRMs show better agreements with GRACE than the other hydrological models (Table 4), and the correlation coefficients of them during this period are larger than 0.98. It implies that human water use is one main factor leading to uncertainty.
At last, TWSA can also be driven by climate variability. As seen from Figure 5, the TWSA trends of these models declined rapidly in the dry years (such as 2005-2009) and rebounded in the wet years (such as 2010-2012). As shown in Figure 3a, we speculate that the increase of precipitation after 2010 is related to two La Nina events. Correspondingly, Figure 12. Trend differences in TWSA, SMSA, and GWSA between hydrological models and corresponding reference.
As stated in the introduction, different hydrological models have completely different model structures. The two GHWRMs overestimate the TWSA trends due to the overvaluation of groundwater changes, and groundwater over-extraction resulted in a decline in GWSA. Diversely, the reasons for the undervaluation of LSMs could be made into two cases. On one hand, Noah and VIC ignore the groundwater storage component; on the other hand, CLSM2.1, CLSM2.0, and CLM have insufficient capture capabilities of groundwater storage. Consequently, the model structure has a significant impact on the differences of TWSA trends among GRACE and hydrological models.
Afterward, CLSM2.0 and CLSM2.1 are taken as examples to verify the influence of climate forcing on uncertainty in this paper. Although the amplitude of CLSM2.0 is a little larger than CLSM2.1 in time-series changes, the two models have similar spatial patterns, with high correlation coefficients both from 2003 to 2009 (r = 0.97) and 2010 to 2014 (r = 0.92). Thus, climate forcing is not the key influence of uncertainties.
Then, we should touch on human water use and climate changes. Table 1 presents that WGHM and PCR-GLOBWB are the only two models that simulate the human activities. In terms of TWSA, the two GHWRMs show better agreements with GRACE than the other hydrological models (Table 4), and the correlation coefficients of them during this period are larger than 0.98. It implies that human water use is one main factor leading to uncertainty.
At last, TWSA can also be driven by climate variability. As seen from Figure 5, the TWSA trends of these models declined rapidly in the dry years (such as [2005][2006][2007][2008][2009]) and rebounded in the wet years (such as 2010-2012). As shown in Figure 3a, we speculate that the increase of precipitation after 2010 is related to two La Nina events. Correspondingly, the TWSA trends simulated by LSMs are inconsistent with GRACE after 2010. It is suggested that climate changes (such as precipitation) are one other reason for uncertainty.

Correlation Coefficient of TWSA in Different Periods
As a whole consideration, to investigate the TWSA trend calculated by the new method under different terrains, any six grid cells are selected in the southern and northern HRB, respectively, to demonstrate the applicability of the NSCHMW method. In the six grids, the first and the second grids belong to the mountain region, while the remaining four grids belong to the plain region. Figure 13 shows the time series of WGHM, PCR-GLOBWB, Merge, and reference values in selected grids. It can be seen that all the single models show downward trends just like the GRACE trends. However, the decline amplitudes of them exist differences. As for the plain region, WGHM overestimates the TWSA trend in the third and fourth grids (Figure 13c,d), while it underestimates the TWSA trend in the fifth and sixth grids (Figure 13e,f). On the contrary, PCR-GLOBWB overestimates it in the third and fourth grids, as well underestimates it in the fifth and sixth grids. As for the mountain region (Figure 13a,b), the trends have similar results with that in the plain region. After correction using the NSCHMW method, the time series of Merge become more agree with GRACE than single models. Table 5 shows the TWSA trends estimated by WGHM, PCR-GLOBWB, Merge, and reference in different periods. It is evident that Merge is more similar to GRACE than WGHM and PCR-GLOBWB. As seen from the reference, the TWSA trend from 2003 to 2009 is larger than that from 2010 to 2014 at grids of 2, 4, and 6. The TWSA trends from 2003 to 2009 are smaller than those from 2010 to 2014 at grids of 1, 3, and 5.
Tables 6 and 7 summarize the indices of TWSA in six grids, including RMSE, NSE, and r during 2003-2009 and 2010-2014, respectively. As seen, Merge represents lower RMSE, higher NSE, and higher r in all grids than single models (WGHM and PCR-GLOBWB). By weighting WGHM and PCR-GLOBWB, the NSCHMW method can improve RMSE, NES, and r with 3-96%, 35-282%, and 1-255% respectively. It implies that the statistical correction method is effective in optimizing TWSA for hydrological model simulation. in the fifth and sixth grids (Figure 13e,f). On the contrary, PCR-GLOBWB overestimates it in the third and fourth grids, as well underestimates it in the fifth and sixth grids. As for the mountain region (Figure 13a,b), the trends have similar results with that in the plain region. After correction using the NSCHMW method, the time series of Merge become more agree with GRACE than single models.    Figure 14 indicates the correlation coefficients of different stages. It can be clearly found that the time is divided into two sections by taking 2009 as the boundary in the time-series TWSA ( Figure 5). As seen in Figure 14, it displays a good fitting between hy drological models and GRACE during 2003-2009, while that is poor during 2010-2014 except for WGHM and PCR-GLOBWB. Taking CLM as an example, the correlation coef ficient between CLM and GRACE from 2003 to 2009 is 0.90, the largest one among al models; however, that is only 0.34 from 2010 to 2014, the smallest one among all models Because LSMs are more sensitive to changes in soil moisture caused by precipitation. The indices of HRB's sub-regions are shown in Tables S8-S13. Different from the results of Scanlon's study, most of the hydrological models under estimate the change of water storage compared with GRACE [20]. We found that some models overestimated the changes of water storage in HRB in this paper. Therefore, i

Reasons for Over/Underestimated the TWSA from GRACE and Models
Different from the results of Scanlon's study, most of the hydrological models underestimate the change of water storage compared with GRACE [20]. We found that some models overestimated the changes of water storage in HRB in this paper. Therefore, it needs to be a targeted analysis on a specific regional basin. On the whole HRB, GRACE observations show a highly negative trend in TWSA and no trendless points, which gradually intensified from north to south, with an average slope of −15.7 ± 0.7 mm/year. Correspondingly, GHWRMs overestimate it, while the LSMs underestimated it. Indeed, LSMs are more effective when simulating SMSA trends, while GHWRMs can simulate GWSA better. In HRB exists large groundwater depletion, which is the reason why GHWRMs with stronger groundwater capture ability are more consistent with GRACE in TWSA. In addition, the spatial distribution from seven hydrological models displays non-significant points more or less in the sub-region, which is related to the model uncertainties.

Drivers of Groundwater Depletion
The groundwater over-extraction and the reduced precipitation account for the decreasing GWSA trends, and agricultural irrigation leads to the groundwater over-extraction in HRB's plains. In HRB's plains exist a large amount of food planting area. Irrigation water capacity accounts for two-thirds of the total water use in HRB, of which one third comes from groundwater withdrawal. Intense human water use leads to groundwater depletion in the HRB's plains [22]. In mountain regions, the annual precipitation in this area is light, and the evaporation increases year by year. It suggests that climate changes are also responsible for the water storage shortage in the HRB [54]. These factors over the HRB have rendered the water storage issue more complex.
Two aspects can contribute to the recoverability of groundwater in HRB. On one hand, the early recovery in groundwater is attributed to elevated precipitation (rainfall and snowfall) that increased from 2010 to 2012. On the other hand, it may be partly due to the construction of the south-north water diversion project which dates back to 2003, and the project went into operation in December 2014. The published statistics of water resources in HRB reported consistent decreases in groundwater consumptions after 2014, which reflects the impact of the south-to-north water diversion project on groundwater recovery. Some previous works [25,55] also indicate that the construction of the south-to-north water diversion project has positive influences on groundwater storage recovery. However, it is a long-term process to raise groundwater levels. Although the water transferred into HRB through the south-to-north water diversion project has increased during the last ten years [56], groundwater is still the paramount water resource because of its universality, flexibility, and lower costs. Therefore, the water crisis is still severe in HRB, which deserves high attention.

Conclusions
Assessing the performance of hydrological models is necessary for the sustainable development of water resources. To this end, this paper applied JPL mascon and CSR mascon data to evaluate spatiotemporal patterns of seven hydrological models (Noah, VIC, CLSM2.1, CLSM2.0, WGHM, PCR-GLOBWB, and CLM) in HRB and HRB's sub-regions from 2003 to 2014. The results are as follows.

1.
This paper develops the NSCHMW method to solve the uncertainty between the hydrological models and GRACE, which gives weight to the hydrological models by combining multi-source models (WGHM and PCR-GLOBWB) simulation results.
Compared with a single model, the NSCHMW method fully considers the accuracy and consistency of hydrological models and GRACE data, and it can effectively improve the accuracy of TWSA trends calculated by hydrological models.

2.
The NSCHMW method is verified in HRB. In terms of spatial distribution, the Merge result calculated by the NSCHMW method is more consistent with GRACE-derived TWSA. As for both WGHM and PCR-GLOBWB, there are insufficient and excessive