Evaluation of the Water Cycle in the European COSMO-REA 6 Reanalysis Using GRACE

Anne Springer 1,*, Annette Eicker 1,2, Anika Bettge 1, Jürgen Kusche 1 and Andreas Hense 3 1 Institute of Geodesy and Geoinformation, Bonn University, 53115 Bonn, Germany; annette.eicker@hcu-hamburg.de (A.E.); s7anbett@uni-bonn.de (A.B.); kusche@geod.uni-bonn.de (J.K.) 2 Hafen-City University, 20457 Hamburg, Germany 3 Meteorological Institute, Bonn University, 52121 Bonn, Germany; ahense@uni-bonn.de * Correspondence: springer@geod.uni-bonn.de; Tel.: +49-228-73-6149


Introduction
Precipitation (P) minus evapotranspiration (E) represents the net flux of water between the atmosphere and the Earth's surface.Globally, P − E is close to zero since land and ocean areas balance each other nearly, but not perfectly, as can be deduced from the present gain of water stored on continents [1,2].Over land areas, in the temporal mean, P − E is slightly positive since average evapotranspiration should not be greater than precipitation.Here, precipitation minus evapotranspiration links the terrestrial water budget to atmospheric moisture transports and, through latent heat flux, to the Earth's surface energy budget.The precipitation versus evapotranspiration deficit thus represents an important boundary condition for climate modeling and hydrological studies.Its temporal evolution can be traced to changes in climate forcing (temperature, precipitation, wind, CO 2 levels, etc.), the direct and indirect impacts of anthropogenic activities such as groundwater abstraction and land use change and the hydrological response of the system.Simulations have shown that while P − E over oceans follows thermodynamical scaling with global warming, the "wet-get-wetter, dry-get-drier" scaling does not seem to apply over land [3].Thus, investigating P − E is very important to develop more elaborate scaling laws.
Various measurement techniques and observation datasets for precipitation and evapotranspiration exist [4,5].In reanalyses, precipitation and evapotranspiration are generated based on numerical weather prediction (NWP) modeling and the assimilation of many datasets, with the result that P − E is usually better simulated compared to P and E individually.It is known that P, E and P − E have biases in reanalyses and do not close budgets [6][7][8].Regional reanalyses seek to improve over global reanalyses through improved process representation and high-resolution modeling, but it is difficult to quantify improvements with independent datasets.
Reanalyses usually do not close the water budgets within error bars due to assimilation increments and since data errors and model inconsistencies necessarily propagate into the estimates.Assessing the closure of water and energy cycles enables diagnosing the error level and also understanding how well numerical weather prediction modeling can derive unobserved fields as residuals.The objective of this study is to validate the net flux of water between the atmosphere and the land surface in global and regional reanalyses including the regional European Consortium for Small-Scale Modelling 6-km Reanalysis (COSMO-REA6, [9]), differentiating per river basin and utilizing total water storage (TWS) measurements obtained from the Gravity Recovery and Climate Experiment (GRACE) satellites and discharge observations.To this end, we equate areal averages of P − E for 26 river catchments grouped into 17 target areas through the terrestrial water budget equation to the observations of TWS change ∆S and discharge R.
GRACE data, commonly expressed as monthly gridded fields of total vertically-integrated water storage change, provide an opportunity to close the terrestrial water balance and thus replace the assumption that storage does not change over long time intervals.The GRACE mission consists of two satellites in tandem formation, chasing each other at about a 450-km altitude and equipped with a highly precise K-band ranging system.Variations of the inter-satellite distance can then be converted to gravitational potential change and further to mass change that is typically expressed as total water storage change per area.Several studies have combined GRACE-derived river basin averaged water storage change with discharge measurements [10,11] for either constraining P − E [12] or combining with observed precipitation data for area-wide estimates of evapotranspiration [13][14][15].However, challenges related to the use of GRACE data are the need to average over large regions, signal attenuation related to the peculiar filtering required to smooth out spatially-correlated data noise and signal leakage from neighboring regions.
Our study follows essentially the approach outlined in [8]; yet the availability of 20 years of COSMO-REA6 reanalyses enables us to assess trends over the 11-year period (2003-2013) common with the GRACE mission data.Moreover, the present study now covers all major European river basins with 26 rivers aggregated to 17 catchment areas.For this, we had to reconstruct discharge data missing in the Global Runoff Data Centre (GRDC) database using a simple model-based approach.This allows evaluating, for the first time, regional patterns of consistency over the whole of Europe for P − E datasets from the regional reanalysis COSMO-REA6, the global Interim ECMWF Reanalysis (ERA-Interim [16]) and the global Modern-Era Retrospective Analysis for Research and Applications, Version 2 (MERRA-2, [17]).For comparison, observation-based precipitation from the Global Precipitation Climatology Centre (GPCC) Version 7 dataset [18] and evapotranspiration from the Global Land Evaporation Amsterdam Model (GLEAM [19,20]) are evaluated.We anticipate that this study will be useful for (1) COSMO modelers, to diagnose model deficiencies and eventually bring forecast and reanalysis closer to observations and (2) scientists that study the evolution of the water cycle over Europe, using reanalysis data including (but not limited) to COSMO-REA6.
We found that for all but one of the investigated catchments, discharge time series could not be retrieved complete from the Global Runoff Data Centre (GRDC).While several methods exist to fill or continue discharge time series based on satellite altimetry and rating curves [21,22], upon remotely-sensed water surface or inundation and water area-discharge rating curves [23], and upon hydraulic equations and using remote sensing measurement of other hydraulic variables such as river width or flow velocities (e.g., [24]), here, we resort to simulating discharge with the monthly Génie Rural rainfall-runoff model (GR2M, [25]) while calibrating it against discharge observations from GRDC using the most recently available data.We test this approach, and all results are accompanied by a thorough error assessment.
Our results suggest that the high-resolution COSMO-REA6 reanalysis indeed improves the closure of the water budget compared to the global reanalyses.Due to the large study area, regional patterns were visible, e.g., positive trends of P − E over western and central Europe and negative trends of P − E over eastern Europe.
The paper is organized as follows.Section 2 describes the datasets used in this study and introduces the study area.Section 3 outlines our method for generating consistent time series of P, E, R and ∆S.In Section 4, first the individual components of the water budget equation are evaluated for selected river basins.Then, the closure of the water budget equation is assessed, followed by more detailed analyses of the two sides of the water budget equation, by contrasting mean values, amplitudes and trends.Additional discussions follow in Section 5.

Study Area
Twenty-six major European river basins were considered in this study (Figure 1).The border of each basin is defined by the catchment associated with the most downstream gauging station.Small neighboring catchments are merged together for water budget analyses taking into account the main European watersheds (marked by colors in Figure 1).This is necessary because of the limited spatial resolution of GRACE-derived TWS.The 17 combined catchments are listed in Table 1 together with their size ranging from ~70,000 km 2 for the Po to ~800,000 km 2 for the Danube.

GRACE
The twin satellite mission GRACE (Gravity Field and Climate Experiment [26]) has been measuring spatial and temporal variations of the Earth's gravity field since 2002.GRACE consists of two satellites following each other on the same orbit.An inter-satellite K-band microwave link observes orbit variations caused by an inhomogeneous mass distribution on Earth.Temporal variations in gravity can then be converted to mass changes in terms of equivalent water height according to [27] taking into account the elastic loading effect.
In this study, we use GRACE release 05 (RL05) time series [28] provided by the German Research Center of Geosciences (GeoForschungsZentrum, GFZ).The monthly GRACE solutions were smoothed using a DDK4 filter [29] to account for the typical anisotropic error striping patterns.To consider the effect of seasonal and secular geo-center variations, we added the degree-one spherical harmonic coefficients provided by [30,31] to the GRACE solutions.The C20 coefficient, which cannot be well determined by GRACE, was replaced by a result obtained from satellite laser ranging [32].Already during the GRACE processing chain, temporal gravity field variations caused by tides (ocean, Earth and pole tides), as well as by atmospheric and non-tidal ocean mass variations were subtracted from the observations prior to the gravity field estimation step.Additionally, the mass trend caused by glacial isostatic adjustment (GIA) [33] was removed in post-processing.Therefore, the resulting mass variations primarily reflect hydrological water storage changes.For a realistic estimation of the GRACE measurement accuracy, we used the calibrated errors provided by GFZ, as well as the standard deviations provided together with the degree-one and of the C20 coefficients in our variance propagation procedure (see Section 3.4).

COSMO-REA6
COSMO-REA6 is among the first reanalysis results covering only a part of the globe, but at a much larger spatial and temporal resolution than all available global reanalyses (see the next subsections).Other regional reanalyses are the North American Regional Reanalysis (NARR) [34], the arctic system reanalysis [35], or the recent European reanalyses from the U.K. MetOffice [36], or the Swedish Meteorological and Hydrological Institute (SMHI) [37].
The general goal of these regional reanalyses is to provide a physically consistent and homogeneous climate dataset of the atmosphere and land surface that resolves atmospheric patterns on spatial scales from 500 km down to about 10 km (also known as atmospheric mesoscale).Important weather and climate phenomena are situated in the mesoscale range e.g., frontal passages, land sea circulations or deep convection events.Furthermore, the effects of orography and land sea distribution (especially with respect to precipitation) are expected to be much better represented than in models with grid sizes of 50-100 km, which are common in global models due to the computational burden.Global reanalyses still play an important role because they provide the atmospheric boundary conditions to the regional high-resolution models.
The COSMO-REA6 uses the regional assimilation and forecasting system of the German National Meteorological Service (Deutscher Wetterdienst, DWD) [38].It combines the non-hydrostatic forecast model COSMO with the continuous data assimilation method nudging [39].The region covered by COSMO-REA6 is identical to the European Coordinated Regional Climate Downscaling Experiment at 0.11 • resolution (CORDEX-EU11) domain, but with a grid size of 0.055 • (~6 km), on a rotated latitude-longitude grid with 40 vertical layers in the atmosphere, between the surface and about a 20-km height.Prognostic variables are the three-component wind velocity vector, temperature, pressure and various water substance concentrations to account for the generation of clouds and precipitation in liquid and solid phases.Observations come from different observing systems like radiosondes, air planes, vertically pointing wind profilers, surface stations, buoys and ships.Depending on the observing system and the observed variable, windows of influence are defined that spread the actual observations to the neighboring grid cells in space and time.Nudging means that each equation for the prognostic variables temperature, pressure, wind velocity and moisture concentration is supplemented by an additional forcing term.This forcing term drives the model solution at any time and at all grid points to the respectively spread observations using a prescribed time scale.Note that this procedure will provide consistent prognostic variables through their physical connection dictated by the equations, but it cannot provide consistent budgets of energy and water because the additional forcing terms always add energy, water and momentum to the system proportional to the difference between the simulated and the observed values.Fluxes like precipitation and evapotranspiration are derived internally from the necessary state variables, but are in general not assimilated.Therefore, an evaluation of precipitation, evapotranspiration and other energy or water fluxes with independent data as is presented here is an important contribution to the overall quality assessment of the COSMO-REA6 or any other regional reanalysis.
The actual setup of COSMO-REA6, including the necessary preparation of land surface variables, like soil moisture, and oceanic variables, like sea surface temperatures, can be found in [9].Here also, a first quality assessment of COSMO-REA6 is presented.Precipitation as one of the most interesting climate variables is further evaluated in [40], where also a comparison over Germany of the global reanalysis ERA-Interim, the U.K. Met Office and SMHI European reanalyses and an even higher resolution COSMO reanalysis at a 2-km grid size is presented, showing among other features the importance of resolution in representing precipitation.

ERA-Interim
The global Interim ECMWF Reanalysis (ERA-Interim) provides gridded fields of atmospheric and land surface variables with approximately 79-km grid spacing considering 60 vertical levels.The model output covers the time span from 1979 onwards at three-hourly resolution.The main objectives of the ERA-Interim reanalysis are a realistic representation of the hydrological cycle and temporal consistency in order to estimate reliable trends [16].The sequential data assimilation system is based on the 4D-Var method [41].In a 12-hourly cycle, the current state of the atmosphere is estimated from a forecast model combined with an important number of observations originating in the majority from satellites.
In this study, synoptic monthly means of precipitation and latent heat flux are evaluated, which are computed by the model from temperature, humidity, wind speed and radiometer observations.Compared to [8], here, we use a finer grid of 0.75 • × 0.75 • .

MERRA-2
While in [8], we assessed the performance of the Modern-Era Retrospective analysis for Research and Applications (MERRA) model, here we moved on to its successor MERRA-2.For data assimilation, an incremental analysis update scheme is applied by the Goddard Earth Observing System Version 5 (GEOS-5).MERRA-2 assimilates new observation types such as hyperspectral radiance, microwave, GPS radio occultation, ozone and aerosol datasets [17].One particular improvement of MERRA-2 is the correction of precipitation biases using observation-based precipitation from different sources [42,43].However, in this study, we assess the uncorrected precipitation in order to compare with the other two reanalyses.
MERRA-2 is provided by the Goddard Earth Sciences Data and Information Service Center (DISC) from 1980 onwards with one-hourly temporal resolution on a 0.625 • × 0.5 • grid.Here, monthly outputs of modeled precipitation and surface latent heat flux are assessed.

Observational Datasets
In this study, the ability of P and E from NWP models to close the water budget equation is compared to the performance observation-based datasets, i.e., the Global Precipitation Climatology Centre (GPCC) dataset for P and the Global Land Evaporation Amsterdam Model (GLEAM) for E.

GPCC
We used the recent version of the GPCC Full Data Reanalysis (V7) that covers the period 1901-2013 and is available at 0.5 • resolution from the GPCC web site [18].The precipitation dataset is based on 75,000 gauging stations world-wide, which are subject to strict quality control.The monthly product is optimized for water budget studies [44].GPCC data were also used as a reference by [9] for assessing the quality of P generated by COSMO-REA6.

GLEAM
GLEAM estimates the individual components of evapotranspiration using satellite observations together with model data and a set of algorithms [19,20].First, the Priestley and Taylor equation [45] is used to calculate potential evaporation from net radiation and air temperature observations.Then, actual evaporation is derived by applying a multiplicative stress factor based on microwave vegetation optical depth and soil moisture observations.Three datasets exist with different forcings and spatio-temporal coverages.Of these, two datasets of limited geographical extent are exclusively based on satellite data.Here, we use the only global dataset, GLEAM_v3.0a.For GLEAM_v3.0a,net radiation and air temperature are obtained from ERA-Interim, but all other required input datasets (precipitation, soil moisture, vegetation optical depth, snow water equivalent) are observation based.GLEAM_v3.0ahas a spatial resolution of 0.25 • and is a daily dataset, which spans the period 1980-2014.
In [8], we used a gridded E dataset based on the current global network of eddy covariance towers (FLUXNET) provided by the Max Planck Institute (MPI) Jena.The MPI dataset is available only until 2011 and was therefore replaced by GLEAM.However, the authors of [20] found good agreement of GLEAM and FLUXNET data.

Discharge
Monthly discharge data from the Global Runoff Data Centre (GRDC) are available for limited periods of time only (Figure 2).Merely 10 out of 26 stations used in this study provide data covering parts of the GRACE time span.Approaches for extending discharge data in time became ever more relevant during the last few years [46], e.g., satellite altimetry, runoff-precipitation ratio and runoff-storage relationships.Here, we created a consistent discharge time series by calibrating the lumped rainfall-runoff model GR2M [25].GR2M (a) is a simple empirical monthly model, (b) achieves good results for river basins of different sizes and with different hydrological conditions [47] and (c) allows for easy inclusion of error estimation.Further model improvement was achieved by extending GR2M using a distributed Hydrologiska Byrans Vattenavdelning (HBV) type snow model.
For each of the 26 river basins, the GR2M-snow model was calibrated against GRDC data using the 10 most recent continuous years available (marked in light blue in Figure 2).Then, the model was run for the GRACE time span.In addition, error estimates were obtained by running GR2M-snow several times with disturbed forcings and disturbed parameters.Finally, discharge from GR2M-snow was merged for the 17 aggregated catchments including error propagation.The resulting discharge time series were used in the further course of this study.While time series of P, E and ∆S are averaged over each basin, discharge is available at the most downstream gauging station, which also defines the border of the catchment.

Methodology
Water storage changes ∆S and discharge R are linked to net atmospheric-terrestrial flux P − E via the water budget equation: Consistent time series of water flux and storage are required for assessing the closure of the water budget equation.In this section, we first describe the derivation of the time series of P − E and ∆S.Afterwards, the set up of the rainfall-runoff model GR2M is presented aiming at the computation of discharge time series R covering the whole GRACE period.Finally, evaluation measures and the error assessment strategy are described.

Consistent Time Series of Atmospheric-Terrestrial Flux P − E and Storage Change ∆S
First, GRACE-derived gridded total water storage (TWS) anomalies S were centered to the 11-year study period by removing the mean static field computed from 2003-2013.Next, monthly time series of P, E and S were computed by spatially averaging over all grid cells of the target area.Finally, storage changes ∆S were obtained for each month τ as central differences from the GRACE storage time series according to: where S τ−1 is the previous month and S τ+1 the next month.Central differences in contrast to backward or forward difference operators avoid introducing a phase shift in the TWS change time series.Spatial averaging of TWS involves attenuation of the signal due to the spectral characteristics of GRACE data and further distortion due to the filtering procedure, known as the leakage effect [48].Depending on the mass distribution, filter properties, and target area, mass is transported either into the basin (leakage in) or out of the basin (leakage-out).This effect becomes particularly large if the basin is very small or if the mass distribution outside and inside of the basin differs significantly [49].Consequently, GRACE-derived time series of TWS change need to be rescaled.To this end, time-variable rescaling factors were derived for each target area separately.Assuming that the spatial distribution of P − E approximately corresponds to the spatial distribution of ∆S, a number of different P and E datasets were used to simulate the effects from spectral resolution, filtering and leakage.A robust estimate for the multiplicative rescaling factor of each month was obtained from the median of the analyzed P − E datasets.More detailed information on the derivation of consistent time series are provided by [8].

Using GR2M-Snow for Generating Modeled Discharge Time Series
The 2-parameter rainfall-runoff model GR2M requires as input monthly precipitation and potential evapotranspiration and includes two stores, a production store of variable size and a routing store with a fixed capacity of 60 mm.Monthly discharge is simulated by calibrating the capacity of the production store and an exchange coefficient, which accounts for the exchange of water with the outside of the basin.GR2M was extended by a distributed HBV-type snow model, which requires gridded temperature as input and adds three more calibration parameters (melting temperature, melting coefficient and temperature separating rain and snow).If the temperature is smaller than a defined value, snow is added to the corresponding grid cell, and if the temperature exceeds the melting temperature, the melting coefficient defines the amount of snow that is subtracted from the grid cell.Then, snow loss is accumulated over all grid cells and considered as additional precipitation.The precipitation and temperature datasets were obtained from the European daily high-resolution gridded dataset (E-OBS [50]) and potential evapotranspiration from the Climate Research Unit (CRU) data set TS3.22 [51]).GR2M-snow was calibrated for each of the basins against observed discharge for the 10 most recent continuous years available from GRDC (Figure 2) minimizing the squared difference of observed and simulated discharge.As GRDC data are erroneous for Neman and Vistula watersheds, those basins were calibrated against data from E-RUN (observational gridded runoff estimates for Europe) [52].Then, the model was run for the time span 1950-2013.As the validation period for each basin, the next 10 years available before the calibration period were chosen.In addition, error estimates were derived by generating an ensemble of simulated discharge by disturbing both the parameters and the input datasets.The input datasets were disturbed with noise following a Gaussian distribution, applying a standard deviation of 1 • C for temperature, 20% for precipitation and 5% for potential evapotranspiration.The variabilities of the five calibration parameters follow uniform distributions with appropriate uncertainty assumptions chosen for each parameter individually.For each river basin, 1000 model runs were performed, and from the ensemble of modeled discharge time series, the standard deviation was computed.The resulting error band fits well to the difference between GRDC and GR2M-snow (see Section 4.1).Furthermore, we were able to reproduce the results from [8] using discharge simulated by GR2M-snow within the error estimates.However, we are aware of limitations of this approach, e.g., when river dynamics changed over time due to river management and/or construction.
The skill of GR2M-snow in simulating discharge is evaluated by assessing different measures for the validation period, i.e., the mean bias and the root mean squared error (RMSE) between observed and simulated discharge, and Nash-Sutcliffe (NS) coefficients [53] for the time series with seasonal cycle and for de-trended and de-seasoned time series.NS coefficients are computed according to: where R obs is observed discharge from GRDC, R obs the temporal mean of observed discharge and R sim simulated discharge from GR2M-snow.Here, NS coefficients are computed from the root of discharge in order to avoid too much weight on high discharge values.An NS coefficient of one means perfect agreement between observed and modeled discharge; values between zero and one imply that the model better simulates discharge than the mean of observed discharge.In the case of NS coefficients smaller than zero, the mean of observed discharge better represents actual discharge than the model.

Evaluation of the Water Budget Equation
The performance of the individual P − E datasets is assessed with respect to ∆S + R using the following measures computed for the whole study period (2003)(2004)(2005)(2006)(2007)(2008)(2009)(2010)(2011)(2012)(2013).A four-parameter model was fitted to the time series of basin averages for both sides of the water budget equation: where T is the annual period and the residuals.A difference between the mean a of the time series of ∆S + R and P − E is defined as the bias of atmospheric-terrestrial flux.Parameter b corresponds to the trend, and amplitude and phase shift are derived from c and d.Additionally, correlations of the time series and of de-seasoned and de-trended time series ˆ were investigated.

Error Assessment
For assessing the significance of our results, we performed a thorough error assessment for ∆S + R. Calibrated errors given for monthly GRACE Stokes coefficients were propagated to the basin-averaged TWS changes ∆S while taking into account errors from degree-1 and -2 coefficients.However, the resulting error estimates might be too optimistic, as we neglect errors from the GIA model and from the rescaling procedure.The uncertainty of simulated discharge was obtained from ensemble runs with the GR2M-snow model by disturbing model parameters and input datasets.Then, errors were propagated further to ∆S + R for each of the aggregated catchments.

Results
Overall, the results presented below show the skill of the individual P − E datasets in closing the water budget equation over Europe.First, the individual components of the water budget equation are analyzed.Then, both sides of the water budget equation are contrasted for selected target areas.Finally, biases, amplitudes and trends are illustrated and discussed for the whole study area.

Modeled Discharge from GR2M-Snow
Discharge simulated from GR2M-snow generally fits well to available discharge from GRDC within the error bounds.Figure 3 shows discharge for the Rhine basin from GRDC (blue) and GR2M-snow (red).Only a few peaks from GRDC exceed the ensemble-based standard deviation (grey).The influence of discharge on the closure of the water budget equation depends not only on its magnitude, but also on the magnitude of TWS change.The magnitude of discharge and its standard deviation vary for the individual river basins with temporal means between 3 and 61 mm/month (Table 2).The largest values are found in central Europe where mean discharge rates are larger than 40 mm/month and corresponding standard deviations larger than 7 mm/month (Rhine, Rhone and Po).In contrast, many basins in eastern Europe, in France and on the Iberian peninsula have mean discharge rates of about 10-20 mm/month and standard deviations of only 2-5 mm/month.In general, the standard deviation derived from ensemble runs amounts to 20%-30% of simulated mean monthly discharge (Table 2).An exception is the Daugava, where the standard deviation is about 40% of simulated discharge.Root mean squared errors (RMSE) between observed and simulated discharge have in most basins approximately the size of the simulated standard deviations.
The bias of mean discharge for the validation period is smaller than 2 mm/month (Table 2) except for Po (4.0 mm/month) and Rhone (2.5 mm/month), and thus, we neglect its contribution to the closure of the water budget equation.Nash-Sutcliffe coefficients (NS) were computed for the validation phase and confirm that the calibration of GR2M-snow was successful for most basins.In western and central Europe, the NS coefficients reach values between 0.6 and 0.9.Only some basins in eastern Europe (Narva, Dnepr, Southern Bug and Don) have NS coefficients smaller than 0.3.Discharge from Don contains some extreme peaks during the validation time span that are not simulated by GR2M-snow, resulting in a negative NS coefficient.NS coefficients for de-seasoned and de-trended time series (computed according to Equation (4), extended by semiannual signals) are mostly between 0.2 and 0.6.Again, the largest values are obtained for western and central Europe with NS coefficients between 0.35 and 0.65.Negative NS coefficients of de-seasoned and de-trended time series are found for Danube, Don, Neva and Po indicating changes in the short-term behavior of discharge between the calibration and the validation time span.However, correlations are between 0.8 and 0.9 for the original time series and between 0.7 and 0.8 for de-seasoned and de-trended time series.Exceptions are the rivers Don and Neva, where correlations of de-seasoned and de-trended time series amount only to 0.3 and 0.4.Table 2. Evaluation of simulated discharge from GR2M-snow for the validation period of each river basin: mean values, mean standard deviation (Std.), root mean squared errors (RMSE), bias of the mean and Nash-Sutcliffe (NS) coefficients for time series with seasonal cycle and de-seasoned and de-trended (des., det.) time series are computed as defined in Section 3.2.Due to missing observations, no validation could be performed for Seine and Tagus (* Neman and Vistula are calibrated and validated using E-RUN (observational gridded runoff estimates for Europe) because of erroneous GRDC data).

Time Series of Fluxes and Storage Change
The noise of ∆S depends mainly on the size and the shape of the target areas [49].For example, the Danube has a smooth time series with standard deviations of about 10 mm/month, while the time series of the Po is very noisy, and the standard deviations reach up to 30 mm/month, which is half of the annual amplitude (Figure 4).In fact, with a size of only 70,000 km 2 , the Po basin is likely too small for obtaining reliable TWS information from GRACE.
In the following, time series of fluxes for five catchments from different regions in Europe and of different sizes are discussed in more detail: (a) Guadiana-Guadalquivir on the Iberian peninsula, which is a small region and the most southern target area with very dry climatic conditions, (b) Rhine-Meuse in central Europe with a medium size and large precipitation events in the Alpine region, (c) Neva, the most northern basin, (d) Dnepr, a catchment with particularly low precipitation rates, and (e) Danube, the largest river basin in Europe.Some, but not all findings from these regions can be transferred to basins with similar climatic conditions (see Section 4.3).
The magnitude of river discharge varies in the individual basins depending on the climatic conditions (Figure 5, black dashed line).In Guadiana-Guadalquivir and Dnepr, discharge is small with mean annual values of 7-9 mm/month and 0-14 mm/month.In contrast, discharge plays an important role in central Europe (Rhine-Meuse) where it amounts to mean annual values of 30 mm/month, which is one third of precipitation.While discharge in the Neva basin is relatively constant with a weak annual cycle, in the Rhine-Meuse and Danube basins, it is clearly related to individual precipitation events.For most basins, discharge contributes about 10%-20% to the error of the left-hand side of the water budget equation.The highest error contribution from discharge is found in the Danube basin with 40%, which is mainly due to the small standard deviation of GRACE data caused by the large basin size.Monthly total precipitation is highly variable for European river basins (Figure 5).The chosen catchments represent different climatic conditions over Europe, e.g., there is nearly no precipitation in summer in Guadiana-Guadalquivir; Neva has a very pronounced annual cycle; and in Rhine-Meuse, individual precipitation events are visible.All model-derived time series show similar features as the observation-based dataset GPCC.However, while in the Danube basin, peaks from all models have about the same size, we find in the Dnepr and Neva basins smaller peaks for the COSMO-REA6 model compared to the other datasets.The authors of [54] showed that COSMO-REA6 has an enhanced skill in representing individual precipitation events compared to ERA-Interim.Moreover, the authors of [9] found that COSMO-REA6 and ERA-Interim overestimate P over parts of Russia and parts of the Alps compared to GPCC.In our study, this is confirmed by particularly high P rates in the Rhone basin for COSMO-REA6 with a temporal mean of 90 mm/month and peaks of up to 180 mm/month.Furthermore, in line with [9], we found that all reanalyses underestimate P in the catchment of Garonne-Loire-Seine compared to GPCC.Besides, MERRA-2 simulates exceptional high P rates in the Neva basin where its temporal mean is more than 10 mm/month higher than the temporal mean from the other datasets.
Evapotranspiration is characterized by a distinct annual cycle over Europe with different amplitudes depending on the regional climate (Figure 6).Large differences are found between the individual E datasets with a mean spread of 17 mm/month (maximum 43 mm/month) in the Danube basin, up to a mean spread of 28 mm/month (maximum 96 mm/month) in the Neva basin, and a mean spread of 29 mm/month (maximum 68 mm/month) in the catchment of Guadiana-Guadalquivir.Interestingly, E for the catchments on the Iberian peninsula does not only show a large spread between the models, but also patterns beyond the annual cycle.Moreover, all models except ERA-Interim suggest a small, but not regularly occurring secondary peak of evapotranspiration in winter in Guadiana-Guadalquivir.One might speculate that this sub-annual variability is due to the interaction of large-scale variability with regional characteristics, e.g., effects of the North Atlantic Oscillation (NAO).In summer, evapotranspiration from ERA-Interim is twice as large as E from COSMO-REA6 in Guadiana-Guadalquivir.Furthermore, in the Guadiana-Guadalquivir basin, MERRA-2 fits best to the observation based dataset GLEAM.
In general, MERRA-2 overestimates E compared to the observation-based dataset GLEAM with the most extreme values in the Neva catchment, where the temporal mean is 20 mm/month higher than the temporal mean from the other models.In contrast, outputs from COSMO-REA6 show the smallest E in all considered basins.It is striking that for most basins, the maximum value in summer is quite the same for COSMO every year, whereas for the global models, it varies from year to year.The year 2010 stands out with high evapotranspiration values for some models, which could be explained by the heat waves over Russia and the Iberian peninsula.
The previous analyses already indicated that the performance of the models shows regional differences.This also impacts the closure of the water budget equation (Figures 7 and 8).Modeled P − E is mostly within the error bars of ∆S + R even in smaller basins (Figure 7).However, MERRA-2 tends to underestimate P − E especially in summer due to large E estimates.Furthermore, ERA-Interim cannot close the water budget equation on the Iberian peninsula as E is overestimated in this region.It is worth noticing that P − E from COSMO-REA6 matches ∆S + R well for all selected catchments.The other models mainly differ from ∆S + R by a constant shift.Correlations between ∆S + R and P − E are between 0.5 and 0.9 and mostly about 0.7 for all models (Figure 9a), which is mainly attributed to the annual cycle.
A closer look at short-term variability is taken in Figure 8.For this purpose the mean, trend and annual signals according to Equation (4) (extended by semiannual signals) were subtracted from the flux time series.De-seasoned and de-trended time series show similar patterns for both sides of the water budget equation in all basins.Short-term variability also depends on the climatic conditions of the target catchment, e.g., in the Dnepr basin de-trended and de-seasoned P − E and ∆S + R have maximum values of 50 mm/month, whereas in Guadiana-Guadalquivir catchment, maximum values of 100 mm/month exist.Best agreement between de-trended and de-seasoned ∆S + R and P − E is achieved for the Danube with correlations of about 0.7 (Figure 9b).In most basins, the correlation is about 0.5, with a few exceptions like the Ebro, Neman, Don and Neva, where correlation is smaller than 0.3.Especially in smaller basins, outliers in the GRACE time series have a huge impact on correlation.Furthermore, in some basins, the agreement between ∆S + R and P − E changes with time, e.g., in the Guadiana-Guadalquivir and Neva basins, models and ∆S + R disagree during the first three years of the study period, but then again match very well (Figure 8).

Statistics of the River Basins
After integration, biases in fluxes produce incorrect trends in storage, i.e., underestimating P − E leads to a negative trend in storage, and overestimating P − E leads to a positive trend in storage.Figure 10a-d shows the bias of the water budget equation for the individual P − E datasets.The global models ERA-Interim and MERRA-2 tend to underestimate P − E (indicated by the red color).For ERA-Interim, the largest biases (about 20 mm/month) are obtained on the Iberian peninsula, which we attribute to deficiencies in the E estimate as discussed before.In contrast, in eastern Europe, both global models perform quite well with biases smaller than 10 mm/month.Although MERRA-2 shows the largest biases in central Europe (15 mm/month for Rhine-Meuse), it improved with respect to MERRA, which was assessed in [8], and had biases of about 25 mm/month in some central European basins.Results for the basins of Po and Rhone should be interpreted with caution as the basins are too small for deriving reasonable storage changes from GRACE data, and moreover, simulated runoff is rather uncertain in these basins (Table 2).
Biases of COSMO-REA6 are smaller than 10 mm/month in all basins, and in many basins, even smaller than 5 mm/month.In central Europe, COSMO-REA6 tends to underestimate P − E, whereas in eastern and western Europe, it overestimates P − E. The relevance of the bias is illustrated by Figure 10e, which provides the standard deviation of the temporal mean of ∆S + R.This value is mostly between 2 mm/month and 3 mm/month; it is smaller for large basins like Danube and Dnepr and higher for the very small basins like Po, Rhone and Ebro.We can conclude that the biases found in COSMO-REA6 are relevant only for a few basins.The observation-based datasets GPCC and GLEAM also tend to overestimate P − E with large values in the catchments of Garonne-Loire-Seine (−9 mm/month), Dnepr (−7 mm/month) and Dniester-Southern Bug (−8 mm/month).Interestingly, no obvious biases are found in the eastern basins for all model-based time series of P − E.
A more complete picture of the closure of the water budget is obtained by analyzing the amplitude, phase and trend of ∆S + R and P − E derived from the parameters estimated from Equation (4).The amplitude of P − E is predominantly affected by modeled evapotranspiration, which differs largely for the individual models (Figure 6).The regional patterns and the magnitudes of the amplitudes of COSMO-REA6 fit well to the amplitudes of ∆S + R with exceptions only in a few basins, e.g., on the Iberian peninsula (Figure 11).In contrast, GPCC in combination with GLEAM and ERA-Interim overestimate the amplitude of P − E with respect to GRACE by 15-20 mm/month in all basins west of the Rhine.The amplitude of P − E from MERRA-2 is particularly large in the French basins, the Rhine and the eastern basins with values of about 50 mm/month.The Danube basin stands out as all P − E time series and also ∆S + R agree very well regarding the amplitude with values between 31 and 35 mm/month.While COSMO-REA6 and ERA-Interim have nearly no phase shift, MERRA-2 and GPCC+GLEAM are contaminated with phase shifts of up to 10 days.
Negative trends in fluxes imply increasingly fast drying; positive trends mean that a region experiences increasingly fast wetting.Within the time span considered here, all models agree that eastern Europe tends to become more and more dry and central and northwestern Europe more and more wet (Figure 12).This conclusion is confirmed by ∆S + R; however, GRACE and discharge time series indicate larger negative trends in eastern Europe than the models.The authors of [55] found similar trend patterns investigating runoff over Europe, with negative trends over eastern Europe and positive trends over northwestern Europe for the time period 1963-2000.In contrast, the authors of [3]

Discussion and Outlook
Our results suggest that the high resolution COSMO-REA6 reanalysis better closes the water equation than global reanalyses.This means COSMO-REA6 can be seen as an important step forward in the consistent representation of the water cycle over Europe and, thus, advances climate monitoring on regional scales.In fact, in comparison to global NWP models, COSMO-REA6 improved the representation of small-scale variability, especially with respect to the accuracy of precipitation events [9].
We investigated the closure of the water budget equation for all major European river basins, aggregated to 17 catchment areas.This allowed us to distinguish regional patterns for the performance of the individual models over Europe.Previously, most water budget studies focused on large river basins worldwide (e.g., [14,57,58]).Only a few studies investigated the closure of the water budget over different river basins in Europe [8,10].However, [10] assessed moisture flux convergence instead of P − E. Due to the limited resolution of the early GRACE releases, they found very poor agreement between GRACE and ECMWF data.In [8], we focused on only four catchment areas in central Europe, which were too few for determining regional patterns.We summarize that the current study provides unique insights into the performance of different P − E datasets over Europe due to (i) the extension of the study area, (ii) the long study period of 11 years and (iii) the usage of the latest GRACE release.
Discharge was modeled applying a simple calibrated rainfall-runoff model instead of using gauge observations like in most water budget studies (e.g., [8,14]).In doing so, we circumvented the problem of lacking recent discharge data; yet, this approach may have limitations, and further research is required.Modeling introduces additional uncertainty to the left-hand side of the water budget equation, which was taken into account by modeling the error of simulated discharge via ensemble runs.We obtained errors of 20-30% of the magnitude of discharge, which we believe is a rather conservative estimate.Furthermore, the NS coefficients indicate a successful calibration of the model in most river basins.However, it should be kept in mind that the computed errors do not take into account changes due to human activities (e.g., dams, redirection) between the calibration time span and the GRACE time span.Nevertheless, the largest contribution to the error of ∆S + R (about 60-80%) originates from GRACE as discharge is much smaller in most basins.Especially in the very small river basins (e.g., Rhone, Po), the GRACE data should be interpreted with caution due to their excessive noise on such small spatial scales.
For most catchments, modeled P − E from COSMO-REA6 lies within the error ranges derived for ∆S + R. Biases are mostly smaller than 5 mm/month.In contrast, ERA-Interim and MERRA-2 underestimate P − E with particularly large biases over the Iberian peninsula for ERA-Interim and over Rhine-Meuse for MERRA-2.This confirms the results obtained in [8] for central Europe.However, it is worth noticing that the bias of MERRA-2 became smaller compared to the bias obtained for MERRA in [8].We attribute this to changes in the data assimilation system.For hydrological studies and climate modeling, it is particularly to reduce the bias of P − E in order to avoid introducing unrealistic trends in storage.
Evapotranspiration was found to be the most uncertain component of the water budget equation over Europe.In accordance with [14], who compared E from various models to GRACE-based estimates and detected significant differences in the mean annual cycles, we found differences in the annual amplitude of up to 50 mm/month between the individual E datasets.In particular, on the Iberian peninsula and for the northeastern catchments, large uncertainties for E were determined.It is likely that the spread of the models on the Iberian peninsula arises from the differences of potential evaporation and actual evaporation in this region.
The annual amplitude of P − E is mainly affected by the annual cycle of E. Compared to ∆S + R, the global datasets overestimate the annual amplitude especially in western Europe.In contrast to this, the amplitude of COSMO-REA6 agrees well with the amplitude of ∆S + R.
Finally, trends from P − E and ∆S + R indicate that northwestern and central Europe becomes increasingly wetter, whereas eastern Europe becomes increasingly drier.However, the trend estimates are only representative for the study period of 11 years, and no reliable conclusions can be drawn for longer time spans.
Our investigations of the closure of the water budget over Europe show regional patterns that can be associated with different regional climatic conditions.Therefore, the strengths and weaknesses of the individual datasets were analyzed for regions representing these varying characteristics.All in all, the regional COSMO-REA6 allows a better modeling of the water cycle than the global reanalyses, which we attribute to its higher spatial resolution.
For future studies, the assessment of the closure of the water budget equation on a grid instead of on catchment scale may provide a more detailed picture of regional differences.In this scope, the availability of gridded runoff is critical.Besides, for actually closing the water budget equation, contributions from pumping, aquifer systems and runoff into the ocean need to be investigated.

Figure 1 .
Figure 1.Study area: 26 European river basins, which are aggregated to 17 catchments as indicated by the different colors.

Figure 2 .
Figure 2. Discharge available from the Global Runoff Data Centre (GRDC) for the most downstream gauging stations of the rivers in our study region.The monthly Génie Rural rainfall-runoff model with snow extension (GR2M-snow) is calibrated against the 10 most recent continuous years of each basin (marked by light blue).In red, the Gravity Recovery and Climate Experiment (GRACE) time span is indicated.

Figure 3 .
Figure 3. Discharge R for the Rhine from Global Runoff Data Centre (GRDC) (blue) and simulated from Génie Rural rainfall-runoff model (GR2M)-snow (red) including the standard deviation (grey).

Figure 4 .Figure 5 .
Figure 4. Total water storage (TWS) change ∆S (red) and its standard deviation (grey) for two selected river basins.

Figure 6 .Figure 7 .
Figure 6.Monthly total evapotranspiration E from different models for selected catchments of the study area.

Figure 8 .
Figure 8. De-seasoned and de-trended time series of P − E and ∆S + R (red) for selected river basins.

Figure 9 .
Figure 9. Correlation of ∆S + R and P − E from COSMO-REA6 Reanalysis for (a) the original time series, as shown in Figure 7, and (b) de-trended and de-seasoned time series, as shown in Figure 8.

1 Figure 10 .
Figure 10.(a-d) Bias of the water budget equation given in equivalent water height (EWH): red color means that P − E is underestimated, and blue color means that P − E is overestimated.(e) provides as a reference the propagated error of the left side of the water budget equation.

1 Figure 11 .
Figure 11.Amplitudes of P − and ∆S + R given in equivalent water height (EWH).

1 Figure 12 .
Figure 12.Trend of P E and ∆S + R given in equivalent water height (EWH).

Table 1 .
The target catchments and their size.