Additional Value of Using Satellite-Based Soil Moisture and Two Sources of Groundwater Data for Hydrological Model Calibration

Although the complexity of physically-based models continues to increase, they still need to be calibrated. In recent years, there has been an increasing interest in using new satellite technologies and products with high resolution in model evaluations and decision-making. The aim of this study is to investigate the value of different remote sensing products and groundwater level measurements in the temporal calibration of a well-known hydrologic model i.e., Hydrologiska Bryåns Vattenbalansavdelning (HBV). This has rarely been done for conceptual models, as satellite data are often used in the spatial calibration of the distributed models. Three different soil moisture products from the European Space Agency Climate Change Initiative Soil Measure (ESA CCI SM v04.4), The Advanced Microwave Scanning Radiometer on the Earth Observing System (EOS) Aqua satellite (AMSR-E), soil moisture active passive (SMAP), and total water storage anomalies from Gravity Recovery and Climate Experiment (GRACE) are collected and spatially averaged over the Moselle River Basin in Germany and France. Different combinations of objective functions and search algorithms, all targeting a good fit between observed and simulated streamflow, groundwater and Water 2019, 11, 2083; doi:10.3390/w11102083 www.mdpi.com/journal/water Water 2019, 11, 2083 2 of 21 soil moisture, are used to analyze the contribution of each individual source of information. Firstly, the most important parameters are selected using sensitivity analysis, and then these parameters are included in a subsequent model calibration. The results of our multi-objective calibration reveal a substantial contribution of remote sensing products to the lumped model calibration, even if their spatially-distributed information is lost during the spatial aggregation. Inclusion of new observations, such as groundwater levels from wells and remotely sensed soil moisture to the calibration improves the model’s physical behavior, while it keeps a reasonable water balance that is the key objective of every hydrologic model.


Introduction
Hydrologic models are indispensable tools for predicting the rainfall-runoff response of catchments in order to anticipate hydrological droughts and (flash) flood events, causing loss of lives and economic damage. Hydrological processes transforming rainfall to runoff are represented by mathematical equations in different types of model structures. Based on their spatial discretization levels, the models are grouped as lumped, semi-distributed and fully distributed models. While the most sophisticated models fall in the latter group, lumped models are the most commonly used models, as they require less inputs and computational sources. Further, if lumped models are calibrated with appropriate observations and objective functions, they are skillful in obtaining reasonable discharge prediction performance. In this study, we are interested in examining the potential usage of remote sensing (RS) data to inform the parameter optimization, and evaluate different outputs of a bucket-type lumped hydrologic model.
Earth observations using satellite-based RS gained popularity with its easy-to-apply usage and large spatial coverage. RS products are recognized as an important source of data, and used directly as model input e.g., leaf area index, or used to assess model performance on simulated states and fluxes [1]. Both the availability and spatial resolution of remotely-sensed data continuously increase the provision of opportunities to constrain hydrological models [2,3]. Actual evapotranspiration (AET) [4,5] and soil moisture data [6] from the Moderate Resolution Imaging Spectroradiometer (MODIS) satellites, and groundwater storage-related observations via the Gravity Recovery and Climate Experiment (GRACE; [7]) satellites are just some of the examples from the numerous satellites. To benefit from spatial observations, spatially-distributed models are utilized [8][9][10][11].
Demirel et al. [3] used MODIS-based AET to calibrate a spatially-distributed model, and revealed that there is little trade-off between either the streamflow performance or the spatial performance of the model. Appropriate objective functions were key to constrain the model and reduce the number of runs to search for an optimal solution, meeting the water balance and spatial AET patterns. Rakovec et al. [9] used GRACE-based total water storage data and evapotranspiration (ET) from FLUXNET data to identify model parameters and improve the streamflow simulation performance in more than 80 European catchments. Zink et al. [10] used remotely-sensed temperature data to constrain a spatially-distributed model, and could succeed in reducing the errors in predicting actual evapotranspiration by 8%, while the model streamflow performance slightly decreased by 6%, showing the trade-off between different objectives when applied in a multi-objective calibration framework. Xiong et al. [12] assessed the impact of the remotely-sensed soil moisture active passive level 3 product (SMAP L3) in a multi-objective calibration of a distributed hydrological model in the Qujiang and Ganjiang catchments in China. They documented minor improvements in the simulation performance of soil moisture and streamflow when adding the SMAP L3 product, which could be due to the short length of this high-resolution soil moisture product. Li et al. [6] used lumped and semi-distributed hydrological models with different RS products and different calibration approaches in two Australian catchments.
Near-surface soil moisture retrievals from MODIS (500 × 500 m) and the Soil Moisture and Ocean Salinity level 3 product (SMOS L3, 25 × 25 km), together with remotely-sensed fractional vegetation cover data from MODIS (500 × 500 m), were incorporated with hydrologic models to test their contribution to the model simulation. They showed that the effect of RS soil moisture data for model calibration is usually substantial for the upstream sub-catchments.
There have been studies solely focused upon the effect of performance metrics (objective functions) on the model calibration. For instance, Kamamia et al. [13] calibrated a semi-distributed hydrologic model (Soil and Water Assessment Tool-SWAT) using a multi-metric framework including a Pareto front and five segment flow duration curve, and compared it with a standard calibration using Nash Sutcliffe Efficiency (NSE; [14]). Tobin et al. [15] calibrated the same model for a mountainous watershed in Idaho, US using ET estimates from MODIS, Simplified Surface Energy Balance and Global Land Evaporation: the Amsterdam Model (GLEAM). Their study showed that the MODIS ET product especially yielded the most to identify the model parameters, and improved streamflow simulations in the recession phase and summertime. They also used NSE as streamflow objective function that is criticized to be biased for high flows [16]. Krause et al. [17] studied nine different objective functions, including the index of agreement [18], the coefficient of determination (R 2 ), NSE and their revised versions, so as to adopt these metrics to high flows and low flows. They highlighted the importance of using multiple metrics depending on the modeling aim. All studies above and numerous other studies [19][20][21][22] have utilized mostly spatially-distributed hydrologic models with RS products and different calibration metrics [23,24].
However, to our knowledge, there have been few studies focusing on lumped models and RS products to improve model calibration, and none of them used groundwater level measurements from drilled wells in model evaluation and calibration. In one of the lumped model studies, Nijzink et al. [2] used soil moisture, evaporation, total water storage and snow state from relevant satellites to calibrate 27 catchments across Europe using the GLUE methodology [25]. They tested the applicability of satellite-based RS data in an evaluation of five different lumped models through a model calibration framework. Their study showed that using multiple information from satellites could be fruitful for estimating model parameters in data scarce regions without streamflow time series.
The objective of this study is to assess the value of remote sensing products, as well as groundwater level measurements in a multi-objective calibration of the Hydrologiska Bryåns Vattenbalansavdelning (HBV) lumped hydrologic model [26,27]. For that, we collected relevant satellite-based hydrologic data for the Moselle River Basin, and spatially averaged them first. Then different calibration scenarios are tested to quantify the individual effect of each RS data source on model calibration results. Two essential features of our study are: (1) The important parameters for the calibration are selected based on a multi-objective sensitivity analysis and, (2) three search algorithms i.e., one gradient descend and two global metaheuristic methods, are used to assess their effects on the calibration results. Further, the important parameters for the calibration are selected based on a multi-objective sensitivity analysis.

Study Area
The study area, the Moselle River, is a tributary of the Rhine River. The Rhine River Basin covers an area of 165,000 km 2 in eight countries, that is, the Netherlands, Germany, Switzerland, Luxemburg, Belgium, France, Austria and Liechtenstein [28,29]. The Moselle River Basin covers the mountains of the middle reaches in Germany, including the Black Forest (Schwarzwald) and the Vosges, reaching elevations over 1000 m (3280 feet) in the south, to around 600 m (1968 feet) towards the north. Land use in these areas include vineyards on sunny valley slopes, and agriculture and forest in the mountains and hillslopes. In the Moselle River Basin, winter is the season where the maximum discharge occurs. [28]. The mean discharge at Cochem is about 314 m 3 s −1 (11,088.8 cu. ft. per sec.), and its discharge varies from 14 m 3 s −1 (494 cu. ft. per sec.), in dry summers to a maximum of 4000 m 3 s −1 (141,259 cu. ft. per sec.), during winter floods [30].
The Moselle River is 313 km (195 mi.) long and has a catchment area of 27,262 km 2 , with an annual precipitation of~800 mm. The altitude ranges from 59 m to 1326 m (Figure 1), with a mean altitude of around 340 m. The Moselle Basin has 26 sub-basins, and the surface areas of these sub-basins vary between 102 and 3353 km 2 [31,32]. The Moselle River is 313 km (195 mi.) long and has a catchment area of 27,262 km 2 , with an annual precipitation of ~800 mm. The altitude ranges from 59 m to 1,326 m (Figure 1), with a mean altitude of around 340 m. The Moselle Basin has 26 sub-basins, and the surface areas of these subbasins vary between 102 and 3353 km 2 [31,32].

Data
The basic inputs of a lumped hydrological model are daily precipitation (P) and potential evapotranspiration (PET). The P and PET time series for the Moselle Basin were obtained from the German Federal Institute of Hydrology (BfG) in Koblenz (Germany). Further, the outlet discharge (Q) measurements at Cochem (gauge #6336050) were provided by the Global Runoff Data Centre (GRDC) in Koblenz (Germany) [33]. Although Rhine has mostly alluvial aquifer, Moselle uplands comprised of Jurassic Limestone aquifers [2]. Also within the Moselle there is contrasting lithology and topography: granitic formations upstream in the Vosges mountains and carbonate platform downstream on the Lorraine plateau [2]. The groundwater well measurements (GWY) from 80 stations within the Moselle Basin were also obtained from the German Federal Institute of Hydrology (BfG) in Koblenz (Germany). GWY data from the stations falling on the same sub-catchment of Moselle are averaged using arithmetic mean. These sub-catchment averaged data are then aggregated to the Moselle Basin Level using areal weights of 26 sub-catchments. In addition to these groundbased data we also included remote sensing soil moisture data from different sources ( Table 1). All remotely-sensed soil moisture (RS-SM) data are upscaled to catchment scale using an arithmetic average of all cells (Figure 2).

European Space Agency (ESA CCI SM V04.4)
Soil moisture (SM) is an important variable for many purposes, e.g., to anticipate possible floods using antecedent SM conditions, and to manage irrigation water. In 2010, the Global Climate Observing System (GCOS) panel recognized soil moisture as one of the 50 Essential Climate Variables (ECVs) [34].

Data
The basic inputs of a lumped hydrological model are daily precipitation (P) and potential evapotranspiration (PET). The P and PET time series for the Moselle Basin were obtained from the German Federal Institute of Hydrology (BfG) in Koblenz (Germany). Further, the outlet discharge (Q) measurements at Cochem (gauge #6336050) were provided by the Global Runoff Data Centre (GRDC) in Koblenz (Germany) [33]. Although Rhine has mostly alluvial aquifer, Moselle uplands comprised of Jurassic Limestone aquifers [2]. Also within the Moselle there is contrasting lithology and topography: granitic formations upstream in the Vosges mountains and carbonate platform downstream on the Lorraine plateau [2]. The groundwater well measurements (GWY) from 80 stations within the Moselle Basin were also obtained from the German Federal Institute of Hydrology (BfG) in Koblenz (Germany). GWY data from the stations falling on the same sub-catchment of Moselle are averaged using arithmetic mean. These sub-catchment averaged data are then aggregated to the Moselle Basin Level using areal weights of 26 sub-catchments. In addition to these ground-based data we also included remote sensing soil moisture data from different sources ( Table 1). All remotely-sensed soil moisture (RS-SM) data are upscaled to catchment scale using an arithmetic average of all cells ( Figure 2).

European Space Agency (ESA CCI SM V04.4)
Soil moisture (SM) is an important variable for many purposes, e.g., to anticipate possible floods using antecedent SM conditions, and to manage irrigation water. In 2010, the Global Climate Observing System (GCOS) panel recognized soil moisture as one of the 50 Essential Climate Variables (ECVs) [34].
The European Space Agency (ESA) started the CCI SM project, known as the Climate Change Initiative, in 2012, to contribute to the monitoring of the ECVs [35]. The purpose of the ESA CCI SM project is to merge active and passive microwave sensors and generate a new global SM data record. The latest-released ESA CCI SM v04.4 product is used in this study (http://www.esa-soilmoisture-cci.org/, last access: 4 September 2019; [36]). The last algorithm of this ESA CCI SM v4 project, which is also being used in the current Copernicus Climate Change Service (C3S) climate data records (CDRs) production system version v201812, uses an approach to better parameterize the least-squares merging [37]. The ESA CCI soil moisture v04.4 product with daily temporal and 0.25 • spatial resolution, spans for the period from November 1978 and June 2018, and consists of three different datasets, the so-called active, passive and combined [38].
The active data set is generated by merging active microwave satellite data by using the TU Wien Water Retrieval Package (WARP) algorithm [39,40] [46,47]; and the Global Change Observation Mission 1st-Water-AMSR-2 (GCOM-W1, AMSR2, 6.9/10.6 GHz, July 2012-June 2018) [48]. The combined data set is obtained by merging the active and passive products. The method for merging based on the weighted averaging and merging scheme is parameterized using triple collocation analysis (TCA) [38]. In this paper, the ESA CCI SM v04.4 product combined data set is used. Studies of [36,38] are recommended to reader for more details about the evolution of the ESA CCI SM data record, details of sensor properties, and product merging methodology.

The Advanced Microwave Scanning Radiometer for the Earth Observing System (AMSR-E)
Microwave remote sensing provides a solution to directly observe the soil moisture. Especially active sensors are generally not affected by cloud cover, although accurate soil moisture estimates are limited to regions that have either bare soil or low to moderate amounts of vegetation cover, since in the absence of significant vegetation cover, soil moisture is the dominant effect on the received signal [49]. Besides, one of the issues about using passive microwaves is that there are contrasting effects of soil moisture and vegetation water content on microwave emission. In other words, an increase of soil moisture and a decrease in vegetation water content have the same effect. Effects of vegetation and roughness become more evident as the frequency increases; that is why low frequencies in the L-band range (1-2 GHz) are better suited for soil moisture sensing [50]. Another issue is about the effect of the daytime temperature measurements affecting the top soil layers, making it difficult for soil moisture inversion. The Advanced Microwave Scanning Radiometer (AMSR-E) on the Earth Observing System (EOS) Aqua satellite was launched on 4 May 2002. The AMSR-E instrument was developed by the National Space Development Agency of Japan (NASDA), and provided to the U.S. National Aeronautics and Space Administration (NASA). AMSR-E records the brightness temperature at frequencies of 6.9, 10.7, 18.7, 23.8, 36.5 and 89 GHz, at horizontal (H) and vertical (V) polarizations. The mean spatial resolution at 6.9 GHz is about 56 km with a swath width of 1445 km. The AMSR-E soil moisture has a strong association to ground-based soil moisture data, with typical correlations of greater than 0.8, and typical RMSD less than 0.03 vol/vol (for a normalized and filtered AMSR-E time series) [50,51].
For AMSR, various soil moisture algorithms that differ mainly in their approaches to the vegetation and surface temperature corrections have been considered. There are four groups of correction approaches, with the first one being sequential corrections using external ancillary data; another approach is the iterative parameter fitting to a forward multichannel brightness temperature model, an yet another approach consists of the use of brightness temperature indices and regression fits, and lastly combinations of the above approaches [50].

The Soil Moisture Active Passive (SMAP)
The Soil Moisture Active Passive (SMAP) mission has been developed by NASA as one of the first Earth observation satellites in return for the National Research Council's Decadal Survey. SMAP analyzes global measurements of soil moisture present at the Earth's surface, thus allowing us to make indirect observations of soil moisture and the thaw/freeze state from space to formulate considerably improved estimates of energy, water and carbon transfers between the atmosphere and the land. Correct characterization of these transfers is highly dependent on the accuracy of numerical atmosphere models that are used in weather prediction and climate projections. Measurement of soil moisture can be applied directly to flood assessment and drought monitoring. It is useful in estimating global water and energy fluxes at the land surface. Since April 2015, the Soil Moisture Active Passive (SMAP) mission of NASA has been successfully monitoring near-surface soil moisture, by mapping the globe between the latitude bands of 85.044 • N/S in 2-3 days, depending upon the location [52][53][54][55]. In this research, daily data of soil moisture with a resolution of 36 × 36 km is downloaded and processed. However, soil moisture retrievals only from only ESA and AQUA are incorporated into the model calibration.

Gravity Recovery and Climate Experiment (GRACE)
The main purpose of GRACE (Gravity Recovery and Climate Experiment) is in the surveying of the Earth's gravity field anomalies of lands and oceans with a spatial resolution of 400 × 400 km [7]. Water storage in hydrologic reservoirs, the movements of oceans, atmospheric and land ice masses and mass swaps of Earth compartments cause gravity changes, and for GRACE monthly changes are considered. The gravity changes are measured in centimeters of equivalent water thickness every month. The GRACE mission consists of twin satellites, and the creators of the mission are the United States National Aeronautics and Space Administration (NASA) and the German Aerospace Centre (DLR).

Methods
Lumped hydrologic models are usually calibrated based solely on discharge data measured at the catchment outlet. In this study, we used well measurements (water levels), as well as temporal information from several remote sensing products, in order to evaluate outputs from different compartments (fast runoff and base flow) of the Hydrologiska Bryåns Vattenbalansavdelning (HBV) model. This is done to simultaneously improve the performance and physical meaning of the results. First, we applied a systematic sensitivity analysis using appropriate objective functions to identify the most important parameters. We then included the selected parameters in the calibration.

Hydrologic Model
In this study we used the Hydrologiska Bryåns Vattenbalansavdelning (HBV) lumped model that includes the basic rainfall-runoff processes, such as snow accumulation and melt, soil moisture, quick runoff and base flow (Figure 3). The model was developed by the Swedish Meteorological and Hydrological Institute [26,27], and applied in numerous projects [30,32,[56][57][58][59]. The Moselle River is a rainfed river, therefore, we only used P and PET as model inputs. The parameter ranges for model calibration ( Table 2) were taken from previous studies [32,58].

Objective Functions
There are various objective functions (i.e., performance metrics) to evaluate the hydrologic model. Some of them may focus on low flows [16], while others focus on high flows [14]. These

Objective Functions
There are various objective functions (i.e., performance metrics) to evaluate the hydrologic model. Some of them may focus on low flows [16], while others focus on high flows [14]. These objective functions (OFs) can use the crude difference between observed and simulated values, like the root mean square error, or the relative difference (i.e., normalized), such as the Nash Sutcliffe Efficiency (NSE) [14,16]. As highlighted in different studies [60][61][62][63], choosing an appropriate OF is very crucial for calibrating hydrologic models. In this paper, we carefully selected three OFs, namely NSE-Q, NSE-LNQ and the Pearson Correlation Coefficient (CORR) to assess the efficiency of the HBV model. The details of these OFs are summarized in Table 3.
The Pearson Correlation Coefficient (shown as CORR in Table 3) was used to determine the relationship between simulated variables from the HBV model (soil moisture and groundwater) and observed sensing data. Observed data contains groundwater well measurements (GWY) and remote sensing products (GRACE, ESA, AQUA and soil moisture active passive (SMAP)). Since soil moisture is highly relevant for fast runoff and high flows, NSE-Q was used as well. This metric uses squared differences which can be dominated by high values in the time series. To avoid this and to reduce the sensitivity of NSE to extreme high values, we used the logarithmic discharge values in NSE, i.e., NSE-LNQ, as well. Herein, peak values in the hydrograph are flattened, leading to a fair treatment of low flows in the model performance [17]. Table 3. Description of objective functions. In these formulas x, y, x and y indicate observation data, simulation data and their means, respectively.

Latin Hypercube Sampling One-Factor-At-A-Time Sensitivity Analysis
Following the study of Demirel et al. [64], we used the Latin Hypercube Sampling One-at-a-Time (LHS-OAT) approach for sensitivity analysis with 20 different initial parameter sets. This number is assumed to be sufficient, as Demirel et al. [64] showed the stability of the average accumulated relative parameter sensitivities after nearly 20 initial parameter sets. Obviously the main advantage of using LHS is that it is a stratified sampling method improving the coverage of the parameter space, using a smaller number of random sets, as compared to other methods like crude Monte Carlo Sampling [65].

Model Calibration and Validation
In this study we tested three different optimization algorithms: (1) Gradient-based Levenberg Marquardt [66], (2) Shuffled Complex Evolutionary algorithm-University of Arizona (SCE-UA) [67] and 3) the Covariance Matrix Adaptation Evolution Strategy (CMAES) [68]. While the first method uses a local search algorithm, the other two use metaheuristic global search algorithms. SCE-UA is one of the most popular algorithms in hydrology, and CMAES emerges as a very efficient global algorithm that can converge quickly. The details of the model calibration and validation periods are presented in Table 4. The discharge data from the period of 2002-2006, GW well measurements and the satellite data (ESA and AQUA) from 2002-2004 are used to calibrate the model. The remaining part of the data is used to validate the HBV model (see Table 4).
It should be noted that the SMAP data have been available since 2015, and therefore they not included in the model calibration. A three year warm-up period is incorporated, and the evaluation of the discharge results started in 1954, instead of 1951. The objective functions are used individually or in different combinations based on our four cases: (1) Only discharge (only-Q), (2) only groundwater and soil moisture (GW-only and SM-only), (3) discharge and groundwater/soil moisture (Q + GW and Q + SM) and (4) all together (Q + GW + SM), i.e., discharge and remote sensing products (GRACE, ESA and AQUA), as well as groundwater measurements (GWY) are used together in the calibration with all objective functions. As the well observations (GWY) are a more reliable source of information about groundwater, as compared to the GRACE-based storage estimations, we gave a higher weight of 0.7 to GWY than GWG (weight of 0.3). Similarly, we gave higher weight to ESA (0.8) as compared to AQUA (0.2) since ESA is a blended product that contains AQUA (AMSR-E) as well. For the combined objective functions, we gave equal weights to Q, SM and GW.
In PEST toolbox, we set the max number of runs to 3000 in all three methods, and used two complexes for SCE-UA and default population values (i.e., lambda = 11, number of parents = 5, seed number = 1111) for CMAES. Also as a stopping rule, the maximum relative objective function change was set to 1% over five iterations in all three methods. We evaluated the differences between model-simulated and observed variables at their available temporal scales i.e., daily, except for the monthly GRACE data.  Table 5 shows the LHS-OAT results based on the correlation coefficient and NSE, respectively. It should be noted that the sensitivity analysis is done separately for each objective function and RS product. The results are illustrated as descending in the third column i.e., SM for comparison and ease of following. The most influential parameters for SM are FC, LP, BETA and CFLUX, whereas the remaining four parameters (i.e., ALFA, KF, KS and PERC) have no effect upon the results. Interestingly for the two groundwater data sources (well and GRACE) the most influential parameters are different; i.e., KS for GWY and LP for GWG. As expected, the ALFA and KF parameters have no effect on the groundwater. Based on the results in the Table 5 columns Q and LNQ, ALFA is the most influential parameter for high flows (NSE), whereas BETA is the most important parameter for low flows (NSE-LNQ).  Figure 4 shows the evolution of parameter rankings based on the average accumulated sensitivities. Interestingly, the rankings are unstable in the very first initial runs and stable after ten or twenty runs from different initial parameter sets. This is consistent with the results of a recent study by Demirel et al. [64], stating that it is key to repeat local sensitivity analysis at different locations in the parameter space for fairly reaching global sensitivity results. In short, the most influential parameters from Table 5 can also be identified in Figure 4, and all parameters that have sensitivity above zero in this table are included in the model calibration.   Table 6 shows different dimensions in our results (different cases, different objective functions, different optimization algorithms and calibration versus validation). Herein the case numbers from 1 to 4 indicate different hydrologic processes included in the model calibration. Appropriate objective functions are then included in the multi-objective calibration based on the considered processes (shown as bold in the Table 6). Three search algorithms, PEST-LM, SCE-UA and CMAES, are particularly selected, as they are widely used in hydrology. Model validation results are also presented next to the calibration in the same table.

Model Calibration and Validation
Obviously, the validation results were mostly poorer than the calibration results. In addition, the two global optimization methods tested in this study (SCE-UA and CMAES) predominantly yielded better results than the gradient-based Levenberg Marquardt method. The best NSE-Q value obtained by the PEST-LM was 0.77 for case 1 (high flows), however with SCE-UA and CMAES, values of 0.90 and 0.89 could be obtained, respectively. The best NSE-LNQ value obtained by PEST-LM was 0.81 for case 1 (low flows), and SCE-UA and CMAES resulted in 0.81 and 0.82, respectively. It is worth pointing out that there is a trade-off between single and multi-objective calibrations. For instance, single objective calibration improved the SM-Only simulations from 0.83 to 0.85 using the SCE-UA method (case 2, CORR_SM). Multi-objective calibration (Q + SM) ensured a reasonable hydrograph match (NSE-Q 0.88), whereas SM-Only calibration failed for both SCE-UA and CMAES, i.e., NSE-Q~−26.7 and 0.2, respectively.
Based on the validation results, the GW-Only calibration algorithm resulted in a value of 0.75 for CORR_GWY using the SCE-UA. Interestingly, the performance did not deteriorate, i.e., 0.72 for CORR_GWY, when we added NSE-LNQ to the set of objective functions. The case 3 (Q + GW) calibration using SCE-UA substantially improved the NSE-LNQ metric from −1.18 to 0.78 in the validation period; i.e., the main motivation of our multi-objective calibration framework proposed in this study (Table 6). With this in mind, the GRACE data from the Rhine basin apparently caused a conflict with the Moselle well measurements, like in the last case (Q + GW + SM); i.e., they showed negative CORR_GWG results. Out of curiosity, we tested GWG-Only and GWY-Only calibration scenarios using all three methods and obtained NSE of 0.18 and 0.76, respectively using the SCE-UA method. This confirms the low reliability of the coarse GRACE (GWG) data, as compared to the groundwater well measurements (GWY) in the Moselle River Basin. Figure 5 shows the results of the model calibration framework comprising seven cases listed in the legend. For brevity, only the CMAES results are presented, as Table 6 gives the overall calibration and validation results. Here one can clearly see the trade-off between single and multiple objective functions in the calibration. Obviously, Only Q and Only Q-LN perform best in both calibration and validation periods, spring and fall seasons in particular. During low flows from June to September, the HBV model captures the hydrological behavior well. Only-GW calibration fails to capture the streamflow dynamics in the Moselle River, and Only-SM calibration fails to simulate low flow periods, since these constraints are not given to the model via the objective functions. In other words, if the objective function does not incorporate appropriate process information, the hydrologic model usually fails/ignores to simulate the underlying physics.  Figure 5 shows the results of the model calibration framework comprising seven cases listed in the legend. For brevity, only the CMAES results are presented, as Table 6 gives the overall calibration and validation results. Here one can clearly see the trade-off between single and multiple objective functions in the calibration. Obviously, Only Q and Only Q-LN perform best in both calibration and validation periods, spring and fall seasons in particular. During low flows from June to September, the HBV model captures the hydrological behavior well. Only-GW calibration fails to capture the streamflow dynamics in the Moselle River, and Only-SM calibration fails to simulate low flow periods, since these constraints are not given to the model via the objective functions. In other words, if the objective function does not incorporate appropriate process information, the hydrologic model usually fails/ignores to simulate the underlying physics.

Discussion
Satellite-based observations have been incorporated into different model calibrations to constrain the spatially-distributed hydrologic model behavior towards reliable physics [3,6,10,12]. In this study, we sought an answer to the question whether satellite-based products can also be helpful for guiding lumped models throughout the calibration, an issue which has been rarely investigated [2]. Obviously spatial information is lost during the averaging (lumping) process, but the temporal information from the satellite data are shown to be helpful for the model evaluations. SM-only and GW-only calibrations fail on discharge, and in combination with Q they add little to the performance.

Discussion
Satellite-based observations have been incorporated into different model calibrations to constrain the spatially-distributed hydrologic model behavior towards reliable physics [3,6,10,12]. In this study, we sought an answer to the question whether satellite-based products can also be helpful for guiding lumped models throughout the calibration, an issue which has been rarely investigated [2]. Obviously spatial information is lost during the averaging (lumping) process, but the temporal information from the satellite data are shown to be helpful for the model evaluations. SM-only and GW-only calibrations fail on discharge, and in combination with Q they add little to the performance.
Also the combination of Q, SM and G data gives good results for Q simulation for the right reasons (SM and G simulation is reasonable, in particular, compared to Q only simulations).
We received P and PET data from two sources: BfG-Koblenz (up to 2006) and van Osnabrugge et al. [29] (up to 2015). We noticed wetter characteristics in the latter data, and decided to stick to the first data from BfG-Koblenz for the model calibration, as the initial calibration results were promising. Instead, the more recent data were used only for the validation. We assume that three to five year daily data (2002-2004 and 2002-2006) are sufficient for model calibration. Besides, the long validation period is probably the main reason for good validation performances. In contrast to the conventional model calibration approaches, we used more recent data for calibration and older data for validation to benefit from the opportunities of new instrumentation.
We limited the PEST-LM calibration with 375 iterations corresponding to nearly 3,000 model runs, which is comparable with the maximum number of runs of SCE-UA and CMAES. However, PEST-LM usually ended much earlier than the maximum number of runs. This is from the fact that the initial point in the model calibration is key to determine a local minimum, or to coincidentally find a global optimum. Apparently, PEST-LM suffered from the local minimum issue. Comparing the NSE-LNQ values in the first column (PEST-LM) for low flows and GW-only, and also the NSE values in the first column of high flows and SM-only, clearly show the importance of the initial parameter set (start point in the solution space) and the local minimum problem of PEST-LM.
We only used the HBV model, which has GW and SM storages, whereas other similar models with three buckets can be also tested to assess the uncertainties from model selection. The effects of model structure and different remote sensing products on the calibration of five different rainfall-runoff models with different spatial discretization have been discussed in the previous studies [2,69].
Moreover, we only used NSE and CORR as objective functions, whereas there are many more objective functions, e.g., Kling and Gupta Efficiency [70], Spatial Efficiency (SPAEF) [3], Spearman and Kendall Rank Correlation Coefficients [71], than these two metrics tested in the spatial and temporal domain. Although SPAEF has been designed to evaluate spatial performance of distributed models, it can be applied to the time-series comparison in this study as well, since the three components in the SPAEF metric, i.e., the correlation coefficient, coefficient of variation and histogram match, are important characteristics of temporal data too. In a subsequent study, we plan to apply SPAEF in a temporal model calibration. Moreover, Spearman's rho and Kendall's tau coefficients can capture non-linear relations between two variables, while Pearson's metric (CORR) only shows the linear correlations.
We expected to see similar parameters showing a high sensitivity for fitting HBV model simulations to observed and remotely-sensed groundwater data (well and GRACE). However, the KS parameter showed the highest sensitivity for GWY, and the LP parameter showed the highest sensitivity for GWG. This can be an initial alert/indicator to the poor GRACE performance, since the LP parameter is relevant for the soil moisture reservoir of the model, and it hardly influences simulated groundwater behavior. We used GRACE-Rhine data directly instead of estimating or finding GRACE-Moselle data. This must have substantial effects on the results as the Rhine Basin size is large, and there can be different aquifer dynamics within the basin.

Conclusions
At the beginning of this study, we applied sensitivity analysis to determine the most important parameters for the calibration using the groundwater well measurements (GWY) and remotely-sensed products (GRACE, ESA, AQUA and SMAP) data. We selected three different objective functions, namely NSE-Q, NSE-LNQ and CORR. We can draw the following conclusions from the results of this study:

•
Based on the sensitivity analysis results, we find that different hydrologic processes are sensitive for different parameters. The HBV model's groundwater performance (GWY) was most sensitive to the KS parameter, whereas the model's soil moisture performance was most sensitive to the FC parameter. Also we confirm that 20 different initial parameter sets [64] using Latin Hypercube sampling are sufficient for globally encapsulating the most sensitive parameters. • Based on the calibration and validation results, we show that the two global methods perform better than the local Levenberg Marquardt method. An upper limit of 3,000 model runs appeared to be plausible for both the local and global optimization of the HBV model. Also including groundwater and remotely-sensed soil moisture information slightly (up to~10%) improved not only the GW and SM simulation performances of the model, but also the simulation of the observed discharge behavior of the HBV model. This is only exploiting the temporal information of the satellite-based data, since we spatially averaged the distributed data to get the time series of SM and GW. This was also confirmed by a recent study by Nijzink et al. [2].
Further work on the combination of lumped hydrological models with newly-available high resolution remote sensing products, e.g. SMAP, is necessary. Also it is recommended to test the value of spatially-distributed well measurements in addition to the GRACE on the performance of a fully distributed version of HBV (i.e., HYPE) or another hydrologic model like mHM [3]. Since ESA is a combined product including AQUA, only ESA can be used to avoid redundancy in the future studies. Spatial variation of droughts and high flows should also be assessed with different performance metrics to exploit both the temporal and spatial value of the remote sensing products. Promising deep learning methods such as Long Short-Term Memory Network (LSTM) [72,73] and Tensorflow can also be incorporated to simulate spatially-distributed runoff based on spatially-distributed layers of precipitation and potential evapotranspiration maps. Funding: This research received funding from the TÜBİTAK grant 118C020.