Multi-Objective Validation of SWAT for Sparsely-Gauged West African River Basins — A Remote Sensing Approach

Predicting freshwater resources is a major concern in West Africa, where large parts of the population depend on rain-fed subsistence agriculture. However, a steady decline in the availability of in-situ measurements of climatic and hydrologic variables makes it difficult to simulate water resource availability with hydrological models. In this study, a modeling framework was set up for sparsely-gauged catchments in West Africa using the Soil and Water Assessment Tool (SWAT), whilst largely relying on remote sensing and reanalysis inputs. The model was calibrated using two different strategies and validated using discharge measurements. New in this study is the use of a multi-objective validation conducted to further investigate the performance of the model, where simulated actual evapotranspiration, soil moisture, and total water storage were evaluated using remote sensing data. Results show that the model performs well (R2 calibration: 0.52 and 0.51; R2 validation: 0.63 and 0.61) and the multi-objective validation reveals good agreement between predictions and observations. The study reveals the potential of using remote sensing data in sparsely-gauged catchments, resulting in good performance and providing data for evaluating water balance components that are not usually validated. The modeling framework presented in this study is the basis for future studies, which will address model response to extreme drought and flood events and further examine the coincidence with Gravity Recovery and Climate Experiment (GRACE) total water storage retrievals.


Introduction
The availability of freshwater is a major concern in West Africa, directly influencing food security, human health, and economic development [1].In the region, approximately 60% of the active labor force is employed in agriculture.However, this only contributes 35% to the gross domestic product [2,3].Many West African farmers are poor and only able to produce close to subsistence levels, rendering them especially vulnerable to water stress [3].Therefore, knowledge of the available water resources is essential and modeling the water balance to estimate available resources can be an important tool in this respect.Several meso-scale models have been applied to the area, among others by Andersen et al., who used the physically-based MIKE SHE model to model the Senegal river basin in 2001 [4].In 2005, River basins were selected based on the availability of discharge data for calibration purposes.The total area of the basins selected for the model is 3.4 mio.km 2 .Due to computational constraints, three different models were built: South (Volta, Ouémé, Comoé, Mono, Pra, Ankobra and Ayensu river basins, 633,000 km 2 , 41 stream gauges), West (Senegal and Gambia river basins, 558,600 km 2 , 9 stream gauges) and Niger (Niger river basin, 2,250,000 km 2 , 12 stream gauges).

The SWAT Hydrological Model
The Soil and Water Assessment Tool (SWAT) represents a continuous-time, semi-distributed, process-based river basin model.SWAT runs at a daily time step but may be calibrated using monthly or yearly observed data [34,35].The model is comprised of eight major components: hydrology, weather, sedimentation, soil temperature and properties, crop growth, nutrients, pesticides and agricultural management.The hydrological component of SWAT is based on the water balance equation [14].SWAT has been proven to be competitive at a number of scales from local to continental, having been employed for the modeling of water resources in Africa and Europe [11,36], River basins were selected based on the availability of discharge data for calibration purposes.The total area of the basins selected for the model is 3.4 mio.km 2 .Due to computational constraints, three different models were built: South (Volta, Ouémé, Comoé, Mono, Pra, Ankobra and Ayensu river basins, 633,000 km 2 , 41 stream gauges), West (Senegal and Gambia river basins, 558,600 km 2 , 9 stream gauges) and Niger (Niger river basin, 2,250,000 km 2 , 12 stream gauges).

The SWAT Hydrological Model
The Soil and Water Assessment Tool (SWAT) represents a continuous-time, semi-distributed, process-based river basin model.SWAT runs at a daily time step but may be calibrated using monthly or yearly observed data [34,35].The model is comprised of eight major components: hydrology, weather, sedimentation, soil temperature and properties, crop growth, nutrients, pesticides and agricultural management.The hydrological component of SWAT is based on the water balance equation [14].SWAT has been proven to be competitive at a number of scales from local to continental, having been employed for the modeling of water resources in Africa and Europe [11,36], among others.
In this study, SWAT 2012 was used.The major model inputs and data preparation will be described in detail below.

•
Digital Elevation Model (DEM): The hydrologically conditioned HydroSHEDS (Hydrological data and maps based on SHuttle Elevation Derivatives at multiple Scales) digital elevation model (DEM) developed by the World Wildlife Fund (WWF) and the United States Geological Survey (USGS) based on the NASA SRTM (Shuttle Radar Topographic Mission) was used for streamflow delineation.HydroSHEDS is available in 3 and 15 arc-second resolutions (approximately 90 and 500 m) [37,38].In this study, sub-basins were generated using the 500 m version.

•
Land use and land cover: The Comité permanent Inter-Etats de Lutte contre la Sécheresse dans le Sahel (CILSS) Landscapes of West Africa land use and land cover raster dataset of the year 2013 was used as a basis for developing the land use layer required by SWAT.Maps are also available for the years 1975 and 2000 at a resolution of 2 km.The maps were created using local information and remote sensing data in cooperation with US Aid and USGS [30].Since no data is included for the country of Cameroon, nor north of the 18th parallel in Mauritania and Mali and north of the 15.5th parallel in Niger, missing data was replaced using the European Space Agency (ESA) Globcover 2.3 dataset depicting the land use of the year 2009 in a 300 m resolution [39].Land use classes were converted to default SWAT classes.It is unclear whether SWAT allows to realistically simulate plant growth under tropical conditions due to its implemented heat unit growth model [40][41][42].In our study, the management database was adapted by setting fixed plant and harvest dates corresponding to onset and end of rainy season.When compared to MODerate-resolution Imaging Spectroradiometer (MODIS) MOD 15A2 leaf area index (LAI) estimates produced by NASA [43], SWAT LAI reaches a Pearson's r of 0.62, whereas without management modifications this value drops to −0.47.The HWSD supplies a raster map and database containing several soil physical and chemical parameters for a top-and subsoil layer [44].Missing parameters were estimated from soil texture using pedotransfer functions [45].The HWSD and its predecessors have been used for SWAT simulations in Africa, the Middle East, and Europe, among others [1,11,12,36,[46][47][48].

•
Climate: In a previous study, ten precipitation datasets were analyzed for six sub-basins in the study area [23].[53].Missing storage volumes information was approximated as proposed by Schuol et al. [1].Lake Volta was not modeled due to insufficient data being available.

Multi-Objective Validation Datasets
We decided to validate model simulations using actual evapotranspiration, soil moisture, and total water storage, in order to evaluate the model performance of processes not reflected in streamflow.This section gives an overview of the remote sensing data used in the multi-objective validation.

•
Actual evapotranspiration (ET a ): Data was extracted from the MODIS MOD 16 dataset supplied by NASA, available at a 1 km The dual one-way K-band microwave ranging system observes the distance between the two satellites.Changes in the distance in conjunction with complementary tracking data are used to derive monthly gravity fields, which, subsequently, are converted to mass changes in terms of equivalent water height according to Wahr et al. [59].
In this study, we used the ITSG-Grace2016 time series provided by the Institute of Geodesy (IfG) at Technical University (TU) Graz as sets of spherical harmonic coefficients up to degree and order 90.
As GRACE does not measure geocenter variations, degree 1 coefficients were replaced by the time series provided by Rietbroek et al. [60,61].The c 20 coefficient, which is corrupted by aliasing effects, was replaced by results from satellite laser ranging [62].GRACE observes the integral sum of all mass variations in hydrosphere, atmosphere, biosphere, oceans and mass variations inside of the earth.Gravity field solutions from ITSG-Grace2016 are already corrected for tides (ocean, earth and pole tides) and non-tidal atmospheric and oceanic effects.Trends from glacial isostatic adjustment are about zero in the study region.Therefore, the spherical harmonic coefficients from ITSG-Grace2016 primarily reflect variations in the terrestrial water storage.As GRACE-derived gravity solutions are contaminated with correlated noise leading to the characteristic striping patterns in the north-nouth direction, the monthly fields were smoothed using the anisotropic DDK3 filter [63].Filtering implies attenuation of the signal and further distortion, known as leakage effect.Therefore, TWS time series derived for the three target areas via spatial averaging were rescaled using the scaling factor approach [64].Here, scaling factors were derived from five global hydrological models for each target area separately [65].All computations are accompanied by a thorough error propagation, which starts from the full error covariance matrices of the spherical harmonic coefficients and results into errors for the rescaled TWS time series.Since Lake Volta was not modeled in SWAT, the lake signal was computed using lake height variations and an area varying between 4450 km 2 and 9970 km 2 [66,67] and subsequently subtracted from the GRACE estimates.

Model Setup and Calibration/Validation
The model parametrization was conducted using the ArcSWAT 2012 interface [68].The research areas were divided into sub-basins based on the DEM and derived stream network.We used a streamflow delineation threshold of at least 500 km 2 for the southern and western models (1500 km 2 for the Niger model) and manually added outlets where data from gauging stations was available, generating 2153 sub-basins (South: 712; West: 630; Niger: 811).Next, the sub-basins were overlaid with land use and soil maps to derive Hydrological Response Units (HRUs), units with the same land use, soil and slope characteristics [40].In view of computational efficiency, we opted to derive one HRU per sub-basin by considering dominant land use, soil and slope [11] (divided into 0-1; >1-5 and >5% slope).
The dominant land use distribution for each model is displayed in Figure 2. In the South model, range-brush is the dominant land use type (53.6%), followed by agriculture (31.4%) and forest (7.8%).Both forest and agricultural areas are mostly located in the more humid south, while rangeland dominates in the arid north.In the western model, rangeland (brush and grasses) dominates with 40.4 and 47.8%, respectively.8.6% of the area is barren and 2.9% under agricultural use.Land use in the Niger model is to almost equal parts range grasses, barren, agriculture and range brush (26.4,24.6, 23.3 and 22.2%).The high prevalence of barren areas can be explained by the hydrologically inactive part of the basin, located in the north-east [69].Only 2.6% of the area is predominantly forested.
Water 2018, 10, x FOR PEER REVIEW 6 of 22 spherical harmonic coefficients from ITSG-Grace2016 primarily reflect variations in the terrestrial water storage.As GRACE-derived gravity solutions are contaminated with correlated noise leading to the characteristic striping patterns in the north-nouth direction, the monthly fields were smoothed using the anisotropic DDK3 filter [63].Filtering implies attenuation of the signal and further distortion, known as leakage effect.Therefore, TWS time series derived for the three target areas via spatial averaging were rescaled using the scaling factor approach [64].
Here, scaling factors were derived from five global hydrological models for each target area separately [65].All computations are accompanied by a thorough error propagation, which starts from the full error covariance matrices of the spherical harmonic coefficients and results into errors for the rescaled TWS time series.Since Lake Volta was not modeled in SWAT, the lake signal was computed using lake height variations and an area varying between 4450 km 2 and 9970 km 2 [66,67] and subsequently subtracted from the GRACE estimates.

Model Setup and Calibration/Validation
The model parametrization was conducted using the ArcSWAT 2012 interface [68].The research areas were divided into sub-basins based on the DEM and derived stream network.We used a streamflow delineation threshold of at least 500 km 2 for the southern and western models (1500 km 2 for the Niger model) and manually added outlets where data from gauging stations was available, generating 2153 sub-basins (South: 712; West: 630; Niger: 811).Next, the sub-basins were overlaid with land use and soil maps to derive Hydrological Response Units (HRUs), units with the same land use, soil and slope characteristics [40].In view of computational efficiency, we opted to derive one HRU per sub-basin by considering dominant land use, soil and slope [11] (divided into 0-1; >1-5 and >5% slope).
The dominant land use distribution for each model is displayed in Figure 2. In the South model, range-brush is the dominant land use type (53.6%), followed by agriculture (31.4%) and forest (7.8%).Both forest and agricultural areas are mostly located in the more humid south, while rangeland dominates in the arid north.In the western model, rangeland (brush and grasses) dominates with 40.4 and 47.8%, respectively.8.6% of the area is barren and 2.9% under agricultural use.Land use in the Niger model is to almost equal parts range grasses, barren, agriculture and range brush (26.4,24.6, 23.3 and 22.2%).The high prevalence of barren areas can be explained by the hydrologically inactive part of the basin, located in the north-east [69].Only 2.6% of the area is predominantly forested.After generating the HRUs, reservoirs were included as described in Section 2.3.Due to uncertainties in the climate data, the potential evapotranspiration was calculated using the 1985 Hargreaves equation, which requires only temperature and extraterrestrial radiation inputs [40,70].In SWAT, extraterrestrial radiation is calculated as a function of location and time of year [40].The Hargreaves method has been suggested if the input data quality is in doubt [71].Since the evapotranspiration processes in the study region are water-limited, more emphasis should be placed on the actual evapotranspiration, as it directly influences runoff generation [72].SWAT calculates the Hargreaves (1985) equation as follows (Equation ( 1)): where λ is the latent heat of vaporization in MJ/kg, E o is the potential evapotranspiration in mm, H 0 is the extraterrestrial radiation in MJ/m 2 , T mx is the maximum air temperature in • C, T mn is the minimum air temperature in • C, and T av is the average air temperature in • C [40].
The simulation covered the period of 1998 to 2013 with a warm-up period from 1996 to 1997.Since CMORPH precipitation data was only available from 1998 onwards, data from 1998 to 1999 was used as fictitious data for the warm-up period in order to maximize the simulation period.
The calibration of a semi-distributed watershed model such as SWAT is challenging due to input uncertainties, model uncertainties and parameter non-uniqueness [10,73].For the calibration of our models, the Sequential Uncertainty Fitting version 2 (SUFI-2) procedure of SWAT-CUP (Calibration and Uncertainty Programs, developed by Karim Abbaspour of the Swiss Federal Institute of Aquatic Science and Technology (EAWAG), Dübendorf, Switzerland) [73] was used.In SUFI-2, all uncertainty (parameter-, model-, and input-uncertainty) is accounted for by the respective parameter uncertainty.Uncertainties are quantified by the p-factor, which measures the percentage of the observed data falling into the 95% prediction uncertainty (95PPU) band.A further parameter, the r-factor, describes the range of the 95PPU.Ideally, one wants the p-factor to be as large as possible and the r-factor to be as small as possible [73].
The model was calibrated using discharge data from 62 gauging stations.Available daily data was aggregated to monthly data by interpolation whenever seven days in a row were missing and deleting the month for longer gaps.Due to large data gaps and different lengths of the discharge time series which did not allow for fixed calibration and validation periods, the first two thirds of the discharge data were used for calibration and the last third for validation [36].
Two different calibration approaches were used.In the first approach (v1), the model parameters were globally calibrated, while in the second approach (v2), upstream sub-basins were individually calibrated apart from downstream sub-basins in order not to influence results if discharge gauges are unevenly distributed [1,11,36].
A wide variety of potential parameters and ranges for calibration were identified using available literature and the SWAT manuals [74,75].In a second step, the effects of the parameter ranges on model results were identified through a one at a time sensitivity analysis coupled with a custom R script graphically representing the reaction of SWAT storages and flows.This way, realistic parameter ranges were defined for the research area.SWAT-CUP allows for certain parameters to be calibrated separately by soil texture or land use types.This again increases the number of parameters.In our approach, we included all potential parameters in an initial iteration with 500 (v1) and 1000 (v2) model runs and used the SWAT-CUP sensitivity analysis tool to assess the global sensitivity of each parameter [76].SWAT-CUP determines the parameter sensitivity by multiple regression of the Latin Hypercube generated parameter values against the objective function and performing a t-test.Parameters with a p-value of <0.05 are assumed to be sensitive [76].To reach an acceptable calibration, three iterations with 500 model runs each were performed with the sensitive parameters.Parameter ranges are updated automatically after each iteration.If an acceptable calibration is reached, the validation is performed using the same parameter ranges and number of simulations.An overview of the included parameters is given in Table 2.The Kling Gupta Efficiency (KGE) was chosen as the objective function, as it can be decomposed into correlation, bias and relative variability between simulated and observed variables [77].SWAT-CUP implements the 2009 equation [76,77].KGE can take values from −∞ to 1 and is calculated as follows (Equation ( 2)) [77]: where KGE is the Kling-Gupta Efficiency, r is the regression coefficient between simulated and measured variables, σ is the standard deviation, µ is the mean value and s and m are simulated and measured values, respectively.In this study, we consider KGE values of ≥0.5 to be good and values ≥0.7 to be very good.
A further efficiency criterion used in this study is the Nash-Sutcliffe Efficiency (Equation ( 3)) [78,79]: where Y obs i is the i-th observation of the variable to be evaluated, Y sim i is the i-th simulation of the variable to be evaluated, Y mean is the mean of the observed variables and n is the number of observations.Similar to KGE, NSE can range from −∞ to 1, where values ≥0.5 are acceptable and values ≥0.7 very good [79].Finally, percent of model bias or PBIAS is calculated as follows (Equation ( 4)) [79]: where Y obs i is the i-th observation of the variable to be evaluated and Y sim i is the i-th simulation of the variable to be evaluated.Positive values represent an underestimation and negative values an overestimation by the model.

Multi-Objective Validation
Calibration and validation of hydrological models is often done using observed discharge alone, whereby aspects of the water balance are being neglected [80].In this study, we perform an additional validation of the model results by comparing ET a , SM and TWS to remote sensing data.ET a was evaluated using the MODIS MOD16 satellite product [54,55].We chose ET a to evaluate the model performance under uncertain precipitation and land use inputs, as well as to validate the Hargreaves evapotranspiration calculations.The modeled soil moisture was validated against the ESA CCI SM product [57].We chose to validate the soil moisture, as its inter-annual variability is very high in West Africa and it is an important factor for crop production.The CCI product was used, as it optimally fits our period of interest.The evaluation of the soil moisture performance of SWAT proved problematic, as outputs produced by the model provide soil moisture in mm for the whole profile or soil layers, while the CCI SM is given in percent over the upper few centimeters of the soil profile.Furthermore, SWAT calculates plant-available soil moisture rather than absolute soil moisture as given for the observation [81][82][83].Therefore, we decided to focus on comparing the dynamics of simulations and observations instead of absolute values.Finally, the calculated total water storage was validated using GRACE TWS data.The SWAT total water storage change was estimated from the water storages by calculating the deviation from the mean water storage during the period of GRACE data availability (2003-2013) according to the following formula (Equation ( 5)): where ∆TWS t is the total water storage change at time step t, SW t is the soil water storage, SA t is the shallow aquifer storage and DA t is the deep aquifer storage.All units are in mm.

Calibration and Validation Results
Results for the three models and two calibration approaches are listed in Table 3 and will be described in detail.For the v1 calibration (Figure 3), 78% of gauging stations reach a KGE of higher than zero, meaning that the model performs better than if using observed mean values as predictors [77].On average, 40% of the gauging stations reach a KGE of more than 0.5, while 13% are above 0.7, with the highest average KGE of 0.40 in the West (the overall best result) and the lowest average KGE of 0.14 found in the Niger region.The average bias is 14.36% and R 2 amounts to 0.52, with the highest R 2 of 0.  Concerning validation, 33% of v1 stations reach a KGE above 0.5 and 12% above 0.7.However, the average KGE is 0.07 due to some poorly-performing stations.If we removed these stations, KGE would increase to 0.42.R 2 is the only factor performing better in the validation than the calibration (0.63 as opposed to 0.52).While the p-factor is similar to the calibration, the r-factor is influenced by the large uncertainty band of the West model and reaches 4.41.PBIAS has also increased to 17.59%.Best performances are reached in the West model.
While for the v2 approach (Figure 4), about the same amount of stations score a KGE of higher than 0.5 (73%), it generally delivers less robust solutions, with only 32% of discharge stations reaching Concerning validation, 33% of v1 stations reach a KGE above 0.5 and 12% above 0.7.However, the average KGE is 0.07 due to some poorly-performing stations.If we removed these stations, KGE would increase to 0.42.R 2 is the only factor performing better in the validation than the calibration (0.63 as Water 2018, 10, 451 11 of 22 opposed to 0.52).While the p-factor is similar to the calibration, the r-factor is influenced by the large uncertainty band of the West model and reaches 4.41.PBIAS has also increased to 17.59%.Best performances are reached in the West model.
While for the v2 approach (Figure 4), about the same amount of stations score a KGE of higher than 0.5 (73%), it generally delivers less robust solutions, with only 32% of discharge stations reaching a KGE of 0.5 or higher and 10% reaching above 0.7, as opposed to 13% in v1.The average KGE value of simulation v2 is worse in the South (0.15) and Niger (0.08) models and decidedly worse in the West (−0.13).While p and r perform slightly worse in the v2 approach, R 2 remains almost constant at 0.51.PBIAS is worse in v2, dropping from 14.36% underestimation to −23.74% overestimation of streamflow.The Comoé, as well as certain upstream Niger basins, perform better.The v2 validation performs worse than the v1 validation, with 25% of stations reaching KGE values above 0.5 and 1% above 0.7.The average KGE is low with −2.94, due to bad performance in the West model.When only taking stations performing above zero into account, KGE is higher with 0.45 than in the v1 validations (0.42).The highest KGE is reached in the Niger (0.10).While the r value for v2 is decidedly better (0.78 as opposed to 4.41), p is slightly worse (0.32 vs. 0.46).v2 strongly overestimates streamflow (PBIAS: −285.95), again mainly due to the performance of the West model.An example of monthly calibration and validation results for four selected discharge stations is given in Figure 5. Displayed are the 95PPU ranges of both v1 and v2 calibration and validation, the observed data, as well as the key efficiency criteria p, R 2 and KGE.The stations are located in the Ouémé (1 Ahlan), White Volta (2 Daboya), Gambia (3 Gouloumbo) and Niger (4 Lokoja) river basins.For v1, KGE values for all stations are between 0.65 and 0.75.On average, validations perform less well than calibrations except in Daboya (Validation: 0.94).While performances increase during v2 in Lokoja, decreases can be observed for the other stations.Only during validation do Ahlan and Lokoja perform better than v1.At this point, we conclude that for the calibration and validation of the model with discharge alone, the global calibration (v1) performed slightly better than the local calibration (v2).

Multi-Objective Validation Results
During the multi-objective validation, several output variables which were not used for calibration were kept for further validation by comparing to MODIS ETa, ESA CCI SM and GRACE data.
Concerning actual evapotranspiration, validation results reach good scores, as shown in Table 4 and Figure 6.

Multi-Objective Validation Results
During the multi-objective validation, several output variables which were not used for calibration were kept for further validation by comparing to MODIS ET a , ESA CCI SM and GRACE data.
Concerning actual evapotranspiration, validation results reach good scores, as shown in Table 4 and Figure 6.Finally, total water storage was calculated from SWAT outputs and validated using GRACE data (see Table 6 and Figure 8).Again, results show a good fit.Nonetheless, an overestimation of TWS during the dry seasons is apparent in all models, as well as a slight underestimation during the wet seasons in the Niger and South models.Also apparent is a phase shift in the model results by approximately half a month.Some fast changes, e.g., the sharp drop and rise in TWS during the wet season 2012, are not visible in the simulation results at all.Performances vary, and a very high uncertainty in the West v2 model is immediately apparent.Otherwise, the dynamics of both calibrations perform similarly with best R 2 and NSE results reached in the globally calibrated models (0.82 and 0.79 as opposed to 0.61 and 0.56 in the locally calibrated models, respectively).All models except West v2 reach between acceptable and very good R 2 and NSE values with the West v1 model performing best and the West v2 model performing worst.
Water 2018, 10, x FOR PEER REVIEW 14 of 22 Finally, total water storage was calculated from SWAT outputs and validated using GRACE data (see Table 6 and Figure 8).Again, results show a good fit.Nonetheless, an overestimation of TWS during the dry seasons is apparent in all models, as well as a slight underestimation during the wet seasons in the Niger and South models.Also apparent is a phase shift in the model results by approximately half a month.Some fast changes, e.g., the sharp drop and rise in TWS during the wet season 2012, are not visible in the simulation results at all.Performances vary, and a very high uncertainty in the West v2 model is immediately apparent.Otherwise, the dynamics of both calibrations perform similarly with best R 2 and NSE results reached in the globally calibrated models (0.82 and 0.79 as opposed to 0.61 and 0.56 in the locally calibrated models, respectively).All models except West v2 reach between acceptable and very good R 2 and NSE values with the West v1 model performing best and the West v2 model performing worst.

Model Calibration/Validation Discussion
Results show that satellite and remote sensing data can be used to substitute missing observations and boundary conditions in a SWAT simulation.Results are promising with especially successful calibrations and validations generated for the Ouémé, Gambia and lower Niger basins.It may be argued that during global calibration (v1), a prevalence of stations in a certain region may unduly influence the model if the weights of the stations remain the same [36].In our case, this can be observed in the Niger basin, where two of the most upstream stations perform poorly due to the calibration being influenced by the downstream gauging stations but performing better when separately calibrated in v2.However, this effect does not explain the poor performance along the Black Volta river, as similarly poor results are observed in the v2 simulation.This was also reported by Schuol et al. [1].Some of the discharge stations are highly influenced by upstream reservoirs for which no outflow data is available.Even when including reservoirs in the SWAT model, we noticed downstream stations often performed poorly due to the limited amount of data available for proper reservoir setup.Also problematic is the decline in the availability of discharge measurements and uncertainty as to their quality, coupled with data gaps.In contrast to Schuol et al. [1], we did not include the Inner Niger Delta in the model.While they set up the delta as an artificial reservoir and defined the outflow as according to a close downstream station, the closest station for our timeframe is located almost 500 km downstream.
Also, the Akosombo dam in southern Ghana, which creates Lake Volta, could not be included due to missing information about in-and outflows.While the lake was removed from GRACEderived water storage change (by deriving mass variations using altimeter measurements and information on the lake area) to correspond to the simulations, the missing lake might lead to lower actual evapotranspiration simulations in this area.

Model Calibration/Validation Discussion
Results show that satellite and remote sensing data can be used to substitute missing observations and boundary conditions in a SWAT simulation.Results are promising with especially successful calibrations and validations generated for the Ouémé, Gambia and lower Niger basins.It may be argued that during global calibration (v1), a prevalence of stations in a certain region may unduly influence the model if the weights of the stations remain the same [36].In our case, this can be observed in the Niger basin, where two of the most upstream stations perform poorly due to the calibration being influenced by the downstream gauging stations but performing better when separately calibrated in v2.However, this effect does not explain the poor performance along the Black Volta river, as similarly poor results are observed in the v2 simulation.This was also reported by Schuol et al. [1].Some of the discharge stations are highly influenced by upstream reservoirs for which no outflow data is available.Even when including reservoirs in the SWAT model, we noticed downstream stations often performed poorly due to the limited amount of data available for proper reservoir setup.Also problematic is the decline in the availability of discharge measurements and uncertainty as to their quality, coupled with data gaps.In contrast to Schuol et al. [1], we did not include the Inner Niger Delta in the model.While they set up the delta as an artificial reservoir and defined the outflow as according to a close downstream station, the closest station for our timeframe is located almost 500 km downstream.
Also, the Akosombo dam in southern Ghana, which creates Lake Volta, could not be included due to missing information about in-and outflows.While the lake was removed from GRACE-derived water storage change (by deriving mass variations using altimeter measurements and information on the lake area) to correspond to the simulations, the missing lake might lead to lower actual evapotranspiration simulations in this area.If comparing the amount of discharge data available for the period modeled by Schuol et al. (calibration from 1970Schuol et al. (calibration from to 1995) ) and this study (1998-2013), the decline in available discharge measurements becomes apparent, with the exception of the Ouémé basin, where we were able to secure additional stations.Interestingly, the distribution of well-performing stations is very similar in results from both studies, except for the upstream Niger stations, which performed less well in our approach.We observed v1 performing stronger in the calibration and validation periods.We attribute this to the global sensitivity evaluation used in this study.While the 500 runs used to evaluate the sensitivity of the global calibration seem appropriate, we believe 1000 runs for the local sensitivity analysis might have been too low, especially considering the large number of parameters used, which influences the relative sensitivity of each parameter [84].While [84] suggest between 500 and 1000 runs suffice, we believe the effects of more runs especially when using many parameters should be studied.Opening the parameter ranges further might lead to increased p-values and better calibrations/validations.However, effects of the parameters on hydrological processes not represented in the streamflow must be carefully assessed.We encountered several difficulties with unrealistic soil moisture and aquifer behavior using less restricted ranges, which led to bad multi-objective validation results.Furthermore, it can be assumed that if more stations with longer and more complete time series are available, better and more accurate results can be generated.

Multi-Objective Validation Discussion
It seems unrealistic to expect more discharge observations becoming available in the near future.So far, discharge measurements based on satellite-derived water levels have been limited to rivers wider than about 100 m, their spatial coverage is limited by orbit patterns, and they rely on assumptions inherent to rating-curve approaches or river hydraulic modeling which are difficult to verify.Therefore, alternative methods for verifying the accuracy of hydrological model outputs must be explored [82].The multi-objective validation allows us to assess the performance of the model for multiple aspects of the water balance.In terms of actual evapotranspiration, the remote sensing and reanalysis climate forcings allowed for a very good performance at the basin scale.
When looking at single sub-basins, however, the model tends to overpredict ET a in extremely arid areas.In some very humid sub-basins, ET a may likewise be underpredicted.When validating MOD 16 ET a for South Africa, underpredictions of between 13% and 35% have been found [85,86], leading us to assume that the apparent overestimation in arid areas in our model may in part be due to inaccuracies of the MODIS validation data, while the underestimation of ET a during the dry seasons in the southern model could be explained due to Lake Volta not being simulated.
Dynamics of the modeled soil moisture fit the observations very well.The SWAT uncertainties increase markedly during the wet seasons due to a higher availability of water and thus greater influence of the governing parameters.SWAT SM outputs do not allow for a direct comparison, due to the lack of residual water content included in the results.
The validation of the simulated total water storage with GRACE showed good agreement with some peculiarities.The phase shift of one-half month that we identify, especially in the South and West models, has also been observed by Grippa et al. [87] and Ndehedehe et al. [88] when comparing multi-model results with GRACE solutions for West Africa.The most noticeable difference between model and GRACE solutions is the very pronounced decline in TWS during the dry season retrieved by GRACE, which is not always captured by SWAT.This discrepancy is very strong in the West and Niger models, and while it may in part be due to our calculation of the TWS change in SWAT, similar observations have been reported in other studies.Grippa et al. compared water storage anomalies derived from nine land surface models to GRACE, both for the Sahel and West Africa [87].Their findings are very similar to ours, with SWAT TWS change estimations of our West model comparing well to the Sahel zone and the South model to the West Africa zone, while the Niger model lies in between the two.They assume that incorrectly modeled evapotranspiration during the dry season led to these results.Boone et al. [89] also compared land-surface models (LSMs) with GRACE over West Africa and came to the conclusion that the difference in amplitudes might either be due to deficits in the precipitation forcing of the LSMs, their insufficient soil depth (where water percolating past a certain depth is lost, similar to SWAT) or the overestimation of the storage anomalies by GRACE during the dry season.Similar observations were made by Ndehedehe et al., 2016, who speculate that differences might be due to anthropogenic influences intensifying land surface processes which the models cannot capture, or the lack of observed data for model calibration leading to improper soil moisture outputs and thus wrong TWS solutions [88].Werth et al. [90] observed an increase in total water storage over the Niger river basin of seven mm/year and conclude this to be mainly due to an accumulation of groundwater in the Sahel Zone.While we observe positive trends of the total water storage for all models except West v2, which is influenced by high uncertainties, trends in SWAT are generally lower than the GRACE solutions.Furthermore, several studies [20,[90][91][92] report a clear positive trend toward a higher total water storage over the Volta basin since 2007 due to increased precipitation.We have seen a similar effect before removing the Lake Volta signal from the GRACE solution, where we observed a trend of 25 mm/year from January 2007 to December 2010.Afterwards, a positive trend is much less evident, and we conclude that their results were masked by the strong signal of the lake.

Conclusions
For the first time, to the authors' knowledge, has a SWAT model been calibrated using remote sensing and reanalysis inputs and validated for streamflow, actual evapotranspiration, soil moisture dynamics and total water storage simultaneously, proving its robustness and predictive capability.Results show that SWAT simulations for different sparsely-gauged regions of West Africa using freely available remote sensing and reanalysis datasets as input perform surprisingly well.This framework significantly eases the modeler's task of acquiring the necessary climatological, land use and soil data to parameterize a physically-based model.Especially considering the lack of measurements conducted in-situ, the use of remote sensing is essential to produce meaningful assumptions of the water resources in West Africa.While the models perform well using two different calibration and validation schemes, it is necessary to further validate parameters apart from streamflow, otherwise errors in other parts of the water balance might be overlooked.Worqlul et al. have e.g., shown that streamflow may be well simulated even if input precipitation data has large errors [26].We therefore chose to additionally validate actual evapotranspiration, soil moisture and total water storage outputs.The multi-objective validation produced very good results and confirmed that the model performs well in the study area.While our approach delivers good results at the regional, sub-continental scale, we realize that it might not be appropriate to model smaller catchments.The model framework could be further improved if data becomes available to accurately model the Niger Inland Delta and Lake Volta.Also, the sensitivity analysis procedure should be improved if using a large number of potential parameters, as in our v2 approach.Furthermore, parameters such as actual evapotranspiration or leaf area index could be included in a multi-objective calibration using SWAT-CUP.Our framework offers possibilities for further evaluation of the water cycle in West Africa.In ongoing work, we plan to evaluate the model performance against global hydrological models to investigate capabilities and limitations of these models and investigate the model response to extreme drought and flood events.Also, the performance of SWAT with different remote sensing inputs can be evaluated for the region.Nonetheless, it is the authors' opinion that remote sensing data should only be used to complement and not replace discharge and other in-situ measurements for model calibration and validation.Despite the availability of satellite measurements, we believe countries should still invest in in-situ measurement networks.

Figure 1 .
Figure 1.Research Area, Soil and Water Assessment Tool (SWAT) Models and Available Discharge Stations.

Figure 1 .
Figure 1.Research Area, Soil and Water Assessment Tool (SWAT) Models and Available Discharge Stations.
57 reached in the West model and the lowest value of 0.46 in the Niger model.While the range of the model uncertainty (r-Factor) is 0.82, the percentage of data bracketed by the 95 PPU (p-Factor) is 47%.Calibrations of the southern model perform best in the Ouémé and White Volta basins and worst in the Black Volta basin.For the western model, best performances can be observed for the downstream Gambia tributary rivers, while some upstream stations perform less well.For the Niger model, the best performance is reached downstream of the confluence of the Benue and the Niger in Lokoja, while it performs worst in most of the most upstream sub-basins.Water 2018, 10, x FOR PEER REVIEW 10 of 22 highest average KGE of 0.40 in the West (the overall best result) and the lowest average KGE of 0.14 found in the Niger region.The average bias is 14.36% and R 2 amounts to 0.52, with the highest R 2 of 0.57 reached in the West model and the lowest value of 0.46 in the Niger model.While the range of the model uncertainty (r-Factor) is 0.82, the percentage of data bracketed by the 95 PPU (p-Factor) is 47%.Calibrations of the southern model perform best in the Ouémé and White Volta basins and worst in the Black Volta basin.For the western model, best performances can be observed for the downstream Gambia tributary rivers, while some upstream stations perform less well.For the Niger model, the best performance is reached downstream of the confluence of the Benue and the Niger in Lokoja, while it performs worst in most of the most upstream sub-basins.

Figure 3 .
Figure 3. Calibration and Validation Results of the v1 (Global Calibration) Models.

Figure 3 .
Figure 3. Calibration and Validation Results of the v1 (Global Calibration) Models.
R 2 performs similar to v1.Best validation results are reached in the White Volta, Oti and Ouémé.Water 2018, 10, x FOR PEER REVIEW 11 of 22 a KGE of 0.5 or higher and 10% reaching above 0.7, as opposed to 13% in v1.The average KGE value of simulation v2 is worse in the South (0.15) and Niger (0.08) models and decidedly worse in the West (−0.13).While p and r perform slightly worse in the v2 approach, R 2 remains almost constant at 0.51.PBIAS is worse in v2, dropping from 14.36% underestimation to −23.74% overestimation of streamflow.The Comoé, as well as certain upstream Niger basins, perform better.The v2 validation performs worse than the v1 validation, with 25% of stations reaching KGE values above 0.5 and 1% above 0.7.The average KGE is low with −2.94, due to bad performance in the West model.When only taking stations performing above zero into account, KGE is higher with 0.45 than in the v1 validations (0.42).The highest KGE is reached in the Niger (0.10).While the r value for v2 is decidedly better (0.78 as opposed to 4.41), p is slightly worse (0.32 vs. 0.46).v2 strongly overestimates streamflow (PBIAS: −285.95), again mainly due to the performance of the West model.R 2 performs similar to v1.Best validation results are reached in the White Volta, Oti and Ouémé.

Figure 4 .
Figure 4. Calibration and Validation Results of the v2 (Local Calibration) Models.An example of monthly calibration and validation results for four selected discharge stations is given in Figure 5. Displayed are the 95PPU ranges of both v1 and v2 calibration and validation, the observed data, as well as the key efficiency criteria p, R 2 and KGE.The stations are located in the Ouémé (1 Ahlan), White Volta (2 Daboya), Gambia (3 Gouloumbo) and Niger (4 Lokoja) river basins.

Figure 4 .
Figure 4. Calibration and Validation Results of the v2 (Local Calibration) Models.
Water 2018, 10, x FOR PEER REVIEW 12 of 22 For v1, KGE values for all stations are between 0.65 and 0.75.On average, validations perform less well than calibrations except in Daboya (Validation: 0.94).While performances increase during v2 in Lokoja, decreases can be observed for the other stations.Only during validation do Ahlan and Lokoja perform better than v1.At this point, we conclude that for the calibration and validation of the model with discharge alone, the global calibration (v1) performed slightly better than the local calibration (v2).

Figure 7 .
Figure 7. Monthly Simulated Soil Moisture Validation against ESA CCI Data, where SWAT v1 is the global and v2 the local calibration.

Figure 7 .
Figure 7. Monthly Simulated Soil Moisture Validation against ESA CCI Data, where SWAT v1 is the global and v2 the local calibration.

Figure 8 .
Figure 8. Monthly Simulated Total Water Storage Validation against GRACE Data, where SWAT v1 is the global and v2 the local calibration.
If comparing the amount of discharge data available for the period modeled by Schuol et al. (calibration from 1970 to 1995) and this study (1998-2013), the decline in available discharge measurements becomes apparent, with the exception of the

Figure 8 .
Figure 8. Monthly Simulated Total Water Storage Validation against GRACE Data, where SWAT v1 is the global and v2 the local calibration.
[49,50]concluded that the Climate Prediction Center Morphing Technique (CMORPH) version 1 CRT produced by the National Oceanic and Atmospheric Administration Climate Prediction Centre (NOAA-CPC) performed best.CMORPHv1 CRT is a global precipitation analysis algorithm, including satellite infrared and microwave precipitation estimates as well as rain gauge information for bias correction.Precipitation estimates are available from 1998 onwards at a resolution of 0.25 •[49,50].Minimum and maximum 2 m daily temperature data were compiled from the NASA MERRA 2 reanalysis dataset.Inputs from both satellite and ground data are included at a resolution of 0.625 • × 0.5 , as well as through personal communication with local agencies.Discharge stations and their temporal coverage (without gaps) are depicted in Figure1 and summarizedin Table 1.The 12 largest reservoirs in the study area where downstream discharge observations are available were included in the model.Reservoir information was provided by the Global Water System Project (GWSP) Global Reservoir and Dam (GRanD) database version 1.1 created by Lehner et al.
[23]].While SWAT-ready climate input files based on the National Centers for Environmental Prediction (NCEP) climate forecast system reanalysis data (CFSR)[52]are readily available, as discovered in Poméon et al.[23], CFSR precipitation information compares worse to other products in the region.No other climate data were necessary as the authors selected Hargreaves as the potential evapotranspiration method.•Dischargeand reservoirs: Discharge data used in this study was obtained from the German Global Runoff Data Center (GRDC) in Koblenz, the French AMMA-CATCH regional observing system

Table 1 .
Selected River Basins and Discharge Gauges in the Study Area.

Table 2 .
Parameters Included in SWAT Model and Initial Ranges.

Table 3 .
Calibration and Validation Results for v1 and v2 Models.

Table 4 .
Actual Evapotranspiration Validation against MODIS MOD 16 Data.

Table 4 .
Actual Evapotranspiration Validation against MODIS MOD 16 Data.

Table 6 .
Total Water Storage Validation against GRACE Data.

Table 6 .
Total Water Storage Validation against GRACE Data.