General Assessment of the Operational Utility of National Water Model Reservoir Inflows for the Bureau of Reclamation Facilities

This work investigates the utility of the National Oceanic and Atmospheric Administration’s National Water Model (NWM) for water management operations by assessing the total inflow into a select number of reservoirs across the Central and Western U.S. Total inflow is generally an unmeasured quantity, though critically important for anticipating both floods and shortages in supply over a short-term (hourly) to sub-seasonal (monthly) time horizon. The NWM offers such information at over 5000 reservoirs across the U.S., however, its skill at representing inflow processes is largely unknown. The goal of this work is to understand the drivers for both well performing and poor performing NWM inflows such that managers can get a sense of the capability of NWM to capture natural hydrologic processes and in some cases, the effects of upstream management. We analyzed the inflows for a subset of Bureau of Reclamation (BoR) reservoirs within the NWM over the long-term simulations (retrospectively, seven years) and for short, medium and long-range operational forecast cycles over a one-year period. We utilize ancillary reservoir characteristics (e.g., physical and operational) to explain variation in inflow performance across the selected reservoirs. In general, we find that NWM inflows in snow-driven basins outperform those in rain-driven, and that assimilated basin area, upstream management, and calibrated basin area all influence the NWM’s ability to reproduce daily reservoir inflows. The final outcome of this work proposes a framework for how the NWM reservoir inflows can be useful for reservoir management, linking reservoir purposes with the forecast cycles and retrospective simulations.


Introduction
Management decisions regarding reservoir storage and releases require accurate and timely knowledge of the contributing inflows from the basin [1][2][3][4][5]. Most global watersheds still remain completely ungauged, partially gauged or poorly gauged [6][7][8]. This lack of hydrological information also limits the possibility to measure reservoir inflows, largely due to high uncertainties in over-lake precipitation estimates and unmeasured streamflow from minor tributaries [9,10]. Even if some efforts have been made in using satellite data and radar data to implement reservoir inflow estimates [11][12][13][14] Water 2020, 12, 2897 3 of 30 the usage of continental scale hydrologic modeling (such as NWM) in order to expand geographic coverage of forecast products to ungauged areas and (vi) provide predictive capabilities on key water budgeting components such as snow states, soil moisture, evapotranspiration, runoff, and streamflow. While recent publications on NWM and operational interest concentrate streamflow evaluation [34], this paper focuses on understanding the ability of the NWM to capture total reservoir inflow, defined as the total contribution of tributaries (gauged and ungauged) into reservoirs, and the conditions for which the NWM can be useful for water use planning. The intent of this paper is to examine and explain the behavior of NWM reservoir inflows to inform BoR planning and operations, especially in reservoirs and watersheds that do not have total inflow forecast information available and for which managers are charged with making short to mid-order decisions (0−30 day) based on inflow information.
In a two-part approach, we first conduct a quantitative evaluation of NWM forecasts and multi-year retrospective simulations. Second, we use the assessment results to investigate the physical mechanisms that drive differences in performance across reservoirs and contextualize this performance within the BoR's reservoir operations objectives. As a final outcome of this work, we develop a set of recommendations for integrating the NWM into BoR reservoir operation decisions. This work is the first of its kind to evaluate a large set of NWM reservoir inflows for the specific purpose of exploring pathways for use in BoR operations and management. This paper is organized as follows: Section 2 describes the materials and methods for the inflow analyses, Section 3 illustrates the results, Section 4 presents a discussion of the results, and Section 5 offers concluding remarks.

NWM Background
Since inception into operations on the National Centers for Environmental Prediction (NCEP) high-performance computing system in August 2016, the NOAA−NWM has been providing real-time distributed continuous hydrologic forecasts across the Continental United States (CONUS), its contributing watersheds, and Hawaii, in four temporal cycles. The NWM development cycle is such that it is difficult to complete a detailed analysis of one version before the next version is updated and pushed for operations. During this study the NWMv1.2 upgraded to v2.0, so for completeness (availability of a full year of forecasts), we focus this study on v1.2. As of NWMv1.2, the cycles consist of an analysis and assimilation (hourly), short range (hourly from 1−18 h), medium range (six hourly, 1−10 days), and long range (daily, four-member ensemble, 1−30 days); for details on the cycles and configurations, see the service agreement [35], the NOAA official website [36] and references therein. The model base for NWM is the Weather Research Forecast (WRF)−Hydro system, which is described by [37]. The NWM configuration is an instance of the WRF−Hydro modeling framework that couples the Noah−MP land surface model at a one-kilometer grid spacing to an overland routing diffusive wave scheme at 250 m; the resulting infiltration excess is mapped from its catchments to a channel network based on the National Hydrography Dataset (NHD)Plus v2.1 [38].
The outputs of the NWM include streamflow and other hydrologic variables that describe hydrological states on the land surface (moisture and heat fluxes, snowpack characteristics, soil moisture and infiltration variables, etc.), at~2.7 million river reaches and 5459 waterbodies (lakes and reservoirs as of NWM v2.0).
The physics of the WRF−Hydro channel and waterbody routing and representation are fully described in [37]. Briefly, the NWM collects streamflow from contributing streams and aggregates it within a waterbody to calculate the total inflow per time step. The outflow is then routed using a simple level-pool scheme, whereby outflow is controlled by static physical parameters (surface area and maximum depth) and parameterized outflow mechanisms (orifice and weir). Waterbodies in the NWM are required to have a minimum surface area of one km 2 , contain at least one "on-channel" inflow tributary reach, and have exactly one discharge point. Note that due to data limitations, the NWM does not currently distinguish between a managed reservoir and a natural lake, though as data services Water 2020, 12, 2897 4 of 30 like the BoR Reclamation Information Sharing Environment (RISE) data service [39], the US Army Corps Water Management System (CWMS) and Access2Water (AWS) become more streamlined, it may be possible to ingest classification information to inform such parameters and discharge mechanisms. Development of data services to ingest real-time observations and forecasts of reservoir outflows is on-going and planned for implementation in a future NWM version. Kim et al. (2020) [40] explored the option to couple the NWM with a more sophisticated reservoir management model: the Hydrologic Evaluation Center's ResSim. In this configuration, NWM inflows were used as inputs into ResSim, which computed outflows based on rule curves and operating plans; results for the Russian River case showed significant improvement in representing managed outflows over the default NWM level pool scheme. This type of advancement provides a pathway for using NWM inflows to couple with more advanced outflow models, strengthening the motivation for understanding and improving NWM reservoir inflow processes.

Reconstructed Total Inflow Observations
The inflow observations used for these comparisons are not direct observations, but are reconstructed inflow estimates based on reservoir daily release and change in reservoir storage. Reservoir evaporation and change in bank storage may be included in the calculations, depending on the availability of the required parameters per reservoir [41].
The BoR distributes this information through the Reclamation Information Sharing Environment (RISE) data portal, a publically available data service to view, access, and download historic and near real-time data [39].
Generally, the reconstructed regulated total inflow calculation attempts to reproduce the true total inflow of a reservoir, taking into account both natural and regulated components upstream. A second product, unregulated inflow, removes the contributions from upstream reservoir operations but still includes diversions/depletions to provide a naturalized inflow that is more independent of management [42]. In the present paper, we use the BoR regulated total inflow to compare with NWM total inflow variable, since our main target is to assess the NWM's ability to capture the observed states. Additionally, unregulated inflows were only available for three of the reservoirs in this selection, and after a preliminary analysis, these reconstructions were not directly applicable for the purposes of this project.
Finally, although the regulated total inflow is the best available observation for this analysis, it is a derived product and carries intrinsic uncertainties. These uncertainties and observation errors vary site by site, mainly impacted by the quality of elevation, evaporation and (in some cases) wind and bank storage measurements. For this reason, the quality of the reconstructed reservoir inflows in this study has been assessed through qualitative inspection of the hydrographs and through input from BoR. As a result, each reservoir in the analysis has been assigned a code to denote quality from red (noted data defects) to green (high fidelity), where orange is moderate fidelity.
Additionally, during periods of ice coverage or near-surface freezing, the BoR observations are not available, thus some time gaps exist in the observed inflows; reservoirs with more than 30% missing data over the total period were discarded.

Reservoir Selection
The reservoirs analyzed in this study were selected to include as many reservoirs as possible in multiple regions of interest to the BoR. The reservoirs selected met two primary criteria: their presence in both NWMv1.2 and NWMv2.0, and the availability and quality of observations from the RISE data service. The intersection of NWM and RISE reservoirs resulted in 42 reservoirs, which serve as the basis for this analysis. From these initial 42 reservoirs, we then discarded reservoirs with more than 30% missing data in either the retrospective or forecast period of interest, respectively, resulting in a final count of 33 reservoirs for both retrospective and forecast evaluation, eight reservoirs only for the forecast analyses (since they had 30% missing data for the retrospective time-period but they matched the criteria for the forecast time-period) and one reservoir (Lake Berriessa and Monticello Dam, ber) that matched the data availability criteria for the retrospective, but not for the forecast evaluation. Figure 1 shows the locations of the reservoirs and associated watersheds colored by dominant Ecoregion (Level 3), which are groupings of geographic regions by similar physical attributes (see [43]).
Water 2020, 12, x FOR PEER REVIEW 5 of 32 time-period but they matched the criteria for the forecast time-period) and one reservoir (Lake Berriessa and Monticello Dam, ber) that matched the data availability criteria for the retrospective, but not for the forecast evaluation. Figure 1 shows the locations of the reservoirs and associated watersheds colored by dominant Ecoregion (Level 3), which are groupings of geographic regions by similar physical attributes (see [43]).

Reservoir Characterization
One goal of this effort is to identify the driving mechanisms for NWM inflow performance so that the results can be generalized and the model may be improved. Thus in the evaluation, we classified reservoirs according to characteristics that may influence the model performance: Ecoregion, percent of watershed area calibrated, drainage area of the contributing watershed, percent assimilated area upstream (described in this section and in Section 3.2 below), number of lakes upstream and percentage of snow contribution over total precipitation. Some of the characteristics mentioned are referring to physical characteristics of the watershed itself (e.g., Ecoregion, drainage area, percentage of snow contribution), while others are particularly linked to the NWM architecture (percent of watershed area calibration, percent of assimilated area upstream). Others, like Ecoregion, link both physical characteristics of the watersheds and the NWM architecture, since the NWM parameters are calibrated in representative watersheds and then regionalized as a function of Ecoregion in v1.2. For further information about NWM calibration, the reader is referred to Gochis et al., 2019 [44].
The characteristics were attributed as follows: Ecoregions-Level 3; the drainage area of the reservoir's contributing watershed was calculated using the NWM flow network; the percentage of snow contribution was attributed using the GAGES2 dataset [45] where gauges from this dataset were associated with the watersheds of each reservoir. The variable: "snow percent of total precipitation estimate, mean for period 1901−2000" was used to define this characteristic, following

Reservoir Characterization
One goal of this effort is to identify the driving mechanisms for NWM inflow performance so that the results can be generalized and the model may be improved. Thus in the evaluation, we classified reservoirs according to characteristics that may influence the model performance: Ecoregion, percent of watershed area calibrated, drainage area of the contributing watershed, percent assimilated area upstream (described in this section and in Section 3.2 below), number of lakes upstream and percentage of snow contribution over total precipitation. Some of the characteristics mentioned are referring to physical characteristics of the watershed itself (e.g., Ecoregion, drainage area, percentage of snow contribution), while others are particularly linked to the NWM architecture (percent of watershed area calibration, percent of assimilated area upstream). Others, like Ecoregion, link both physical characteristics of the watersheds and the NWM architecture, since the NWM parameters are calibrated in representative watersheds and then regionalized as a function of Ecoregion in v1.2. For further information about NWM calibration, the reader is referred to Gochis et al., 2019 [44].
The characteristics were attributed as follows: Ecoregions-Level 3; the drainage area of the reservoir's contributing watershed was calculated using the NWM flow network; the percentage of snow contribution was attributed using the GAGES2 dataset [45] where gauges from this dataset were associated with the watersheds of each reservoir. The variable: "snow percent of total precipitation estimate, mean for period 1901−2000" was used to define this characteristic, following McCabe and Wolock (2010) [46]. The percent calibrated area is the NWMv1.2 calibrated area within the reservoir watershed divided by the total contributing drainage area. The percent upstream assimilated area (or, percent gauged area) is calculated by summing the contributing areas that belong to assimilation gauges in the NWM and dividing by the total reservoir drainage area. This characteristic indicates both how much assimilation correction has been performed upstream of the reservoir for the NWM forecasts, and how many streamflow stations are present upstream of each reservoir. If the percent is equal to zero, it means not only that the NWM forecast is unadjusted by assimilation, but it also means that the area upstream of the reservoir is ungauged by United States Geological Survey (USGS). Finally, the number of lakes upstream is defined as the number of NWM lakes or reservoirs contributing to the inflow of a given reservoir, identified by tracing upstream on the flow network. A summary of the reservoirs and characterizations are given in Table 1.
In addition to these physical and model-based classifications, we also used information from the National Inventory of Dams (NID) database [47] to assign a primary purpose to each reservoir. Using expert knowledge from BoR partners and local offices, the primary purpose was matched with the NWM cycles that are most applicable for each reservoir purpose ( Table 2). This association was intended to emphasize the relevance of the NWM for specific reservoir purposes. The results from the responses highlighted that, for example, the NWM long range forecast is more relevant for irrigation, while the short and medium range forecast is more appropriate for flood control operations. A summary of the stated primary purpose for each reservoir according to NID is also given in Table 1, last column.

NWM Inflow Assessment
The inflow evaluation was conducted on the 34 selected reservoirs through two sets of model simulations: (1) "retrospective": a seven-year, hourly, open-loop (no assimilation) retrospective simulation using the NWMv1.2 and NWMv2.0 forced by the National Land Data Assimilation System (NLDAS−2) [48] downscaled and regridded on the 1-km NWM land-surface grid, and (2) "forecast": a one-year simulation of the operational NWMv1.2 short-(0−18h), medium-(0−10 days) and long-range (0−30 days) forecast configurations, forced with NWM operational forcings, and with data assimilation active. Model configuration details, including sources for meteorological forcings, and information on downloading model outputs are available at Office of Water Prediction (2016) and references therein [36].
Assessing both the forecast and retrospective simulations is important for obtaining a robust evaluation and providing a more complete picture of overall NWM performance. The performance of the NWM in forecast mode incorporates meteorological uncertainties coming from the weather forecast forcings, and indicates the model's general ability to capture shorter events across lead times and to provide utility for operational decisions. In comparison, the retrospective is forced by weather variables coming from analyses (NLDAS−2), and provides insight on overall model physics and representation of seasonal processes that can be informative for longer-term planning. In dissecting where and when the model shows congruence between the forecast and retrospective performance, we can draw conclusions about its utility for reservoir management.  Using expert knowledge from BoR partners and local offices, the primary purpose was matched with the NWM cycles that are most applicable for each reservoir purpose ( Table 2). This association was intended to emphasize the relevance of the NWM for specific reservoir purposes. The results from the responses highlighted that, for example, the NWM long range forecast is more relevant for irrigation, while the short and medium range forecast is more appropriate for flood control operations. A summary of the stated primary purpose for each reservoir according to NID is also given in Table 1, last column.

Purpose 1 Short Range Medium Range Long Range
Water Supply x Ecosystem (1) x x 1 numbers in the parentheses denotes the reservoir count in this study; "x" indicates that the cycle is relevant for the project purpose; dark grey = highly relevant, light grey = potentially relevant.

NWM Inflow Assessment
The inflow evaluation was conducted on the 34 selected reservoirs through two sets of model simulations: (1) "retrospective": a seven-year, hourly, open-loop (no assimilation) retrospective simulation using the NWMv1.2 and NWMv2.0 forced by the National Land Data Assimilation System (NLDAS−2) [48] downscaled and regridded on the 1-km NWM land-surface grid, and (2) "forecast": a one-year simulation of the operational NWMv1.2 short-(0−18h), medium-(0−10 days) and long-range (0−30 days) forecast configurations, forced with NWM operational forcings, and with data assimilation active. Model configuration details, including sources for meteorological forcings, and information on downloading model outputs are available at Office of Water Prediction (2016) and references therein [36].
The total reservoir hourly inflow from the NWM was aggregated to a daily timestep for comparison with the BoR reconstructed daily inflow observations as follows: hours with same lead times (e.g., six-hour lead time) with valid times on each day were averaged for each day to create aggregated daily forecasts. This was done for each forecast cycle (i.e., short-, medium-, and long-range) at their respective output frequencies (i.e., hourly, three-hour, and six-hour, respectively).
NWM skill was assessed using three common goodness of fit statistics widely utilized in hydrology: Pearson's correlation coefficient (r), Nash-Sutcliffe Efficiency (NSE), and relative bias (relBias) [49][50][51]. We use relBias, which is a bias normalized by average inflow (Equation (1)), where O i is the observation value and P i is the forecast value, in order to compare reservoirs of different sizes with one another. These skill scores were selected to give an indication of response congruence (r), volumes (relBias), and an overall measure that considers both timing and volume (NSE).

Retrospective Analyses
Retrospective NWM total inflow was analyzed in order to provide insight on overall model physics and representation of seasonal processes. Since NWM was in the process of transitioning from v1.2 to v2.0, we did an initial check on retrospective performance differences between the two versions to ensure analysis from v1.2 is still relevant. Goodness of fit metrics were computed on daily streamflow using the retrospective model simulations from NWMv1.2 and v2.0 against BoR observations. Based on skill scores for r, relBias, and NSE, we note that inflows were generally improved from v1.2 to v2.0 across most reservoirs, and that reservoirs with low skill in v1.2 also generally had low skill in v2.0, indicating a persistent issue at these locations. Because a sufficient time period of forecasts were only obtainable from NWMv1.2, and the retrospective evaluation confirms a similar pattern between v2.0 and v1.2, NWMv1.2 is the focus of this analysis. Figure 2 shows the goodness of fit scores for r, relBias, and NSE (top to bottom) for the retrospective evaluation in v1.2 (dark) and v2.0 (light), where the x-axis is sorted from left to right by an increasing number of lakes upstream (with the count in brackets). The colors from green to red indicate the quality of the observations according to the description in Section 2.2.  Since the retrospective runs in 'open-loop', it does not account for managed upstream flows or active data assimilation. While we would expect the NWM to be able to simulate inflows into reservoirs with no management upstream better than those with multiple upstream lakes (Figure 2), the results are mixed. Nevertheless, the high positive relBias in managed watersheds indicates that NWM is simulating the unmanaged hydrology, which is not able to account for engineered inflow reductions from upstream reservoir operations.
In parsing the NWM inflow performance by the full set of reservoir characteristics (as listed in Table 1), expected patterns with percent calibrated area ( Figure A1) and drainage basin size ( Figure A2) emerged. In reservoirs with a higher percent of calibrated area, NWM inflows show better performance, especially for lakes with calibrated areas greater than~10%. The relationship between modeled inflow performance and lake drainage basin size is inverse, such that the model is able to simulate inflows in lakes with smaller basins better than larger ones, likely because larger basins contain more upstream management (e.g., diversions, lakes, etc.). Finally, snow percentage contribution over total precipitation revealed better retrospective statistical scores for reservoirs with higher snow percentage contribution ( Figure A3). See additional material in Appendix A, Figures A1-A3 for visual reference to these findings.

Forecast Assessment
This section presents the analysis of the NWM v1.2 daily inflow forecasts from September 2017 to August 2018 for the short, medium and long range forecast cycles. The results are sorted by characteristic (number of lakes upstream, percentage of assimilated area upstream, solid precipitation contribution and retrospective ranking) in order to investigate the influence of upstream management and assimilation on the forecast performance, focusing also on the differences between snow-driven and rain-driven total inflow contributions and how these different hydrological regimes are handled by the NWM forecast. Sorting the forecast results by different characteristics helped to infer the impact of each of these different elements on model performance.
As in the retrospective evaluation, the forecast analysis indicates that upstream management is a factor in inflow performance (Figure 3). We note that the r in reservoirs with upstream management is generally good (>0.7), but the relBias is high, whereas in the lakes without upstream management (24 of the 36), the relBias indicates underestimation. This pattern is true across all forecast cycles. Figure 4 illustrates how the percentage of upstream assimilated area, a proxy for gauged area, influences each forecast cycle and lead times within. To explain the mechanisms governing this behavior, a brief description of the NWM streamflow data assimilation follows. In the AnA cycle, which provides the forecast initial conditions, nudging data assimilation is applied to the streamflow state at~7000 USGS streamflow gauge locations through a real-time data feed on the operational system. Nudging is also applied in the forecast but only observations near the initialization time are used and the effect of the nudge is relaxed (with two hour e-folding) as the forecast advances into the future, beyond observations available near the initial time. Thus, we expect to see evidence of this relaxation across the first 0−3 h in the short-range at many gages (top row, Figure 4). Lakes that have a small percent of assimilation upstream exhibit two behaviors: low skill from early lead times, or good performance through the entire cycle. In highly assimilated areas, on the contrary, the short-range in some cases manages to keep memory of assimilation, improving the results even past the e-fold time (e.g.,~5-h lead time or more for Pactola reservoir (ptr), Echo reservoir (echoreservoir), Elephant Butte Dam (elephantbuttedam), Lovewell Dam (lvks), etc.). In contrast, the effect of large assimilation on medium and long range tends to dissipate relatively quickly over the longer lead times. In general, the medium and long-range inflow forecasts are less reliable at longer lead times than the short-range.  Figure 4 illustrates how the percentage of upstream assimilated area, a proxy for gauged area, influences each forecast cycle and lead times within. To explain the mechanisms governing this behavior, a brief description of the NWM streamflow data assimilation follows. In the AnA cycle, which provides the forecast initial conditions, nudging data assimilation is applied to the streamflow state at ~7000 USGS streamflow gauge locations through a real-time data feed on the operational system. Nudging is also applied in the forecast but only observations near the initialization time are used and the effect of the nudge is relaxed (with two hour e-folding) as the forecast advances into the future, beyond observations available near the initial time. Thus, we expect to see evidence of this relaxation across the first 0−3 h in the short-range at many gages (top . The black bold line separates reservoirs with more than zero lakes upstreams to the ones with zero lakes upstream.Each column from left to right columns illustrate r, NSE, relBias, where the metric is scaled by color density and grey is out of valid range (relBias > +1 or relBias < −1 and NSE < −1). For r and NSE, green values are higher performance and red is worse; and for relBias, blue is high-biased and red is low-biased. The subplot rows are forecast cycles: short (a), medium (b), long (c).
areas, on the contrary, the short-range in some cases manages to keep memory of assimilation, improving the results even past the e-fold time (e.g., ~5-h lead time or more for Pactola reservoir (ptr), Echo reservoir (echoreservoir), Elephant Butte Dam (elephantbuttedam), Lovewell Dam (lvks), etc.). In contrast, the effect of large assimilation on medium and long range tends to dissipate relatively quickly over the longer lead times. In general, the medium and long-range inflow forecasts are less reliable at longer lead times than the short-range. . Performance of NWM reservoir inflows across lead times for one year, where reservoirs in each subplot are ordered from bottom to top by the percent of upstream assimilated area (percentages are listed in parenthesis on the y-axis with reservoir name). The black bold line separates reservoirs with no assimilation upstream to the ones that have assimilation upstream. Each column from left to right columns illustrate r, NSE, relBias, where the metric is scaled by color density and grey is out of valid range (relBias > +1 or relBias < −1 and NSE < −1). For r and NSE, green values are higher performance and red is worse; and for relBias, blue is high-biased and red is low-biased. The subplot rows are forecast cycles: short (a), medium (b), long (c).

Figure 4.
Performance of NWM reservoir inflows across lead times for one year, where reservoirs in each subplot are ordered from bottom to top by the percent of upstream assimilated area (percentages are listed in parenthesis on the y-axis with reservoir name). The black bold line separates reservoirs with no assimilation upstream to the ones that have assimilation upstream. Each column from left to right columns illustrate r, NSE, relBias, where the metric is scaled by color density and grey is out of valid range (relBias > +1 or relBias < −1 and NSE < −1). For r and NSE, green values are higher performance and red is worse; and for relBias, blue is high-biased and red is low-biased. The subplot rows are forecast cycles: short (a), medium (b), long (c).
In watersheds that are dominated by snow-driven processes, the r and NSE skill scores over the forecast cycles and across lead times tend to outperform those in rain-driven watersheds ( Figure 5). In addition, snow-driven watersheds have also less degradation of the performance as a function of lead time in every forecast range (short, medium and long), even if some exceptions can still be found in the pattern. Rain-driven watersheds have the tendency to overestimate the volumes, according to relBias results, while reservoirs with greater snow contributions show more of a tendency to underestimate. These findings are encouraging and indicate that the NWM can reasonably reproduce snow-melt runoff, an essential process for water resources management.
In watersheds that are dominated by snow-driven processes, the r and NSE skill scores over the forecast cycles and across lead times tend to outperform those in rain-driven watersheds ( Figure 5). In addition, snow-driven watersheds have also less degradation of the performance as a function of lead time in every forecast range (short, medium and long), even if some exceptions can still be found in the pattern. Rain-driven watersheds have the tendency to overestimate the volumes, according to relBias results, while reservoirs with greater snow contributions show more of a tendency to underestimate. These findings are encouraging and indicate that the NWM can reasonably reproduce snow-melt runoff, an essential process for water resources management. Figure 5. Performance of NWM reservoir inflows across lead times for one year, where reservoirs in each subplot are ordered from bottom to top by the percent contribution of snow in total precipitation (percentages are listed in parenthesis on the y-axis with reservoir name). Each column from left to right columns illustrate r, NSE, relBias, where the metric is scaled by color density and grey is out of valid range (relBias > +1 or relBias < −1 and NSE < −1). For r and NSE, green values are higher performance and red is worse; and for relBias, blue is high-biased and red is low-biased. The subplot rows are forecast cycles: short (a), medium (b), long (c).

Figure 5.
Performance of NWM reservoir inflows across lead times for one year, where reservoirs in each subplot are ordered from bottom to top by the percent contribution of snow in total precipitation (percentages are listed in parenthesis on the y-axis with reservoir name). Each column from left to right columns illustrate r, NSE, relBias, where the metric is scaled by color density and grey is out of valid range (relBias > +1 or relBias < −1 and NSE < −1). For r and NSE, green values are higher performance and red is worse; and for relBias, blue is high-biased and red is low-biased. The subplot rows are forecast cycles: short (a), medium (b), long (c).
From the retrospective assessment, we can confirm that snow processes are a significant driver in understanding inflow performance. For example, if we compare two reservoirs of similar size, gauged area, but in different regions, Lake Sherburne (sher, 67.7% snow-driven, Canadian Rockies ecoregion, 23% calibrated) and Lovewell Dam (lvks, 9.2% snow-driven, Central Great Plains ecoregion, 65% calibrated), the impact on the seasonal cycle is evident ( Figure 6 and Table 3). The NWM is able to replicate the seasonal and annual cycles in the snow-driven watershed contributing to Lake Sherburne reservoir (sher), while the peaks at Lovewell Dam (lvks) were not well aligned with the observations. From the retrospective assessment, we can confirm that snow processes are a significant driver in understanding inflow performance. For example, if we compare two reservoirs of similar size, gauged area, but in different regions, Lake Sherburne (sher, 67.7% snow-driven, Canadian Rockies ecoregion, 23% calibrated) and Lovewell Dam (lvks, 9.2% snow-driven, Central Great Plains ecoregion, 65% calibrated), the impact on the seasonal cycle is evident ( Figure 6 and Table 3). The NWM is able to replicate the seasonal and annual cycles in the snow-driven watershed contributing to Lake Sherburne reservoir (sher), while the peaks at Lovewell Dam (lvks) were not well aligned with the observations. Figure 6. Retrospective comparison of inflows from two sites, Lake Sherburne (a) (sher, 67.7% snow-driven, Canadian Rockies, 23% calibrated) and Lovewell Dam (b) (lvks, 9.2% snow-driven, Central Great Plains, 65% calibrated). Observed is in black, model simulation in gray. Figure 6. Retrospective comparison of inflows from two sites, Lake Sherburne (a) (sher, 67.7% snow-driven, Canadian Rockies, 23% calibrated) and Lovewell Dam (b) (lvks, 9.2% snow-driven, Central Great Plains, 65% calibrated). Observed is in black, model simulation in gray. Table 3. Seasonal statistics for Lake Sherburne (sher) (panel a) and Lovewell Dam (lvks) (panel b). Rows represent statistical scores (from top to bottom): r, relBias and NSE. The columns represent the scores over each season: "total" refers to the total length of the retrospective time series, "ONDJFM" refers to the time-period from October to March, "AMJ" refers to the time-period from April to June, "JAS" refers to the time-period from July to September. The percentages in brackets next to each column represent the percentage observational data availability over the time period of interest.

Combined Assessment
In an effort to synthesize the factors that influence inflow performance, Table 4 summarizes the reservoirs, sorted by Ecoregion, with the characteristics presented in Tables 1 and 2, and with the retrospective NSE and the forecast-cycle NSE scores at three lead times. Table 4. Summary of reservoir characteristics and NSE skill scores for retrospective and forecasts (following page). The color densities indicate higher values for the characteristics (e.g., larger lake surface area) while the red to green densities for the NSE scores indicate the relative performance (red is worse and green is better). Ber does not show scores for the forecast since it had more than 30% missing data for the forecast period, as explained in Section 2.3.1. NA values for snow % are associated with places in which the nearest USGS was not included in the GAGES2 database.
Water 2020, 12, x FOR PEER REVIEW 16 of 31 Table 4. Summary of reservoir characteristics and NSE skill scores for retrospective and forecasts (following page). The color densities indicate higher values for the characteristics (e.g., larger lake surface area) while the red to green densities for the NSE scores indicate the relative performance (red is worse and green is better). Ber does not show scores for the forecast since it had more than 30% missing data for the forecast period, as explained in Section 2.3.1. NA values for snow % are associated with places in which the nearest USGS was not included in the GAGES2 database. To summarize the results from the retrospective and forecast evaluation, Table 4 and Figure 7 offer visualizations of the combined analysis. Figure 7 provides a visual for assessing congruence between the retrospective and forecast by comparing the NWM inflow performance between the two configurations (ordered by NSE from bottom to top (best to worst) according to their retrospective ranking).

Reservoir
Water 2020, 12, x FOR PEER REVIEW 18 of 32 To summarize the results from the retrospective and forecast evaluation, Table 4 and Figure 7 offer visualizations of the combined analysis. Figure 7 provides a visual for assessing congruence between the retrospective and forecast by comparing the NWM inflow performance between the two configurations (ordered by NSE from bottom to top (best to worst) according to their retrospective ranking). Taken together, the retrospective and forecast evaluation provide a comprehensive understanding of the overall reservoir performance: the retrospective evaluation consists of a multi-year time series and is forced by NLDAS analyses, that, even if have their own uncertainties, have assimilated observations in the meteorological forcings. On the contrary, the forecast runs span shorter time periods (with varying lead times for short, medium long range) and are forced by meteorological products that are derived from forecast models (High-Resolution Rapid Refresh Taken together, the retrospective and forecast evaluation provide a comprehensive understanding of the overall reservoir performance: the retrospective evaluation consists of a multi-year time series and is forced by NLDAS analyses, that, even if have their own uncertainties, have assimilated observations in the meteorological forcings. On the contrary, the forecast runs span shorter time periods (with varying lead times for short, medium long range) and are forced by meteorological products that are derived from forecast models (High-Resolution Rapid Refresh (HRRR) for the short range, Global Forecast System (GFS) for the medium range, Climate Forecast System (CFS) for the long range). When the retrospective shows favorable performance, it is plausible to assume that in that case the NWM is able to sufficiently reproduce the physical processes over multiple years and seasons. The value of this analysis is in understanding variance in model performance by season and by different model version (since the meteorological forcings are the same) and using this knowledge to guide future development. In the cases with poor retrospective performance, we expect similar behavior in the forecast, unless a significant contribution from assimilation occurs at that site. Finally, when the retrospective skill score are unsatisfactory, but the forecast performance is good, it is difficult to infer conclusions but further investigations can be done in assessing the quality of the meteorological forcings (since, it is well known that the hydrological model cannot correct the errors coming from the forcings). This is the general set of guidelines to follow when synthesizing the two different evaluations (retrospective and forecast). Nevertheless, due to the NWM complexity and the propagation of uncertainties from the atmospheric forcings to the hydrology, the real answers are much more complicated and would require a detailed study for each single reservoir case. This focus is outside the scope of the present study, but hopefully this broader analysis will encourage researchers and stakeholders to deepen the analyses on the local level.
Generally, NWM reservoir inflow performance between the forecast and retrospective agree, with some exceptions (Figure 7). One example of these exceptions is Navajo reservoir (navajo), which ranks high in the retrospective evaluation but has poor goodness of fit metrics across forecast cycles. This contrast between forecast and retrospective suggests issues with the meteorological forcings used to drive the forecasts. In Navajo, only 2% of the contributing watershed is calibrated, and 76% of the basin is gauged (high assimilation). The NWM simulates flows with high fidelity in the retrospective (Figure 8, panel a), but some loss in performance in the forecast for some possible forcing overestimation. The short range hydrograph is illustrated in Figure 8 panel b, while medium and long range are available in the additional material in Figure A4. The converse is true for Elephant Butte (elephantfbutte), which has a low retrospective performance (see ranking in Figure 7 and hydrograph in Figure 9, panel a), but a relatively high goodness of fit metrics in the forecast short lead times (Table 3 and hydrograph at Figure 9, panel b (short range, while medium and long range are available in additional material at Figure A5). In this case, we note that elephantbutte has 15 lakes upstream and significant diversions in the basin; and 81% of the basin is gauged, which means assimilation is a key factor in achieving an NSE = 0.75 in early lead times across all forecast cycles in this basin (Table 4). For longer lead times, conversely, the effects of assimilation vanish, and we assist to a rapid decrease in performance at longer lead times (Table 4 and medium-and long-range hydrographs for elephantbutte in AX3). Table 4 indicates that more often, simulated reservoir inflows perform worse in the retrospective and better in the forecast than the opposite case. These reservoirs also have a considerable amount of assimilation upstream, proving the effectiveness of nudging to improve performance, at least for some of the cycles.  This analysis suggests that certain reservoir inflows are reliable for the retrospective and forecast periods, such as lemon, blr, sher, taylorpark, gbr, while others may be more suitable to provide information only for a particular forecast cycle. Table 4 also highlights that NWM performs relatively well in snow-driven watersheds, in both retrospective and forecast simulations. In this analysis, the snow-driven watersheds tend to be small, headwaters basins, with minimal upstream management, and well calibrated with a high percentage of gauged (assimilated) area (e.g., taylorpark, blr, sher). In such basins, where the hydrology is tied to seasonal accumulation and snow melt periods rather than convective storms, the NWM forecasts and retrospectives can be useful for planning purposes such as those described in Table 2. This analysis suggests that certain reservoir inflows are reliable for the retrospective and forecast periods, such as lemon, blr, sher, taylorpark, gbr, while others may be more suitable to provide information only for a particular forecast cycle. Table 4 also highlights that NWM performs relatively well in snow-driven watersheds, in both retrospective and forecast simulations. In this analysis, the snow-driven watersheds tend to be small, headwaters basins, with minimal upstream management, and well calibrated with a high percentage

Discussion
In general, we found that simulated inflows into reservoirs with upstream management have more bias and lower NSE and r skill scores in the retrospective and forecast evaluation compared to those in unregulated watersheds. Furthermore, as the number of upstream lakes increase, performance in the forecast decays rapidly as a function of lead time. As explained in Section 2, the NWM currently employs a simplified representation of management from upstream reservoirs and this is, as expected, reflected in poorer performances both in the retrospective and in the forecast cycles for watersheds that include a high number of regulated reservoirs and lakes upstream.
The performance of reservoirs by watershed size is highly correlated with the number of lakes: most of the time a bigger water projects include multiple upstream projects/facilities and, consequently, the NWM's simulation of natural processes cannot capture these features.
As with the retrospective evaluation, we find that basins with partially calibrated drainage areas generally have worse goodness of fit scores than fully calibrated across all forecast cycles, an indication that regionalization within a large watershed may adversely impact performance. For example, if a large watershed has multiple regionalization "clusters" within it, the parameters may be distributed in a non-consistent manner across the different contributing basins to the total inflow.
In addition to that, the joined forecast and retrospective evaluation of reservoirs in Table 4 highlights some important considerations regarding the role of assimilation and reinforces the discussion of Figure 4: some reservoirs that have very poor performance in the retrospective and large assimilated area are able to have much better performance in the short lead times (e.g., Pactola reservoir (ptr), Echo reservoir (echoreservoir), Elephant Butte Dam (elephantbuttedam), Lovewell Dam (lvks)) presumably thanks to correction provided by assimilation.
Finally, another important surprising result is the watersheds with a higher percentage of snow precipitation performed better than the ones fed mostly by rain on both retrospective and forecast evaluation. Although it is beyond the scope of this study to conduct specific experiments related to snow-driven processes, it was still possible to state the fact that the current NWM has been better optimized to reproduce the runoff characteristics in these snow-driven watersheds. Even if this can also sum to the ability of the forcings to correctly depict solid vs. liquid precipitation, these results are very consistent between retrospective and forecast, suggesting that rain-fed watersheds are more challenging for the NWM to simulate, probably due to flashy behavior in the Midwestern U.S. plains compared to many snow-driven watersheds in the western U.S. Further research into better meteorological forcings and physics improvements are needed to address this issue.

Conclusions
In the absence of an observed total inflow into critical reservoirs, the NWM can provide an understanding of the hydrological processes and patterns of the contributing watershed hydrology from the retrospective simulations. The NWM also provides access to a CONUS-wide set of continuous forecasts which use consistent model physics and meteorological forcings at near real-time, which can serve as an additional set of information for water managers. As NWM evolves from its nascent stages, there is a great need and effort to encourage participation in model improvements and evaluating sources for data assimilation and parameter estimation. However, the ability to directly use the forecasted inflows depends on several factors. The main factors driving inflow performance are the precipitation dominance (snow vs. rain fed) and the influence of upstream management (number of lakes upstream). Other physical characteristics were influential but not in a consistent and systematic way. Thus, in headwater basins that are snow-driven, the NWM can provide a realistic depiction of the flow profile.
This analysis also exposed short-comings with regard to the mixed performance in calibration basins, which were expected to show improvements especially, in the retrospective. Given the variability in observation quality as denoted in Section 2.2, it is difficult to attribute poor performance strictly to the model; thus, future work will expand the number of reservoirs to include those from other agencies that have inflow observations. Overall, even if local engineers will still use their methods for accounting total inflows in reservoirs of interest, in ungauged headwater basins, NWM provides information that can supplement these calculations. Moreover, the hourly time resolution offered by the NWM may be useful to operators and forecasters who otherwise might only have access to daily information. With a growing community of researchers and local offices that are getting involved in exploring the potential usage of NWM for water management, the hope is that this study will offer good guidance for future work and applications.    Figure A3. Goodness of fit scores from the NWM v1.2 (dark grey) and NWM v2.0 (light grey) retrospective total inflow evaluation, ordered by increasing snow percentage over total precipitation.from top to bottom: r (a), relBias (b), NSE (c). When the snow percentages are replaced by NAs means that that particular reservoirs did not have any GAGES2 snow percentage information. Figure A3. Goodness of fit scores from the NWM v1.2 (dark grey) and NWM v2.0 (light grey) retrospective total inflow evaluation, ordered by increasing snow percentage over total precipitation.from top to bottom: r (a), relBias (b), NSE (c). When the snow percentages are replaced by NAs means that that particular reservoirs did not have any GAGES2 snow percentage information. Water 2020, 12, x FOR PEER REVIEW 28 of 32 Figure A4. NWM simulated reservoir inflows into Navajo Dam for a one−year forecast period for medium (a) and long range (b). Blue lines are the modeled flows for all lead times and black is the BoR observations. Figure A4. NWM simulated reservoir inflows into Navajo Dam for a one−year forecast period for medium (a) and long range (b). Blue lines are the modeled flows for all lead times and black is the BoR observations. Water 2020, 12, x FOR PEER REVIEW 29 of 32 Figure A5. NWM simulated reservoir inflows into Elephant Butte Reservoir for a one−year forecast period for medium (a) and long range (b). Blue lines are the modeled flows for all lead times and black is the BoR observations.