OpenForecast: An Assessment of the Operational Run in 2020–2021

: OpenForecast is the ﬁrst openly available national-scale operational runoff forecasting system in Russia. Launched in March 2020, it routinely provides 7-day ahead predictions for 834 gauges across the country. Here, we provide an assessment of the OpenForecast performance on the long-term evaluation period from 14 March 2020 to 31 October 2021 (597 days) for 252 gauges for which operational data are available and quality-controlled. Results show that OpenForecast is a robust system based on reliable data and solid computational routines that secures efﬁcient runoff forecasts for a diverse set of gauges.


Introduction
Floods remain the primary source of economic and human losses among all natural disasters [1][2][3].During the past few decades, floods caused thousands of deaths and billions of dollars of material damage [4].The modern trend towards warming and hence more extreme climate [5][6][7], as well as the growing anthropogenic load on the environment [8,9], leaves floods in the strong research focus as dominant impact-relevant events [5,6,10].
Continuous development and benchmarking of operational flood forecasting services are among the most dynamic research areas based on the highest relevance of early warnings for prevention of disastrous flood events and reduction of their impacts [11,12].Today, many forecasting services-from global [13][14][15] to continental [16][17][18][19], national [20][21][22], and regional scales [23][24][25]-are in operational use, producing timely and reliable runoff forecasts.All these services are complex structures based on many individual components that provide, e.g., operational data assimilation, forecast computation, and dissemination functionality [26][27][28].Hence, it is important to guide the directed development of operational runoff forecasting services and their components towards more skillful predictions by the continuous evaluation of their performance [29].
OpenForecast is the first national-scale operational runoff forecasting system in Russia [20,24] that has been developed since 2018 by the consortium of researchers from the State Hydrological Institute (Saint-Petersburg, Russia), the Water Problems Institute (Moscow, Russia), the Central Administration for Hydrometeorology and Ecology Monitoring (Moscow, Russia), and the Lomonosov Moscow State University (Moscow, Russia) on the funds provided by the Russian Foundation for Basic Research.The system was launched on 14 March 2020.Since then, OpenForecast operationally provided one week ahead runoff forecasts for 843 gauges across Russia.In our previous research studies, we presented: (1) the proof of concept of runoff forecasting service design (OpenForecast v1 [24]) that has been evaluated for three pilot river basins in the European part of Russia for the period from 1 March 2019 to 30 April 2019 (61 days), and (2) the second version of the OpenForecast system (OpenForecast v2 [20]) that has been scaled to 834 gauges across Russia and then evaluated for the period from 14 March 2020 to 6 July 2020 (115 days).Both studies confirmed OpenForecast as a successful example of the state-of-the-art national-scale runoff forecasting system.However, short evaluation periods leave room for speculation about the system's consistency and robustness.
Here, in this Short Communication, we aim to provide a long-term assessment of the OpenForecast performance metrics based on the results obtained for the period from 14 March 2020 to 31 October 2021 (597 days).There are five central research questions: 1.
Is there a consistency between the performance on calibration and evaluation periods? 2.
Is there a consistency between the performance of computed runoff hindcasts and forecasts?What are the differences in performance between distinct hydrological models?3.
Is communicating ensemble mean a good strategy for forecast dissemination? 4.
What is the role of meteorological forecast efficiency in runoff forecasting? 5.
How many people do use OpenForecast?
In our opinion, even a brief investigation of the research questions mentioned above would increase the confidence in OpenForecast's reliability among the general audience, academic, and government institutions.

Data and Methods
The comprehensive description of data and methods used for the development of the OpenForecast system is provided in an open-access paper [20].Here, we briefly introduce the system's main underlying data sources and computational components required to support results presentation and analysis.

Runoff Data
Streamflow and water level observations for the historical period (2008-2017) are available at the website of the Automated Information System for State Monitoring of Water Bodies (AIS; https://gmvo.skniivh.ru,accessed on 10 December 2021).Streamflow (m 3 /s) observations have been used for hydrological model calibration.In addition to streamflow, water level (cm above the "gauge null") observations have been used to calculate rating curves for the transformation of streamflow values to water level (and vice versa) for the corresponding gauges.
Only water level observations for a limited number of gauges are available operationally at the Unified State System of Information website regarding the Situation in the World Ocean (ESIMO; http://esimo.ru/dataview/viewresource?resourceId=RU_RIHMI-WDC_1325_1, accessed on 10 December 2021).

Meteorological Data
ERA5 global meteorological reanalysis [30] and its pre-operational (5 day delay from the real-time) product ERA5T serves as a source of historical meteorological forcing of air temperature (T, • C) and precipitation (P, mm).The outputs from the global numerical weather prediction model ICON [31] serve as a source of deterministic 7 day-ahead meteorological forecasts for air temperature and precipitation.Here, meteorological data have been aggregated to the daily time step and then averaged at the basin scale for each available river basin.In addition to air temperature and precipitation, potential evaporation (PE, mm) is calculated using the temperature-based equation proposed in [32].

Hydrological Models
We use two conceptual lumped hydrological models: HBV [33], and GR4J [34].While HBV has an internal snow module, the GR4J model has been complemented with the Cema-Neige snow accumulation routine [35,36].Both models require only daily precipitation, air temperature, and potential evaporation as inputs (see Section 2.2).HBV and GR4J models have 14 and six free parameters, respectively (Tables 1 and 2).For each available river basin, model parameters have been automatically calibrated against observed runoff using two loss functions: (1) the Nash-Sutcliffe efficiency coefficient (NSE; Equation (1); [37]) and (2) the Kling-Gupta efficiency coefficient (KGE; Equation (2); [38]).Here, utilization of different loss functions is the simplest way to introduce an ensemble approach for runoff forecasting [39].Thus, in the OpenForecast system, there are four models used to calculate runoff: HBV NSE , HBV KGE , GR4J NSE , and GR4J KGE .
where Ω is the period of evaluation, Q sim and Q obs are the simulated and observed runoff, Q sim and Q obs are the mean simulated and observed runoff, r is the correlation component represented by Pearson's correlation coefficient, σ sim and σ obs are the standard deviations in simulations and observations.NSE and KGE are positively oriented and not limited at the bottom: a value of 1 represents a perfect correspondence between simulations and observations.NSE > 0 and KGE > −0.41 can be considered to be showing skill against the mean flow benchmark [40].

Openforecast Runoff Forecasting System
The OpenForecast system provides one week ahead runoff forecast for 843 gauges that have been selected based on calibration results and data availability [20].The illustration of the OpenForecast computational workflow is presented in Figure 1.First, for each gauge of interest, meteorological forcing is updated based on the latest ERA5T and ICON data (see Section 2.2).Then, the updated meteorological forcing data is used as input to four hydrological models (Section 2.3) to obtain recent runoff forecasts.Finally, the calculated forecasts are communicated on the project's website (https://openforecast.github.io,accessed on 14 December 2021).
Figure 2 illustrates OpenForecast's modeling phases and the corresponding input data in more detail.There are three general phases (periods): (1) hindcast, (2) pre-operational hindcast, and (3) forecast.For the hindcast phase, ERA5 and ERA5T meteorological data is utilized.That describes hindcasts as model predictions (similar to those on the calibration period) during the run time of the forecasting system.Because of the delay of ERA5T from real-time, the scheme of filling missing data between the recent ERA5T update and ICON forecast is needed.To that, we use ICON hindcasts-the past 1 day-ahead ICON forecasts [20,24].To distinguish this phase from the hindcast phase, which utilizes ERA5based data instead of ICON-based, we call it pre-operational hindcast (Figure 2).Finally, we use deterministic 7 day-ahead ICON forecasts to force hydrological models to provide the corresponding predictions for the forecast phase.The results of the OpenForecast operational run have been obtained for the period from 14 March 2020 to 31 October 2021 (597 days).While we compute four realizations of runoff predictions (by the number of utilized hydrological models; Section 2.3), we communicate only the ensemble mean (hereafter ENS), and ensemble spread on the OpenForecast website (https://openforecast.github.io,accessed on 14 December 2021).

Reference Gauges
Unfortunately, it is impossible to provide an efficiency assessment of OpenForecast for each of all 834 gauges.There are two main reasons: (1) the operational information provided by the ESIMO system (Section 2.1) does not cover all OpenForecast gauges, (2) historical water level observations from the AIS system are not always consistent with operational information provided by the ESIMO.Thus, after the semi-automatic checking of operational data consistency (e.g., detection of outliers, sudden changes in flow dynamics, and the visual inspection), 252 gauges have been selected for further performance assessment (Figure 3).This number represents 30% of operational OpenForecast's gauges and could be considered representative because they keep the distribution of small, medium, and large basins similar to the general population (834 gauges).There are 13/41/46% and 20/45/35% for small/medium/large basins and the sample and the general population.

Consistency between Calibration and Evaluation Periods
Temporal consistency of model efficiency ensures computational system robustness and confidence in the underlying routines and models [44][45][46][47].There are many cases when that consistency could be disrupted: inconsistency of meteorological data sources between periods of consideration, significant change in runoff formation or pathways (e.g., reservoir construction), instability of model parameters, to name a few.As a result, a well-calibrated model could not provide reliable predictions under new conditions.
Here, we provide results of model efficiency assessment for two periods: (1) calibration and (2) hindcast (evaluation).Figure 4 illustrates the differences between efficiencies of the individual models in terms of NSE (see Figure A1 for the KGE metric) for 843 Open-Forecast gauges.The obtained results are similar for all models and illustrate visually distinct differences between model performance on two independent periods.Expectedly, individual model efficiencies decrease on the hindcast period compared to the calibration period.The median NSE is dropped from 0.81 to 0.71, 0.78 to 0.66, 0.82 to 0.64, 0.8 to 0.63 for GR4J NSE , GR4J KGE , HBV NSE , and HBV KGE , respectively.Major quantiles, 25th and 75th, also follow the same pattern.Also, it is visually clear that the bottom "tail" of lower values is bigger on the hindcast period than on the calibration period.Here, obtained results also show that the HBV-based models with 14 calibrated parameters, HBV NSE and HBV KGE , lose more efficiency than GR4J-based models with six calibrated parameters, GR4J NSE and GR4J KGE .That provides an interesting insight into the higher reliability of simpler models for runoff forecasting even if they had comparable efficiency during the calibration period.However, that distinct decrease in performance from the calibration to the hindcast period could not be considered crucial and critical for the forecasting system's reliability.Only the minor number of gauges show unskillful results in terms of NSE (NSE ≤ 0): four, nine, seven, and 16 for GR4J NSE , GR4J KGE , HBV NSE , and HBV KGE , respectively.Sixteen gauges with unskillful NSE for HBV KGE model include those detected for other individual models.Most of them (12 out of 16) are located in the European part of Russia and represent small and medium-size basins (under 10,000 km 2 ).There are 58, 78, 78, and 87 gauges that show unsatisfactory (after [48]) yet skillful results in terms of NSE (NSE ≤ 0.5).However, there is no distinct pattern in their spatial or basin area distribution.In terms of KGE, unskillful results (KGE ≤ −0.41) have shown only for a single gauge by GR4J NSE and GR4J KGE models.Therefore, we argue that OpenForecast's underlying hydrological models are robust and provide a solid basis for reliable runoff predictions.The decrease of model efficiency on the evaluation period compared to the calibration period is commonplace in hydrological modeling studies, and well-reported in literature [20,[44][45][46][47].There are several significant reasons for that behavior, e.g., changing meteorological or landscape conditions of considered periods or/and instability of model parameters.However, for the present case of OpenForecast efficiency assessment, the factor of observational data inconsistency takes its lead.First, for the performance assessment, we use operational water level data from the ESIMO system that could be inconsistent with historical runoff data from the AIS system that we use for model calibration (Section 2.1).ESIMO's data does not undergo correction routines, so that it could be misleading for some number of gauges.Additionally, for some gauges, processes of river channel transformation may play a huge role, so the correction of the rating curve is needed for reliable conversion of operational water levels to runoff.Unfortunately, the AIS database has a significant time lag for approximately two years of update cycles.Thus, we could provide an OpenForecast performance assessment based on consistent runoff data no earlier than the end of 2022.
Figure 5 shows the spatial distribution of differences in NSE between the calibration and hindcast periods (NSE hindcast − NSE calibration ) for the HBV KGE model for which the corresponding differences are the most pronounced.First, we should mention that for 34 gauges (13.5%),NSE on the hindcast period is higher than the calibration period.No distinct spatial clusters of plotted differences could be attributed appropriately to geographical or hydrological factors.It was also expected that models that have been calibrated using particular metrics (either NSE or KGE) would have better results in terms of those metrics on the evaluation (hindcast period).Thus, GR4J NSE and HBV N SE have higher median NSE efficiencies (0.71 and 0.64, respectively) than GR4J KGE and HBV KGE (0.66 and 0.63, respectively) on the evaluation (hindcast) period.Obtained results raise a question of the best metric that could serve the needs of all interested parties: professional community, government agencies, and general public [49].Currently, NSE and KGE metrics are popular only within the hydrological community.Hence, we need a targeted effort to make them (or more successful analogs) familiar to the general public.

Consistency between Hindcasts and Forecasts
In contrast to the comparison of model efficiencies between the calibration and evaluation (hindcast) periods where the general idea was to validate overall model reliability and robustness on contrasting periods (Section 3.1), here we aim to evaluate the consistency and skill of model predictions under the inconsistent input data for the hindcast, pre-operational hindcast, and forecast modeling phases (Figure 2).The difference in prediction efficiency between the hindcast and pre-operational hindcast periods aims to highlight the trade-off of the transition from the ERA5 reanalysis to ICON hindcast to fill in meteorological forcing data seven days before the forecast run time (Figure 2).The difference in prediction efficiency between the pre-operational hindcast and forecast periods describes forecasting efficiency and highlights the cumulative role of initial conditions and meteorological forecasts in efficiency decrease over lead time.
Figure 6 illustrates the distribution of individual and ensemble mean (ENS) model performances in terms of NSE for the hindcast period, as well as for seven pre-operational hindcast (t − 7,. . ., t − 1 days) and seven forecast (t + 1,. . ., t + 7 days) lead times for 252 OpenForecast gauges (see Figure A2 for the KGE metric).First, it is visually apparent that all models follow the same pattern of efficiency change: the efficiency slowly decreases with increasing lead time, and there are no significant drops between hindcast, pre-operational hindcast, and forecast periods.Thus, all models demonstrate persistent and robust behavior while assessed on a long-term period of almost two years (March 2020-October 2021).Both mean NSE and KGE (Figures 4 and A1) are higher than behavioral values for all considered periods and lead times: 0.5 for NSE and 0.3 for KGE (after Knoben et al. [40]).The obtained results are in line with the previous large-scale assessments of the OpenForecast performance [20,50] that capitalizes on the robustness and reliability of the developed forecasting system.
Similar to the results obtained in Section 3.1, GR4J-based models (GR4J NSE and GR4J KGE ) generally show higher efficiency than HBV-based models (HBV NSE and HBV KGE ) in terms of the NSE metric.The differences are less pronounced in terms of the KGE metric.Thus, higher model complexity (of HBV-based models) does not ensure higher efficiency of runoff predictions and forecasts in the case of the OpenForecast system.While the difference in mean NSE between GR4J and HBV-based models is significant (around 10% for each lead time), they show a similar rate of around 10% for efficiency decrease with lead time.While two different hydrological models differ in catching up with the complexity of runoff formation processes, they respond similarly to changes in meteorological input forcing (from hindcasts to forecasts).

Communication of Ensemble Mean
Despite a large variety of options in communicating ensemble runoff forecasts, there is yet no consensus on what practice fits differing requirements of many parties the best [51].From the beginning, communication of the ensemble mean and spread is the only option in the dissemination of runoff forecasts in the OpenForecast system [24].That choice was driven by two main factors: (1) ensemble mean could provide more skillful and less biased results than each of its members [52,53], and (2) visualization of a single line is perceptually clear and easier to understand [51].The previous assessment studies confirm that as reliable and skillful [20,50].Figure 7 illustrates time series of simulated ensemble mean streamflow compared to observations.Figure 8 shows the development of mean efficiencies with a lead time for individual models, as well as their ensemble mean, for the long-term evaluation period in terms of NSE and KGE.Results show that the communication of ensemble mean is the best strategy so far-ENS demonstrates higher efficiency than all individual models for all lead times in terms of both performance metrics.Results show that communication of ensemble mean benefits more for the end-users because of perceptual clarity and the highest prediction efficiency.The latter probably is a result of a combination of structurally different yet efficient models [53][54][55].We see ample potential to increase further the number of hydrological models within the computational core of OpenForecast.The recent advances in hydrological model distribution as an opensource software package [56][57][58] makes it particularly easy to implement and further capitalizes on the open nature of the OpenForecast system.

Role of Meteorological Forecast Efficiency
Discrepancies between observed and predicted runoff may have many sources (uncertainties), e.g., the inability of the hydrological model to capture the entire diversity of runoff formation processes on the considered watershed, and/or systematic biases in meteorological input data that could trigger errors in initial conditions and following predictions [59,60].
In the presented study, we provide information on cumulative model-related and datainduced errors that could be represented as a difference between the reference efficiency (1 for both NSE and KGE) and the efficiency on the calibration/evaluation (Section 3.1) and forecast (Section 3.2) periods.Results showed that, on average, runoff forecast efficiency decreases on 10% in terms of NSE between t − 1 and t + 7 days lead times (Section 3.2, Figure 8).While it is almost impossible to distinguish the different sources of errors in runoff prediction without a controlled environment, here we provide a brief quality assessment of ICON forecasts comparing them with ERA5 data, which is considered as ground truth (Figure 9).Results show that while numerical weather prediction made a huge step towards increasing efficiency of weather forecasts in recent decades [61], the chaotic nature of precipitation-related processes remains the main (unsolved) problem.It is clear that the efficiency of air temperature forecasting is solid and highly reliable-the lowest correlation coefficient is around 0.93 with the lowest mean value of 0.96 for a lead time of one week (Figure 9, top panel).In contrast, the mean correlation coefficient for precipitation decreases from 0.9 to 0.29 for the lead times of t + 1 and t + 7 days, respectively (Figure 9, bottom panel).However, due to a crucial role of transformation processes of water flow on a watershed (e.g., water travel time, basin memory), that distinct decrease in precipitation forecast efficiency does not directly transfer to a similar decrease in runoff forecast efficiency.Recent studies show [62,63] that modern deep learning techniques have ample potential to set new state-of-the-art results in the field of precipitation forecasting.Until then, OpenForecast may increase the number of meteorological forecasting products (apart from sole ICON) used in the system's computational core to provide a wider range of ensemble forecasts.

OpenForecast Users
The description of runoff forecasting systems users is usually ignored in scientific literature.The high importance of any developed forecasting system is unquestionable until it helps mitigate the effect of extreme floods.However, high importance does not assure a high number of users, and we argue that this topic has high relevance for the hydrological community, which develops forecasting services.
In contrast to weather forecasts, runoff forecasts have limited temporal and spatial demand.Floods typically occur in known flood-rich periods and should be impact-relevant, i.e., affect population and material property in river valleys, to be a problem for local communities.Thus, many people do not even require any flood forecasts, which cannot be said about the weather forecasts.Figure 10 illustrates OpenForecast daily users and devices they use to access the forecasting system website (https://openforecast.github.io,accessed on 22 December 2021).Figure 10 shows that the daily number of OpenForecast users highly correlates with flood-rich periods of spring, snowmelt-driven (March-June) and summer, rainfall-driven (June-July) flood periods.Thus, of the 12 months of the year, only four provide an interest to the public.In this way, due to continuous operational run on an everyday basis, Open-Forecast demonstrates very high idle costs-that could be negligible for government-driven agencies or big tech companies but considerable for small independent research groups or startups.Also, most OpenForecast users use it from their desktops-obsolete devices in the new mobile era.There are obvious reasons for that, e.g., the absence of mobile version or application and (comparatively) slow evolving flood events that do not require frequent on-the-go updates.The absolute number of users is also meager and could be considered negligible compared to daily users of weather forecasts (millions of people).However, we know that OpenForecast is routinely utilized as an additional information source by different government authorities; thus, it indirectly delivers reliable 7-day ahead runoff forecasts for a wider audience.

Conclusions
The main aim of the presented Short Communication is to provide an up-to-date performance assessment of the long-term operational run of the OpenForecast system-the first national-scale service that delivers 7-day ahead runoff forecasts for 834 gauges across Russia.To that, we assess the efficiency of OpenForecast on the evaluation period from 14 March 2020 to 31 October 2021 (597 days) for 252 gauges that have been supported by reliable operational runoff observations (Figure 3).The results could be summarized following the related research questions as follows: 1.
All hydrological models under the hood of OpenForecast computational workflow (Figure 1) demonstrate robust and reliable results of runoff prediction either on calibration or evaluation (hindcast) periods (Figure 4).We argue that the selected hydrological models form a solid basis for operational forecasting systems allowing consistent and skillful runoff predictions.2.
While the OpenForecast system utilizes different sources of meteorological data for different modeling phases (Figures 1 and 2), there are no distinct gaps in model performance between them (Figure 6).The additional exciting insight obtained: simpler models have comparable or even higher reliability on the evaluation period than more complex models even while demonstrating similar results on the calibration period.

3.
The ensemble mean of individual model forecast realizations outperforms each model in terms of NSE and KGE for all considered evaluation periods and lead times (Figure 8).That underlines that the communication of ensemble mean with the end-users is the best dissemination strategy so far.4.
Despite the recent advances in numerical weather prediction, the skill of one-weekahead precipitation forecasting remains the main (unsolved) problem in the forecasting chain (Figure 9).However, due to the comparatively high inertia of runoff formation processes on a watershed, uncertainties of precipitation forecast do not entirely transfer to the runoff predictions.

5.
User engagement in accessing runoff forecasting systems is low and mostly limited to flood-rich periods (March-July) (Figure 10).That makes costs of idle systems high and requires new, mobile-first approaches to deliver runoff forecasts to the general public efficiently.
In summary, OpenForecast could be considered as a successful national-scale forecasting service that delivers timely and reliable runoff predictions for hundreds of gauges across Russia.In further studies, we will continue to capitalize on the increasing diversity of issued runoff ensembles by increasing the number of utilized hydrological models and sources of meteorological forecast data.In addition, we admit an ample potential of deep learning techniques to be utilized on different stages of the forecasting chain to increase its efficiency.

Figure 2 .
Figure 2. Illustration of the temporal sequence of modeling phases and the corresponding input data.

Figure 3 .
Figure 3.The spatial location of OpenForecast gauges (n = 843) and those from the ESIMO database that were selected for the verification procedure (n = 252).

Figure 4 .
Figure 4. Differences between individual hydrological model performances in terms of NSE for calibration and hindcast (evaluation) periods.Violin plots represent the distribution of estimated values within the full range of variation.Dashed lines show the quantiles of 25, 50, and 75%, respectively.

Figure 5 .
Figure 5. Spatial distribution of differences between performances of calibration and hindcast periods for the HBV KGE model.

Figure 6 .
Figure 6.Differences between performances of individual hydrological models and their ensemble mean for the hindcast, pre-operational hindcast, and forecast modeling phases in terms of NSE.Violin plots represent the distribution of estimated values within the full range of variation.Dashed lines show the quantiles of 25, 50, and 75%, respectively.

Figure 7 .
Figure 7. Time series of observed and predicted (7-day ahead ensemble mean) streamflow for gauges with high (top panel), medium (middle panel), and low (bottom panel) NSE.

Figure 8 .
Figure 8. Mean values of individual model efficiencies and their ensemble mean for the evaluation period (hindcast, pre-operational hindcast, and forecast) in terms of NSE (top panel) and KGE (bottom panel) metrics.

Figure 9 .
Figure 9. Evaluation-period correlation coefficients between ERA5 reanalysis and ICON forecast data with increasing lead times for air temperature (P, top panel) and precipitation (T, bottom panel).

Figure 10 .
Figure 10.OpenForecast daily users and the distribution of their device types.

Figure A2 .
Figure A2.Differences between performances of individual hydrological models and their ensemble mean for the hindcast, pre-operational hindcast, and forecast modeling phases in terms of KGE.Violin plots represent the distribution of estimated values within the full range of variation.Dashed lines show the quantiles of 25, 50, and 75%, respectively.