Multi-Step Ahead Probabilistic Forecasting of Daily Streamﬂow Using Bayesian Deep Learning: A Multiple Case Study

: In recent decades, natural calamities such as drought and ﬂood have caused widespread economic and social damage. Climate change and rapid urbanization contribute to the occurrence of natural disasters. In addition, their destructive impact has been altered, posing signiﬁcant challenges to the efﬁciency, equity, and sustainability of water resources allocation and management. Uncertainty estimation in hydrology is essential for water resources management. By quantifying the associated uncertainty of reliable hydrological forecasting, an efﬁcient water resources management plan is obtained. Moreover, reliable forecasting provides signiﬁcant future information to assist risk assessment. Currently, the majority of hydrological forecasts utilize deterministic approaches. Nevertheless, deterministic forecasting models cannot account for the intrinsic uncertainty of forecasted values. Using the Bayesian deep learning approach, this study developed a probabilistic forecasting model that covers the pertinent subproblem of univariate time series models for multi-step ahead daily streamﬂow forecasting to quantify epistemic and aleatory uncertainty. The new model implements Bayesian sampling in the Long short-term memory (LSTM) neural network by using variational inference to approximate the posterior distribution. The proposed method is veriﬁed with three case studies in the USA and three forecasting horizons. LSTM as a point forecasting neural network model and three probabilistic forecasting models, such as LSTM-BNN, BNN, and LSTM with Monte Carlo (MC) dropout (LSTM-MC), were applied for comparison with the proposed model. The results show that the proposed Bayesian long short-term memory (BLSTM) outperforms the other models in terms of forecasting reliability, sharpness, and overall performance. The results reveal that all probabilistic forecasting models outperformed the deterministic model with a lower RMSE value. Furthermore, the uncertainty estimation results show that BLSTM can handle data with higher variation and peak, particularly for long-term multi-step ahead streamﬂow forecasting, compared to other models.


Introduction
Sustainable water resource management is an essential requirement worldwide, and streamflow forecasting is an essential component of an effective water resource management plan [1].Accurate streamflow forecasting plays a critical role in many decisionmaking scenarios related to water resource management such as flood/drought control and mitigation, reservoir management, hydropower generation, sediment transport, and irrigation management [2].Owing to the complex and nonlinear characteristics associated with streamflow [3], forecasting is challenging.Sustainable water resource management plans are used to meet the requirements of people today and in the future.To support risk-aware decision-making in water resource management, current streamflow forecasting approaches should be improved to estimate forecasting uncertainties and leverage large volumes of data with complex dependencies [4].
Water 2022, 14, 3672 3 of 22 Bayesian framework to estimate the extreme flood quantile from a rainfall-runoff model of a dam in California.Xu et al. [35] developed a real-time probabilistic channel flood forecasting model by combining a hydraulic model with the Bayesian approach in the upstream reaches of the Three Gorges Dam on the Yangtze River, China.A state-of-the-art review was provided by Huang et al. [36] to summarize the application of Bayesian inference in system identification and damage assessment for civil infrastructure.Goodarzi et al. [37] proposed a decision-making model using BN to predict heavy precipitation in the Kan Basin.Bayesian neural networks are yet to be applied to probabilistic streamflow forecasting, as aforementioned.
Recent studies on probabilistic prediction in the field of hydrology are summarized in Table 1.As shown in Table 1, a few researchers trained a deterministic model and used the obtained deterministic result to obtain a probabilistic forecasting result to estimate the uncertainty [38].However, in a few studies, a deterministic layer has been coupled with a probabilistic layer to achieve forecasting uncertainty [39].Conversely, a few studies have focused on developing a probabilistic model by introducing stochastic components into the network by giving the network either stochastic activation or weights [40][41][42].watersheds with different sizes and physical and climatic characteristics.Costa and Fernandes [34] developed a Bayesian framework to estimate the extreme flood quantile from a rainfall-runoff model of a dam in California.Xu et al. [35] developed a real-time probabilistic channel flood forecasting model by combining a hydraulic model with the Bayesian approach in the upstream reaches of the Three Gorges Dam on the Yangtze River, China.A state-of-the-art review was provided by Huang et al. [36] to summarize the application of Bayesian inference in system identification and damage assessment for civil infrastructure.Goodarzi et al. [37] proposed a decision-making model using BN to predict heavy precipitation in the Kan Basin.Bayesian neural networks are yet to be applied to probabilistic streamflow forecasting, as aforementioned.
Recent studies on probabilistic prediction in the field of hydrology are summarized in Table 1.As shown in Table 1, a few researchers trained a deterministic model and used the obtained deterministic result to obtain a probabilistic forecasting result to estimate the uncertainty [38].However, in a few studies, a deterministic layer has been coupled with a probabilistic layer to achieve forecasting uncertainty [39].Conversely, a few studies have focused on developing a probabilistic model by introducing stochastic components into the network by giving the network either stochastic activation or weights [40][41][42].a limited number of river basins have been studied from the Bayesian perspective to date.Furthermore, we should determine if the Bayesian approaches are suitable for different watersheds with different sizes and physical and climatic characteristics.Costa and Fernandes [34] developed a Bayesian framework to estimate the extreme flood quantile from a rainfall-runoff model of a dam in California.Xu et al. [35] developed a real-time probabilistic channel flood forecasting model by combining a hydraulic model with the Bayesian approach in the upstream reaches of the Three Gorges Dam on the Yangtze River, China.A state-of-the-art review was provided by Huang et al. [36] to summarize the application of Bayesian inference in system identification and damage assessment for civil infrastructure.Goodarzi et al. [37] proposed a decision-making model using BN to predict heavy precipitation in the Kan Basin.Bayesian neural networks are yet to be applied to probabilistic streamflow forecasting, as aforementioned.
Recent studies on probabilistic prediction in the field of hydrology are summarized in Table 1.As shown in Table 1, a few researchers trained a deterministic model and used the obtained deterministic result to obtain a probabilistic forecasting result to estimate the uncertainty [38].However, in a few studies, a deterministic layer has been coupled with a probabilistic layer to achieve forecasting uncertainty [39].Conversely, a few studies have focused on developing a probabilistic model by introducing stochastic components into the network by giving the network either stochastic activation or weights [40][41][42].[36] to summarize the application of Bayesian inference in system identification and damage assessment for civil infrastructure.Goodarzi et al. [37] proposed a decision-making model using BN to predict heavy precipitation in the Kan Basin.Bayesian neural networks are yet to be applied to probabilistic streamflow forecasting, as aforementioned.
Recent studies on probabilistic prediction in the field of hydrology are summarized in Table 1.As shown in Table 1, a few researchers trained a deterministic model and used the obtained deterministic result to obtain a probabilistic forecasting result to estimate the uncertainty [38].However, in a few studies, a deterministic layer has been coupled with a probabilistic layer to achieve forecasting uncertainty [39].Conversely, a few studies have focused on developing a probabilistic model by introducing stochastic components into the network by giving the network either stochastic activation or weights [40][41][42].[34] developed a Bayesian framework to estimate the extreme flood quantile from a rainfall-runoff model of a dam in California.Xu et al. [35] developed a real-time probabilistic channel flood forecasting model by combining a hydraulic model with the Bayesian approach in the upstream reaches of the Three Gorges Dam on the Yangtze River, China.A state-of-the-art review was provided by Huang et al. [36] to summarize the application of Bayesian inference in system identification and damage assessment for civil infrastructure.Goodarzi et al. [37] proposed a decision-making model using BN to predict heavy precipitation in the Kan Basin.Bayesian neural networks are yet to be applied to probabilistic streamflow forecasting, as aforementioned.
Recent studies on probabilistic prediction in the field of hydrology are summarized in Table 1.As shown in Table 1, a few researchers trained a deterministic model and used the obtained deterministic result to obtain a probabilistic forecasting result to estimate the uncertainty [38].However, in a few studies, a deterministic layer has been coupled with a probabilistic layer to achieve forecasting uncertainty [39].Conversely, a few studies have focused on developing a probabilistic model by introducing stochastic components into the network by giving the network either stochastic activation or weights [40][41][42].Coverage probability, Mean width percentage, Suitability metric, CRPS [38] line quantile regresn model combined kernel density estimation ) denotes the application of Posterior Approximation.
The application of probabilistic DL showed superior performance in various fields, including residential net load forecasting [45,46], short-term scheduling in power markets [47], photovoltaic power [48], load forecasting for buildings [49], and electricity consumption [50].This indicates the wide range of the potential applicability of the probabilistic DL approaches.Univariate streamflow forecasting using conventional datadriven models has been investigated in the previous studies [51][52][53][54].To the best of the authors' knowledge, the application of BLSTM in multi-step ahead probabilistic prediction using a retrospective univariate time series has not been applied to streamflow prediction yet.To address the aforementioned research gaps, this study proposed a framework for transforming a deterministic model into a probabilistic model with improved performance.
This study developed a Bayesian deep neural network framework to characterize the prognostic uncertainties for probabilistic streamflow forecasting, which investigated both epistemic and aleatoric uncertainties.The motivation of the framework was to transform existing deterministic prediction models into their probabilistic counterparts for better performance in water resources management and decision-making, and to cover newly emerged challenges that humankind encountered primarily due to climate change.
The primary contributions of the study are as follows: For the first time, in streamflow prediction, we introduced the Bayesian LSTM network's application for multi-step ahead probabilistic forecasting in water resource management.Bayesian theory and LSTM networks were combined to generate probabilistic streamflow forecasts to capture both epistemic and aleatoric uncertainties.This is the first study to exploit Bayesian deep learning for streamflow prediction.Moreover, a comprehensive comparison with a series of state-of-the-art probabilistic prediction methods is conducted.The superior performance of the proposed scheme was demonstrated with respect to both the deterministic and probabilistic forecasting results.Moreover, to demonstrate the superiority of probabilistic forecasting, particularly for water resource management, a comparative analysis was conducted for three case studies with different forecasting horizons and timescales.
The paper is organized as follows: In Section 2, the materials and methods are presented in subsections on Bayesian long-short-term memory (BLSTM) (2.1), experimental setup (2.2), and performance evaluation (2.3).In Section 3, the case study, study area (3.1), and experimental setup (3.2) are detailed.The results are presented in Section 4, with two subsections focusing on the probabilistic forecasting performance assessment (4.1) and the impact of the forecast horizon on probabilistic forecasting performance (4.2).Furthermore, the concluding remarks of this study with directions for future research are discussed in Section 5.

Materials and Methods
The proposed Bayesian deep-learning approach for probabilistic streamflow forecasting is presented in detail in the following sections.

Bayesian Long Short-Term Memory (BLSTM)
In this study, the Bayesian approach is employed, which is a well-established and thorough approach to fit probabilistic models that capture and distinguish different sources of uncertainties [55].The BNN is a stochastic artificial neural network (ANN) trained using the Bayesian approach.Probability is defined in terms of the degree of belief in the Bayesian approach; the more likely an outcome is, the higher its degree of belief.The primary idea of the Bayesian approach in deep learning is to replace each weight with a distribution [56].An LSTM network overcomes the long-term dependency issue of conventional RNNs through additional interactions in its various unit cells.Additionally, LSTM cells (memory cells) are composed of three gates (input, forget, and output) for short-term memory selection and a state vector transmission responsible for long-term memory.Information can be selectively passed during the learning procedure by manipulating the gate settings.The LSTM network is mathematically represented as follows [57]: Water 2022, 14, 3672 where at time step t, x t is the input vector, h t is the LSTM output and hidden state (shortterm memory), and i t , f t , and o t are the input, forget, and output gates, respectively.W and b are the weight matrix and bias, respectively.C t is the current cell state (long-term memory), and C is the candidate cell state value.σ is a sigmoid activation function that uses h t−1 and x t to make decisions regarding the input, forget, and output gates [57].
Given the input data, X train = [x 1 , . . ., x Train ] and their corresponding output labels Y train = [y 1 , . . ., y Train ].The primary goal of the Bayesian approach is to identify the parameter W of a function y = f W (x) that probably generates the desired output [58,59].In this approach, a prior distribution that represents the prior belief about the neural network parameter distribution before observing the inputs is employed over W to capture epistemic uncertainty.The Bayesian neural network structure is illustrated in Figure 1.Rather than a single network, this method trains a set of networks in which the weight of each network is derived from a shared learning probability distribution [59].Setting a standard normal distribution as a prior with zero mean, which can bring the benefit of regularization, has been demonstrated as one of the most effective solutions when the prior distribution is difficult to identify.After training the Bayesian deep neural network and observing data, the model likelihood distribution p Y Train f W should be defined as a normal distribution N f W (X Train , σ 2 ) and observation noise σ to capture roughly suitable parameters.Based on the Bayesian rule, the posterior p(W|X Train , Y Train ) is employed over the weights to generate samples of predictions rather than the prior distribution.The posterior is calculated as follows [59]: where p(Y Train |X Train ) is the marginal likelihood probability that cannot be estimated, thereby, the posterior is not tractable without a variational inference to approximate it.With this distribution, suitable parameters given by the input data can be captured, and the output y can be predicted for a new input x by integration [58]: Water 2022, 14, 3672 To evaluate the true posterior p(W|X Train , Y Train ), an approximation variational dis- tribution q θ (W), which is parameterized by θ, is required to ensure that the optimal distribution q θ (W) can represent the posterior by minimizing the Kullback-Leibler (KL) divergence [60] between the approximation variational and posterior distributions [61]: Generally, two methods are available to approximate the posterior distribution: variational inference (VI) and Monte Carlo (MC) dropout [56,61,62].The study employed the VI to solve the optimization issue analytically.Interested readers can refer to [62] for detailed information on the approximation method.The predictive distribution can be approximated by: p(y|x, X Train , Y Train ) = p(y|x, W) q θ (W)dW = q θ (y|x). (11)

Epistemic Uncertainty
Mathematically, by simulating the model based on input x, the predictive mean can be estimated with an unbiased estimator, as follows [63]: where E[y] is the predictive mean, f Ŵt is the stochastic output of the prediction model, Ŵt represents the sample weights, and T denotes the number of samples at time t.Similar to the estimation of the predictive mean, given that Ŵt ∼ q θ (W) and p y f W (x) = N y; f w (x), σ 2 for σ > 0, the predictive variance can be estimated by an unbiased estimator as follows [63]: The term σ 2 corresponds to inherent noise in the input data.Afterward, the epistemic uncertainty, which represents the uncertainty of the model about its prediction outputs, is captured by the predictive variance that can be approximated as [63]:

Aleatoric Uncertainty
Aleatoric uncertainty can be divided into homoscedastic and heteroscedastic uncertainties.To capture the aleatoric uncertainty, parameter σ should be tuned.For each input x, in the homoscedastic uncertainty, the observation noise σ is assumed to be constant.In contrast, heteroscedastic uncertainty assumes that observation noise varies with the input.Heteroscedastic models are data-dependent and can be expressed as: Because the maximum posterior is performed to find a single value for parameter θ, this approach does not capture the epistemic uncertainty since it is a property of the model, not the input data.

Combining Aleatoric and Epistemic Uncertainty
Abdar et al. [64] explained that an effective way to combine both uncertainties in a single model is to transform the heteroscedastic model into a Bayesian model by placing a distribution over its weight and bias parameters.Thus, both the predictive mean and variance were derived from the developed prediction model.
where f W M is the prediction model (BLSTM) used in this study, parameterized by the model weight Ŵ [55,56].The Gaussian likelihood is used to model the aleatoric uncertainty, and the final loss function of the prediction model can be expressed as [58]: Finally, the predictive uncertainty of the prediction model, consisting of both aleatoric and epistemic uncertainties, can be approximated as where T sample denotes the number of training samples.An example of the Bayesian LSTM cell of the proposed BLSTM network is shown in Figure 2, with a zoomed-in plot of the forget gate at time step t in the first layer.

Performance Evaluation Metrics
To assess the performance of the prediction models, this study adopted the root mean square error (RMSE) metric for deterministic prediction and three metrics for probabilistic prediction: continuous ranked probability score (CRPS), prediction interval coverage probability (PICP), and mean prediction interval width (MPIW), which are formulated as follows.

Metric for Deterministic Forecasting
To evaluate the accuracy of the deterministic forecasting results, the root-mean-square error (RMSE) was selected as a commonly used hydrological evaluation indicator.The RMSE is defined as follows: where Ŷi is the predicted variable, Y i is the observed value, and n is the number of samples.

Metrics for Probabilistic Forecasting
A useful metric to assess the accuracy of probabilistic prediction models is the CRPS.The CRPS expresses the distance between the probabilistic forecast p and the observed value Y i and is defined as where p(x) is the probability density function (PDF), P Ŷi is the prediction cumulative distribution function (CDF), and H is the Heaviside step function, which equals 0 if Ŷi < Y i and equals 1 otherwise.The mean prediction interval width (MPIW) is an effective representation of sharpness in probabilistic predictions.This metric is defined as where n is the size of the test set, and Ŷi u and Ŷi l denote the upper and lower bounds of the 95% prediction interval, respectively.
Prediction interval coverage probability (PICP) or (PI) is the probability that the target lies within the interval provided by the prediction model.PICP is defined as: Thus, PICP indicates the frequency with which the prediction interval (PI) captures the observed value, ranging from 0 if all predicted values lie outside the PI and 1 if all predicted values lie inside the PI.

Case Study
To evaluate the performance of the probabilistic data-driven models under different conditions, three basins in the United States with different hydroclimatic conditions and drainage areas were selected as study areas, as described in the following section.

Study Area
The study basins were located in different climate regions of three states across the United States, i.e., IN (Indiana), MN (Minnesota), and CA (California).Figure 3 shows the locations of the three basins.The first case study was conducted in Bartholomew County, IN, the second was conducted in Koochiching County, MN, and third was conducted in Shasta County, CA.The drainage area of the river basins was approximately 1560-4420 km 2 .Based on the USGS statewide streamflow-water year 2021 report, the annual mean streamflow was ranked by state from 1 to 92, indicating the maximum and minimum annual flow for all years analyzed.Streamflow rankings were grouped into categories of much below normal, below normal, normal, above normal, and much above normal based on percentiles of flow (<10%, 10-24%, 25-75%, 76-90%, and >90%, respectively) [65].Much-below-normal streamflow with a rank 84-91 is reported in CA and below-normal streamflow with a rank 70-83 is reported in MN.The annual mean streamflow rank for IN was reported to be normal, with a rank 24-50.Daily historical streamflow data for the three selected case studies were obtained from the United States Geological Survey (USGS) website, (https://waterdata.usgs.gov/nwis)(accessed on 1 February 2022).
The descriptive information of the daily streamflow in the three case studies is presented in Table 2. Details of the case studies, including gauge ID, gauge name, and drainage area, are presented in Table 3.For all catchments, streamflow with a 30-day lag was considered owing to the cross-correlation results.

Experiment Setup
Before using the data to train the model, data preprocessing began with min-max normalization and log transfer as the initial phase of model development.The input time step was then derived from an autocorrelation analysis of the transformed-streamflow time series.Using a threshold of more than 0.5, which represents a moderate relationship, the past 30 days were selected as input.The autocorrelation analysis results for three case studies are given in the Supplementary File.The datasets for the three case studies were split into three sets: the first set accounted for 60% of the data, and it was used for model training; the second set was used for model validation (20%), and the remaining 20% was used for test purposes.Subsequently, the sliding window technique with a window size of 30 days was used.To demonstrate the superior performance of the Bayesian forecasting approach, probabilistic methods that have been widely used in the literature were employed for comparison.More specifically, LSTM-BNN [66], LSTM Monte-Carlo Dropout (LSTM-MC) [62,67], BNN [68], and deterministic LSTM were implemented in this study.Monte Carlo dropout is a straightforward epistemic uncertainty extension to the neural network.In general, dropout is a technique used to avoid overfitting by randomly dropping units during training.This can be considered the application of random noise in training.When this dropout was performed multiple times, multiple results were obtained.The distribution of the samples represents the uncertainty of the prediction model.The structure of the prediction models along with their graphical scheme are given in the Supplementary File.
The prediction models were developed in Python 3.6.9with the Keras [69], Tensor-Flow [70], and PyTorch [71] libraries.The prediction model was implemented by an NVIDIA ® GeForce ® RTX 2070 SUPER and an Intel ® Core i9-10920X central processing unit at 3.5 GHz utilizing 128 GB random access memory.For a fair comparison among the prediction models, a grid search for hyperparameter tuning was used to ensure identical evaluation.

Result and Discussion
To better clarify the forecasting performance of BLSTM method, the study compares the BLSTM to the LSTM-BNN, BNN, and LSTM-MC in terms of prediction interval uncertainty, sharpness, prediction reliability, and multi-step ahead probabilistic prediction performance.Moreover, LSTM is used as a deterministic model to evaluate the performance of all probabilistic prediction models against the deterministic model.In this section, the predictive ability of the four probabilistic models for 1 day (Scenario I), 7 days (Scenario II), and 30 days (Scenario III) ahead of streamflow prediction is investigated.

Probabilistic Prediction Performance Assessment
The PICP, MPIW, and CRPS values for the four models in the three case studies during the test period are listed in Table 4.The length of the bar represents the value of the evaluation metrics.In terms of PICP, the higher the value, the better and longer the bar, and vice versa for the other measures.Three major aspects must be considered simultaneously to evaluate probabilistic forecasting performance.PICP refers to the reliability of a model, MPIW refers to the model's sharpness, and CRPS indicates overall performance.In Scenario I, case study I was considered as an example because the models used the same mechanism to quantify the forecast uncertainty, and the PICP values of the four models were relatively close.Note that the larger the PICP and the smaller the MPIW and CRPS, the better the model performance.We observed that BLSTM showed better performance in handling datasets with high Std and peak streamflow.Case study III had 22,645 samples, which was ~17% and ~34% less than that of case studies I and II, with 27,146 and 34,205 samples, respectively.This difference did not lead to a particular change in the prediction performance of all the models for the first scenario.In this case study, from Scenarios I to III in BLSTM models, PICP decreased ~2% and 4%, respectively.While for case study I, PICP decreased ~25% and 50%, respectively, and for case study II, PICP decreased ~1% and 2%, respectively.Therefore, from the obtained results, we inferred that for single-step ahead prediction, the results were promising for all case studies, and the number of samples and peak streamflow did not affect the prediction performance.This made BNNs extremely data-efficient because they could learn from even a small dataset without overfitting.Furthermore, we predict that more uncertainty is associated with the results, particularly for the case study with a higher peak of streamflow, leading to wider prediction intervals and lower coverage.As expected, all models exhibited better predictive performance during shorter lead times (1-7 days) than during the longer horizon (30 days).Therefore, from the obtained results, we can infer that the probabilistic forecasting model can lead to higher uncertainty and lower accuracy over a longer forecasting horizon.To further evaluate the results of the four probabilistic models for all scenarios in the three case studies, LSTM as a well-known deterministic model was used to make a comparison in terms of RMSE, as shown in Figure 4.All probabilistic models in all horizons performed well and provided more accurate prediction performance in terms of RMSE than the deterministic LSTM, indicating the superiority of all probabilistic models in comparison with the conventional deterministic model.The range of RMSE indicated that all models were fairly trained, and they showed promising predictability performance in terms of RMSE.As shown in Figure 5, the probability that the prediction lies within the prediction interval by the LSTM-MC model is higher, followed by the LSTM-BNN, BNN, and BLSTM in the first scenario.The values of PICP indicate the percentage of the observed streamflow data lies within their 95% predictive intervals.In Scenario I, LSTM-MC outperformed BNN in terms of PICP.Moreover, for a longer horizon (7 days and 30 days ahead), due to the gradient vanishing of LSTM and BNNs as non-sequential models, both showed the lowest coverage and the points falling within the interval decreased by increasing the uncertainty in comparison with BLSTM and LSTM-BNN.The MPIW values of the four models during the test period are shown in Figure 6a.The MPIW was considered an effective representation of sharpness in probabilistic predictions, and referred to the concentration of the predictive distributions.The more concentrated the predictive distributions, the lower the MPIW, the sharper the prediction, and consequently the better the predictive performance.As shown in Figure 6a, BLSTM has the lowest MPIW, which indicates that it is the sharpest predictive model among the other models in all scenarios.The stand-alone BNN and LSTM-MC were slightly different, whereas the LSTM-MC obtained the highest MPIW value among the other models, particularly by increasing the horizon.Compared with the BLSTM model with the narrowest MPIW, LSTM-MC had the worst prediction sharpness over all horizons.The fact that BLSTM presents the best predictive capability indicates the significance of capturing both epistemic and aleatoric uncertainties.
To comprehensively evaluate the probability prediction accuracy and reliability, a comparison among all prediction models in terms of the CRPS is shown in Figure 6b.Overall, BLSTM and LSTM-BNN competed with each other in all case studies and scenarios.However, in the longer horizon, BLSTM outperformed other models and proved its superiority by keeping more points falling within its forecasting interval while keeping the interval as narrow as possible while also increasing the uncertainty for a longer horizon.Therefore, the proposed BLSTM model outperformed the other models in terms of RMSE, MPIW, and CRPS, demonstrating the forecasting accuracy, sharpness, and overall performance of the model.

Impact of Forecast Horizon in Probabilistic Prediction Performance
The prediction results of all models are compared graphically in Figures 7-10 in the form of time series for the entire test set for case study II for all scenarios (forecasting horizon).Considering space limitations, we have avoided adding the results of all case studies and scenarios in the main text, and only the results of case study II are presented.The actual streamflow and forecast value for the test period are represented by black and red curves, respectively.The red band represents the prediction interval, with a 95% confidence level.The probabilistic forecasts generated with the BLSTM model presented the benefits of high prediction coverage of observed streamflow data (PICP) with a tighter prediction width (MPIW) and better overall performance (CRPS), corresponding to reliability, sharpness, and resolution.Furthermore, accurate peak prediction, which is a crucial factor for disaster prevention and water resources management, can be predicted with reasonable magnitudes with the proposed BLSTM.Additionally, with increasing the forecast horizon, BLSTM still showed reliable performance, while other models were incapable of handling such a situation.In forecast horizon 30, massive fluctuations in the prediction results occurred for all the models and case studies.However, most of the prediction results were covered by the 95% interval in Scenario I, followed by Scenario II.
Water 2022, 14, 3672    As shown in Figure 11, with an increase in the forecasting horizon from 1 to 30 days, the MPIW and CRPS values increase, and the PICP decreases.This indicates that the accuracy of the prediction models decreases with an increase in the forecasting horizon.The prediction accuracy over longer horizons decreased mainly as a result of the accumulative error issue in multi-step ahead recursive models and the gradient vanishing issue in long sequence time-series forecasting.Nevertheless, the predictive mean values of probabilistic streamflow from the BLSTM model matched the observations better than the other three models.The performance of all models gradually worsened with increasing lead times for the three case studies.As shown in Figure 11, when the overall prediction accuracy was low, MPIW was smaller.The interval width of the forecasting with LSTM-MC increased rapidly with the prediction horizon.The average interval width of the proposed BLSTM was much smaller than that of the other models.Simultaneously, the overall performance in terms of CRPS was higher, proving the superiority of the proposed method for the probability forecasting of daily streamflow, particularly for longer prediction horizons.For a better and more vivid comparison of all model performances, the time series of all models for all case studies in the three scenarios are shown in Figure 12a-c for the first year of the test period (365 days).We observed that BNN and LSTM-MC underestimated the peak flows with a misleading trend in the first 365 days.In the first scenario, a 95% PI was relatively narrow and constant for all models, indicating that models captured both low and high flow values appropriately with low uncertainty.However, longer horizons in Scenarios II and III were associated with a wider 95% PI, indicating greater model uncertainty.Furthermore, we observed that all case studies can be effectively covered by the PI.Furthermore, for case study II, which had the lowest peak, BLSTM achieved the best results in all three scenarios.In contrast, case study I, with the highest peak at 1654 m 3 /s and Std. of 90 m 3 /s, achieved the worst prediction results for all scenarios.From Scenarios I to II in case study I for BLSTM, LSTM-BNN, BNN, and LSTM-MC, PICP decreased by approximately 25, 38, 48, and 53%, respectively, indicating the best performance of BLSTM in maintaining its predictability in case study I, with the highest peak and Std in the extended horizon prediction.Moreover, by increasing the horizon, prediction performance for case study I dramatically decreased, whereas, in terms of PICP for BLSTM, case studies II and III from Scenarios I to III decreased by approximately 1-2% and 2-4%, respectively.Furthermore, LSTM-MC and BNN achieved the worst overall prediction performance for all the scenarios.The catchment area of case study I was relatively large, and heavy rain was the primary source of streamflow.These two characteristics cause the seasonal and annual variations in streamflow to be greater than those in the other two case studies.In this case study, the streamflow was very stable and small during the dry season, whereas in the rainy season, the streamflow increased steeply and then decreased.This made forecasting challenging and resulted in a higher uncertainty than that in the other case studies.Therefore, for this type of catchment, using more in-situ meteorological predictors, such as precipitation and temperature, along with available high-resolution large-scale hydroclimate data, can improve forecasting accuracy.
The kernel density estimation plots of the daily streamflow prediction of all models for case study II are displayed in Figure 13a-c.As depicted, the kernel density estimation plots are on the top with boxplots, and the data points of the prediction are underneath.The boxes represent the inner quartiles, the vertical lines within the box indicate the median, and the diamonds represent the outliers in each model.As shown in Figure 13, the prediction variance of BLSTM is lower in comparison with the other models, in particular for the Scenario III which is forecasting horizon 30.Moreover, the inter-quartile range of BLSTM is smaller which indicates that the BLSTM prediction results has less dispersion while LSTM-MC has the highest dispersion.The results of this study indicate that BLSTM shows the best overall probabilistic prediction performance.

Conclusions
This study proposes BLSTM as a probabilistic prediction model to estimate streamflow uncertainty.For comparison, three probabilistic models and one deterministic model, including LSTM-BNN, BNN, LSTM-MC, and LSTM, are developed under three scenarios: 1 day, 7 days, and 30 days ahead daily streamflow forecasting.The results are compared in terms of reliability, sharpness, and overall performance for three different case studies in the USA.Reliability is measured by computing the PICP, sharpness is measured by computing the MPIW, and overall performance is measured by CRPS.The results show that all probabilistic models outperformed the deterministic model (LSTM).Moreover, among the probabilistic models, BLSTM is superior.The Bayesian LSTM achieves better results with less computing time and is easier to train than those of LSTM-BNN and BNN.The results reveal that the BLSTM network with variational inference achieves the highest accuracy.The fact that BLSTM shows the best predictive performance indicates the significance of capturing temporal dependencies by considering both uncertainties.Moreover, taking advantage of the long-and short-term dependencies and capturing the inherent uncertainty that is inevitable in hydrology provides better prediction results.For

Figure 1 .
Figure 1.Structure of the Bayesian neural network.

Figure 2 .
Figure 2. Example of the proposed BLSTM network with a zoomed-in plot of the forget gate at time step t in the first layer.

Figure 3 .
Figure 3. Location of 3 study basins in different climate regions across the United States.

Figure 4 .
Figure 4. Comparison among prediction models in terms of RMSE.

Figure 5 .
Figure 5.Comparison among prediction models in terms of PICP.

Figure 6 .
Figure 6.Comparison among prediction models in terms of (a) MPIW and (b) CRPS.

Figure 7 .
Figure 7. Probabilistic streamflow forecasting results obtained by BLSTM for case study II for (a) forecast horizon 1, (b) forecast horizon 7, and (c) forecast horizon 30.

Figure 9 .
Figure 9. Probabilistic streamflow forecasting results obtained by BNN for case study II for (a) forecast horizon 1, (b) forecast horizon 7, and (c) forecast horizon 30.

Figure 11 .
Figure 11.Change in probabilistic streamflow forecasting results by increasing horizon for (a) case study I, (b) case study II, and (c) case study III.

Figure 12 .
Figure 12.Prediction results of all models with 1, 7, and 30 days ahead forecasting for (a) case study I, (b) case study II, and (c) case study III.

Figure 13 .
Figure 13.Kernel density estimation plots of daily streamflow prediction of all models in case study II for (a) forecast horizon 1, (b) forecast horizon 7, and (c) forecast horizon 30.

Table 1 .
Overview of recent probabilistic prediction studies in the field of hydrology.

Table 1 .
Overview of recent probabilistic prediction studies in the field of hydrology.

Table 1 .
Overview of recent probabilistic prediction studies in the field of hydrology.

Table 1 .
Overview of recent probabilistic prediction studies in the field of hydrology.

Table 1 .
Overview of recent probabilistic prediction studies in the field of hydrology.

Table 2 .
Descriptive information of daily historical streamflow data for three case studies.

Table 3 .
Details of the selected case studies.

Table 4 .
Summary of prediction performance results for three case studies and three scenarios.