Uncertainty Quantification in Machine Learning Modeling for Multi-Step Time Series Forecasting: Example of Recurrent Neural Networks in Discharge Simulations

As a revolutionary tool leading to substantial changes across many areas, Machine Learning (ML) techniques have obtained growing attention in the field of hydrology due to their potentials to forecast time series. Moreover, a subfield of ML, Deep Learning (DL) is more concerned with datasets, algorithms and layered structures. Despite numerous applications of novel ML/DL techniques in discharge simulation, the uncertainty involved in ML/DL modeling has not drawn much attention, although it is an important issue. In this study, a framework is proposed to quantify uncertainty contributions of the sample set, ML approach, ML architecture and their interactions to multi-step time-series forecasting based on the analysis of variance (ANOVA) theory. Then a discharge simulation, using Recurrent Neural Networks (RNNs), is taken as an example. Long Short-Term Memory (LSTM) network, a state-of-the-art DL approach, was selected due to its outstanding performance in time-series forecasting, and compared with simple RNN. Besides, novel discharge forecasting architecture is designed by combining the expertise of hydrology and stacked DL structure, and compared with conventional design. Taking hourly discharge simulations of Anhe (China) catchment as a case study, we constructed five sample sets, chose two RNN approaches and designed two ML architectures. The results indicate that none of the investigated uncertainty sources are negligible and the influence of uncertainty sources varies with lead-times and discharges. LSTM demonstrates its superiority in discharge simulations, and the ML architecture is as important as the ML approach. In addition, some of the uncertainty is attributable to interactions rather than individual modeling components. The proposed framework can both reveal uncertainty quantification in ML/DL modeling and provide references for ML approach evaluation and architecture design in discharge simulations. It indicates uncertainty quantification is an indispensable task for a successful application of ML/DL.


Introduction
Recently, Machine Learning (ML) techniques have obtained growing attention and Deep Learning (DL) techniques have led to substantial changes across many areas of study [1]. As a subfield of ML, DL is more concerned with datasets, algorithms and layered structures. Owing to the enhanced capability to characterize the system complexity and flexible structure, research and applications of ML/DL for time-series prediction are proliferating and promising. In hydrology, the advent of ML/DL techniques has encouraged novel applications or substantially improved old ones [2,3]. Though ML/DL evaluated and analyzed, respectively. Finally, we estimate the total ensemble uncertainty of the ML modeling and quantify the contributions of different uncertainty sources in discharge simulations. Figure 1 shows the diagrammatic flowchart of the proposed framework for uncertainty quantification in ML modeling for multi-step time-series forecasting. This framework is composed of 8 steps: Steps 1-3 include three parts of ML modeling: partition sample sets, select ML approaches and design ML architectures. In Step 4, ML modeling combination scheme is established.

The Proposed Framework
Step 5 is to train all the combining schemes and simulate discharge. In Step 6, simulated results are compared to evaluate sample sets, ML approaches and ML architectures. The purpose of Step 7 is to quantify the individual and interactive contributions of different uncertainty sources, using ANOVA. Finally, Step 8 is to evaluate multi-step time-series forecasting models. In this paper, discharge simulations using by RNNs is taken as an example.
Water 2020, 12, 912 3 of 19 two ML architectures are evaluated and analyzed, respectively. Finally, we estimate the total ensemble uncertainty of the ML modeling and quantify the contributions of different uncertainty sources in discharge simulations. Figure 1 shows the diagrammatic flowchart of the proposed framework for uncertainty quantification in ML modeling for multi-step time-series forecasting. This framework is composed of 8 steps: Steps 1-3 include three parts of ML modeling: partition sample sets, select ML approaches and design ML architectures. In Step 4, ML modeling combination scheme is established.

The Proposed Framework
Step 5 is to train all the combining schemes and simulate discharge. In Step 6, simulated results are compared to evaluate sample sets, ML approaches and ML architectures. The purpose of Step 7 is to quantify the individual and interactive contributions of different uncertainty sources, using ANOVA. Finally, Step 8 is to evaluate multi-step time-series forecasting models. In this paper, discharge simulations using by RNNs is taken as an example.

RNN Approaches
Two RNN approaches were selected for discharge simulations in this paper. Figures 2 and 3 show the internals of a Simple RNN and LSTM memory cell, respectively.
Two RNN approaches were selected for discharge simulations in this paper. Figures 2 and 3 show the internals of a Simple RNN and LSTM memory cell, respectively.

Simple RNN
Simple RNN has the most basic structure in all RNNs [8]. In Figure 2, t X is the input at time-

LSTM Network
LSTM is an artificial RNN architecture which is suitable for processing and predicting the relatively long events in the time series [9]. Its particularity lies in that an LSTM memory cell is composed of four main elements: a forget gate, an input gate, a neuron with a self-recurrent connection and an output gate. In Figure 3, t X is the input to the memory cell at time-step t.

LSTM Network
LSTM is an artificial RNN architecture which is suitable for processing and predicting the relatively long events in the time series [9]. Its particularity lies in that an LSTM memory cell is composed of four main elements: a forget gate, an input gate, a neuron with a self-recurrent connection and an output gate. In Figure 3, t X is the input to the memory cell at time-step t.

Simple RNN
Simple RNN has the most basic structure in all RNNs [8]. In Figure 2, X t is the input at time-step t; h t−1 is the hidden state and the output at time-step t−1; and h t is the hidden state and output of time-step t. Moreover, h t is calculated based on the previous hidden state h t−1 , the current input X t and activation function tanh().

LSTM Network
LSTM is an artificial RNN architecture which is suitable for processing and predicting the relatively long events in the time series [9]. Its particularity lies in that an LSTM memory cell is composed of four main elements: a forget gate, an input gate, a neuron with a self-recurrent connection and an output gate. In Figure 3, X t is the input to the memory cell at time-step t. C t−1 , C t and h t−1 , h t are the cell states and hidden states at time-step t−1, t. Moreover, f t , i t , o t are the states of forget gate, input gate and output gate; C t is the candidate state of C t ; and σ represents the activation function of sigmoid().

Model Architecture Design
Two model architectures are proposed based on a stacked DL structure for discharge simulations in this paper. Architecture 1 adopts the most common modeling strategy of time-series prediction, and architecture 2 is designed based on the expertise of hydrology. Figures 4 and 5 show the data flow schematic diagram of model architectures. To make the data flow clearer, we only draw the Recurrent layers and omit other layers (i.e., Dense layers, TimeDistributed layers, Flatten layers and Concatenate layers). Recurrent layers are unrolled along the time dimension in Figures 4 and 5. Additionally, the inputs of these two model architectures have not taken evaporation, infiltration and storage into account, since their variations give no relevant contribution to the river flow rate in the short period of heavy rain [19]. To achieve a longer discharge lead-time, potential future rainfall during lead-time periods is considered.
Two model architectures are proposed based on a stacked DL structure for discharge simulations in this paper. Architecture 1 adopts the most common modeling strategy of time-series prediction, and architecture 2 is designed based on the expertise of hydrology. Figures 4 and 5 show the data flow schematic diagram of model architectures. To make the data flow clearer, we only draw the Recurrent layers and omit other layers (i.e., Dense layers, TimeDistributed layers, Flatten layers and Concatenate layers). Recurrent layers are unrolled along the time dimension in Figures 4 and 5. Additionally, the inputs of these two model architectures have not taken evaporation, infiltration and storage into account, since their variations give no relevant contribution to the river flow rate in the short period of heavy rain [19]. To achieve a longer discharge lead-time, potential future rainfall during lead-time periods is considered.
As seen in Figure 4, two Recurrent layers are completely stacked. In layer 1, each time-step is a multivariate series comprised of discharge feature (q) and rainfall features (P). All features take the observed value during the observed period, and P take predicted value during the lead-time periods. Moreover, t represents the current time; H represents the maximum time lag in the basin; and T represents the maximum lead-time. The outputs 

Model Architecture 2
As seen in Figure 5, two Recurrent layers are stacked in the lead-time periods; t, H, T and tT tT q , ..., q , q are the same as Figure 4. According to the expertise of hydrology, two crucial improvements are achieved in architecture 2. Firstly, rainfall and discharge features are no longer indiscriminately inputted into architecture 2. This is because the relationship between rainfall features, P, and output is totally different from the relationship between discharge feature q and output. These two different features play their respective roles in discharge simulation, so exploring different relationships by the ML model should not be restricted by the same input form. Secondly, the discharge feature is no longer inputted into the model at each time-step, and only discharge feature  t H q and t q are inputted to layer 1 and layer 2 as the initial cell state, c, and hidden state, h, respectively. As seen in Figure 4, the input form at each time-step of architecture 2 is the same as the rainfall input of hydrological models. In addition, architecture 2 weakens the effect of the observed discharges trend on forecasting discharges by discarding the input of , ..., q , q . It emphasizes the direct driving effect of rainfall time series on forecasting discharges. In short, compared with architecture 1, architecture 2 is more in line with the hydrological model architecture. In addition, we have tried a variety of architecture designs for discharge simulations, including different stacked architectures, different input modes and different numbers of Recurrent layers. The results of the case study indicate that architectures 1 and 2 are most representative, because (1) architecture 1 adopts the simplest design, in which both rainfall and discharge features are inputted at each time-step; (2) architecture 2 combines the expertise of hydrology and outperforms other architectures, in which discharge features are inputted as initial states. In other words, other architectures could be regarded as variants of architecture 1 or architecture 2, so architectures 1 and 2 are only chosen for variance decomposition in this study.

Criteria for Accuracy Assessment
Uncertainties of multi-step time-series forecasting models are evaluated based on discharge observations. The following evaluation criteria are used in discharge error assessment: Relative Peak-Discharge Error (RPE), Nash-Sutcliffe coefficient of efficiency (NSE), Mean Absolute Error (MAE) As seen in Figure 4, two Recurrent layers are completely stacked. In layer 1, each time-step is a multivariate series comprised of discharge feature (q) and rainfall features (P). All features take the observed value during the observed period, and P take predicted value during the lead-time periods. Moreover, t represents the current time; H represents the maximum time lag in the basin; and T represents the maximum lead-time. The outputs q sim t+1 , . . . , q sim t+T−1 , q sim t+T are the forecasting discharges. As seen in Figure 5, two Recurrent layers are stacked in the lead-time periods; t, H, T and q sim t+1 , . . . , q sim t+T−1 , q sim t+T are the same as Figure 4. According to the expertise of hydrology, two crucial improvements are achieved in architecture 2. Firstly, rainfall and discharge features are no longer indiscriminately inputted into architecture 2. This is because the relationship between rainfall features, P, and output is totally different from the relationship between discharge feature q and output. These two different features play their respective roles in discharge simulation, so exploring different relationships by the ML model should not be restricted by the same input form. Secondly, the discharge feature is no longer inputted into the model at each time-step, and only discharge feature q t−H and q t are inputted to layer 1 and layer 2 as the initial cell state, c, and hidden state, h, respectively. As seen in Figure 4, the input form at each time-step of architecture 2 is the same as the rainfall input of hydrological models. In addition, architecture 2 weakens the effect of the observed discharges trend on forecasting discharges by discarding the input of q t−H+1 , . . . , q t−2 , q t−1 . It emphasizes the direct driving effect of rainfall time series on forecasting discharges. In short, compared with architecture 1, architecture 2 is more in line with the hydrological model architecture.
In addition, we have tried a variety of architecture designs for discharge simulations, including different stacked architectures, different input modes and different numbers of Recurrent layers. The results of the case study indicate that architectures 1 and 2 are most representative, because (1) architecture 1 adopts the simplest design, in which both rainfall and discharge features are inputted at each time-step; (2) architecture 2 combines the expertise of hydrology and outperforms other architectures, in which discharge features are inputted as initial states. In other words, other architectures could be regarded as variants of architecture 1 or architecture 2, so architectures 1 and 2 are only chosen for variance decomposition in this study.

Criteria for Accuracy Assessment
Uncertainties of multi-step time-series forecasting models are evaluated based on discharge observations. The following evaluation criteria are used in discharge error assessment: Relative Peak-Discharge Error (RPE), Nash-Sutcliffe coefficient of efficiency (NSE), Mean Absolute Error (MAE) and Mean Square Error (MSE) [20]. In addition, NSE and RPE are suggested by the Chinese flood-forecasting guidelines and often used in flood-forecasting studies [21,22].
where q obs peak and q sim peak are the observed and simulated peak-discharge; q obs t and q sim t are the observed and simulated discharge; q obs is the average value of the observed discharge; and N is the total number of observations. Figure 6 depicts the ML time-series forecasting modeling combination scheme employed in this study. It has three parts: sample set (SS), RNN approach (RA) and Model architecture (MA). In the variance decomposition, we aimed to decompose the total ensemble uncertainty into contributions from different elements of the ML modeling and interactions among them. The ML modeling ensemble consists of 10 combinations. For each of them, we estimated MAE, MSE, NSE and RPE. In the following, we describe these evaluation criteria as the general variable Y. To relate variable Y to the uncertainty sources, we use superscripts in Y j, k, l , with j, k and l representing the different options of SS, RA and MA, respectively. The subsampling of the training sets and the ANOVA approach are explained in detail, as follows.

Subsampling Approach
The ANOVA approach could underestimate the variance in small sample sizes [23]. To diminish the effect of the sample size on the quantification of the variance contribution, Bosshard et al. [24] added the subsampling method to ANOVA approach. The purpose of the subsampling approach is to keep the same number of sample sizes for SS, RA and MA in each variance decomposition. In this paper, we subsample the five different SSs. In each subsampling iteration i, we select two SSs out of the five SSs, which results in a total of 10 possible SS pairs in Equation (5). To prevent confusion between the full set of five SSs and the subsampled SS pair, we replace the superscript j with g(h, i). As seen in Equation (5), the superscript g is a 2 × 10 matrix that contains the selected SSs for the particular subsampling iteration i ( 1 2

ANOVA Approach
In this study, we referred to an application of ANOVA in hydrological climate-impact projections [24]. According to the ANOVA theory, in each iteration i, the total sum of the squares ( where i SST represents the total variance from two SSs, two RAs and two MAs in iteration i; i SSA represents the variance from two SSs in iteration i; i SSB represents the variance from two RAs in

Subsampling Approach
The ANOVA approach could underestimate the variance in small sample sizes [23]. To diminish the effect of the sample size on the quantification of the variance contribution, Bosshard et al. [24] added the subsampling method to ANOVA approach. The purpose of the subsampling approach is to keep the same number of sample sizes for SS, RA and MA in each variance decomposition. In this paper, we subsample the five different SSs. In each subsampling iteration i, we select two SSs out of the five SSs, which results in a total of 10 possible SS pairs in Equation (5). To prevent confusion between the full set of five SSs and the subsampled SS pair, we replace the superscript j with g(h, i). As seen in Equation (5), the superscript g is a 2 × 10 matrix that contains the selected SSs for the particular subsampling iteration i (i = 1, 2, . . . , 10). In each iteration i, there are two SSs (h = 1, 2).

ANOVA Approach
In this study, we referred to an application of ANOVA in hydrological climate-impact projections [24]. According to the ANOVA theory, in each iteration i, the total sum of the squares (SST i ) can be split into sums of squares due to the individual effects (SSA i , SSB i and SSC i ) and their interactions (SSI i ) as follows: where SST i represents the total variance from two SSs, two RAs and two MAs in iteration i; SSA i represents the variance from two SSs in iteration i; SSB i represents the variance from two RAs in iteration i; SSC i represents the variance from two MAs in iteration i; SSI i represents the variance from the interaction among two SSs, two RAs and two MAs in iteration i. The terms in Equation (6) are estimated by the subsampling procedure, as follows: where variable Y represents evaluation criteria or simulated discharges; H, K and L represent the number of SS, RA and MA in iteration i. H, K and L are all equal to 2 in each iteration i. The symbol • indicates averaging over the particular index. For example, in Equation (7), Then, each variance fraction, η 2 , is derived as follows: where η 2 has a range of 0 to 1 and describes a 0-100% contribution to the total uncertainty of simulated discharges. According to the matrix (5), I is equal to 10 in this study. The sum of Interactions is equal to 1, and the larger the value of η 2 is, the relatively greater uncertainty contribution of corresponding compositions in ML modeling.

Study Area
Anhe catchment, located in the province of Jiangxi, in humid Southeastern China, with a drainage area of 251 km 2 and mean annual rainfall of 1426 mm, is taken as the case study. The rainfall data from eight rain stations and discharge data from one hydrology station were used in this study, as shown in Figure 7. According to statistical analysis, the maximum basin time lag of Anhe is 6 h. In this paper, observed rainfall measurements are taken as the perfect forecast for 1-10 h lead-time. Thus, we take H as 5 h and T as 10 h in the two model architectures (see Figures 4 and 5). The hourly time-series data for twenty-nine years (from 1984 to 2012; 75 flood events) are included in the modeling process. The beginning and the end of a flood event are taken as the rising point from base flow before the main rainfall and the recession point to base flow, respectively. In light of the time sequence of flood events and the representativeness of peak-discharge, 75 flood events are divided into five parts, and each part has 15 flood events (see Figure 8). Then, five sample sets are constructed, and each sample set is composed of a training set (four parts) and a validation set (one part). The training set is used to fit model parameters, and the validation set is used to tune the parameters for preventing overfitting or underfitting.

Model Training
(1) Data normalization: Observed rainfall, 1 2 8 p , p ,..., p , from eight rainfall stations, and The hourly time-series data for twenty-nine years (from 1984 to 2012; 75 flood events) are included in the modeling process. The beginning and the end of a flood event are taken as the rising point from base flow before the main rainfall and the recession point to base flow, respectively. In light of the time sequence of flood events and the representativeness of peak-discharge, 75 flood events are divided into five parts, and each part has 15 flood events (see Figure 8). Then, five sample sets are constructed, and each sample set is composed of a training set (four parts) and a validation set (one part). The training set is used to fit model parameters, and the validation set is used to tune the parameters for preventing overfitting or underfitting. The hourly time-series data for twenty-nine years (from 1984 to 2012; 75 flood events) are included in the modeling process. The beginning and the end of a flood event are taken as the rising point from base flow before the main rainfall and the recession point to base flow, respectively. In light of the time sequence of flood events and the representativeness of peak-discharge, 75 flood events are divided into five parts, and each part has 15 flood events (see Figure 8). Then, five sample sets are constructed, and each sample set is composed of a training set (four parts) and a validation set (one part). The training set is used to fit model parameters, and the validation set is used to tune the parameters for preventing overfitting or underfitting.
x is the normalized value; x is the observed value; x max and x min are the maximum and minimum observed value.
Rainfall and discharge data are normalized separately due to their different physical significance and dimensions.
(2) Establish a dataset: For each sample set, the training set (60 flood events) and validation set (15 flood events) are restructured to a supervised learning dataset by sliding window method [25]. In the supervised learning dataset, the sample size of the supervised learning training set is about 9000, and the sample size of the supervised learning validation set is about 2000. The sample size is slightly different in the supervised learning datasets for sample sets 1-5 because of different duration of 75 flood events.
As seen in Figures 4 Figure 9 shows the visualization of tensor flow in four ML models by Keras. In Figure 9a,b, the numbers 16, 9, 10 and 5 represent the number of total time-steps, features, the forecasting time-steps and units, respectively. Moreover, 6, 8 and 1 represent the numbers of observed time-steps, rainfall features and discharge features in Figure 9c,d. In the red part, q t−5 is inputted to simple_rnn_1 (or lstm_1) and turned into an initial hidden state (or both cell state and hidden state); in the green part, q t is inputted to simple_rnn_2 (or lstm_2) and turned into an initial hidden state (or both cell state and hidden state); the yellow part presents the inputs of eight rainfall features of 16 time-steps.

Sample Set Evaluations
The NSE and RPE for different sample sets at 1-10-h lead-time are summarized in box-and-whisker plots in Figures 10 and 11, and each boxplot includes 300 simulated results from 75 flood events by four combination schemes, which consist of two RNN approaches and two model architectures. The minimum, maximum, 25th percentile (Q 1 ), 75th percentile (Q 3 ) and the median value (Q 2 ) are shown by the lower and upper whiskers, lower and upper bounds of the box, and the line within the box, respectively. The lower and upper whiskers indicate the values (Q 1 − 1.5 × IQR) and (Q 3 + 1.5 × IQR), where IQR is the Inter-Quartile Range (Q 3 − Q 1 ). have almost the same upper whisker, 2 Q and 3 Q at each lead-time. For 1 Q and lower whisker, the differences between SSs are greater with the increase of lead-time. The NSE of SS2 is slightly better than others. Figure 11 indicates the RPE distributions of SS1-SS4 are almost the same. Compared with SS1-SS4, SS5 shows a slightly different distribution range of RPE. Figures 10 and 11 illustrate that no SS outperforms others obviously in the statistic distribution of NSE and RPE, so it is feasible to divide the sample set by a time sequence of flood events and the representativeness of peak-discharge.   the differences between SSs are greater with the increase of lead-time. The NSE of SS2 is slightly better than others. Figure 11 indicates the RPE distributions of SS1-SS4 are almost the same. Compared with SS1-SS4, SS5 shows a slightly different distribution range of RPE. Figures 10 and 11 illustrate that no SS outperforms others obviously in the statistic distribution of NSE and RPE, so it is feasible to divide the sample set by a time sequence of flood events and the representativeness of peak-discharge.   As shown in Figure 10, at least 75% of NSE is greater than 0.8 at 1-7-h lead-time. Different SSs have almost the same upper whisker, Q 2 and Q 3 at each lead-time. For Q 1 and lower whisker, the differences between SSs are greater with the increase of lead-time. The NSE of SS2 is slightly better than others. Figure 11 indicates the RPE distributions of SS1-SS4 are almost the same. Compared with SS1-SS4, SS5 shows a slightly different distribution range of RPE. Figures 10 and 11 illustrate that no SS outperforms others obviously in the statistic distribution of NSE and RPE, so it is feasible to divide the sample set by a time sequence of flood events and the representativeness of peak-discharge. is mainly reflected in the flood events, which are simulated relatively well at short lead-times, and the flood events, which are simulated relatively poorly at long lead-times. Figure 13 illustrates the simulated peak discharges of LSTM have more compact RPE distribution at each lead-time. Both Figures 12 and 13 indicate LSTM outperforms Simple RNN in discharge simulations, especially in the simulation of flood process at relatively long lead-times. LSTM indeed demonstrates its powerful ability for processing multi-step time-series discharge simulations due to its special gates structure.    Figure 14 illustrates that architecture 2 has higher Q1, Q2 and Q3 and lower whisker of NSE than that of architecture 1. At 1-3-h lead-time, the advantages of architecture 2 are especially obvious. For RPE distribution in Figure 15, architecture 2 also has a distinct advantage at 1, 2-h lead-time. As the lead-time increases, the difference in RPE distribution between architecture 1 and 2 gets smaller. Both Figures 14 and 15 indicate that Architecture 2 has absolute superiority in simulating peak discharge the flood events, which are simulated relatively poorly at long lead-times. Figure 13 illustrates the simulated peak discharges of LSTM have more compact RPE distribution at each lead-time. Both Figures 12 and 13 indicate LSTM outperforms Simple RNN in discharge simulations, especially in the simulation of flood process at relatively long lead-times. LSTM indeed demonstrates its powerful ability for processing multi-step time-series discharge simulations due to its special gates structure.    Figure 14 illustrates that architecture 2 has higher Q1, Q2 and Q3 and lower whisker of NSE than that of architecture 1. At 1-3-h lead-time, the advantages of architecture 2 are especially obvious. For RPE distribution in Figure 15, architecture 2 also has a distinct advantage at 1, 2-h lead-time. As the lead-time increases, the difference in RPE distribution between architecture 1 and 2 gets smaller. Both Figures 14 and 15 indicate that Architecture 2 has absolute superiority in simulating peak discharge  Figure 12 illustrates LSTM has higher Q 1 , Q 2 , Q 3 and whisker of NSE than that of Simple RNN. Except for 1, 2-h lead-time, LSTM shows a more compact NSE distribution (shorter distance between Q 1 and Q 3 ). It indicates LSTM takes obvious advantage with the increase of lead-time. Its advantage is mainly reflected in the flood events, which are simulated relatively well at short lead-times, and the flood events, which are simulated relatively poorly at long lead-times. Figure 13 illustrates the simulated peak discharges of LSTM have more compact RPE distribution at each lead-time. Both Figures 12 and 13 indicate LSTM outperforms Simple RNN in discharge simulations, especially in the simulation of flood process at relatively long lead-times. LSTM indeed demonstrates its powerful ability for processing multi-step time-series discharge simulations due to its special gates structure.  Figure 14 illustrates that architecture 2 has higher Q 1 , Q 2 and Q 3 and lower whisker of NSE than that of architecture 1. At 1-3-h lead-time, the advantages of architecture 2 are especially obvious. For RPE distribution in Figure 15, architecture 2 also has a distinct advantage at 1, 2-h lead-time. As the lead-time increases, the difference in RPE distribution between architecture 1 and 2 gets smaller. Both Figures 14 and 15 indicate that Architecture 2 has absolute superiority in simulating peak discharge at 1-h lead-time and flood process at all lead-times, because a shorter distance between Q1 and Q3 means the quartile data are bunched together.

Model Architecture Evaluations
The results reveal that the strategy, t q , is inputted as an initial cell state, c, and hidden state, h, to Recurrent layers and improves the simulation accuracy of both flood hydrograph and peak. As the leadtime increases, the difference of simulated discharges between two architectures decreases due to the similar architecture at long lead-time (see Figures 4 and 5). In brief, the multi-step time-series forecasting models could obtain better performance by inputting the discharge feature and rainfall features separately. It suggests that the ML model might be improved by combining the expertise of hydrology.

Uncertainty Source Quantification
As shown in Figures 16 and 17, the stacked area chart is used to show the ratio of uncertainty contributions, which is calculated by Equations (12)-(15), at each lead-time or in the discharge quantile interval. Recurrent layers and improves the simulation accuracy of both flood hydrograph and peak. As the leadtime increases, the difference of simulated discharges between two architectures decreases due to the similar architecture at long lead-time (see Figures 4 and 5). In brief, the multi-step time-series forecasting models could obtain better performance by inputting the discharge feature and rainfall features separately. It suggests that the ML model might be improved by combining the expertise of hydrology.

Uncertainty Source Quantification
As shown in Figures 16 and 17, the stacked area chart is used to show the ratio of uncertainty contributions, which is calculated by Equations (12)-(15), at each lead-time or in the discharge quantile interval. Figure 16a-c shows the ratio of the contribution of uncertainty sources to discharge simulations by MAE, MSE and NSE at 1-10-h lead-time. The MA is the dominant source at 1-3-h lead-time. The uncertainty contribution ratio of MA reaches 0.8 at 1-h lead-time. It is attributed to the strategy that t q is inputted as an initial state to Recurrent layers in architecture 2 because of the strong correlation The results reveal that the strategy, q t , is inputted as an initial cell state, c, and hidden state, h, to Recurrent layers and improves the simulation accuracy of both flood hydrograph and peak. As the lead-time increases, the difference of simulated discharges between two architectures decreases due to the similar architecture at long lead-time (see Figures 4 and 5). In brief, the multi-step time-series forecasting models could obtain better performance by inputting the discharge feature and rainfall features separately. It suggests that the ML model might be improved by combining the expertise of hydrology.

Uncertainty Source Quantification
As shown in Figures 16 and 17, the stacked area chart is used to show the ratio of uncertainty contributions, which is calculated by Equations (12)-(15), at each lead-time or in the discharge quantile interval. Figure 16a-c shows the ratio of the contribution of uncertainty sources to discharge simulations by MAE, MSE and NSE at 1-10-h lead-time. The MA is the dominant source at 1-3-h lead-time. The uncertainty contribution ratio of MA reaches 0.8 at 1-h lead-time. It is attributed to the strategy that q t is inputted as an initial state to Recurrent layers in architecture 2 because of the strong correlation between q t and q t+1 . Separately inputted discharge feature plays a major role at short lead-time, and the strong correlation between q t and q t+1 , q t+2 , q t+3 weakens gradually with the increase of lead-time. trend is consistent with Figure 16a,b. Figure 16d shows the ratio of the contribution of uncertainty sources to peak-discharge simulations by MAE at 1-10-h lead-time.
Combining the previous analysis of Section 4.3, it indicates that (1) LSTM and architecture 2 shows their advantages of peak-discharge simulations mainly at 1-4-h lead-time; (2) the SS is the least important individual source of uncertainty for peak-discharge simulations at 1-4-h lead-time. Compared with Figure 15a-c, the uncertainty contribution of the interaction on peak-discharge simulations is much greater than other flood processes. It indicates that peak-discharge is more difficult to simulate, because the rainfall process with variable space-time distribution always occurs before the flood peak.  In the low quantile range, the SS is the dominant source of uncertainty at 2-10-h lead-time. It suggests that the different combinations of RA and MA make little difference for low discharge simulations. Small discharges mainly come from base flows or recession flows, which are relatively stable, barely affected by rainfall and much easier to simulate, whereas at 1-h lead-time, the MA and interactions contribute the majority of uncertainty sources due to the strong correlation between t q and 1 t q .
In the intermediate range, the contribution of the MA gradually decreases with the increase of lead-time, and the contribution of the SS and RA is relatively stable. Combined with the analysis of Figure 14 and Figure 15, it indicates that architecture 2 is more helpful to improve the discharge simulations in the intermediate and high quantile ranges. LSTM has more advantages in intermediate and high quantile ranges, in which the contribution ratio of RA is stable at 0.2. However, LSTM does not show any advantage compared to Simple RNN in the low quantile range. Compared with the intermediate range, the contribution of the SS increases in the high range, because the division of training set and validation set is according to the value of peak-discharge. Moreover, the relatively small sample size in the 99%-99.9% quantile also tends to increase the uncertainty contribution of the SS. The contribution of the interaction term to the total uncertainty varies throughout the different discharge quantiles between 0.4 and 0.5 and more stable at long lead-time. The results of variance decomposition indicate that the contribution of individual uncertainty source quantification varies on different lead-time and different discharge quantiles. In addition, the contribution of interactions should not be ignored. Although the proposed framework cannot take all the uncertainties into account, or gives no further explanation of interactions, it is valuable to evaluate the multi-step time-series forecasting models. It suggests the combination of LSTM and model architecture 2 is superior to others in discharge simulations.
Finally, through the result analysis of model evaluation and uncertainty source quantification, several suggestions are put forward to the ML modeling for multi-step discharge simulations: (1) LSTM network demonstrates its superiority in discharge simulations, and ML architecture design is as important as ML approach selection; (2) it is essential to input discharge feature properly for short- By comparing Figure 16a and Figure 16b, we see the uncertainty contribution ratio of SS by MSE is greater than that of MAE. We concluded that the SS contributes more uncertainty in simulated outliers, because MAE has better robustness than MSE for outliers. On the contrary, the RA contributes less uncertainty in simulated outliers. Therefore, we suspect the simulated results of LSTM and Simple RNN might be similar for partial outliers, which are caused by the partition of SS.
Despite the contribution of uncertainty sources that fluctuate a little bit in Figure 16c, the overall trend is consistent with Figure 16a,b. Figure 16d shows the ratio of the contribution of uncertainty sources to peak-discharge simulations by MAE at 1-10-h lead-time.
Combining the previous analysis of Section 4.3, it indicates that (1) LSTM and architecture 2 shows their advantages of peak-discharge simulations mainly at 1-4-h lead-time; (2) the SS is the least important individual source of uncertainty for peak-discharge simulations at 1-4-h lead-time. Compared with Figure 15a-c, the uncertainty contribution of the interaction on peak-discharge simulations is much greater than other flood processes. It indicates that peak-discharge is more difficult to simulate, because the rainfall process with variable space-time distribution always occurs before the flood peak. Figure 17 depicts the ratio of the contributions of uncertainty sources in different discharge quantiles. We take the mean of the variance decomposition for each quantile interval and divide the quantile bins into three parts: a low range (5-75%, 9. In the low quantile range, the SS is the dominant source of uncertainty at 2-10-h lead-time. It suggests that the different combinations of RA and MA make little difference for low discharge simulations. Small discharges mainly come from base flows or recession flows, which are relatively stable, barely affected by rainfall and much easier to simulate, whereas at 1-h lead-time, the MA and interactions contribute the majority of uncertainty sources due to the strong correlation between q t and q t+1 .
In the intermediate range, the contribution of the MA gradually decreases with the increase of lead-time, and the contribution of the SS and RA is relatively stable. Combined with the analysis of Figures 14 and 15, it indicates that architecture 2 is more helpful to improve the discharge simulations in the intermediate and high quantile ranges. LSTM has more advantages in intermediate and high quantile ranges, in which the contribution ratio of RA is stable at 0.2. However, LSTM does not show any advantage compared to Simple RNN in the low quantile range.
Compared with the intermediate range, the contribution of the SS increases in the high range, because the division of training set and validation set is according to the value of peak-discharge. Moreover, the relatively small sample size in the 99-99.9% quantile also tends to increase the uncertainty contribution of the SS. The contribution of the interaction term to the total uncertainty varies throughout the different discharge quantiles between 0.4 and 0.5 and more stable at long lead-time.
The results of variance decomposition indicate that the contribution of individual uncertainty source quantification varies on different lead-time and different discharge quantiles. In addition, the contribution of interactions should not be ignored. Although the proposed framework cannot take all the uncertainties into account, or gives no further explanation of interactions, it is valuable to evaluate the multi-step time-series forecasting models. It suggests the combination of LSTM and model architecture 2 is superior to others in discharge simulations.
Finally, through the result analysis of model evaluation and uncertainty source quantification, several suggestions are put forward to the ML modeling for multi-step discharge simulations: (1) LSTM network demonstrates its superiority in discharge simulations, and ML architecture design is as important as ML approach selection; (2) it is essential to input discharge feature properly for short-term discharge forecasting; (3) for low discharge simulations, different combinations of RA and MA make little difference, and the sample set partitioning dominates the uncertainty in modeling.

Conclusions
A framework is proposed for uncertainty quantification of ML modeling and evaluation of multi-step time-series forecasting models. Discharge simulation using RNN networks and stacked DL structure in Anhe catchment of Southeastern China is taken as an example to quantify uncertainty contributions of each component (five different SSs, two RAs and two MAs). To evaluate the interactions between LSTM networks and ML architecture, a novel ML architecture is designed by combining the expertise of hydrology and stacked DL structure, and compared with another conventional architecture. We estimated the total ensemble uncertainty of the ML modeling and quantified the contributions of different uncertainty sources in discharge simulations. The proposed framework can reveal discharge-simulation uncertainties in ML multi-step time-series forecasting models based on ANOVA theory. Moreover, it also provides references for ML technology selection and architecture design. The results indicated that uncertainty quantification is an indispensable task that must be performed in multi-step time-series discharge simulations, for a successful application of ML. Conclusions from this study include the following: (1) The results of variance decomposition indicate that the contribution of individual uncertainty source quantification varies at different lead-times and different discharge quantiles, and the contribution of interactions should not be ignored. With the increase of lead-time, the contributions of SS and RA gradually increase, while the contributions of MA gradually decrease. The SS is the dominant source of uncertainty in the low quantile range. In the intermediate and high quantile range, the contribution of the MA gradually decreases with the increase of lead-time, and the contribution of the SS and RA is relatively stable. Interactions contribute a lot to the total uncertainty in discharge simulations, and interactive impacts are influenced by discharge magnitude and lead-time. The evaluation results of the proposed framework demonstrate the model combining LSTM and architecture 2 is superior to others in the statistic distribution of NSE and RPE.
(2) As expected, the LSTM network outperforms Simple RNN in both flood processes and peak discharge. According to the analysis results of the proposed framework, the advantages of the LSTM network are mainly embodied in intermediate and high quantile ranges due to the dominant uncertainty contribution of SS in low quantile range. Comparing with ML architecture, the LSTM network shows limited advantage at short lead-time. In short, the LSTM network demonstrates its powerful ability for processing and predicting the relatively long events for multi-step time-series discharge simulations. It has more practical value to evaluate the LSTM network based on the interactions between LSTM networks and other components of ML modeling.
(3) ML architecture design is as necessary as ML approaches selection for multi-step time-series forecasting model. Architecture 2 has absolute superiority in simulating peak discharge at 1-h lead-time and flood process at all lead-times, especially in intermediate and high quantile ranges. It is mainly due to the strategy that incorporating hydrological expertise into stacking DL architecture.
(4) The SS is the dominant source of uncertainty in the low quantile range (except 1-h lead-time). In other words, different combinations of RA and MA make little difference for low discharge simulations. No one SS is better than the others in the statistic distribution of NSE and RPE. Therefore, it is feasible to divide the sample set by time sequence of flood events and the representativeness of peak-discharge.
Although this study only investigated a discharge simulation with two ML approaches and two ML architectures, the proposed framework is versatile and adjustable to research and applications of ML techniques in other areas of study. In addition, further research should also focus on the hydrological interpretation of ML models, such as the relationship between the dynamical changes of LSTM cell state and the soil moisture content.