Improving the Accuracy of Dam Inﬂow Predictions Using a Long Short-Term Memory Network Coupled with Wavelet Transform and Predictor Selection

: Accurate and reliable dam inﬂow prediction models are essential for effective reservoir operation and management. This study presents a data-driven model that couples a long short-term memory (LSTM) network with robust input predictor selection, input reconstruction by wavelet transformation, and efﬁcient hyper-parameter optimization by K-fold cross-validation and the random search. First, a robust analysis using a “correlation threshold” for partial autocorrelation and cross-correlation functions is proposed, and only variables greater than this threshold are selected as input predictors and their time lags. This analysis indicates that a model trained on a threshold of 0.4 returns the highest Nash–Sutcliffe efﬁciency value; as a result, six principal inputs are selected. Second, using additional subseries reconstructed by the wavelet transform improves predictability, particularly for ﬂow peak. The peak error values of LSTM with the transform are approximately one-half to one-quarter the size of those without the transform. Third, for a K of 5 as determined by the Silhouette coefﬁcients and the distortion score, the wavelet-transformed LSTMs require a larger number of hidden units, epochs, dropout, and batch size. This complex conﬁguration is needed because the amount of inputs used by these LSTMs is ﬁve times greater than that of other models. Last, an evaluation of accuracy performance reveals that the model proposed in this study, called SWLSTM, provides superior predictions of the daily inﬂow of the Hwacheon dam in South Korea compared with three other LSTM models by 84%, 78%, and 65%. These results strengthen the potential of data-driven models for efﬁcient and effective reservoir inﬂow predictions, and should help policy-makers and operators better manage their reservoir operations.


Introduction
Reservoirs and dams serve a variety of critical purposes, including flood mitigation, freshwater storage, irrigation, hydroelectric power, and ecological conservation. Substantial efforts have been made over the past century to develop optimal reservoir operating strategies. Proposing an optimal operating solution for a multipurpose reservoir is not straightforward because it can be affected by various factors, the most important of which is reservoir inflow estimates [1][2][3]. Accurate and reliable inflow forecasts are essential to effective reservoir operation [4,5]. Predictive models can be divided into process-based and data-driven varieties [6,7]. As process-based models use mathematical formulations based on physical principles, embracing state variables and fluxes that are theoretically observable and scalable [8], they provide a superior understanding of physical processes [9][10][11][12]. However, these models also require detailed data volumes and high computational costs for a study basin [13][14][15][16], and when simplified assumptions related to scaling problems are applied, their predictions can be accompanied by considerable amounts of uncertainty [17][18][19][20][21]. Empirical data-driven models are based on past observations. They are Table 1. Streamflow prediction studies using a data-driven model.
The main objective of this study is to investigate the potential of the combined methods of LSTM, predictor selection, data processing, and hyper-parameter optimization, thereby developing a unified data-driven modeling framework that can produce accurate dam inflow predictions. Of specific interest are (1) the robust selection of principal inputs and their sequence lengths, (2) the transformation of input time series to better capture extremes, and (3) the efficient optimization of LSTM hyper-parameters. The rest of this study is organized as follows. Section 2 describes the methodologies of LSTM structure, input selection, wavelet transform, and hyper-parameter optimization. Section 3 provides the study area, dataset, and performance measures. Section 4 presents the experimental results and discussion, and a conclusion follows in Section 5.

Long Short-Term Memory Network
Long short-term memory is a special kind of RNN that includes memory cells that are analogous to the states of physically based models [28]. An advantage of LSTM over an RNN is that LSTM can learn long-term dependencies between input and output features by resolving gradients that are exploding or vanishing [37]. The main difference between LSTM and RNN structures is that LSTM adds a cell state; four times more parameters should be trained because three gate functions are employed to calculate the cell and the hidden states. The internal structure of LSTM is sketched in Figure 1a.
A LSTM-based data-driven model is composed of repeating LSTM blocks, each of which contains three gates (forget gate f t , input gate i t , and output gate o t ) to determine which information is renewed, discarded, and outputted from the memory cell. Given the inputs x t = x 1,t , x 2,t , . . . , x N in ,t at time t with the number of inputs N in , cell state c t (a long-term memory) and hidden state h t (a short-term memory) at time t are computed using three gates and the cell state at a previous time step. A new state c t can be controlled through a forget gate that can forget information from the past state c t−1 and an input gate that can accept new information from the cell update c t . The output gate determines how much information from the cell state c t flows into the new hidden state h t . Mathematically, where the intermediate cell update c t and three gates are calculated for x t and h t−1 : where W, U, and b are learnable parameters specific to the three gates and determined through the training process. The activation functions of σ and tan h are the sigmoid and the hyperbolic tangent, respectively. At t = 1, the hidden and cell states are initialized as zero vectors [37]. LSTM models can also be built with more than one layer by stacking multiple layers on top of each other. The output of a stacked LSTM connects to a final "dense layer." A target output y t can be computed from h t in the dense layer: where W d and b d are learnable parameters known as the weight matrix and the bias term, respectively, of the dense layer. The total number of LSTM parameters is therefore 12 for each layer plus 2 for the dense layer. The shapes of these parameters can be expressed in a matrix (Table 2). At the beginning of training, the learnable parameters are initialized using an Xavier initialization and later optimized by the Adam algorithm, preferred in abundant studies [42]. 12 for each layer plus 2 for the dense layer. The shapes of these parameters can be expressed in a matrix (Table 2). At the beginning of training, the learnable parameters are initialized using an Xavier initialization and later optimized by the Adam algorithm, preferred in abundant studies [42].
The internal structure of long short-term memory (LSTM), where and denote input predictors and a target output; , , stand for the forget, input, and output gate, respectively; , , and ℎ denote cell state, cell update, and hidden state at time , respectively. (b) Diagram of wavelet transform; the input time series can be subdivided into multiple subseries down to the -th level (i.e., and ); and denote 'Detail' and 'Approximation' time series of the original at the level ; the input predictors and their subseries (shown as gray circles) are all used as the input of the LSTM models. (c) Schematic overview of -fold cross-validation. The training and validation dataset (corresponding to 90% of total dataset in this study) is randomly split into folds. A fold (shown as gray color) is used to validate the LSTM trained for the other folds (white color) at each iteration. , where x t and y t denote input predictors and a target output; f , i, o stand for the forget, input, and output gate, respectively; c t , c t , and h t denote cell state, cell update, and hidden state at time t, respectively. (b) Diagram of wavelet transform; the input time series x t can be subdivided into multiple subseries down to the j-th level (i.e., D j t and A j t ); D j t and A j t denote 'Detail' and 'Approximation' time series of the original x t at the level j; the input predictors and their subseries (shown as gray circles) are all used as the input of the LSTM models. (c) Schematic overview of K-fold cross-validation. The training and validation dataset (corresponding to 90% of total dataset in this study) is randomly split into K folds. A fold (shown as gray color) is used to validate the LSTM trained for the other folds (white color) at each iteration. Table 2. Learnable parameters of each layer of LSTM and their shapes.

Layer
Parameter Shape

Input Predictor Selection
Maintaining a high correlation between inputs and outputs can guarantee the predictability of data-driven models. Therefore, to arrive at the optimal combination of inputs that correlate closely with the output, statistical properties of the respective time series can be used. Specifically, the cross-correlation function (CCF) and the partial autocorrelation function (PACF) are used to determine the appropriate predictors and the number of lagged values.
The CCF measures the similarity of a time series (e.g., dam inflow, y) with its lagged versions (e.g., candidate input variables, where k is the lag time; SD v and SD y are the standard deviations of v and y, respectively; c vy k , which is the cross-covariance function of v and y, is defined as: where t is the time step; y is the average of y; v is the average of v; N t denotes the number of data points of time series. The PACF measures the linear correlation between a time series (y t ) and a lagged version of itself (y t+k ) and can be defined as: where j denotes an index for lag k; ρ k is an autocorrelation coefficient at lag k between y t and y t+k . At k = 1, PACF 1,1 is equal to ρ 1 .
Additionally, an approximate 95% confidence interval (CI) on the CCF and PACF [43] can be estimated by: Based on the calculated CCF and PACF, modelers can determine the number of antecedent values that should be included in the input vector. Other variables that may not have a significant effect on model performance can be cut off from the input vector. Generally, the CCF can be used as a reference when selecting highly correlative input predictors, while the PACF can indicate an appropriate lag for the selected variables.

Wavelet Transform
A WT is a mathematical tool that decomposes one signal into several with lower resolution levels by controlling the scaling and shifting factors of a single wavelet-the mother wavelet function exists locally as a pattern. It offers time-frequency localization of a given time series and analyzes nonstationary elements such as breakdown points, discontinuities, and local minima and maxima [35]. Due to the nature of streamflow represented by discrete signals, a discrete wavelet transform (DWT) is usually preferred in hydrological applications [33]. A DWT is easier to implement compared with a continuous wavelet transform and has a shorter computational time [44]. However, a DWT is not inherently shift-invariant. If any new values are added to the end of a time series, certain values of the wavelet component can change. This means it cannot be applied to problems related to singularity detection, forecasting, and nonparametric regression [45]. To overcome this "boundary" problem, an à trous algorithm that uses redundant information attained from observation data has been suggested [46]. The decomposition formulas of an à trous algorithm are defined as [47]: where D j t and A j t represent the jth-level wavelet (detail) and scaling (approximation) coefficients of the original time series at time t; g l is a scaling filter with g l = g DWT l / √ 2 where g DWT l is a scaling filter for DWT; L is the length of the scaling filter; l denotes an index for L; mod refers to the modulo operator. At j = 0, A 0 t is equal to the original time series of x t . The latter can be obtained by the additive reconstruction: The approximation A j becomes increasingly rough as j increases. As data processing using a Daubechies 5 wavelet at level 3 has been preferred in studies of flow predictions [33,[48][49][50][51], the discrete wavelet at level 3 (J = 3) was used in this study.

LSTM Hyper-Parameters
Configuring an LSTM network by adjusting hyper-parameters is a difficult task, but it can have a significant impact on the performance of data-driven models [37]. Additionally, the shape of the learnable parameters depends heavily on the number of inputs (N in ), the hyper-parameters of the number of hidden units (N hU ), and the number of layers (N l ), as shown in Table 2. As inappropriate values of N hU and N l can lead to unreliable LSTM models, close attention should be paid to their selection. If these values are too large, the learnable parameters that need to be trained will increase, the size of the training dataset will be large, and considerable training time will be required. Complicating matters further, too many hidden units can cause overfitting phenomena in data-driven models [52]. To compensate for this issue, a dropout technique is often used, as reducing the number of cells in the network during training can prevent overfitting. The number of cells can be adjusted from 0 to 1 depending on the dropout rate (N D ).
To train an LSTM network by estimating the learnable parameters W, U, and b, an objective function (or a loss function) for a given hyper-parameters must be evaluated. Here, the value of the loss function was computed from a subset (i.e., a mini-batch) of LSTM predictions and their corresponding observations; the learnable parameters during training were updated according to a given loss function at each iteration step. The number of iterations (N it ) was determined based on N t , the mini-batch size (N b ), and the number of epochs (N e ) (i.e., N it = N t /N b × N e ). Neural networks using small batch sizes can achieve convergence with fewer epochs [53]. However, using an N b that is too small can lead to a large number of iterations, which will take excessive time to compute them. For this study, Nash-Sutcliffe efficiency (NSE) was chosen as a loss function as it can build LSTM with greater prediction accuracy compared with other metrics, such as the mean square error [54].
The hyper-parameters associated with the configuration of this study consisted of mini-batch size (N b ), dropout rate (N D ), the number of hidden units (N hU ), the number of layers (N l ), and the number of epochs (N e ). When tuning the hyper-parameters, two popular approaches, grid search and random search, are often used [55]. In the first approach, the grid search can be considered exhaustive as it defines a search space as a grid of hyper-parameter values and evaluates grid position for all combinations of all hyper-parameter values. A random search defines a search space as a bounded domain of hyper-parameter values and chooses random combinations in that domain for evaluations. The latter approach can create a more reliable model with more combinations of hyperparameters, particularly when large amounts of training are used [55][56][57]. A random search was therefore used to determine an appropriate set of hyper-parameters to optimize the LSTM network in this study.
We chose a special form of resampling procedure, K-ld cross-validation, to evaluate the LSTM model's performance with a limited dataset. First, the dataset was partitioned into equally (or nearly equally) K-sized folds or clusters. Subsequently, K iterations were performed for training and validation such that within each iteration, a different fold of the dataset was held out for validation (gray cells in Figure 1c) while the remaining K-1 folds were used for training (white cells in Figure 1c) [58]. A useful set of hyper-parameters can provide almost equally good validation values for an object function for each iteration.
To determine an appropriate number of clusters (K), the average of Silhouette coefficients (s) is commonly used [59]. For a given data point i in a cluster, the Silhouette coefficient s(i) is defined as: where a(i) is the average distance between point i and all other points in the same cluster, and b(i) is the average distance between the point i and all points in the nearest cluster. It is advisable to choose a K that provides a high value of s.

Summary of Modeling Framework
A brief overview of the methodology adopted for dam inflow prediction, hereafter called SWLSTM, follows the schematic in Figure 2: (1) Collect the time series of both the target output y t (i.e., dam inflow) and the candidate input Any inappropriate or missing values in the collected data should be reviewed carefully.
(2) Determine the explanatory "principal" variables x t = x 1,t , x 2,t , . . . , x N in ,t , t = 1, . . . , N t among the candidate predictors, with an appropriate lag time, using the CCF and PACF. (3) Decompose and reconstruct the selected input predictors into the wavelet-transformed These reconstructed data are normalized to values between 0 and 1, and then split into one set for training and validation and another for testing. In this study, we set 90% of the total data length for training and validation and 10% for test. (4) Determine the number of clusters for the K-fold cross-validation and then optimize five hyper-parameters by the random search over the training and validation set. (5) Train and build the LSTM models, using the optimal values of hyper-parameters over the training and validation set. (6) Assess and compare the performance of LSTM models in predicting dam inflow for the test dataset. The LSTM models chosen to demonstrate the effectiveness of SWLSTM presented here are (1) a regular LSTM without both the determination of principal lags and variables and the WT, (2) a "WLSTM," which is a regular LSTM coupled with a WT, and (3) a "SLSTM," which is similar to a regular LSTM but performs the input specification in the Step 2.
(3) Decompose and reconstruct the selected input predictors into the wavelet-transformed subseries , , , , , , , , … , , , , , , , , . These reconstructed data are normalized to values between 0 and 1, and then split into one set for training and validation and another for testing. In this study, we set 90% of the total data length for training and validation and 10% for test. (4) Determine the number of clusters for the -fold cross-validation and then optimize five hyper-parameters by the random search over the training and validation set. (5) Train and build the LSTM models, using the optimal values of hyper-parameters over the training and validation set. (6) Assess and compare the performance of LSTM models in predicting dam inflow for the test dataset. The LSTM models chosen to demonstrate the effectiveness of SWLSTM presented here are (1) a regular LSTM without both the determination of principal lags and variables and the WT, (2) a "WLSTM," which is a regular LSTM coupled with a WT, and (3) a "SLSTM," which is similar to a regular LSTM but performs the input specification in the Step 2.

Evaluation Metrics
Three evaluation metrics of NSE, mean absolute error (MAE), and peak error (PE) were used to qualitatively measure the performance of model accuracy. Each was computed over the test period as:

Evaluation Metrics
Three evaluation metrics of NSE, mean absolute error (MAE), and peak error (PE) were used to qualitatively measure the performance of model accuracy. Each was computed over the test period as: where y obs t denotes the observation of dam inflow at time t; y obs is the average of y obs ; y max and y obs max are the predicted and observed peak dam inflow, respectively.

Open Source Software
Our research relies on open source software with the programing language of Python 3.7 [60]. The libraries of Numpy [61], Pandas [62] and Scikit-learn [63] were used for Mathematics 2021, 9, 551 9 of 21 managing and preprocessing the data. TensorFlow [64] and Keras [65] were utilized to implement LSTM. The hardware environment was configured with Intel(R) Xeon(R) Gold 6242 CPU at 2.80 GHz × 32 processors, and 376 GB of RAM.

Study Area and Dataset
The Hwacheon dam watershed in the central part of the Korean Peninsula (its latitude and longitude are 127 • 47 E and 38 • 7 N, respectively) was chosen as a case study (Figure 3). The watershed created by the dam covers 3901 km 2 , approximately 80% of which is forest, and its elevation varies from close to 120 m at the dam site to 1600 m. The Hwacheon dam was designed as a multipurpose dam for generate electricity, prevent floods, and store water. Its power generation capacity is 326 GWh and its total storage capacity is approximately 1018 Mm 3 , making it a relatively large dam for South Korea. The Peace dam, located upstream of the Hwacheon dam, was built to prevent flooding and prepare for North Korean military (flood) attacks, and is normally operated as a dry-water dam. A daily dataset for 5844 days from 1 January 2004, to 31 December 2019 was collected and is described in Table 3. Wherein, the first 5260 days (90% of the total data) were used for training and validation, while the rest was left for testing the trained models. The data collected includes inflow to the Hwacheon dam (Qin), dam outflow from Peace dam (Qo), and meteorological data such as precipitation (Pr), temperature (Ta), humidity (H), wind speed (Ws), and pressure (Pre). Any inappropriate (negative) or missing values in the collected data were replaced with those interpolated linearly. The amount of such an inappropriate data is however less than 0.01%. Spatially averaged precipitation was computed using Thiessen polygons for six rain gauges in Table 3 and Figure 3.
where denotes the observation of dam inflow at time ; is the average of ; and are the predicted and observed peak dam inflow, respectively.

Open Source Software
Our research relies on open source software with the programing language of Python 3.7 [60]. The libraries of Numpy [61], Pandas [62] and Scikit-learn [63] were used for managing and preprocessing the data. TensorFlow [64] and Keras [65] were utilized to implement LSTM. The hardware environment was configured with Intel(R) Xeon(R) Gold 6242 CPU at 2.80 GHz × 32 processors, and 376 GB of RAM.

Study Area and Dataset
The Hwacheon dam watershed in the central part of the Korean Peninsula (its latitude and longitude are 127°47' E and 38°7' N, respectively) was chosen as a case study ( Figure 3). The watershed created by the dam covers 3901 km 2 , approximately 80% of which is forest, and its elevation varies from close to 120 m at the dam site to 1600 m. The Hwacheon dam was designed as a multipurpose dam for generate electricity, prevent floods, and store water. Its power generation capacity is 326 GWh and its total storage capacity is approximately 1018 Mm 3 , making it a relatively large dam for South Korea. The Peace dam, located upstream of the Hwacheon dam, was built to prevent flooding and prepare for North Korean military (flood) attacks, and is normally operated as a drywater dam. A daily dataset for 5844 days from 1 January 2004, to 31 December 2019 was collected and is described in Table 3. Wherein, the first 5260 days (90% of the total data) were used for training and validation, while the rest was left for testing the trained models. The data collected includes inflow to the Hwacheon dam (Qin), dam outflow from Peace dam (Qo), and meteorological data such as precipitation (Pr), temperature (Ta), humidity (H), wind speed (Ws), and pressure (Pre). Any inappropriate (negative) or missing values in the collected data were replaced with those interpolated linearly. The amount of such an inappropriate data is however less than 0.01%. Spatially averaged precipitation was computed using Thiessen polygons for six rain gauges in Table 3 and Figure 3.

Determining Principal Input Predictors and Their Sequence Lengths
To construct an appropriate input combination for the LSTM model, the principal variables and sequence lengths were determined using the statistical properties of the candidate variables. Figure 4 provides the statistical correlations between the seven candidate input variables (i.e., Qin, Qo, Pr, Ta, H, Ws, and Pre) and the target variable (Qin). Figure 4a shows the CCF between the seven candidate variables with a time lag of zero, indicating that Qo and Pr were strongly correlated with Qin (their CCFs were approximately 0.63 and 0.48, respectively); Ta and H had a relatively weaker correlation with Qin; and Ws and Pre had a negative correlation with Qin. The correlations for other combinations of the remaining variables were all less than 0.2, with the exception of the correlation between H and Ta. Regarding the correlations between Qin and the seven candidate variables at different time lags from 0 to 10, Figure 4b shows that Qin had a strong autocorrelation up to a 1-day lag (the PACF is approximately 0.8), and the correlation became significantly lower when the time lag was greater than 1 day. The CCF values for Qo and Pr, which had a strong correlation with Qin, became smaller as the time lag increased, while the values for the remaining four variables were not influenced by the magnitude of the time delay (see Figure 4c-h).
Based on the PACF and CCF analyses, a final set of input variables and sequence lengths (time lags) that had a high correlation with the target output Qin were selected. However, choosing only the input variables and the lags that have a close correlation with the target variable posed some challenges. In previous research, such a selection was typically based on user decisions made through trial and error, and no specific rules or criteria were used to determine the which key inputs were optimal [5,37,66]. We proposed a robust analysis using a "correlation threshold" for the PCAF and CCF values, and only variables greater than this threshold were used as input predictors and their time lags to construct and train a model. If the correlation threshold was small (e.g., 0.026, the upper bound of the 95% CI from Equation (12)), most of the variables and sequence lengths could be adopted to predict the target variable. Conversely, if the threshold was large (e.g., 0.8), the SLSTM model was constructed using only a limited number of predictors that were highly correlated with the target variable. Figure 5 depicts the performance of the SLSTM against the correlation threshold as a loss function (NSE). A model trained on a threshold of 0.4 produced the highest NSE value of 0.66, which was considered optimal. The model performed the poorest with a small threshold of 0.026, which means that a large number of inputs (up to 44 in this study) that were not highly correlated with the target variable led to overfitting and less-accurate outcomes. However, if using a high threshold (greater than 0.6), the number of input predictors may be limited (only 1 in this study) and not be sufficient to describe the target variable. By selecting the principal variables and their corresponding sequence lengths based on the optimal threshold of 0.4, Qin t−1 , Qo t−1 , Qo t−2 , Qo t−3 , Pr t−1 , and Pr t−2 became the inputs to predict the inflow of Hwacheon dam.
Mathematics 2021, 9, x FOR PEER REVIEW 11 of 21 led to overfitting and less-accurate outcomes. However, if using a high threshold (greater than 0.6), the number of input predictors may be limited (only 1 in this study) and not be sufficient to describe the target variable. By selecting the principal variables and their corresponding sequence lengths based on the optimal threshold of 0.4, Qin , Qo , Qo , Qo , Pr , and Pr became the inputs to predict the inflow of Hwacheon dam.   led to overfitting and less-accurate outcomes. However, if using a high threshold (greater than 0.6), the number of input predictors may be limited (only 1 in this study) and not be sufficient to describe the target variable. By selecting the principal variables and their corresponding sequence lengths based on the optimal threshold of 0.4, Qin , Qo , Qo , Qo , Pr , and Pr became the inputs to predict the inflow of Hwacheon dam.

Decomposing Input Time Series by a Wavelet Transform
Three levels of DWT decomposition were performed on the seven candidate input variables for WLSTM and the six selected inputs defined above for SWLSTM, extracting four subseries for each. Figure 6 shows the original and transformed time series of the three principal variables of Qin, Qo, and Pr. The degree of fluctuations in the "Detail" time series is smoother and has a lower frequency at higher decomposition levels. The "Approximation" A 3 has a rougher and slower gradual trend than the original time series x.
variables for WLSTM and the six selected inputs defined above for SWLSTM, extracting four subseries for each. Figure 6 shows the original and transformed time series of the three principal variables of Qin, Qo, and Pr. The degree of fluctuations in the "Detail" time series is smoother and has a lower frequency at higher decomposition levels. The "Approximation" has a rougher and slower gradual trend than the original time series . Data processing by the three levels of decomposition created an additional time series that is five times the number of the original time series (i.e., one original time series plus four decomposed time series). Eventually, the inputs used to train the LSTM models presented in this study and predict dam inflow were (1) for LSTM, all seven candidate variables at t-1 timestep (i.e., Qin , Qo , Pr , Ta , H , Ws , and Pre ); (2) for SLSTM, six variables selected from the above "correlation threshold" analysis (i.e., Qin , Qo , Qo , Qo , Pr , and Pr ); (3) for WLSTM, in addition to the seven candidate variables, subtime series on each by WT (for a total of 35 inputs); (4) for SWLSTM, a total of 30 variables made by WT on the six variables. For more details, see Table 4.  Data processing by the three levels of decomposition created an additional time series that is five times the number of the original time series (i.e., one original time series plus four decomposed time series). Eventually, the inputs used to train the LSTM models presented in this study and predict dam inflow were (1) for LSTM, all seven candidate variables at t-1 timestep (i.e., Qin t−1 , Qo t−1 , Pr t−1 , Ta t−1 , H t−1 , Ws t−1 , and Pre t−1 ); (2) for SLSTM, six variables selected from the above "correlation threshold" analysis (i.e., Qin t−1 , Qo t−1 , Qo t−2 , Qo t−3 , Pr t−1 , and Pr t−2 ); (3) for WLSTM, in addition to the seven candidate variables, subtime series on each by WT (for a total of 35 inputs); (4) for SWLSTM, a total of 30 variables made by WT on the six variables. For more details, see Table 4. Table 4. Summary of input and output variables used for training four data-driven models (LSTM, SLSTM, WSLTM, and SWLSTM).

Model
x y Pr,t−2

Optimizing the Hyper-Parameters
Choosing an appropriate set of hyper-parameters significantly affected model performance. To investigate the effects of the five hyper-parameters on the value of the loss function (i.e., NSE), their configurations were set as listed in Table 5, with controlled values suggested by Kratzert, Klotz [37]. As shown in Figure 7, change in NSE values was negligible (neither increasing nor decreasing) as the value of each hyper-parameter increased. SWLSTM consistently provided the largest NSE, with values in the range of 0.7-0.8, while LSTM provided the smallest NSE range of 0.5-0.6. These results indicate in part that SWLSTM outperformed the other models. Table 5. The configurations of the hyper-parameters to investigate the effects of each hyper-parameter on the model performance in Figure 7. The controlled value for each hyper-parameter was borrowed from Kratzert, Klotz [37]. In this study, an optimal hyper-parameter set for the four data-driven models was determined from K-fold cross-validation and the random search. First, an appropriate number of clusters for K-fold cross-validation was chosen based on the average Silhouette coefficients, s, and the "distortion score" of the elbow method ( Figure 8). In general, K values can be selected that provide a high value of s and those corresponding to the "elbow" of the distortion score curve. As K increased, the s value tended to decrease overall, so K values from 2 to 5 could be chosen. The gray dashed line in Figure 8 shows an elbow with a K near 5-7. A K of 5 was therefore selected to optimize the hyper-parameters of the four data-driven models in this study. function (i.e., NSE), their configurations were set as listed in Table 5, with controlled values suggested by Kratzert, Klotz [37]. As shown in Figure 7, change in NSE values was negligible (neither increasing nor decreasing) as the value of each hyper-parameter increased. SWLSTM consistently provided the largest NSE, with values in the range of 0.7-0.8, while LSTM provided the smallest NSE range of 0.5-0.6. These results indicate in part that SWLSTM outperformed the other models.   Table 5.

The Number of Layers The Number of Hidden Units Dropout Rate Batch Size The Number of Epochs
For a K of 5 and a confined range specified in the second column of Table 6, a set of hyper-parameters for the four models was found using the random search method (see Table 6 for the optimal hyper-parameter set). As a result, two LSTM layers were required for all four models, while the values of other hyper-parameters varied. Specifically, WLSTM and SWLSTM required a larger number of hidden units, epochs, dropout, and batch size compared with the other two models. Such a complex configuration was required because the amount of inputs used by WLSTM and SWLSTM was five times more than in other models. The CPU time required for this process is 2-6 min, much more than LSTM training that takes about 50 s and other processes (LSTM prediction, wavelet transformation, and normalization) that only take a few seconds. Table 6. The prior ranges of the hyper-parameters used in the Random search (see the second column) and determined optimal values of the hyper-parameters for four data-driven models by random search and K-fold cross-validation with K of 5 (see the rest columns).

Hyper-Parameters
Range LSTM SLSTM WLSTM SWLSTM coefficients, ̅ , and the "distortion score" of the elbow method ( Figure 8). In general, values can be selected that provide a high value of ̅ and those corresponding to the "elbow" of the distortion score curve. As increased, the ̅ value tended to decrease overall, so values from 2 to 5 could be chosen. The gray dashed line in Figure 8 shows an elbow with a near 5-7. A of 5 was therefore selected to optimize the hyper-parameters of the four data-driven models in this study. For a of 5 and a confined range specified in the second column of Table 6, a set of hyper-parameters for the four models was found using the random search method (see Table 6 for the optimal hyper-parameter set). As a result, two LSTM layers were required for all four models, while the values of other hyper-parameters varied. Specifically, WLSTM and SWLSTM required a larger number of hidden units, epochs, dropout, and batch size compared with the other two models. Such a complex configuration was required because the amount of inputs used by WLSTM and SWLSTM was five times more than in other models. The CPU time required for this process is 2-6 min, much more than LSTM training that takes about 50 s and other processes (LSTM prediction, wavelet transformation, and normalization) that only take a few seconds.

Predicting Dam Inflow with Trained LSTMs
LSTM models trained using optimal hyper-parameters were applied to the test dataset to predict inflow at the Hwacheon dam. Comparing the hydrographs with observations, the overall variation and magnitude of predicted inflow using SWLSTM agreed more closely with observations than did the results produced by other models (Figure 9). Quantitatively, SWLSTM had an R 2 of 0.96 in a 1:1 comparison between prediction and observation, which outperformed the 0.77, 0.92, and 0.92 values produced by LSTM, SLSTM, and WL-STM, respectively. The results produced by NSE, MAE, and PE confirm that the predictions from SWLSTM were closest to observations. In particular, compared with LSTM, which has an NSE of 0.65 and a PE of 29.1%, both metrics were significantly improved to an NSE of 0.89 and a PE of 7.7% (see Figure 9 for specific values). ematics 2021, 9, x FOR PEER REVIEW 16 of 21  Table 6 and the 5-folds ( = 5) are employed. Table 7. Accuracy improvements of SWLSTM to three other models of LSTM, SLSTM, and WLSTM for dam inflow predictions for the test dataset. A relative "difference" metric (∆ ) in Equation (20) is computed for four evaluation metrics (R2, NSE, MAE, and PE). The positive (or negative) values of ∆ indicate that the prediction results of SWLSTM are more (or less) accurate than those computed by other models. The optimal hyper-parameters specified in Table 6 Table 6 and the 5-folds (K = 5) are employed.
To examine the superiority of SWLSTM proposed in this study compared with the other models, a relative "difference" metric (∆) in Equation (20) was introduced: where Metric A and Metric B are the values of a metric for two models A and B, Metric ideal represents the ideal (perfect) value of the metrics of R 2 , NSE, MAE, and PE (1, 1, 0, and 0, respectively). The positive (or negative) values of ∆ indicate that the prediction results of model B are more (or less) accurate than those computed by model A. Table 7 shows that the accuracy performance of SWLSTM was superior to that of LSTM, SLSTM, and WLSTM. Based on these results, we concluded that the selection of appropriate input predictors and time lags helped create a more reliable data-driven model (see comparisons of SLSTM vs. LSTM and SWLSTM vs. WLSTM). In general, all factors related to weather and hydrology as well as the past histories of each variable, can affect dam inflow predictions, but using too much information that is not highly correlated ( Figure 4) creates an overfitting model. As the number of input data increases, the noise for that input will also increase, and it is difficult to accurately estimate weights (or learnable parameters) for each input. Additionally, the historical period that affects the future inflow varies depending on the predictor. For example, Qin is closely related to itself a day ago, while for Qo and Pr, data from 3 and 2 days ago are also important.
It is interesting to note that the use of WT improved the flood peak predictability of data-driven models. That is, the PE values of WLSTM and SWLSTM were approximately one-half and one-quarter the size of those from LSTM and SLSTM, respectively. Additionally, near the flood peak in the test dataset, the bias values for both models using WT were much smaller than those for the non-WT models ( Figure 10). With the aid of WT, five times as much data were used, and they can be reconstructed with various levels of decomposition. To train extreme data such as flood peaks, including separate time series such as WT appeared to be effective. If only one original data point was used for training, abnormally extreme high-frequency data may be considered noisy rather than critical information to be learned, which may fail to recognize, learn, and predict these events. In general, all factors related to weather and hydrology as well as the past histories of each variable, can affect dam inflow predictions, but using too much information that is not highly correlated ( Figure 4) creates an overfitting model. As the number of input data increases, the noise for that input will also increase, and it is difficult to accurately estimate weights (or learnable parameters) for each input. Additionally, the historical period that affects the future inflow varies depending on the predictor. For example, Qin is closely related to itself a day ago, while for Qo and Pr, data from 3 and 2 days ago are also important. It is interesting to note that the use of WT improved the flood peak predictability of data-driven models. That is, the PE values of WLSTM and SWLSTM were approximately one-half and one-quarter the size of those from LSTM and SLSTM, respectively. Additionally, near the flood peak in the test dataset, the bias values for both models using WT were much smaller than those for the non-WT models ( Figure 10). With the aid of WT, five times as much data were used, and they can be reconstructed with various levels of decomposition. To train extreme data such as flood peaks, including separate time series such as WT appeared to be effective. If only one original data point was used for training, abnormally extreme high-frequency data may be considered noisy rather than critical information to be learned, which may fail to recognize, learn, and predict these events. Figure 10. Bias between observation and prediction of dam inflow for four models in test dataset; Std is the standard deviation. The optimal hyper-parameters specified in Table 6 and the 5-folds ( = 5) are employed.

Feasibility to Multimodal, Multitask, and Bidirectional Learning
Recently, multimodal, multitask, and bidirectional learning has received great interest [33,34,66,67], and it is worth discussing the feasibility of hydrological time series predictions (e.g., dam inflow, runoff, or flood predictions). First, both multimodal learning with multiple inputs and multitask learning with multiple outputs are related to high- Figure 10. Bias between observation and prediction of dam inflow for four models in test dataset; Std is the standard deviation. The optimal hyper-parameters specified in Table 6 and the 5-folds (K = 5) are employed.

Feasibility to Multimodal, Multitask, and Bidirectional Learning
Recently, multimodal, multitask, and bidirectional learning has received great interest [33,34,66,67], and it is worth discussing the feasibility of hydrological time series predictions (e.g., dam inflow, runoff, or flood predictions). First, both multimodal learning with multiple inputs and multitask learning with multiple outputs are related to highdimensional problems. In hydrological time series predictions, there are three reasons for increasing the dimension, namely, a case where the number of input or output variables is large, a case where the sequence length or lead time of each variable is long, and a case where the values of the variables vary spatially. In the latter case, the number of dimensions can increase significantly up to O(10 2 to 10 6 ) while it is not very large, O(10 0 to 10 1 ) in the first two cases. In this study, multimodal learning for the first two cases was performed with a maximum of 30 dimensions. As a future study, it will be necessary to review the learning ability for ultra-high-dimensional problems that can take into account the spatial heterogeneity of variables.
The second is to examine the applicability of bidirectional learning in predicting hydrologic time series. Several studies have mentioned the superiority of bidirectional learning [68], which combines information from both the past and the future at the same time. However, this is limited to predictions of language models or hindcasting problems in which future data exist. It is challenging to apply it to hydrological forecasting cases because future information about weather variables (e.g., precipitation) and human (e.g., dam) operation is unknown at the present time.

Conclusions
In this study, data-driven models based on an LSTM network were built to predict daily inflow at the Hwacheon dam in South Korea. Three important aspects were considered to improve the accuracy of dam inflow predictions in an integrated fashion: (1) principal input predictors and their time lags were determined from a robust analysis of the statistical properties of the data series; (2) the original time series was converted to multiscale subseries by a WT; (3) hyper-parameters of all models were efficiently optimized through K-fold cross-validation and the random search. The effectiveness of SWLSTM, a model trained to consider these aspects, was compared with LSTM (trained without input selection and data transformation), SLSTM (trained with input selection only), and WLSTM (trained with data transformation only). The primary findings of this study are presented in the following paragraphs.
First, seven candidate input variables (i.e., inflow to the Hwacheon dam, outflow from Peace dam, precipitation, temperature, humidity, wind speed, and pressure) were initially chosen to investigate the correlation properties for the target output, Qin. Based on PACF and CCF analyses, we selected a final set of input variables and their sequence lengths (time lags). However, how to choose only the input variables and the lags that are closely correlated with the target variable remains an open question. In this study, a robust analysis using a correlation threshold for the PCAF and CCF values was proposed, and only variables greater than this threshold were selected as input predictors and their time lags. As shown in Figure 5, a model trained on a threshold of 0.4 produced the highest NSE value. Eliminating variables that have a low correlation with Qin helped prevent divergence and restrict overfitting in the learning model. Conclusively, Qin t−1 , Qo t−1 , Qo t−2 , Qo t−3 , Pr t−1 , and Pr t−2 become the principal inputs to predict the inflow of the dam. The effectiveness of such an input specification was validated because the models using it (SLSTM and SWLSTM) provided exceptionally accurate predictions compared with the unused models (i.e., LSTM and WLSTM).
Second, using additional data series reconstructed by a WT improved predictability, particularly for flow peak (see the comparisons of WLSTM vs. LSTM and SWLSTM vs. SLSTM). The PE values of WLSTM and SWLSTM were approximately one-half and onequarter the size of those produced by LSTM and SLSTM, respectively. For training extreme data such as flow peaks, including separate time series by WT can be effective. If only one original data point is used for training, abnormally extreme high-frequency data may be considered noisy rather than critical information to be learned, and the system may fail to recognize, learn, and predict these events.
Third, for a K of 5 as determined by the Silhouette coefficients and the distortion score ( Figure 8), a set of hyper-parameters for the four models was found using a random search ( Table 6). Both WLSTM and SWLSTM require a larger number of hidden units, epochs, dropout, and batch size compared with the other two models. The need for this complex configuration is clear because the amount of inputs used by WLSTM and SWLSTM was five times greater than that of the other models.
Last, accuracy performance investigated by various evaluation metrics revealed that SWLSTM is superior to LSTM, SLSTM, and WLSTM by 84%, 78%, and 65%, respectively. When the SWLSTM framework in this study is coupled with the procedures of a WT and the input specifications, overall and peak accuracy of time-dependent flow prediction improved. Ultimately, accurate forecasts of inflow will help policy makers and operators better manage their reservoir operations and tasks.