Comparative Analysis of Recurrent Neural Network Architectures for Reservoir Inﬂow Forecasting

: Due to the stochastic nature and complexity of ﬂow, as well as the existence of hydrological uncertainties, predicting streamﬂow in dam reservoirs, especially in semi-arid and arid areas, is essential for the optimal and timely use of surface water resources. In this research, daily streamﬂow to the Ermenek hydroelectric dam reservoir located in Turkey is simulated using deep recurrent neural network (RNN) architectures, including bidirectional long short-term memory (Bi-LSTM), gated recurrent unit (GRU), long short-term memory (LSTM), and simple recurrent neural networks (simple RNN). For this purpose, daily observational ﬂow data are used during the period 2012–2018, and all models are coded in Python software programming language. Only delays of streamﬂow time series are used as the input of models. Then, based on the correlation coe ﬃ cient (CC), mean absolute error (MAE), root mean square error (RMSE), and Nash–Sutcli ﬀ e e ﬃ ciency coe ﬃ cient (NS), results of deep-learning architectures are compared with one another and with an artiﬁcial neural network (ANN) with two hidden layers. Results indicate that the accuracy of deep-learning RNN methods are better and more accurate than ANN. Among methods used in deep learning, the LSTM method has the best accuracy, namely, the simulated streamﬂow to the dam reservoir with 90% accuracy in the training stage and 87% accuracy in the testing stage. However, the accuracies of ANN in training and testing stages are 86% and 85%, respectively. Considering that the Ermenek Dam is used for hydroelectric purposes and energy production, modeling inﬂow in the most realistic way may lead to an increase in energy production and income by optimizing water management. Hence, multi-percentage improvements can be extremely useful. According to results, deep-learning methods of RNNs can be used for estimating streamﬂow to the Ermenek Dam reservoir due to their accuracy.


Introduction
Large dam structures are built to reserve water for different supply objectives, such as drinking water irrigation, hydroelectric generation, and flood control. Generating clean energy, especially in developing countries, is a major focus of managers of hydroelectric dams. Accurate predictions of daily several clusters, and used the deep network of SAE-BP (integrating stacked autoencoders with BP neural networks) for modeling. Then, they compared the results with other methods, such as SVM. According to the results, the deep-learning algorithm presented a relatively better performance in flood estimation. Esmaeilzadeh et al. [14] estimated daily flow to the Sattarkhan Dam in Iran by using data-mining methods, including ANNs, the M5 tree method, support vector regression, and hybrid Wavelet-ANN methods. Results indicated that the wavelet artificial neural network (WANN) method estimated flow better than other methods. Amnatsan et al. [15] used the variation analog method (VAM), WANN, and weighted-mean analog methods to estimate input flow to the Sirikit Dam in Thailand. Results revealed that the VAM approach was better than other methods for high-intensity flows. Chiang et al. [16] integrated ensemble techniques into artificial neural networks to reduce model uncertainty in hourly streamflow predictions in Longquan Creek and Jinhua River watersheds in China. Results demonstrated that the ensemble neural networks improved about 19-37% of the accuracy of streamflow predictions as compared to a single neural network. Zhang et al. [17] used ANN, SVM, and long short-term memory (LSTM) methods to operate the Gezhouba Dam reservoir in China in hourly, daily, and monthly time scales. Results showed that the LSTM method, with less computational time, was more accurate in peak periods, and the number of maximum iterations was effective in model performance. Zhou et al. [18] used three ANN architectures (a radial basis function network, an extreme learning machine, and the Elman network) for monthly streamflow forecasting in the Jinsha River Basin in China. The best estimate was made by Elman architectures with an R of 0.906. Kratzert et al. [19] used the LSTM method with a freely available data set for rainfall-runoff modeling. Results showed that this method works well and could be used in hydrological modeling. Zhang et al. [20] used three deep learning algorithms (recurrent neural network (RNN), long short-term memory (LSTM), and gated recurrent unit (GRU)) to predict outflows in the Xiluodu reservoir. They determined that all three models can predict reservoir outflows accurately and efficiently and could be used to control floods and generate power; the number of iterations and hidden nodes mainly influenced the model precision. Kao et al. [21] used the LSTM-based Encoder-Decoder (LSTM-ED) model for multi-step-ahead flood forecasting in the Shihmen Reservoir catchment in Taiwan. Results showed that the proposed model that translated and linked the rainfall sequence with the runoff sequence could improve the reliability of flood forecasting and increase the interpretability of model internals. Zhou et al. [22] adopted an unscented Kalman filter (UKF) post-processing technique to forecast point flood by RNN in the three Gorges Reservoir in China. Results indicated that the proposed method extracted the complex non-linear dependence structure between the model's outputs and observed inflows and overcame the systematic error.
Fossil fuels throughout the world but also in Turkey are inadequate and electricity production from fossil fuels is expensive. Therefore, there is a potential importance of hydroelectric power plants for the production of electricity and the country relies on them for stabilizing the economic conditions. Inflow to the reservoir at the hydroelectric power plant is converted into much cheaper electricity. Increasing prediction accuracy can clearly lead to more useful planning for timely production and increase its generation to further continue improving profits and the efficiency of the dam. However, streamflows have a totally random and stochastic structure. Deep learning can be utilized for the evaluation of the complex and nonlinear association between streamflows, basin properties, and meteorological variables. In a stream, the streamflows are a time series with a seasonal recurrence, and the RNN is a technique that is implemented efficiently under recurrent conditions. RNN architectures have not been sufficiently examined and are not compared in streamflows. Given the significance of flow prediction in hydroelectric dam power production, this study attempts to predict daily inflow to the Ermenek Dam reservoir located in Turkey by utilizing deep learning techniques. Streamflow is precisely a time series, and RNN is the most suitable deep learning approach for the time series. Therefore, this research uses four RNN architectures, including Bi-LSTM, gated recurrent unit (GRU), LSTM, and a simple recurrent neural network (simple RNN). The efficiency of deep learning techniques in the flow prediction is assessed, and the results acquired from these techniques are compared with the ANN technique. The deep-learning focused method suggested in this study can help specialist engineers make more accurate and effective predictions. This proposed research may be considered as a part of creative prediction approaches and stochastic views in hydrology.

Study Area and Data
In the study, data measured in the Ermenek Dam, which is within the boundaries of the Ermenek District of Karaman Province, were used. The district is located between 36 • 58 N and 32 • 53 E. The average height of the district from sea level is 1250 m. The town of Ermenek contains many river sources, plateaus, promenades, as well as sources of historical and natural beauty. The Ermenek River, one of the crucial streams in the region, collects almost all waters of the region. Ermenek District, with an area of 122,297 ha, covers 13.86% of Karaman and 0.16% of Turkey. Of the total area, 21% of the district is farmland, 10% is meadow-pasture, and 38% is forest area [23][24][25][26].
Data used in the research belong to the Ermenek Dam, which was opened in 2009 with an installed capacity of 308.88 MW and an annual electricity generation capacity of 1187 GWh. The body of the dam is a concrete arch, and the crest length is 123 m. The filling volume is 305,000 m 3 , and the storage volume is 4582 billion m 3 [27].
To simulate streamflow to the Ermenek Dam ( Figure 1), daily mean inflow data are used during the 2012-2018 period. The basic characteristics of the data are provided in Table 1, and the time-series plots are displayed in Figure 2. This proposed research may be considered as a part of creative prediction approaches and stochastic views in hydrology.

Study Area and Data
In the study, data measured in the Ermenek Dam, which is within the boundaries of the Ermenek District of Karaman Province, were used. The district is located between 36°58′N and 32°53′E. The average height of the district from sea level is 1250 m. The town of Ermenek contains many river sources, plateaus, promenades, as well as sources of historical and natural beauty. The Ermenek River, one of the crucial streams in the region, collects almost all waters of the region. Ermenek District, with an area of 122,297 ha, covers 13.86% of Karaman and 0.16% of Turkey. Of the total area, 21% of the district is farmland, 10% is meadow-pasture, and 38% is forest area [23][24][25][26].
Data used in the research belong to the Ermenek Dam, which was opened in 2009 with an installed capacity of 308.88 MW and an annual electricity generation capacity of 1187 GWh. The body of the dam is a concrete arch, and the crest length is 123 m. The filling volume is 305,000 m 3 , and the storage volume is 4582 billion m 3 [27].
To simulate streamflow to the Ermenek Dam ( Figure 1), daily mean inflow data are used during the 2012-2018 period. The basic characteristics of the data are provided in Table 1, and the time-series plots are displayed in Figure 2.     For modeling, ANN methods and four different RNN deep-network methods were employed. In total, 70% and 30% of the data were used in training and testing stages, respectively, as indicated in Figure 2. Besides this, several delayed daily mean streamflow are used as input data. For this purpose, the correlation between the time series of streamflows with its delays is obtained, and seven delays (SFt-1, SFt-2, SFt-3, SFt-4, SFt-5, SFt-6, and SFt-7) are selected as inputs due to high correlation ( Figure  3). The correlation coefficients are found to vary between 0.63 and 0.85 in seven-day lag conditions. Similar studies [7,8] have also used seven delays in flow time series as inputs. For modeling, ANN methods and four different RNN deep-network methods were employed. In total, 70% and 30% of the data were used in training and testing stages, respectively, as indicated in Figure 2. Besides this, several delayed daily mean streamflow are used as input data. For this purpose, the correlation between the time series of streamflows with its delays is obtained, and seven delays (SF t-1 , SF t-2 , SF t-3 , SF t-4 , SF t-5 , SF t-6 , and SF t-7 ) are selected as inputs due to high correlation ( Figure 3). The correlation coefficients are found to vary between 0.63 and 0.85 in seven-day lag conditions. Similar studies [7,8] have also used seven delays in flow time series as inputs. For modeling, ANN methods and four different RNN deep-network methods were employed. In total, 70% and 30% of the data were used in training and testing stages, respectively, as indicated in Figure 2. Besides this, several delayed daily mean streamflow are used as input data. For this purpose, the correlation between the time series of streamflows with its delays is obtained, and seven delays (SFt-1, SFt-2, SFt-3, SFt-4, SFt-5, SFt-6, and SFt-7) are selected as inputs due to high correlation ( Figure  3). The correlation coefficients are found to vary between 0.63 and 0.85 in seven-day lag conditions. Similar studies [7,8] have also used seven delays in flow time series as inputs.

Artificial Neural Network (ANN)
An ANN is a distributed knowledge treatment system in which performance essentials are alike to the human brain, and is based on a simulated biological neural network [28]. Each neural network has three layers, namely, input, hidden, and output. The input layer is a layer for providing data provided as inputs to the network. The output layer contains values predicted by the network. The hidden layer is the data analysis location. Usually, the number of selected neurons of the layers is obtained by trial and error. The general architecture of the ANN is displayed in Figure 4, where X (x 1, x 2, ..., x n) = inputs vector, W = connecting weights to the next layer, b k = bias, and y k is the ANN final output. The activation function converts input signals into output signals.
Water 2020, 12, x FOR PEER REVIEW 6 of 18 Figure 3. Correlation plot between streamflow and its delays.

Artificial Neural Network (ANN)
An ANN is a distributed knowledge treatment system in which performance essentials are alike to the human brain, and is based on a simulated biological neural network [28]. Each neural network has three layers, namely, input, hidden, and output. The input layer is a layer for providing data provided as inputs to the network. The output layer contains values predicted by the network. The hidden layer is the data analysis location. Usually, the number of selected neurons of the layers is obtained by trial and error. The general architecture of the ANN is displayed in Figure 4, where X (x1, x2, ..., xn) = inputs vector, W = connecting weights to the next layer, bk = bias, and yk is the ANN final output. The activation function converts input signals into output signals. In Figure 4, N inputs are given from x1 to xn to the counterpart weights Wk1 to Wkn. Initially, the weights are multiplied by their inputs, and then they are summed with the amount of bias to obtain u (Equation (1)): Then, the activation function is adapted on u, i.e., f (u); ultimately, the final output value is obtained as = f(u) of the neuron. The most popular activation functions are Sigmoid, ReLU, and Softmax. In this study, feed-forward neural networks were used.

Recurrent Neural Networks (RNN)
Deep-learning algorithms are a sample of machine-learning algorithms where the purpose is to discover multiple levels of representation of input data. Developed in the 1980s, multilayer RNNs are among the most commonly used models for deep learning [30]. These types of networks have a memory that records the information they have seen so far, and many types exist. Moreover, RNNs are powerful models for sequential data (time series) [31], and they use the previous output to predict the next output. In this case, the networks themselves have repetitive loops. These loops, which are in the hidden neurons, allow the storing of previous input information for a while so that the system can predict future outputs. The hidden layer output is retransmitted t times to the hidden layer. The output of a recursive neuron is only sent to the next layer when the number of iterations is completed. In this case, the output is more comprehensive, and the previous information is kept for longer. Finally, the errors are returned backward to update the weights. In this study, four available RNN architectures are used. In order to improve the readability of this study, a research flow chart is given in Figure 5. In Figure 4, N inputs are given from x 1 to x n to the counterpart weights W k1 to W kn . Initially, the weights are multiplied by their inputs, and then they are summed with the amount of bias to obtain u (Equation (1) Then, the activation function is adapted on u, i.e., f (u); ultimately, the final output value is obtained as y k = f(u) of the neuron. The most popular activation functions are Sigmoid, ReLU, and Softmax. In this study, feed-forward neural networks were used.

Recurrent Neural Networks (RNN)
Deep-learning algorithms are a sample of machine-learning algorithms where the purpose is to discover multiple levels of representation of input data. Developed in the 1980s, multilayer RNNs are among the most commonly used models for deep learning [30]. These types of networks have a memory that records the information they have seen so far, and many types exist. Moreover, RNNs are powerful models for sequential data (time series) [31], and they use the previous output to predict the next output. In this case, the networks themselves have repetitive loops. These loops, which are in the hidden neurons, allow the storing of previous input information for a while so that the system can predict future outputs. The hidden layer output is retransmitted t times to the hidden layer. The output of a recursive neuron is only sent to the next layer when the number of iterations is completed. In this case, the output is more comprehensive, and the previous information is kept for longer. Finally, the errors are returned backward to update the weights. In this study, four available RNN architectures are used. In order to improve the readability of this study, a research flow chart is given in Figure 5.

Simple Recurrent Neural Network (Simple RNN)
A simple RNN is essentially a collection of common neural networks arranged together, each of them transmitting a message to another. In other words, these networks have a memory that stores knowledge about the data seen, but their memory is short term and cannot maintain long-term time series [32]. Figure 6a displays a simple RNN. A simple recurrent network has only one internal memory-ℎ -which is computed from Equation (2): where g() denotes an activation function, U and W are flexible weight matrices of the h layer, b is a bias, and X is an input vector [19].

Simple Recurrent Neural Network (Simple RNN)
A simple RNN is essentially a collection of common neural networks arranged together, each of them transmitting a message to another. In other words, these networks have a memory that stores knowledge about the data seen, but their memory is short term and cannot maintain long-term time series [32]. Figure 6a displays a simple RNN. A simple recurrent network has only one internal memory-h t -which is computed from Equation (2): where g() denotes an activation function, U and W are flexible weight matrices of the h layer, b is a bias, and X is an input vector [19].

Long Short-Term Memory (LSTM)
LSTM is a kind of model or structure for sequential data developed by [33] for the advancement of RNN. It uses a special combination of hidden units, elementwise products, and sums between units to implement gates that control "memory cells." These cells are designed to retain information without modification for long periods [34]. The greatest feature of LSTM is in its capability to learn long-term dependency, which is not possible with simple RNNs. To predict the next step, the weight values on the network have to be updated, which requires the maintenance of information from the initial steps. A simple RNN can only learn a limited number of short-term relationships and it cannot learn long-term series. However, LSTM can learn these long-term dependencies properly, and LSTM has three gates: input, forget, and output ( Figure 6b). The forget gate is embedded to indicate how much the previous memory remembers and how much it has forgotten. For LSTM, the hidden state ℎ is computed as follows: where , , and are the input, forget, and output gates at time t, respectively; , , , and are weights that map the hidden layer input to the three gates of input, forget, and output while , , , and weights matrices map the hidden layer output to gates; , , , and are vectors. Moreover, and ℎ are the outcome of the cell and the outcome of the layer, respectively [35]

Long Short-Term Memory (LSTM)
LSTM is a kind of model or structure for sequential data developed by [33] for the advancement of RNN. It uses a special combination of hidden units, elementwise products, and sums between units to implement gates that control "memory cells." These cells are designed to retain information without modification for long periods [34]. The greatest feature of LSTM is in its capability to learn long-term dependency, which is not possible with simple RNNs. To predict the next step, the weight values on the network have to be updated, which requires the maintenance of information from the initial steps. A simple RNN can only learn a limited number of short-term relationships and it cannot learn long-term series. However, LSTM can learn these long-term dependencies properly, and LSTM has three gates: input, forget, and output ( Figure 6b). The forget gate is embedded to indicate how much the previous memory remembers and how much it has forgotten. For LSTM, the hidden state h t is computed as follows: where i t , f t , and O t are the input, forget, and output gates at time t, respectively; W i , W f , W o , and W c are weights that map the hidden layer input to the three gates of input, forget, and output while U i , U f , U o , and U c weights matrices map the hidden layer output to gates; b i , b f , b o , and b c are vectors. Moreover, C t and h t are the outcome of the cell and the outcome of the layer, respectively [35].

Gated Recurrent Unit (GRU)
GRU is a simple type of LSTM suggested by Cho et al. [36]. The variance with LSTM is that GRU merges the input and forget gates and converts them with an update gate. Therefore, GRU has fewer parameters than LSTM which makes training easier. For GRU, the output value of h t is computed as follows: where r is a reset gate, and z denotes an update gate. The reset gate indicates how the new input can be combined with earlier memory. The update gate also indicates how much previous memory is kept. If the update gate is 1, the previous memory is fully preserved, and if it is 0, the previous memory is completely forgotten. There is a forget gate in the LSTM that automatically determines how much of the previous memory is maintained whereas, in GRU, all previous memories are maintained or completely forgotten (Figure 6c). For some problems, GRUs can provide a performance comparable with LSTMs, but with a lower memory requirement [34,36].

Bidirectional LSTM (Bi-LSTM)
In fact, instead of one direction, the network is processed on two sides: backward and forward with two separate hidden layers. Graves and Schmidhuber [37] demonstrated that bidirectional networks responded better than unidirectional networks in cases such as phonemic clustering. The structure of a bidirectional network is displayed in Figure 6d. Given this figure, these networks have a structure with a forward and backward LSTM layer. The forward layer output order, → h t , is repeatedly computed via inputs in a positive order from time T − n to time T − 1, while the backward layer outcome order, ← h t , is computed using the inverted inputs from time T − n to T − 1 [35]. The output of both forward and backward layers is computed in a similar manner to the unidirectional LSTM. In the Bi-LSTM layer, Y t is computed from Equation (13): where σ is a function used to merge two → h t and ← h t outputs [35]. In all networks, the dropout function is used among layers, which is a technique to prevent network overfitting. This means that learning on different architectures occurs with a variety of neurons. Figure 7a,b illustrate networks before and after dropout, respectively.
where σ is a function used to merge two ℎ ⃗⃗⃗ and ℎ ⃖⃗⃗⃗ outputs [35]. In all networks, the dropout function is used among layers, which is a technique to prevent network overfitting. This means that learning on different architectures occurs with a variety of neurons. Figure 7a and b illustrate networks before and after dropout, respectively. The learning rate is one of the hyperparameters necessary to find the optimal value and usually takes values between 1 and 1 × 10 −7 . In fact, the learning rate expresses the size of the move steps by the network. Figure 8 compares the changes in the loss function versus the epoch based on the learning rate. Working with a low learning rate, it takes a long time to find the best solution. Besides, if the learning rate is too high, it rejects the optimal mode. Since a high learning rate is advantageous in early iterations, and a low one is advantageous in later iterations, a learning rate that slows down as the algorithm progresses is preferred [39]. The learning rate is one of the hyperparameters necessary to find the optimal value and usually takes values between 1 and 1 × 10 −7 . In fact, the learning rate expresses the size of the move steps by the network. Figure 8 compares the changes in the loss function versus the epoch based on the learning rate. Working with a low learning rate, it takes a long time to find the best solution. Besides, if the learning rate is too high, it rejects the optimal mode. Since a high learning rate is advantageous in early iterations, and a low one is advantageous in later iterations, a learning rate that slows down as the algorithm progresses is preferred [39].

Open-Source Software and Codes
This research relies heavily on open-source software. Python 3.6 is used as a programming language. The libraries are used for preprocessing the data and for data management, including Pandas, Numpy, and Scikit-Learn. Deep-learning frameworks employed are TensorFlow and Keras.

Open-Source Software and Codes
This research relies heavily on open-source software. Python 3.6 is used as a programming language. The libraries are used for preprocessing the data and for data management, including Pandas, Numpy, and Scikit-Learn. Deep-learning frameworks employed are TensorFlow and Keras. Different codes are used for each RNN and ANN. All figures are created using Matplotlib.

Evaluation Criteria
In order to evaluate the models, correlation coefficients (CC), Nash-Sutcliff coefficient (NS), root mean square error (RMSE), and mean absolute error (MAE) are used, as presented in Equations (14) to (17), respectively.
where X i is the observation parameter with a mean denoted by X; Y i is the prediction parameter with a mean denoted by Y; N is a number of instances. The more the two first criteria are closer to 1 and the next three values are closer to 0 show the better performance of the model. According to Chiew et al. [41], if NS > 0.90, the simulation is very acceptable; if 0.60 < NS < 0.90, the simulation is acceptable; and if NS < 0.60 as in this case, simulation is unacceptable.

Results and Discussion
In this research, daily streamflow to the Ermenek Dam reservoir in Turkey was simulated using various deep-learning models. In this section, the historical observation streamflow data are compared with the computed streamflow from artificial neural networks and RNN, such as Bi-LSTM, GRU, LSTM, and simple RNN using seven lag days. A network attempts to predict outcomes as accurately as possible. The value of this precision in the network is obtained by the cost function, which tries to penalize the network when it fails. The optimal output is the one with the lowest cost. In this study, for the applied networks of MSE, the cost function is used. A repetition step in training generally works with a division of training data named a batch size. The number of samples for each batch is a hyperparameter, which is normally obtained by trial and error. In this study, the value of this parameter in all models is 512 in the best mode. In each repetition step, the cost function is computed as the mean MSE of these 512 samples of observed and predicted streamflow. The number of iteration steps for neural networks is named an epoch; in each epoch, the streamflow time series is simulated by the network once. Like other networks, neurons or network layers can be selected arbitrarily in recurrent networks. For the purpose of the comparison of models with each other, the structures of all recurrent network models are created identically. In each network, a double hidden layer is used so that there are 200 units in the first layer and 150 units of the neuron in the second layer. The last layer output of the network at the final time step is linked to a dense layer with a single output neuron. Between the layers, a dropout equal to 10% is used. The structure of the neural network is also used in two hidden layers. The first and second layers have 200 and 150 neurons, respectively. In all networks, the ReLU activation function is applied for the hidden layer. The main advantage of using ReLU is that, for all inputs greater than 0, there is a fixed derivative. This constant derivative speeds up network learning. Each method is run with epoch numbers of 100, 200, and 500. Results of different methods based on the evaluation criteria and their running times are presented in Table 2 (Computations are performed using a Pentium B960, 2.20 GHz laptop computer with Windows 8 and 4 GB RAM.). Among the methods, ANN is the fastest; among RNN methods, the speed sequence is as follows: simple RNN, GRU, LSTM, and Bi-LSTM. Different models are run with different iterations. As seen in Table 2, the number of epochs as one of the influential parameters plays a basic role; so in fewer iterations, the accuracy of the model is low, and the error is greater. By increasing the number of iterations, the model gradually converges so that there is not a significant difference between the results of 300 and 500 epochs. For this reason, it is decided that a maximum of 500 epochs is sufficient. Another parameter that influences the accuracy of the models is the number of neurons in the hidden layers. If it is low, the model will not be able to simulate correctly, and if it is too high, there is a risk of overfitting. This problem is solved by using the dropout method, which in fact deactivates several neurons. The values of the learning rate (LR) and decay are also presented in Table 2. Decay occurs due to how much the learning rate is reduced in each step. As seen in Table 2, five different evaluation criteria are used to compare five different prediction methods. Each method has the best result with 300 or 500 epochs. The performance of the testing stage is generally 1-5% lower than that of the training stage. According to Table 2, among the methods used, the LSTM method performs best in 500 epochs with an accuracy of CC = 87% in the testing stage. This network has long-term memory, and its forget gate specifies how much previous memory is kept. The first step is the multiplication of input data in weights and then its summation with the bias, followed by output. In this step, the output is likely to be very different from the actual output. Therefore, errors are returned backward to update the weights and also biases.
Comparison of some metrics by all methods used in the study are given in Table 3. In this table, measured streamflow values (those set aside for the test and the whole streamflow data set) and values predicted by the methods can be compared. The GRU and LSTM models had the closest results to measured values in max, mean, and standard deviation (StDev) statistics in terms of similarity to the measured values. Simple RNN, ANN, and BLSTM methods gave the worst results, respectively.
The loss function, observational streamflow values and predicted values for training and testing stages computed by various methods are displayed in Figure 9. When the loss function and time-series graphs are examined, the most successful methods are LSTM and GRU. The loss charts of testing and training stages of both methods overlap at the highest epoch value. According to the loss function graphs (Figure 9), with the reduction of the modeling error for the training stage, the error of the testing stage decreases, and the distance between the two graph lines reduces. Therefore, it can be said that the dropout function helps to prevent network overfitting. Besides, considering the changes in the loss function value versus epochs (and according to Figure 8) can ensure that the learning rate is selected correctly.  Figure 10 shows the scatter of observed values versus the predicted values for both training and testing stages. Regarding the streamflow graphs (Figures 9 and 10), the networks yield poor results in peak periods and cannot simulate these periods well, except LSTM. One major reason of this is that streamflow abundance has very few high values, and the network cannot properly learn them. This happens even though the streamflow abundance has several low values, so the model can correctly and accurately be learned in the training stage. According to the data, the difference between the minimum and the maximum flow values is high, and the validity of these values is investigated during the study period based on precipitation in the region. In periods where peak streamflow is observed, the maximum rainfall is observed with a significant amount that indicates the correctness of peak flow values. However, among the methods applied, the LSTM network is better than other methods, and is able to simulate peak flow periods fairly. As noted above, the greatest feature of this network is its skill to learn long-term dependencies, and the forget gate makes the network keep or forget the desired amount of previous memory and thus helps to improve the modeling. The GRU method is similar to the LSTM method, and its results are close to those of LSTM. In contrast, the other methods failed to simulate peak flow values. The scatter plots in Figure 10 indicate a relatively small dispersion of observed and predicted data in training and testing stages compared to other methods. Kratzert et al. [19] used deep LSTM networks in similar research and demonstrated that these networks perform better than other networks and can be used in hydrological simulations as an application method.   Figure 10 shows the scatter of observed values versus the predicted values for both training and testing stages. Regarding the streamflow graphs (Figures 9 and 10), the networks yield poor results in peak periods and cannot simulate these periods well, except LSTM. One major reason of this is that streamflow abundance has very few high values, and the network cannot properly learn them. This happens even though the streamflow abundance has several low values, so the model can correctly and accurately be learned in the training stage. According to the data, the difference between the minimum and the maximum flow values is high, and the validity of these values is investigated during the study period based on precipitation in the region. In periods where peak streamflow is observed, the maximum rainfall is observed with a significant amount that indicates the correctness of peak flow values. However, among the methods applied, the LSTM network is better than other methods, and is able to simulate peak flow periods fairly. As noted above, the greatest feature of this network is its skill to learn long-term dependencies, and the forget gate makes the network keep or forget the desired amount of previous memory and thus helps to improve the modeling. The GRU method is similar to the LSTM method, and its results are close to those of LSTM. In contrast, the other methods failed to simulate peak flow values. The scatter plots in Figure 10 indicate a relatively small dispersion of observed and predicted data in training and testing stages compared to other methods. Kratzert et al. [19] used deep LSTM networks in similar research and demonstrated that these networks perform better than other networks and can be used in hydrological simulations as an application method.

Conclusions
In this paper, a novel approach to streamflow simulation based on recurrent neural networks is presented. For this purpose, 70% of the inflow data are trained to ANN and RNNs, including Bi-LSTM, LSTM, GRU, and simple RNN. Then, the 30% remaining data were predicted. For each of the networks, the double hidden layer was used. The number of neurons in the hidden layers, the learning rate, and the number of iterations have a basic role in modeling accuracy, the optimal values of which were obtained using trial and error. Using the dropout function in the network structure can well prevent network overfitting. To attain this purpose, all models were coded in Python programming software. The efficiency of the proposed approach is evaluated for the simulation of daily inflow to the Ermenek Dam reservoir in Turkey. Results of the study point to the better performance of recurrent networks (NS: 0.74, RMSE:17.92) compared to ANN (NS: 0.72, RMSE:18.71). Among different RNN architectures, the LSTM network performed the best and can estimate streamflow with fairly good accuracy (CC: 0.87), and it can be said that these networks can be used as an effective method of streamflow modeling and simulation.
Thermal power plants are not environmentally friendly and cause air pollution and global warming. Due to limited fossil resources in Turkey the importance of the hydroelectric power plant is increased. Considering the purpose of the reservoir construction, which is the production of the hydroelectric power or agricultural irrigation, the prediction of streamflow may help to increase efficiency and achieve the desired purpose. In this study, for the first time, deep learning methods are used to model inflows to the Ermenek Dam reservoir located in Turkey. Results of this study are consistent with those of studies by Nagesh Kumar et al. [5] and Zhang et al. [17] Given that water resource management is essential, it can be acknowledged that the use of these methods can help managers and officials of the Ermenek Dam in future planning. As a result, increasing the accuracy of the model can help researchers and experts to achieve this goal.