Development and Evaluation of the Combined Machine Learning Models for the Prediction of Dam Inﬂow

: Predicting dam inﬂow is necessary for e ﬀ ective water management. This study created machine learning algorithms to predict the amount of inﬂow into the Soyang River Dam in South Korea, using weather and dam inﬂow data for 40 years. A total of six algorithms were used, as follows: decision tree (DT), multilayer perceptron (MLP), random forest (RF), gradient boosting (GB), recurrent neural network–long short-term memory (RNN–LSTM)


Introduction
Global warming has led to concerns over climate change and caused the complexity of the hydrologic cycle, resulting in greater uncertainty in the management of water resources [1]. Especially in Korea, it is important to establish a plan for water resource management through efficient water management, because of the high coefficient of flow fluctuation and the steep geographical characteristics [2]. Korea has continued to make programs for watershed planning and sustainable water resource management by improving the operational efficiency of hydraulic structures such as Water 2020, 12, x FOR PEER REVIEW 3 of 18 algorithms cannot be perfect for streamflow estimation, as the weather and flow regimes vary greatly in Korea, new algorithms considering these various characteristics need to be selected and developed.
To reflect the characteristics of the catchments in South Korea, the dynamic flow regimes due to the temporal distribution of rainfall should be considered. Therefore, this study aims to (1) evaluate the performance of the algorithm to predict the amount of inflow of the Soyang River Dam, and (2) develop and evaluate the combined machine learning algorithms, with the consideration of flow duration.

Study Area
The Soyang River Dam is located in the Han River Basin in South Korea (Figure 1). The Soyang River Dam is a multipurpose dam for water supply and power generation, with an effective storage capacity of 2,900,000,000 tons. The hydrologic catchment area of the dam is 2637 km 2 , of predominantly forest area (89.5%), and the remaining land uses are agricultural land (5.7%), water (2.4%), and other land uses (2.4%), which covers administrative districts (Chuncheon, Inje, Yanggu, and Goseong) and some of the North Korean territory. According to the precipitation data collected at the Chuncheon weather station, annual rainfall from 1980 to 2019 is, on average, 1321 mm, and has varied between 677 mm and 2069 mm with an increasing pattern, as shown in Figure 2. The Soyang River Dam is the main water source for the metropolitan area. The inflow from Inbuk Stream and Naerin Stream are the main tributaries to the Soyang River Dam, and the flow rate at the confluence accounts for more than 90% of the total inflow of the Soyang River Dam [24]. The discharge of Soyang River Dam, which directly affects the water environment at the downstream, is carried out during water level control before the flood, water supply in drought season, and power generation in the dam. Between 1980 and 2019, the average flow and the peak flow recorded at the Soyang River Dam were 68.6 m 3 /s and 7405.6 m 3 /s with a decreasing pattern, as shown in Figure 2. The Soyang River Dam is the main water source for the metropolitan area. The inflow from Inbuk Stream and Naerin Stream are the main tributaries to the Soyang River Dam, and the flow rate at the confluence accounts for more than 90% of the total inflow of the Soyang River Dam [24]. The discharge of Soyang River Dam, which directly affects the water environment at the downstream, is carried out during water level control before the flood, water supply in drought season, and power generation in the dam. Between 1980 and 2019, the average flow and the peak flow recorded at the Soyang River Dam were 68.6 m 3 /s and 7405.6 m 3 /s with a decreasing pattern, as shown in Figure 2. The increasing tendency of precipitation and the decreasing tendency of inflow are caused by the increase in evapotranspiration due to the increasing tendency of the annual average of the maximum temperature.
Water 2020, 12, x FOR PEER REVIEW 4 of 18 The increasing tendency of precipitation and the decreasing tendency of inflow are caused by the increase in evapotranspiration due to the increasing tendency of the annual average of the maximum temperature. South Korea suffered the worst droughts from 2014 to 2015, receiving less than 43% of the annual precipitation average of the past 30 years [25]. Predicting floods and droughts can reduce damage caused by disasters, and enable efficient water management.

Data Descriptions
The machine learning prediction models were built for inflow estimation flowing into the Soyang River Dam, with the use of inflow data and weather data. The machine learning prediction models were constructed for a learning period of 40 years (1980-2019). The time series data of the weather (precipitation, maximum temperature, minimum temperature, humidity, wind speed, and solar radiation) were extracted from the Korea Meteorological Administration [26] database for the Chuncheon observation station located the nearest to the Soyang River Dam. The inflow data of the Soyang River Dam were obtained from the Korea Water Resources Corporation (K-water) [27]. The time series data used for prediction of the inflow of the dam are shown in Figure 3. Figure 3 represents 40 years of the variation in the amount of dam inflow (m 3 /s), precipitation (mm), maximum temperature (°C), minimum temperature (°C), humidity (%), wind speed (m/s), and solar radiation (MJ/m 2 ). Figure 2 also shows that the patterns of precipitation and dam inflow are similar.  South Korea suffered the worst droughts from 2014 to 2015, receiving less than 43% of the annual precipitation average of the past 30 years [25]. Predicting floods and droughts can reduce damage caused by disasters, and enable efficient water management.

Data Descriptions
The machine learning prediction models were built for inflow estimation flowing into the Soyang River Dam, with the use of inflow data and weather data. The machine learning prediction models were constructed for a learning period of 40 years (1980-2019). The time series data of the weather (precipitation, maximum temperature, minimum temperature, humidity, wind speed, and solar radiation) were extracted from the Korea Meteorological Administration [26] database for the Chuncheon observation station located the nearest to the Soyang River Dam. The inflow data of the Soyang River Dam were obtained from the Korea Water Resources Corporation (K-water) [27]. The time series data used for prediction of the inflow of the dam are shown in Figure 3. Figure 3 represents 40 years of the variation in the amount of dam inflow (m 3 /s), precipitation (mm), maximum temperature ( • C), minimum temperature ( • C), humidity (%), wind speed (m/s), and solar radiation (MJ/m 2 ). Figure 2 also shows that the patterns of precipitation and dam inflow are similar.
Water 2020, 12, x FOR PEER REVIEW 4 of 18 The increasing tendency of precipitation and the decreasing tendency of inflow are caused by the increase in evapotranspiration due to the increasing tendency of the annual average of the maximum temperature. South Korea suffered the worst droughts from 2014 to 2015, receiving less than 43% of the annual precipitation average of the past 30 years [25]. Predicting floods and droughts can reduce damage caused by disasters, and enable efficient water management.

Data Descriptions
The machine learning prediction models were built for inflow estimation flowing into the Soyang River Dam, with the use of inflow data and weather data. The machine learning prediction models were constructed for a learning period of 40 years (1980-2019). The time series data of the weather (precipitation, maximum temperature, minimum temperature, humidity, wind speed, and solar radiation) were extracted from the Korea Meteorological Administration [26] database for the Chuncheon observation station located the nearest to the Soyang River Dam. The inflow data of the Soyang River Dam were obtained from the Korea Water Resources Corporation (K-water) [27]. The time series data used for prediction of the inflow of the dam are shown in Figure 3.   In this study, the machine learning prediction model for the dam inflow of a day used the following data: weather data of the day (forecasted), weather data of one day ago, weather data of two days ago, the inflow of one day ago, and the inflow of two days ago, for the study period of 40 years (1980-2019) ( Table 1). The weather data includes precipitation, maximum temperature, minimum temperature, relative humidity, wind speed, and solar radiation. In addition, prior weather conditions in the basin have a significant impact on soil moisture in the basin [28]; therefore, weather data and flow data of one day ago and two day ago were used for machine learning, to take into account prior weather conditions. Table 1. The input data for machine learning models.
Data preprocessing, such as to improve the quality of data and to generate comprehensible information, needs to be done for effective machine learning [29]. Data preprocessing includes a selection of input variables, standardization, noise instance removal, data dimension reduction, and multiple collimations. Several studies revealed that more data preprocessing makes better predictive performance of the model [30,31]. In this study, data preprocessing was performed by the scaling method and standardization methods, including shape scaling, normalization, and standardization using the 'StandardScaler' function, which is one of the most commonly used scalers of the 'sklearn.preprocessing' library for a preprocessing step [32,33]. This step applies to a dataset an average of 0 and variance 1, by applying linear transformations to all data.

Machine Learning Algorithms
Machine learning is an algorithm that learns from data and improves its performance as it learns. Machine learning is classified among three separate branches: supervised learning, unsupervised learning, and reinforcement learning [29,34]. In this study, supervised learning was used to predict the inflow of the Soyang River Dam. Supervised learning uses labeled data (e.g., precipitation, temperature, inflow, outflow, etc.) for training, and infers a function from the data. A total of six methods, including RNN-LSTM and CNN-LSTM, were used to build models to estimate the amount of dam inflow. Model information is given in Table 2. functions of Keras modules in the TensorFlow library. "Notation" refers to the name used when briefly stated in the graph.
Decision Tree A decision tree is a widely used model for classification and regression, and it learns as it continues to ask yes or no questions to reach a decision [35]. In the decision tree, the hyperparameter that controls model complexity is a prepruning parameter that causes the tree to stop before it is completely created. In general, the designation of either "max_depth", "max_leaf_nodes", or "min_samples_leaf" is sufficient to prevent overfitting. The decision tree in the scikit-learn is intended to provide adequate prefabrication through "min_samples_leaf". The critical hyperparameters in the decision tree (DT) regressor are the following: entropy for criterion, 1 for min_samples_leaf, 0 for min_impurity_decrease, best for splitter, 2 for min_samples_split, and 0 for random_state (Table 3).

Multilayer Perceptron
A multilayer perceptron (MLP) is one of the feed-forward neural network (FFNN) structures, consisting of a total of three layers: input layer, hidden layer, and output layer [36]. The input data is entered at the input layer, weighted to fit the set hidden layer structure, and the results are printed out at the output layer [36]. Recently, MLPs configured with more than one hidden layer have produced more accurate predictions than other machine learning techniques [37]. Table 3 provides details of the setting of hyperparameters for the MLPRegressor function. The critical hyperparameters in the MLP regressor are the following: 50 nodes for each of the three layers for hidden_layer_sizes, adam for solver, 0.001 for learning_rate_init, 200 for max_iter, 0.9 for momentum, 0.9 for beta_1, 1e−8 for epsilon, and relu for activation.

Random Forest
Random forest is a classification technique, developed by Breiman [38], that combines the bagging algorithm, the ensemble learning method, and the classification and registration tree (CART) algorithm. Random forest is also executable on large-scale data and provides high accuracy because it runs Water 2020, 12, 2927 7 of 18 using many variables without removing them [37,39]. In addition, compared to the artificial neural network and support vector regression, the hyperparameter is simple for detailed tuning. Of the hyperparameters set in the function "RandomForestRegressor" used in this study, the sensitive hyperparameter is the primary parameter. Since the random forest model is an ensemble model of the decision tree, the primary parameter is the tree number, "n-estimator", and the value is set to 50. The other variables are as the following: 2 for min_samples_split, 0 for min_weight_fraction_leaf, 0 for min_impurity decrease, 0 for verbose, mse for criterion, 1 for min_samples_leaf, auto for max_features, and true for bootstrap (Table 3).
Gradient Boosting Gradient boosting is an ensemble model that learns the boosting algorithm by ensemble learning for the decision tree. In gradient boosting, the gradient reveals the weaknesses of the model that have been learned so far, while other models focus on it to boost performance. The parameters that minimize the loss function that quantifies errors in the predictive model should be found for better prediction. The advantage of gradient boosting is that the other loss functions can be used as much as possible. The character of the loss function is automatically reflected in learning through the gradient [40].
LSTM LSTM is an RNN, and RNN is a type of deep learning algorithm that learns time series data repeatedly [41]. RNN is a structure in which the output data of the previous RNN, in the course of data learning, affects the output data of the current RNN. This allows for the connection of current and past learning and is useful for continuous and repetitive learning; however, the predictive performance is compromised with use of data from the past too far. LSTM is a type of RNN-based deep learning algorithm that makes it easy to predict time series data by taking into account the order or time aspects of learning, and by preventing chronic problems of weight loss in RNN [42,43]. Several recent studies have shown that LSTM transforms its structure to improve its predictive performance [15]. In this study, RNN-LSTM built the layers of "LSTM" and "dense", and put "dropout" layer in the middle to prevent overfitting ( Figure 4).

CNN-LSTM
As the performance of deep learning has recently been verified throughout data science and technology, it is believed that the deep neural network (DNN), a deep learning technique to solve the problem of numerical prediction, can also contribute to improving the accuracy of the calculation of dam inflow. CNN-LSTM, which is joined by a leading algorithm with CNN, is one of the examples of LSTM transformation [22,44,45].
Information on layers added to the sequential functions of CNN-LSTM in this study is shown in Figure 4, consisting of seven layers. In the case of CNN-LSTM, CNN, which uses two-dimensional data mainly, can be used for one-dimensional time series data to extract data characteristics and analyze data prediction. Additional "Conv1D" and "MaxPooling1D" were used to construct layers for CNN-LSTM. dam inflow. CNN-LSTM, which is joined by a leading algorithm with CNN, is one of the examples of LSTM transformation [22,44,45].
Information on layers added to the sequential functions of CNN-LSTM in this study is shown in Figure 4, consisting of seven layers. In the case of CNN-LSTM, CNN, which uses two-dimensional data mainly, can be used for one-dimensional time series data to extract data characteristics and analyze data prediction. Additional "Conv1D" and "MaxPooling1D" were used to construct layers for CNN-LSTM.

Development of Combined Machine Learning Algorithms (CombML)
As indicated in the objective of this study, we predicted the inflow of Soyang River Dam using the past weather data, inflow, and the weather forecast, accurately. The results of the study showed that the ensemble models (the random forest and gradient boosting) performed well in under 100 m 3 /s of the inflow. On the other hand, MLP has merits when predicting the inflow of over 100 m 3 /s. Therefore, a new model, combining MLP and ensemble models, was created to predict the dam inflow; however, it is impossible to predict the inflow of the next day. Therefore, the forecasted precipitation data, which were shown to have the highest correlation with the dam inflow by the heatmap analyzing, were used as a standard for the new ensemble model. The reference point was set by averaging rainfall on days with dam inflow greater than 100 m 3 /s. The average precipitation of the filtered dam inflow was 16 mm.

Model Training Test
In this study, the inflow of Soyang River Dam was predicted using weather data and inflow data during the period 1980 to 1919. The model training period was from 1980 to 2016, and the test period was from 2017 to 2019.
To assess the performance of each machine learning model, Nash-Sutcliffe efficiency (NSE), root mean squared errors (RMSE), the mean absolute error (MAE), correlation coefficient (R), and determination coefficient (R 2 ) were used. Numerous studies indicated the appropriateness of these measures to assess the accuracy of hydrological models [46][47][48]. NSE, RMSE, MAE, R, and R 2 for evaluation of the model accuracy can be calculated from Equations (1)- (5).
where O t is the actual value of t, O t is the mean of the actual value, M t is the estimated value of t, M t is the mean of the estimated value, and n is the total number of times. RMSE is the standard deviation of the residuals, and MAE is the mean of the absolute values of the errors. Therefore, the closer the verification values are to zero, the more similar the observed and the model values are.
R, the correlation coefficient, represents the magnitude of the correlation; R values are +1 if the observed and simulated values are the same, 0 if they are completely different, and −1 if they are completely the same in the opposite direction. The R 2 compares the propensity of the observed to the simulated value.

Impact Factor Analysis
In this study, heatmap analysis was used to evaluate the correlation of the data used ( Figure 5). The evaluation showed the correlation coefficient of 0.59 as the highest correlation between precipitation and dam inflow, followed by minimum temperature, humidity, maximum temperature, wind, and solar, which infers that the characteristics of precipitation have the most robust effects on the dam inflow.
Water 2020, 12, x FOR PEER REVIEW 9 of 18 where is the actual value of t, is the mean of the actual value, is the estimated value of t, is the mean of the estimated value, and n is the total number of times. RMSE is the standard deviation of the residuals, and MAE is the mean of the absolute values of the errors. Therefore, the closer the verification values are to zero, the more similar the observed and the model values are.
R, the correlation coefficient, represents the magnitude of the correlation; R values are +1 if the observed and simulated values are the same, 0 if they are completely different, and −1 if they are completely the same in the opposite direction. The R 2 compares the propensity of the observed to the simulated value.

Impact Factor Analysis
In this study, heatmap analysis was used to evaluate the correlation of the data used ( Figure 5). The evaluation showed the correlation coefficient of 0.59 as the highest correlation between precipitation and dam inflow, followed by minimum temperature, humidity, maximum temperature, wind, and solar, which infers that the characteristics of precipitation have the most robust effects on the dam inflow.   Figure 6 shows the feature importance of each input data used in the classification of the decision tree model. The feature importances of precipitation, maximum temperature, minimum temperature, humidity, wind speed, and solar radiation, which are the weather data used to predict the inflow of dams, are 0.549, 0.097, 0.150, 0.073, 0.078, and 0.054, respectively. Regarding the feature importance on the Soyang River Dam inflow prediction, precipitation indicates the highest importance for predicting dam inflow, whereas the other input data indicate less importance for predicting dam inflow.
Water 2020, 12, x FOR PEER REVIEW 10 of 18 Figure 6 shows the feature importance of each input data used in the classification of the decision tree model. The feature importances of precipitation, maximum temperature, minimum temperature, humidity, wind speed, and solar radiation, which are the weather data used to predict the inflow of dams, are 0.549, 0.097, 0.150, 0.073, 0.078, and 0.054, respectively. Regarding the feature importance on the Soyang River Dam inflow prediction, precipitation indicates the highest importance for predicting dam inflow, whereas the other input data indicate less importance for predicting dam inflow.  Table 4 shows the prediction accuracy results (NSE, RMSE, MAE, R, and R 2 ) of six machine learning models, by comparing the predicted dam inflow to the observed dam inflow. The result from the MLP model showed high prediction accuracy, with NSE of 0.812, RMSE of 77.218 m 3 /s, MAE of 29.034 m 3 /s, R of 0.924, and R 2 of 0.817. MLP is the most widely used and powerful model of supervised learning as a predictive algorithm [29,49]. The prediction of the dam inflow using MLP was more accurate than using deep learning, which had more layers. Some previous studies have shown that a simpler neural network model, such as MLP, can perform much better than more complex models, such as deep-learning stacked autoencoder (SAE) [50]. Due to the characteristics of the data, performing the formation of deep layers leads to overfitting, which, in turn, prevents the actual prediction from reducing the loss. The second-best predicted models are the ensemble models, the random forest, and the gradient boosting models. The results from the random forest and the gradient boosting model showed prediction accuracy with NSE of 0.745 and 0.718, respectively, and R 2 of 0.753 and 0.718, respectively, which indicates that the models are the most effective, except MLP, in predicting dam inflow. On the  Table 4 shows the prediction accuracy results (NSE, RMSE, MAE, R, and R 2 ) of six machine learning models, by comparing the predicted dam inflow to the observed dam inflow. The result from the MLP model showed high prediction accuracy, with NSE of 0.812, RMSE of 77.218 m 3 /s, MAE of 29.034 m 3 /s, R of 0.924, and R 2 of 0.817. MLP is the most widely used and powerful model of supervised learning as a predictive algorithm [29,49]. The prediction of the dam inflow using MLP was more accurate than using deep learning, which had more layers. Some previous studies have shown that a simpler neural network model, such as MLP, can perform much better than more complex models, such as deep-learning stacked autoencoder (SAE) [50]. Due to the characteristics of the data, performing the formation of deep layers leads to overfitting, which, in turn, prevents the actual prediction from reducing the loss. The second-best predicted models are the ensemble models, the random forest, and the gradient boosting models. The results from the random forest and the gradient boosting model showed prediction accuracy with NSE of 0.745 and 0.718, respectively, and R 2 of 0.753 and 0.718, respectively, which indicates that the models are the most effective, except MLP, in predicting dam inflow. On the other hand, CNN-LSTM (LSTM combined with CNN) has prediction accuracy, with NSE of 0.455. It is slightly better than the RNN-LSTM, with NSE of 0.429, but less predictable than other machine learning. Figure 7 shows the loss variation calculated by a mean squared error of the learning and validation material, given 100 epochs in deep learning RNN-LSTM and CNN-LSTM, and shows that verification loss falls to 0.0001. Figure 7 clearly indicates that the data can be trained enough, with about 60 epochs.

Prediction Results Using Machine Learning Algorithms
Water 2020, 12, x FOR PEER REVIEW 11 of 18 other hand, CNN-LSTM (LSTM combined with CNN) has prediction accuracy, with NSE of 0.455. It is slightly better than the RNN-LSTM, with NSE of 0.429, but less predictable than other machine learning. Figure 7 shows the loss variation calculated by a mean squared error of the learning and validation material, given 100 epochs in deep learning RNN-LSTM and CNN-LSTM, and shows that verification loss falls to 0.0001. Figure 7 clearly indicates that the data can be trained enough, with about 60 epochs. Due to the limitation of long-term simultaneous analysis by one graph, the results from six models were analyzed separately in Figure 8. On 5 July 2016, the 187th day of Julian Day, the peak inflow of 3918.5 m 3 /s was recorded, and on 29 August 2018, the 972nd day, the second-highest inflow of 2380.8 m 3 /s was recorded, between 2016 and 2019. The prediction models for dam inflow failed to predict inflow accurately when an excessive flow had flowed into the dam during the evaluation period, due to the scarcity of training data for the intensive flood. On the other hand, for dam inflow of 500 m 3 /s or less, it can be seen that the trend is generally well matched. Due to the limitation of long-term simultaneous analysis by one graph, the results from six models were analyzed separately in Figure 8. On 5 July 2016, the 187th day of Julian Day, the peak inflow of 3918.5 m 3 /s was recorded, and on 29 August 2018, the 972nd day, the second-highest inflow of 2380.8 m 3 /s was recorded, between 2016 and 2019. The prediction models for dam inflow failed to predict inflow accurately when an excessive flow had flowed into the dam during the evaluation period, due to the scarcity of training data for the intensive flood. On the other hand, for dam inflow of 500 m 3 /s or less, it can be seen that the trend is generally well matched. Figure 8 shows that the prediction by (b) MLP of the nonlinear regression models and (e) LSTM of deep learning models best represents the first peak flow of the time series variation of the observed values; and the predictive changes by the (a) decision tree and (d) gradient boosting models can be seen to be overestimated, relative to the observed values. Among the other techniques, the MLP model performed the best, representing the time series variation of observed values comprehensively.
Due to the difficulties of analyzing residuals from the observation in time series analysis of dam inflows, Figure 9 illustrates X-Y plots of the values predicted for flow rates less than 100 m 3 /s, compared to the observed values by models. Of the daily inflow of Soyang River Dam, 87.15% is less than 100 m 3 /s, and most of the observed inflow data are included in this range, unless it is heavy rainfall. Figure 10 showed that DT, RF, and GB predict the dam inflow appropriately; however, MLP and CNN-LSTM tend to overestimate the dam inflow, and LSTM performed the least accurately among the algorithms. Figure 10 illustrates the dam inflow predicted for flow rates above 100 m 3 /s. Figure 10 shows that the distribution was significantly different from the observed values, compared to the distributions below 100 m 3 /s with these and predicted results by (c) random forest and (d) gradient boosting models in Figure 9, and the distribution of less residual difference with the observed values. In other words, it is seen that the amount of dam inflow below 100 m 3 /s is well predicted by the ensemble models.
of deep learning models best represents the first peak flow of the time series variation of the observed values; and the predictive changes by the (a) decision tree and (d) gradient boosting models can be seen to be overestimated, relative to the observed values. Among the other techniques, the MLP model performed the best, representing the time series variation of observed values comprehensively. Due to the difficulties of analyzing residuals from the observation in time series analysis of dam inflows, Figure 9 illustrates X-Y plots of the values predicted for flow rates less than 100 m 3 /s, compared to the observed values by models. Of the daily inflow of Soyang River Dam, 87.15% is less than 100 m 3 /s, and most of the observed inflow data are included in this range, unless it is heavy rainfall. Figure 10 showed that DT, RF, and GB predict the dam inflow appropriately; however, MLP and CNN-LSTM tend to overestimate the dam inflow, and LSTM performed the least accurately among the algorithms. Figure 10 illustrates the dam inflow predicted for flow rates above 100 m 3 /s. Figure 10 shows that the distribution was significantly different from the observed values, compared to the distributions below 100 m 3 /s with these and predicted results by (c) random forest and (d) gradient boosting models in Figure 9, and the distribution of less residual difference with the observed values. In other words, it is seen that the amount of dam inflow below 100 m 3 /s is well predicted by the ensemble models. In order to predict the amount of dam inflow during heavy rain, the significant changes in the dam inflow should be predicted at the heavy rainfall events. Figure 10 illustrates the comparison of simulated dam inflow and observed dam inflow only if the observation of dam inflow is greater than 100 m 3 /s. Compared to the model performances when the dam inflow is under 100 m 3 /s, all of the six models had worse results. Among the models, MLP showed the best matching prediction with the observed dam inflow; meanwhile CNN-LSTM model captured well the magnitude of peaks under 700 m 3 /s, however, the model failed to capture the pattern of peaks over 700 m 3 /s. The poor accuracy of the prediction for the peaks can be caused by the scarcity of the training data, since the heavy rainfall events occur rarely.
Although the frequency is low, flood prediction is necessary to prepare for the flood. In Table 5, the dam inflow observations with over 1000 m 3 /s are shown with the results of machine-learning forecasts. The analysis reveals that the decision tree has a tendency to over-or underestimate the peak flow, and the ensemble models are underestimating the peak flows during the period of 2017. Although MLP failed to predict the exact values, it appears that the predicted values are close to approximation on average (Table 5). Figure 10 shows that the distribution was significantly different from the observed values, compared to the distributions below 100 m 3 /s with these and predicted results by (c) random forest and (d) gradient boosting models in Figure 9, and the distribution of less residual difference with the observed values. In other words, it is seen that the amount of dam inflow below 100 m 3 /s is well predicted by the ensemble models.      In order to predict the amount of dam inflow during heavy rain, the significant changes in the dam inflow should be predicted at the heavy rainfall events. Figure 10 illustrates the comparison of Figure 10. Comparison of dam inflow predicted by machine learning models and observed dam inflow, when the observed dam inflow is over 100 m 3 /s.

Prediction Results Using CombML
As indicated in the objective of this study, we predicted the inflow of Soyang River Dam using the past weather data, inflow, and the weather forecast accurately. The results of the study showed that the ensemble models (the random forest and gradient boosting) performed well in under 100 m 3 /s of the inflow. On the other hand, MLP has merits when predicting the inflow of over 100 m 3 /s.
Therefore, a new model combining MLP and ensemble models was created to predict the dam inflow. However, it is impossible to predict the inflow of the next day. Therefore, the forecasted precipitation data, which were shown to have the highest correlation with the dam inflow by the heatmap analyzing, were used as a standard for the new ensemble model. The reference point was set by averaging rainfall on days with dam inflow greater than 100 m 3 /s. The average precipitation of the filtered dam inflow was 16 mm. Hence, the MLP was used when the daily precipitation is more than 16 mm, and the ensemble models (random forest and gradient boosting) were used when the daily precipitation is less than 16 mm, to predict the dam inflow.
The results are shown in Figure 11 and Table 6. The random forest-MLP combined model (RF_MLP) has the best results, with NSE of 0.857, RMSE of 68.417 m 3 /s, MAE of 18.063 m 3 /s, R of 0.927, and R 2 of 0.859. The gradient boosting-MLP combined model (GB_MLP) has the results of NSE of 0.829, RMSE of 73.918 m 3 /s, MAE of 18.093 m 3 /s, R of 0.912, and R 2 of 0.831. In the previously conducted streamflow estimation using single machine learning algorithms [22,23], and the prediction results of single algorithms conducted in this study, there were comparatively higher uncertainties in a certain situation. For the inflow of the Soyang River Dam, since the flow duration were the variables that affected the prediction accuracy of the algorithms, the prediction accuracy has been improved for either peak flow or the normal and low flow using the CombML algorithms. The accuracy improvement illustrates that it is necessary to construct the dam inflow prediction system with the interval division prediction using the combined model. accuracy improvement illustrates that it is necessary to construct the dam inflow prediction system with the interval division prediction using the combined model.

Conclusions
This study evaluated the performance of the algorithms to predict the amount of inflow of the Soyang River Dam, and developed and evaluated the combined machine learning algorithms with the consideration of flow duration. As a result of the comparative analysis of inflow prediction through various algorithms, MLP was proven to be the best algorithms for flow prediction. However, even though MLP was the best algorithm for flow prediction, in terms of model performance evaluation, there was a limitation of flow prediction at the entire flow duration, which means that a single use of algorithm could not perfectly consider flow regimes. To improve this, CombML algorithms were developed, and the results show that it is possible to predict inflow through inflow learning, considering flow characteristics, such as flow in Korea.
The CombML was developed and evaluated to take account of flow regimes for the prediction of the single algorithms. The CombML models, the random forest-multilayer perceptron model (RF_MLP), and the gradient boosting-multilayer perceptron model (GB_MLP), increased the model accuracy for the prediction of the dam inflow, whereas each of the single models performed partially satisfactory. The random forest-multilayer perceptron model ( Although the research area was focused on the Soyang River Dam watershed, the application of the algorithm can be expected, because most of the dam watersheds in Korea are covered by forest, like the Soyang River Dam, and the referencing point is based on precipitation rather than flow rate.

Conclusions
This study evaluated the performance of the algorithms to predict the amount of inflow of the Soyang River Dam, and developed and evaluated the combined machine learning algorithms with the consideration of flow duration. As a result of the comparative analysis of inflow prediction through various algorithms, MLP was proven to be the best algorithms for flow prediction. However, even though MLP was the best algorithm for flow prediction, in terms of model performance evaluation, there was a limitation of flow prediction at the entire flow duration, which means that a single use of algorithm could not perfectly consider flow regimes. To improve this, CombML algorithms were developed, and the results show that it is possible to predict inflow through inflow learning, considering flow characteristics, such as flow in Korea.
The CombML was developed and evaluated to take account of flow regimes for the prediction of the single algorithms. The CombML models, the random forest-multilayer perceptron model (RF_MLP), and the gradient boosting-multilayer perceptron model (GB_MLP), increased the model accuracy for the prediction of the dam inflow, whereas each of the single models performed partially satisfactory. The random forest-multilayer perceptron model (RF_MLP) had the results of NSE of 0.857, RMSE of 68.417 m 3 /s, MAE of 18.063 m 3 /s, R of 0.927, and R 2 of 0.859. The gradient boosting-multilayer perceptron combined model (GB_MLP) had the results of NSE of 0.829, RMSE of 73.918 m 3 /s, MAE of 18.093 m 3 /s, R of 0.912, and R 2 of 0.831. The weakness of MLP model analysis for prediction of the dam inflow was mitigated by the improvement of the prediction of the flood runoff, by combining MLP and ensemble models. The experimental results clearly indicate that the CombML improves on the limitations of flow regime and rainfall on inflow prediction from using a single algorithm.
Although the research area was focused on the Soyang River Dam watershed, the application of the algorithm can be expected, because most of the dam watersheds in Korea are covered by forest, like the Soyang River Dam, and the referencing point is based on precipitation rather than flow rate. Therefore, it is expected that CombML, which takes into account flow regimes, can be used to establish basic data for the establishment of plans for water supply by controlling dam water level during flooding and securing dam water storage capacity during drought. Also, the global application of this method for other water sources (e.g., dam, river, stream, etc.) could be possible by analyzing the correlation of the data and combining the model, since it requires only the weather and flow data.
For further research, the development of dam inflow prediction technology considering not only natural factors, but also artificial factors, such as water use, land use changes, hydraulic structures, etc., in the upstream watershed is necessary.

Conflicts of Interest:
The authors declare no conflict of interest.