Estimation of Low-Flow in South Korean River Basins Using a Canonical Correlation Analysis and Neural Network (CCA-NN) Based Regional Frequency Analysis

: Low-ﬂow quantiles at ungauged locations are generally estimated based on hydrological methods, such as the drainage area ratio and frequency analysis methods. In practice, the drainage area ratio approach is a popular but simple linear model. When hydrologically nonlinear characteristics govern the runo ﬀ process, the linear approach leads to signiﬁcant bias. This study was conducted to develop an improved nonlinear approach using a canonical correlation analysis and neural network (CCA-NN)-based regional frequency analysis (RFA) for low-ﬂow estimation. The jackknife technique was utilized to validate the two methods. The approaches were applied to 33 river basins in South Korea. In this work, we focused on two-year and ﬁve-year return periods. For the two-year return period, the BIAS, RMSE, and R 2 were 0.013, 0.511, and 0.408 with the RFA, respectively, and − 0.042, 1.042, and 0.114 with the drainage area ratio method, respectively; whereas for the ﬁve-year return period, the respective indices were − 0.018, 0.316, and 0.573 with RFA, respectively, and 0.166, 0.536, and 0.044 with the drainage area ratio method, respectively. RFA outperformed the drainage area ratio method based on its high prediction accuracy and ability to avoid the bias problem. This study indicates that machine learning-based nonlinear techniques have the potential for use in estimating reliable low-ﬂows at ungauged sites.


Introduction
Reliable low-flow estimates are necessary to provide information for water supply planning, reservoir storage design, water quantity and quality preservation, irrigation, hydropower production, and pollution load dispersion [1][2][3][4]. In the case of insufficient or no streamflow records, several approaches can be used to obtain low-flow estimates. For example, regression models, including linear methods, can be applied with explanatory variables that are determined by physiographical and meteorological characteristics. Additionally, several studies with nonlinear models have been conducted to provide more reliable low-flow estimates [5][6][7].
The drainage area ratio method is a linear model between the drainage area and discharge and has been popular for the estimation of low-flow with a 10/365 non-exceedance probability [8]. A number of studies have applied the drainage area ratio method. Wiche et al. [9] examined historic streamflow data by focusing on the James River in North Dakota and South Dakota, USA and performed record extension based on different techniques, such as the drainage area ratio method. Guenthner et al. [10] and Emerson and Dressler [11] studied monthly gauged and estimated streamflow for the Red River, USA and used the drainage area ratio approach to develop streamflow records. Cho et al. [12] investigated low-flows during a dry season in South Korea to obtain low-flow estimates at ungauged sites based on three approaches, including the drainage area ratio method.
The regional frequency analysis (RFA) method has been widely used to assess hydrological characteristics at locations with little or no data available. The hydrological estimations that are derived from RFA are of prime significance in the design of hydraulic structures such as dams and reservoirs. In RFA, two principal steps are required: (a) the identification of groups of basins (homogenous regions) that are hydrologically similar to a target basin and (b) model application for regional estimation within the homogenous regions. These regions have been traditionally defined using geographical and administrative boundaries considering hydrological features [13,14]. The region of influence approach, which pools a certain number of river basins based on proximity in a catchment feature space, has also been utilized to define homogenous regions with objective functions [15][16][17]. In recent studies, the canonical correlation analysis (CCA) was recommended and used to determine hydrologically similar regions by creating a canonical space and providing the optimal number of stations in the regions [18].
As a common nonlinear regression approach, artificial neural networks (ANNs) have been broadly adopted for a wide range of hydrological problems. Luk et al. [19] performed rainfall forecasting using an ANN over an urban catchment in Australia. Shu and Burn [20] and Dawson et al. [21] used an ANN for indexing floods and flood quantile estimation based on catchments in the United Kingdom (UK) by improving a hydrological prediction model. Seidou et al. [22] also applied an ANN for the regional estimation of lake ice thickness at ungauged locations in Canada. Shu and Ouarda [23] used regional frequency analysis based on ANN models to obtain flood quantile estimations for 151 river networks in the province of Quebec, Canada. Ouarda and Shu [3] conducted a regional low-flow frequency analysis using an ANN model with low-flow quantiles of the summer and winter seasons based on selected river basins in Canada.
The main objectives of the present study are to develop an advanced method of obtaining low-flow estimates in ungauged basins and to identify the relationships between physiographical/meteorological variables and hydrological variables in South Korea. A regional low-flow estimation approach based on CCA and ANNs for RFA is proposed and compared with the drainage area ratio method. CCA is used to identify the canonical space that is the transformed space to obtain continuous hydrologic variables. In this space, the prediction performance of the original data is preserved and redundant information is excluded to improve the estimates. CCA also identifies projections of high correlations between the two sets of multivariate variables by providing canonical variables that are linear combinations of the variables. ANNs are then used to establish the nonlinear relationships between the canonical variables and hydrological variables to be estimated. We will provide more details about these processes in the methodology section.
The remainder of the paper is organized as follows. In Section 2, the data set that was used in the present study is described. Section 3 presents the methodologies that were used in the analysis to obtain low-flow estimates with assessments based on the proposed models. The results and discussion are given in Section 4. Finally, conclusions are summarized in Section 5.

Data Set
A data set of 33 river basins in South Korea was created to estimate the low-flow values in ungauged basins. The variables of the river basins that were used for this study are shown in Table 1. Figure 1 shows the outlet of each basin selected in the present study based on the following criteria.

1.
A historical flow record of 10 years or longer is available for the analysis.

2.
The gauged catchment has a flow regime with minimal human intervention.

3.
The historical data pass the stationarity [24] and independence [25] tests.  To conduct RFA, several variables representing the physiographical and meteorological features were obtained for the river networks in South Korea. In this study, we set seven variables that have generally been used in previous studies with RFA [3,18,23,26]. These variables include the drainage area (AREA), mean basin slope (MBS), annual mean precipitation (AMP), annual mean temperature (AMT), length of the main channel (MCL), slope of the main channel (MCS), and curve number (CN). A brief description with statistics for all the variables that were used in the analysis is given in Table  2. To conduct RFA, several variables representing the physiographical and meteorological features were obtained for the river networks in South Korea. In this study, we set seven variables that have generally been used in previous studies with RFA [3,18,23,26]. These variables include the drainage area (AREA), mean basin slope (MBS), annual mean precipitation (AMP), annual mean temperature (AMT), length of the main channel (MCL), slope of the main channel (MCS), and curve number (CN). A brief description with statistics for all the variables that were used in the analysis is given in Table 2. For the hydrological variables that are related to low-flows in the present work, the specific quantiles, such as the two-year and five-year quantiles, are calculated based on the flow records from all the gauged sites in the study area. Cho et al. [12] investigated the low-flows in South Korea to select an appropriate statistical distribution and found that the Gamma distribution was the most feasible for the analysis of the low-flows. Thus, we used the Gamma distribution to estimate the two-year and five-year quantiles. Additionally, to compare the results based on different statistical distributions, we investigated several distributions, including the generalized extreme value (GEV), two-parameter lognormal (LN2), and Weibull (W2) distributions, which are commonly applied for hydrological analysis [3,23,27,28].

Methodology
After all the variables that were used to estimate the low-flows were obtained, two methodologies, including the drainage area ratio and RFA, were applied to enhance the low-flow estimations at ungauged sites in South Korea. The drainage area ratio method uses the drainage area and low-flow quantiles. We describe this method in detail in Section 3.1. In RFA, the appropriate data preprocessing steps are required before performing the analysis. In the preprocessing stage, the physiographical, meteorological, and hydrological variables were standardized. Then, we obtained the standardized database that has a mean of zero and standard deviation of one, and asymmetry in the variables could then be assessed. With the database, RFA using CCA and ANNs were performed to estimate the low-flow quantiles in ungauged basins. We specifically describe the RFA application process and provide a diagram of the procedures that were used to obtain the low-flow estimates in Section 3.2. The overall processes that were applied in the present study are shown with a simple diagram in Figure 2.  For the hydrological variables that are related to low-flows in the present work, the specific quantiles, such as the two-year and five-year quantiles, are calculated based on the flow records from all the gauged sites in the study area. Cho et al. [12] investigated the low-flows in South Korea to select an appropriate statistical distribution and found that the Gamma distribution was the most feasible for the analysis of the low-flows. Thus, we used the Gamma distribution to estimate the twoyear and five-year quantiles. Additionally, to compare the results based on different statistical distributions, we investigated several distributions, including the generalized extreme value (GEV), two-parameter lognormal (LN2), and Weibull (W2) distributions, which are commonly applied for hydrological analysis [3,23,27,28].

Methodology
After all the variables that were used to estimate the low-flows were obtained, two methodologies, including the drainage area ratio and RFA, were applied to enhance the low-flow estimations at ungauged sites in South Korea. The drainage area ratio method uses the drainage area and low-flow quantiles. We describe this method in detail in Section 3.1. In RFA, the appropriate data preprocessing steps are required before performing the analysis. In the preprocessing stage, the physiographical, meteorological, and hydrological variables were standardized. Then, we obtained the standardized database that has a mean of zero and standard deviation of one, and asymmetry in the variables could then be assessed. With the database, RFA using CCA and ANNs were performed to estimate the low-flow quantiles in ungauged basins. We specifically describe the RFA application process and provide a diagram of the procedures that were used to obtain the low-flow estimates in Section 3.2. The overall processes that were applied in the present study are shown with a simple diagram in Figure 2.

Drainage Area Ratio Method
The drainage area ratio approach is based on the assumption that the streamflow at a location of interest can be estimated by multiplying the ratio of the drainage area corresponding to a streamflow at ungauged stations and the drainage area corresponding to a streamflow at gauged

Drainage Area Ratio Method
The drainage area ratio approach is based on the assumption that the streamflow at a location of interest can be estimated by multiplying the ratio of the drainage area corresponding to a streamflow at ungauged stations and the drainage area corresponding to a streamflow at gauged stations. The drainage area ratio approach is commonly used to estimate low-flows at ungauged locations because of its simplicity [12,[29][30][31]. This method is relatively effective if the streams have similar hydrological features [32]. The method that was used in the present study is given as follows: where Q y denotes the estimated low-flows in the river basin of interest, A y is the basin area of the river basin of interest, A x is the basin area of the river basin with the streamflow records, and m is the exponent of A y /A x . In the simplest drainage ratio method, it is assumed that m equals 1 and the equation is unbiased, indicating that the expected value of the estimated low-flows tends to equal the value of the observed low-flows.

Ensemble ANN for RFA
The ANNs that were used to conduct the RFA have been applied to estimate extreme events in several studies. For example, Shu and Ouarda [23] used an ensemble ANN with a CCA to improve the flood quantile estimation for extremely high flow events based on 151 catchments with ungauged sites and Ouarda and Shu [3] analyzed the low-flow quantiles of extreme events using an ensemble ANN based on more than 100 river basins in Canada. In the present study, the RFA method that was implemented to estimate the low-flows at ungauged sites in South Korea was based on the CCA and ANN methods. Using the CCA, we could construct the physiographical space as a canonical space to interpolate the hydrological variables of interest in the space, estimate the hydrological variables at ungauged sites, and create canonical variables. The canonical variables that were obtained from CCA were then fed to the ANN models to generate hydrological variable estimates in the physiographical domain. In Figure 3, a simple diagram shows the processes that were used to estimate hydrological variables such as low-flow quantiles using the CCA and ANN models.
The CCA method is a statistical multivariate analysis method that reflects the relationship between two sets of random variables by omitting nonessential data and preserving the original characteristics of the variables [33,34]. Given that we have a set of physiographical and meteorological variables, X, and a set of hydrological variables, Y, CCA was used to link the two sets based on vectors of canonical variables. If W and V are linear combinations of X and Y, we have where W represents the canonical physiographical and meteorological variables and V represents the canonical hydrological variables. The correlation between W and V is estimated as follows.
In the CCA processes, we identified vectors α and β by maximizing the correlation ρ as discussed in previous studies [18,23]. After the first pair of canonical variables were obtained, other pairs of canonical variables were calculated based on the correlation subject to the constraint of the unit variance for normalization.
The CCA in the RFA was used to construct a transformed space (canonical space) which was determined by the physiographical and meteorological characteristics of the variables and a canonical space in which the hydrological variables that are continuous can be obtained [35]. The hydrological variables can be indirectly estimated in space by establishing a functional relationship between the physiographical and meteorological variables and the hydrological variables. The physiographical and meteorological variables that are generally available at ungauged locations can provide information to Atmosphere 2019, 10, 695 7 of 21 calculate the hydrological variables at the ungauged sites. Thus, we can estimate hydrological variables by locating an ungauged site of interest in the canonical space defined by the variables. Additional detailed theoretical information about the application of CCAs for RFAs can be found in the study of Ouarda et al. [18]. Note that the study of Ouarda et al. [18] proposed a theoretical framework for the application of the CCA for RFAs. In the present study, the CCA was used to estimate low-flow quantiles for ungauged locations based on the ANN-based model. Also, the DAR method was applied to the study region and the results of the DAR were compared with the results of the ANN model to determine a better approach for low-flow estimation in South Korea.
Atmosphere 2019, 10, x FOR PEER REVIEW 7 of 21 found in the study of Ouarda et al. [18]. Note that the study of Ouarda et al. [18] proposed a theoretical framework for the application of the CCA for RFAs. In the present study, the CCA was used to estimate low-flow quantiles for ungauged locations based on the ANN-based model. Also, the DAR method was applied to the study region and the results of the DAR were compared with the results of the ANN model to determine a better approach for low-flow estimation in South Korea. Figure 3. Diagram of the processes that were used to obtain the low-flow estimates at ungauged locations based on regional frequency analysis.
In this study, ANN models were applied in canonical space to estimate the hydrological variables, such as low-flow quantiles, for ungauged basins in South Korea. Based on the CCA, the canonical variables W and V can be obtained as the linear combination of the set of physiographical and meteorological variables and the set of hydrological variables. After we obtained the canonical variable W, the ANN models were used to approximate the functional relationship between W and the hydrological variable Y. With these variables, multilayer perceptrons (MLPs), which are also . Diagram of the processes that were used to obtain the low-flow estimates at ungauged locations based on regional frequency analysis.
In this study, ANN models were applied in canonical space to estimate the hydrological variables, such as low-flow quantiles, for ungauged basins in South Korea. Based on the CCA, the canonical variables W and V can be obtained as the linear combination of the set of physiographical and meteorological variables and the set of hydrological variables. After we obtained the canonical variable W, the ANN models were used to approximate the functional relationship between W and the hydrological variable Y. With these variables, multilayer perceptrons (MLPs), which are also known as multilayer feed-forward networks, were used to train the hydrological variables in the ANN procedure. The MLPs consist of an input layer, with one or more hidden layers, and an output layer that are interconnected. The MLP input layer receives values of the input variables and the hidden layers between the input and output layers play significant roles in transferring information between these layers. The transfer functions of the hidden layers affect the behavior of the ANN model. The output layer then provides an ANN prediction and represents the model output, which is the low-flow estimate in the present study.
The ANNs should be trained in the estimation phase using the samples from the gauged locations. During the training process of ANNs, network parameters such as the number of neurons in the hidden layer and learning rate must be optimized until the estimation error of the network is minimized and the network reaches the specified level of accuracy for the ANN model. After a network is trained and tested, the new input information can be provided to produce the model output. The training algorithm that was used in this study is the Levenberg-Marquardt (LM) algorithm. This algorithm is faster than other algorithms, such as the gradient descent method, in finding optimal solutions [36][37][38]. In the LM algorithm, an appropriate value of the scalar parameter µ should be selected [39]. A large µ value forces the LM algorithm to follow the gradient descent method with a small sized step, whereas a small µ value leads to the Guess-Newton method, which is accurate near a minimum error solution.
The initial value was given as 0.005, and the value of µ changed during the ANN training process until the performance of the ANN was satisfactory. In the process, when the training epoch decreases the function of the performance, the µ value is multiplied by 0.1, whereas when the training epoch increases the function of the performance, the µ value is multiplied by 10. The maximum µ value is 10 6 , at which point the training algorithm stops. In the analysis of the ANN with the scalar parameter, an early stopping criterion was used to avoid overfitting (overtraining) during ANN training as described by Bishop [40].
To improve the generalizability and stability of the ANN, an ensemble ANN model was used in the present study. The ensemble ANN model consisted of a set of ANNs that were trained for the same task and produced the output of the model. The bagging method was applied for the ensemble ANN model to provide component networks by averaging the resulting networks. In the bagging method, each member ANN of the ensemble was trained with a subset of the training set and the subset was drawn from the original training set with replacement. This approach assists in enhancing the accuracy of the predictions and the model generalization ability in regression and classification problems [41][42][43]. Selecting the size of an ensemble plays an important role in obtaining satisfactory output from ANN models. The improvement of the generalizability is not apparent if the ensemble size is too small and the training time and ensemble creation will have time costs if this value is too large. Different ensemble sizes ranging from two to 20 were considered for the study area to determine the ideal number, as demonstrated in a previous study [23]. The ensemble size of 14 was chosen in this study based on the characteristics of the ensemble and hydrological variables.

Evaluation Criteria
To assess the proposed methods in the present work, we used the following indices: the R-squared (R 2 ), mean bias (BIAS), and root mean squared error (RMSE) indices. These indices were calculated based on the following equations: Atmosphere 2019, 10, 695 where RSS is the residual sum of squares, TSS is the total sum of squares, n is the total number of sites, q i is the at-site estimate for site i, andq i is the estimate that was derived from the models for site i.
In the evaluation procedure, we use the jackknife resampling technique to compare the relative performance of each model that was used to estimate the low-flow quantiles at ungauged locations. In this procedure, the low-flow records in each drainage basin were temporarily removed from the database to assume that the site represented an ungauged location. Each model was calibrated using the data from the remaining sites. Then, regional estimates could be obtained for the ungauged river basins based on the calibrated models that were proposed in the study, and these estimates were compared to the at-site estimates, which were also called local estimates.

Analysis of the Correlation Between Variables
Two approaches, the drainage area ratio and RFA, were applied to obtain low-flow estimates at gauged sites in South Korea. In the first method, we used AREA with low-flow quantiles of a two-year return period and a five-year return period and in the second method, we used AREA, MBS, AMT, AMT, MCL, MCS, and CN with these low-flow quantiles.
The physiographical and meteorological variables that were considered in this study were investigated to identify correlations with the hydrological variables, the two-year low-flow quantile, and the five-year low-flow quantile, as defined by statistical distributions. The scatterplots between (a) the two-year quantile and (b) the five-year quantile that were estimated from the Gamma distribution and the physiographical and climatic variables are presented in Figure 4. where RSS is the residual sum of squares, TSS is the total sum of squares, n is the total number of sites, is the at-site estimate for site i, and is the estimate that was derived from the models for site i.
In the evaluation procedure, we use the jackknife resampling technique to compare the relative performance of each model that was used to estimate the low-flow quantiles at ungauged locations. In this procedure, the low-flow records in each drainage basin were temporarily removed from the database to assume that the site represented an ungauged location. Each model was calibrated using the data from the remaining sites. Then, regional estimates could be obtained for the ungauged river basins based on the calibrated models that were proposed in the study, and these estimates were compared to the at-site estimates, which were also called local estimates.

Analysis of the Correlation Between Variables
Two approaches, the drainage area ratio and RFA, were applied to obtain low-flow estimates at gauged sites in South Korea. In the first method, we used AREA with low-flow quantiles of a twoyear return period and a five-year return period and in the second method, we used AREA, MBS, AMT, AMT, MCL, MCS, and CN with these low-flow quantiles.
The physiographical and meteorological variables that were considered in this study were investigated to identify correlations with the hydrological variables, the two-year low-flow quantile, and the five-year low-flow quantile, as defined by statistical distributions. The scatterplots between (a) the two-year quantile and (b) the five-year quantile that were estimated from the Gamma distribution and the physiographical and climatic variables are presented in Figure 4.  The two-year low-flow quantile plot shows that the river basin descriptors, including AREA, MBS, MCL, CN, and AMP, are positively correlated with low-flows and the other river basin descriptors, namely MCS and AMT, are negatively correlated with low-flows. For the five-year lowflow quantile plot, the observations are similar, except for the MBS. The MCS exhibits a positive correlation with low-flows when divided by the basin area to offset the area effect. The Pearson correlation coefficients of variables range from 0.063 to 0.491 for the two-year quantile and from 0.137 to 0.498 for the five-year quantile. In the RFA process, we used the canonical correlation coefficients, as described in the methodology section. Thus, we also examined the canonical correlation coefficients for the variables to determine the improvements in the correlations between the variables when CCA was applied. The canonical correlation coefficients range from 0.161 to 0.889 for the twoyear quantile and from 0.124 to 0.883 for the five-year quantile, as shown in Table 3. In particular, the canonical correlation coefficients for the two-year quantile are high for AREA, MCL, AMP, and AMT compared to the corresponding Pearson correlation coefficients. In the RFA process, we used the canonical correlation coefficients, as described in the methodology section. Thus, we also examined the canonical correlation coefficients for the variables to determine the improvements in the correlations between the variables when CCA was applied. The canonical correlation coefficients range from 0.161 to 0.889 for the two-year quantile and from 0.124 to 0.883 for the five-year quantile, as shown in Table 3. In particular, the canonical correlation coefficients for the two-year quantile are high for AREA, MCL, AMP, and AMT compared to the corresponding Pearson correlation coefficients.
We also tested the hypothesis that there is no relationship between the physiographical and meteorological variables and the hydrological variables. Based on the test, the corresponding correlations between AREA, AMP, MCL, and MCS and low-flows were considered significant at the 95% confidence level. The corresponding correlations between MBS, AMT, and CN and low-flows were not considered significant at the 95% confidence level. However, MBS and AMT positively affected the model's performance in estimating the low-flows at ungauged sites based on RFA, potentially because AMT is highly related to AMP, and the corresponding correlation between MBS and AMT was considered significant at the 95% confidence level. The combinations of the variables that were used in the RFA processes may have also influenced the model's performance.

Analysis of the Drainage Area Ratio Method
To estimate the low-flows at ungauged sites in South Korea, the drainage area ratio method was generally used based on the river basin area. In this study, one river basin was considered a gauged site and the other river basins were considered ungauged sites. For example, if the Imokjeonggyo station was used as the gauged location, the low-flow estimates in other river basins were obtained using Equation (1). Table 4 shows the assessment of the drainage area ratio method based on the BIAS, RMSE, and R 2 values using the two-year quantile and the five-year quantile derived from Gamma distribution. BIAS values range from −6.171 to 0.787; RMSE values range from 0.802 to 9.933; and R 2 values range from 0.071 to 0.169 for the two-year quantile. Additionally, BIAS values range from −3.110 to 0.421; RMSE values range from 0.532 to 5.039; and R 2 values range from 0.019 to 0.113 for the five-year quantile.  Figure 5 plots the relationship between the estimated low-flow using the drainage area ratio method and the measured low-flow based on the two-year quantile of the Gamma distribution for the river basins that were used in this study. We only show the results in Figure 5 based on the two-year quantile derived from the Gamma distribution because other distributions with these quantiles displayed similar results. The subfigures in Figure 5  Banglimgyo. A limitation of linear methods is that they provide biased estimates in the flood flow domain [44], and this phenomenon was observed in the present work, as shown in Figure 5. Also, Pandey and Nguyen [45] and Grover et al. [46] stated that non-linear approaches can produce more precise estimates than linear regression methods in RFA procedures. If the area of the river basin with streamflow records is too large in the drainage area ratio method, the low-flows are underestimated (e.g., Soyanggang Dam, Andong Dam, and Imha Dam). Conversely, when the area of the basin with the streamflow records is too small, the low-flows are overestimated (e.g., Imokjeonggyo, Baekokpogyo, and Epyunggyo). Based on the BIAS and RMSE results that were obtained from the drainage area ratio approach, we chose Misung station and compared the results with the indices that were calculated by RFA to estimate the low-flow quantiles at ungauged locations in South Korea. streamflow records is too large in the drainage area ratio method, the low-flows are underestimated (e.g., Soyanggang Dam, Andong Dam, and Imha Dam). Conversely, when the area of the basin with the streamflow records is too small, the low-flows are overestimated (e.g., Imokjeonggyo, Baekokpogyo, and Epyunggyo). Based on the BIAS and RMSE results that were obtained from the drainage area ratio approach, we chose Misung station and compared the results with the indices that were calculated by RFA to estimate the low-flow quantiles at ungauged locations in South Korea.

RFA with CCA and ANNs
In estimating the low-flow quantiles based on RFA, identifying the number of hidden neurons in the hidden layers is a significant task to improve model performance. In general, if too many hidden neurons are used, overfitting can occur due to not having enough training cases in the ensemble ANN process. If too few neurons are used, underfitting can occur due to not having sufficient complexity to represent the functional relationship between the input and output systems [3,23]. For comparison with the drainage area ratio method that was used to calculate the low-flow

RFA with CCA and ANNs
In estimating the low-flow quantiles based on RFA, identifying the number of hidden neurons in the hidden layers is a significant task to improve model performance. In general, if too many hidden neurons are used, overfitting can occur due to not having enough training cases in the ensemble ANN process. If too few neurons are used, underfitting can occur due to not having sufficient complexity to represent the functional relationship between the input and output systems [3,23]. For comparison with the drainage area ratio method that was used to calculate the low-flow quantiles at ungauged sites in South Korea, we performed RFA with CCA-based ANNs using the low-flow quantiles defined by the Gamma distribution. By varying the number of hidden neurons from one to 20, as shown in Figure 6, we can observe that the ensemble ANN models for the two-year quantile and five-year quantile suffer from overfitting problems when the number of hidden neurons increases above five hidden neurons. The ensemble ANN models with five hidden neurons tend to provide the most reliable estimates of low-flows at ungauged sites; therefore, we selected the five hidden neurons for low-flow estimation in this study. The performance measure, RMSE, was obtained from a jackknife procedure to identify the optimal number of hidden neurons. Note that we also compared the ensemble ANN-based model without the CCA and with the CCA to identify the impact of the CCA on the proposed model. The model without the CCA was built using AMP, which is the most highly correlated variable with the low-flow in this study. The results for the model performance are presented in Table 5. This table indicates that the CCA-based model (the ensemble ANN model with CCA) seems to provide a better performance based on the RMSE, BIAS, and R 2 indices.
Atmosphere 2019, 10, x FOR PEER REVIEW 14 of 21 quantiles at ungauged sites in South Korea, we performed RFA with CCA-based ANNs using the low-flow quantiles defined by the Gamma distribution. By varying the number of hidden neurons from one to 20, as shown in Figure 6, we can observe that the ensemble ANN models for the twoyear quantile and five-year quantile suffer from overfitting problems when the number of hidden neurons increases above five hidden neurons. The ensemble ANN models with five hidden neurons tend to provide the most reliable estimates of low-flows at ungauged sites; therefore, we selected the five hidden neurons for low-flow estimation in this study. The performance measure, RMSE, was obtained from a jackknife procedure to identify the optimal number of hidden neurons. Note that we also compared the ensemble ANN-based model without the CCA and with the CCA to identify the impact of the CCA on the proposed model. The model without the CCA was built using AMP, which is the most highly correlated variable with the low-flow in this study. The results for the model performance are presented in Table 5. This table indicates that the CCA-based model (the ensemble ANN model with CCA) seems to provide a better performance based on the RMSE, BIAS, and R 2 indices.
Based on the optimal number of hidden neurons for the ensemble ANN models based on the CCA, we obtained RMSE, BIAS, and R 2 values for the results of the RFA using the two-year and five-year quantiles and compared these values with the results of the drainage area ratio method. Table 5 presents the results that were obtained from the drainage area ratio method and the ensemble ANN model using the jackknife validation procedure. For each cell in this table, a bold font denotes the best performing method for the low-flow estimation. Table 6a indicates that RFA with the ensemble ANN model provides better performance than the drainage area ratio approach based on the indices for the two-year quantile low-flows. In Table 6b, the RFA with the ensemble ANN model is compared with the drainage area ratio method for the five-year low-flow quantile. Based on statistical assessments, the model performance is relatively high when the RFA was used to obtain the low-flow estimates at ungauged sites. Thus, the ensemble ANN model, which is a nonlinear method, yields a performance enhancement. In particular, the ensemble ANN model improves the bias problem, which is highly important in the design of hydrological structures. Table 6. Comparison of the validation results based on the drainage-area ratio method for the Misung station and regional frequency analysis (ensemble ANN with CCA) for the (a) two-year quantile and (b) five-year quantile derived from the Gamma distribution.

Method
The regional estimates of low-flow using the jackknife validation procedure for the drainage area ratio method and RFA based on the ensemble ANN model are shown in Figure 7a for the two-year quantile and Figure 7b for the five-year quantile derived from the Gamma distribution. This figure shows that the ensemble ANN model typically exhibits better performance than the drainage area ratio approach at Misung station. This result indicates that the ensemble ANN, as a nonlinear model, outperforms the drainage area ratio method, as a linear model, for the two quantiles. A similar study was performed for river basins in Canada to obtain low-flow quantiles by comparing models based on the ANN and multiple regression methods [3]. In their analysis, the ANN-based model also led to a better performance compared with the regression models. The ensemble ANN model provides less biased estimates and better prediction accuracy than the drainage area ratio method. The BIAS, RMSE, and R 2 values of the drainage area ratio method are −0.042, 1.042, and 0.114, respectively, for the two-year quantile. Additionally, the BIAS, RMSE, and R 2 values of RFA are 0.013, 0.511, and 0.408, respectively, for the two-year quantile. Moreover, these indices for the five-year quantile are 0.166, 0.536, and 0.044 and −0.018, 0.316, and 0.573 based on the drainage area ratio method and RFA, respectively. Figure 7 shows that for the drainage area ratio approach, most of the low-flows are underestimated. However, using RFA, the bias problem is adjusted and the accuracy is improved for both quantiles, as shown in Figure 7.  Figure 7 shows that for the drainage area ratio approach, most of the low-flows are underestimated. However, using RFA, the bias problem is adjusted and the accuracy is improved for both quantiles, as shown in Figure 7. To evaluate the low-flow estimates based on the Gamma distribution, different statistical distributions were used by creating the two-year and five-year quantiles. The regional estimates of the low-flows based on GEV, L2, and W2 using the jackknife validation procedure are plotted against the local estimates in Figure 8. The figure indicates that the regional estimates of the low-flows derived from GEV, L2, and W2 are similar to the regional estimates of the low-flows using the Gamma distribution. Based on Gamma distribution results, the results of RFA based on the three To evaluate the low-flow estimates based on the Gamma distribution, different statistical distributions were used by creating the two-year and five-year quantiles. The regional estimates of the low-flows based on GEV, L2, and W2 using the jackknife validation procedure are plotted against the local estimates in Figure 8. The figure indicates that the regional estimates of the low-flows derived from GEV, L2, and W2 are similar to the regional estimates of the low-flows using the Gamma distribution. Based on Gamma distribution results, the results of RFA based on the three distributions exhibit better performance than those of the drainage area ratio method for the two-year low-flow and the five-year low-flow quantiles shown in Figure 8.  The indices, including BIAS, RMSE, and R 2 , for the Gamma, GEV, L2, and W2 distributions are shown in Table 7. The Gamma distribution exhibits the highest accuracy in estimating the low-flow quantiles based on the RMSE and R 2 values. The W2 distribution displays the best bias values for the two-year and five-year quantiles. However, all the distributions that were used in this study yielded good bias values and they improved the bias problem compared to using the drainage area ratio method. Table 7. Comparison of the validation results based on the different statistical distributions for the regional frequency analysis for the (a) two-year quantile and (b) five-year quantile.

Conclusions
We examined the correlations between the physiographical/meteorological variables and the hydrological variables to better understand the characteristics of low-flows. Among the variables representing the physiographical and climatic features, we found that AREA, AMP, MCL, and MCS are positively correlated with low-flows based on a significance test. Additionally, MBS, AMT, and CN are not significantly correlated with low-flows, but are correlated with other variables that may influence the RFA results. In addition, when we used CCA to generate the canonical correlation coefficients of the variables, the correlations between the physiographical/meteorological variables and low-flows were improved.
The drainage area ratio method was used to estimate the low-flows at ungauged sites. With this method, one river basin was considered a gauged basin and the other basins were considered ungauged basins. Using the basin area ratio, the method was assessed based on the BIAS, RMSE, and R 2 . The average values of BIAS, RMSE, and R 2 were −0.703, 2.097, and 0.121 for the two-year quantile and −0.297, 1.111, and 0.050 for the five-year quantile, respectively. To compare the results of the drainage area ratio method and RFA, several basins that exhibited good performance were selected. The ranges of the BIAS, RMSE, and R 2 values of the Misung, Tanbugyo, Chungmi, and Hwachon basins were −0.042~0.095, 0.924~1.042, and 0.109~0.120 for the two-year quantile and −0.011~0.166, 0.536~0.631, and 0.042~0.049 for the five-year quantile, respectively. Additionally, if the selected basin was too small or too large, the estimated quantiles were biased. Based on the results of this study, the estimates seem to be relatively overestimated when the basin area is smaller than 150 km 2 and relatively underestimated when the basin area is larger than 1000 km 2 .
Compared with the drainage area ratio approach, RFA using CCA-based ANNs was applied for the 33 river basins. In this assessment, we used jackknife validation with statistical indices, such as BIAS, RMSE, and R 2 . The indices based on the RFA were 0.013, 0.511, and 0.408 for the two-year quantile and −0.018, 0.316, and 0.573 for the five-year quantile. Based on the indices, we found that the RFA method that was proposed in this paper performs better than the drainage area ratio method based on the results for the 33 river basins. We determine that the ensemble ANN method, as a nonlinear model, seems to outperform the drainage area ratio approach, a linear model, in obtaining low-flow estimates at ungauged sites in South Korea. Although the ensemble ANN did not show such improvements, we found that the nonlinear model has the potential to enhance low-flow estimations in the study region.
Other statistical distributions, such as GEV, LN2, and W2, were used to obtain the two-year and five-year low-flow quantiles and to assess the low-flow estimates at ungauged basins. When these distributions are used in RFA and the drainage area ratio method, RFA based on CCA and ANNs outperforms the drainage area ratio approach, and the Gamma distribution provides the best results. The BIAS, RMSE, and R 2 values are also used for model assessment based on the distributions. W2 exhibits the best performance for BIAS and the Gamma distribution displays the best performance for RMSE and R 2 . In this paper, we found that the machine learning-based nonlinear model provides relatively reliable estimates of low-flow quantiles for ungauged basins in South Korea compared to the estimates by the linear model. The results point to the use of the machine learning model to enhance estimates of low-flow quantiles in areas characterized by nonlinearity. Additionally, the explicit correlations between the quantiles and the sets of physiographical and meteorological covariates can be determined to improve the quality of regional quantile estimates in ungauged basins.

Conflicts of Interest:
The authors declare no conflicts of interest.