Evaluation of Nitrate Load Estimations Using Neural Networks and Canonical Correlation Analysis with K-Fold Cross-Validation

: The present work aimed to examine the feasibility of using artificial neural network (ANN) based models to obtain accurate estimates of nitrate loads in river basins, which is an important parameter for water quality management. Both Single ANN (SANN) and Ensemble ANN (EANN) models were used to obtain the load estimations for five river basins in the Midwest United States. These basins included the Cuyahoga, Raisin, Sandusky, Muskingum, and Vermilion basins in Michigan and Ohio. Further, canonical correlation analysis (CCA) was applied to the ANN models to improve the performance. The k-fold cross-validation method was then utilized to evaluate the proposed models based on two statistical indices, namely, the rRMSE and rBAIS , and the estimates were compared for four different k values (k = 3, 5, 7, and 10). According to the results, the EANN model seemed to produce better load estimations than the SANN model, and the CCA based EANN model tended to produce the best estimates among all of the proposed models in this study. The box plot data for the rRMSE index were also investigated, and the plot results indicated that increasing values of k tended to generate better estimates. Thus, the use of k = 10 is recommended for load estimations since this value was associated with better performances and less biased estimates.


Introduction
Nutrient enrichment is a growing problem in rivers and streams, where excessive nutrients can cause degradation in water quality. Nutrients represent one of the most problematic water quality constituents in rivers in the Midwestern United States [1][2][3]. In particular, nutrient levels in streams and rivers in the state of Ohio are a critical issue and have been monitored for many years to obtain accurate nitrate load estimations. For designing suitable conservation measures or reduction strategies, it is necessary to accurately calculate the nutrient loads on a monthly, seasonal, and yearly basis at each monitoring station. However, evaluation results for nutrient loads typically contain many potential sources of uncertainties including those related to the models used and the data sets.
Various type of methods have been developed to resolve the data gaps encountered when estimating nutrient loads [2,4]. One well-known approach for estimating water quality constituents is a regression based method, in which water data are correlated with the constituents of concern such as the nutrient load. The United States Geological Survey (USGS) has developed several methods to calculate various water quality constituents. The most well-known USGS methods are based on multiple regression techniques that relate observed concentrations with the daily discharge, time, and season. In obtaining water quality constituent data, the estimation of daily nutrient concentrations is often a critical issue.
Annual, seasonal, and monthly load estimations are important because these loads are the summation of the daily load multiplied by the daily discharge and daily nutrient concentration [4][5][6][7]. Nutrient concentrations are often not monitored every day and over long periods of time. Hence, many researchers have suggested different approaches for estimating the missing nutrient concentrations. The most representative methods for load estimations are based on regression analyses between the streamflow and nutrient concentrations. Cohn et al. [8] and Cohn [4] suggested the use of a regression model with seven parameters to estimate daily concentrations. This model estimates the logarithm-transformed concentrations through use of a second-order polynomial regression equation with data on logarithm-transformed daily flows and decimals of time. The algorithm of the regression method can be obtained by downloading the LOADEST or FLUXMASTER load-estimation software package from USGS. Hirsch et al. [9] also developed another load estimation method, which is referred to as the Weighted Regressions on Time Discharge and Season (WRTDS). This method estimates the logarithm of daily concentrations by using the sine and cosine transformations of decimal time, and the logarithm of daily discharge. The method is implemented with five or seven parameter equations. The main input parameters in WRTDS are decimal time and streamflow discharge. One of the important processes in WRTDS is the estimation of weights for each day in the sample depending on the differences in the values of the variables between the prediction and sample day [10]. In a more recent study, the algorithm of WRTDS was applied to Exploration and Graphics for RivEr Trend (EGRET) to enhance load estimations [11].
The prediction of nitrate concentrations in streamflow by using artificial intelligence algorithms has been studied. Markus et al. [12] and Markus et al. [13] used artificial neural network (ANN) based models to forecast weekly nitrate concentrations. They also compared those ANN model results with results from evolutionary polynomial regressions and naïve Bayes models for several watersheds in Illinois, USA. The authors demonstrated that the most outstanding models differed depending on the error evaluation method used, and they proposed a multi-tool approach for analysis. Besides, many other studies have applied ANN models to predict the monthly biological oxygen demand (BOD) [14,15], monthly total nitrogen content, total phosphorus content, and dissolved oxygen level [16,17] in various types of rivers located in different countries.
Several research projects have been conducted to estimate various hydrological variables based on an Ensemble ANN (EANN) approach. These studies used hydrological variables that have been spatially and temporally monitored for long periods of time, and the results seem to suggest that an EANN approach is appropriate for application in artificial intelligence algorithms. For example, the EANN approach has been applied to simulations and forecasts of the rainfall-runoff process [18], flood frequency [19,20], peak discharge [21], and monthly potential evapotranspiration [22]. These studies proposed that the EANN was more effective than a Single ANN (SANN) or other existing physical approaches. Moreover, cross-validation techniques have been widely used for different hydrologic variables to assess the estimates obtained from hydrological models [23][24][25][26].
Recently, the EANN approach has been applied for forecasting and simulations of water quality constituents to improve the estimation modeling. Kan et al. [27] used an EANN based on a hybrid function approximator, named the PEK model, to simulate runoff in three different catchments in China. They investigated the results for the runoff hydrograph and peak flow derived with the EANN by comparing the model performance with the performances of two physical runoff models, namely, the Xinanjiang model and the IHACRES (identification of unit hydrographs and component flows from rainfall, evaporation, and streamflow) model. The authors demonstrated that the performances of the EANN model for runoff and peak flow results were better than those of the other two physical watershed models tested. In addition, Huang and Gao [28] applied an EANN to simulate chlorophyll with other seven water quality parameters. They reported that the ensemble simulations were affected by the ensemble size and that determination of the appropriate ensemble size was significant for ensemble simulations. However, little research has been carried out to investigate EANN applications with multivariate statistics for enhancing load estimations and to evaluate the load estimation performances based on both SANN and EANN approaches.
In the present study, we aimed to identify a better estimation model for obtaining load estimations, which can be used for nutrient concentration simulations in Midwest streams or rivers. The SANN and EANN models were applied to determine a proper model by investigating daily load estimations and by comparing the performances. Further, multivariate statistics, namely, canonical correlation analysis (CCA) results, were used to improve the load estimations by establishing a correlation structure between the data sets of two variables that were strongly related to nutrient concentrations. For the model validation, this study utilized the k-fold cross-validation technique, which involved identifying the appropriate value of folds and evaluating the model performance.

Data Sets
For the analysis of load estimations, we focused on Midwest river basins in the USA and used data from five stations located in the Cuyahoga, Raisin, Sandusky, Muskingum, and Vermilion basins. The five stations cover river basins characterized by various areas ranging from 697 km 2 to 19,208 km 2 . The average for the nitrate concentration ranged from 1.389 to 3.957 mg/L, and the average for the discharge ranged from 491.013 to 8642.170 m 3 /s. The monitoring durations were 35, 27, 35, 22, and 7 years for the Cuyahoga, Raisin, Sandusky, Muskingum, and Vermilion basins, respectively. The land in the river basins is basically agricultural, urban, and wooded land. The agricultural areas represent a large portion of the overall land use, and agricultural activities here have led to high nitrate concentrations downstream in Lake Erie and the Mississippi River basin. Around the Cuyahoga, Raisin, Sandusky, Muskingum, and Vermilion stations, the proportion of land used for agriculture amounts to 17%, 72%, 83%, 71%, and 52%, respectively. The drainage basins also contain many wetlands, lakes, and floodplain forests where ecological resources are plentiful. Figure  1 shows the five river basins studied in the present study. To conduct the analysis of load estimations, we used daily discharge data, nitrate concentrations, and the day of year at the five stations. The daily discharge was denoted as Q, and the daily nitrite plus nitrate (NO2-N+NO3-N) data were represented as nitrate (NO3) in this study.  Figure 2 shows Q and nitrate concentration for the five stations. (a) (b) Figure 2. Plots of (a) daily discharge (m 3 /s) and (b) daily nitrate concentrations (mg/L) for the five river basins.

ANN Models for Load Estimations
In the present study, we constructed several ANN models to evaluate the optimal ANN model for estimating daily loads at the sites of interest. Variables for discharge, nitrate concentration, and time were used to obtain the load estimations. An ANN is an information processing model designed for reproducing certain structures and for identifying the interconnections among a group of nodes. Such a model can be utilized to solve complex problems related to classification, pattern recognition, and estimation as a nonlinear mathematical model. With an ANN model, multilayer perceptrons (MLPs) have been commonly used to provide the ANN predictions, which represent the model output [29]. The MLPs consist of three types of layers including input, hidden, and output layers, and they can be characterized as a feed forward supervised model. During this process, the input layer receives information consisting of the input variables, and then, the hidden layer connects the input layer with the output layer based on weighted acyclic arcs. The number of hidden neurons of the hidden layer plays a crucial role in predictions of the model output with the ANN model. The model can be overfitted when many hidden neurons are applied due to the use of an insufficient training sample, while the model can underfitted when few neurons are utilized due to the difficulty of determining functional relationships among the variables. Here, five hidden neurons were selected for the hidden layer to build the ANN model by considering the model performance. This number was also used in previous studies to estimate hydrological variables [20,23].
For the training algorithm in the ANN process, the Levenberg-Marquardt (LM) algorithm was applied for the determination of optimal solutions to decrease errors of the model. This algorithm is relatively faster and more accurate than other algorithms including the gradient descent algorithm. The scalar parameter of the LM algorithm was selected based on the analysis of Demuth et al. [30]. If the value of the scalar parameter is large, the algorithm follows the features of the gradient descent method, whereas if the value of the parameter is small, it follows the properties of the Gauss-Newton method. Here, the error of a specific configuration was identified and compared with the target output by running the training samples. Then, early stopping criteria were applied in this study to find the optimal network parameters that could minimize the estimation error.

Ensemble ANN Models
Once we set up the SANN model for load estimations, the ensemble technique was used for creating the EANN model, which was applied to improve the ability of generalization and stability of model performance. Based on the purposes of this study, we compared the results obtained from the SANN model and the EANN model. The EANN model was based on a number of ANNs that were trained and generated by individual networks [31,32]. As a result, the number of ANN models is 14 in this study. In the EANN processes, the bagging approach was used to produce unique predictions of the model [33]. With the bagging method, a number of ANNs were trained based on a subset of the training set by solving a given problem. Then, the results generated by the individual networks were combined to produce the unique output, which is the output of the ensemble.
The size of an ensemble plays a significant role in the design of the EANN model as it defines the amount of information and determines the degree of homogeneity within the training subsets. If the size for the ensemble is relatively large, the time of training will be increased because of the many sub-models. This affects the amount of information assigned in each ensemble, and it decreases the model performance. If the size for the ensemble is relatively small, the ability for generalization and stability will be not improved. Different ensemble sizes were examined for this study, and an ensemble size of 14 was selected. The estimation error was gradually reduced by the size of 11, while subsequent increases in the size seemed to result in very little change in the error. Notably, Shu and Ouarda [20] investigated the size of the ensemble and chose a size of 14 for the EANN model with the bagging method.

Integration of ANN Models and CCA
The CCA technique is basically used to establish a linear relationship between two groups of random variables. This multivariate approach provides a general theoretical framework for factorial discriminant analysis and multivariate regression. Previously, CCA has been applied for estimations of hydrological variables and recommended for CCA based ANN models to enhance the generalization and performance [20,23]. If X and Y are two random variables, CCA computes two sets for basis vectors that are canonical variables. Given that W and V are a linear combination of X and Y, respectively, we have where ′ denotes the transpose of the vector and ′ denotes the transpose of the vector . The correlation between W and V is calculated as Based on the above equation, the vectors of and are determined by maximizing the correlation, . If the first pair of canonical variables is calculated, other pairs of the canonical variables are estimated based on the correlation subject to the constraint of unit variance for normalization. Note that X implies a set of discharge and time variables and Y implies a set of load variables in this study. A time variable was used in this study because the water quality and discharge can be influenced by seasonal trends. The CCA constructs a transformed space called a canonical space, and then, a calibration with data is conducted by establishing the functional relationship between the two sets of variables in the space. Detailed information on the process of CCA is available in the literature [34].
Once the discharge and time variables were projected for the use of the ANN model in the canonical space, the projected variables were fed to the model for estimations of load variables. The ANN model approximates the functional relationships among the canonical variables as the input variables and load variables as the output variables. Through the MLP process, the output layers generate the ANN predictions. The present study used an integration of the SANN and CCA (SANN-CCA) and an integration of the EANN and CCA (EANN-CCA) to achieve improvements in the load estimations. Figure 3 presents a diagram of the processes to estimate loads using the ANN based models.

Evaluation Approaches and Criteria
To assess the performance of models, cross-validation techniques have been commonly used as resampling methods [19,35,36]. This study used the k-fold cross-validation method to evaluate the relative performance of several ANN models during load estimations.
In the k-fold cross-validation procedure, the original sample is randomly grouped into k subsamples based on the same size. The subsamples are classified for a testing member and training members. The testing set as the validation data set represents an unknown data set, and the training sets represent known data sets. Then, the model conducts the analysis on the training sets and validates the analysis on the testing set. The cross-validation process is repeated k times to obtain a single estimate of the model output from the average value of the results for the different sets.
The ANN models can be assessed on the basis of two measures, namely, the relative root mean squared error (rRMSE) and the relative mean bias (rBIAS), which were used for flood quantile estimations [20]. The two measures can be computed as follows: where n is the total number of data points used for the analysis, is the measured data for day i, and indicates the load estimation derived from the ANN models for day i. The rRMSE ranges from zero to large positive numbers. The rBIAS ranges from large negative numbers to large positive numbers. The optimal value of both rRMSE and rBIAS is zero.

Single ANN and Ensemble ANN
In the analysis, the SANN and EANN models were structured for load estimations, which could be used to investigate nutrient concentrations and manage water quality. The k-fold cross-validation procedure was applied to the study areas, and 3-fold, 5-fold, 7-fold, and 10-fold cross-validation techniques were examined during the estimations of loads for five river basins. Rodriguez et al. [37] used various values of the folds including 2, 5, and 10 for the identification of optimal values of the folds. We also selected fold values less than 10 for the analysis of load estimations in this study. A correlation analysis for discharge and nitrate concentrations and for discharge and loads was conducted as shown in Figures 4 and 5. Figure 4a shows that correlation coefficients between the daily discharge and daily nitrate concentration ranged from −0.584 to 0.519. Figure 4b shows that correlation coefficients between the annual discharge and annual nitrate concentration ranged from −0.805 to 0.283. Additionally, Figure 5 presents that correlation coefficients between the daily discharge and daily load and between the annual discharge and annual load ranged from 0.777 to 0.919 and from 0.675 to 0.918, respectively.   With both the SANN and EANN models, we obtained load estimations for different crossvalidations, and Table 2 presents the corresponding rRMSE and rBIAS indices for the 3-fold, 5-fold, 7-fold, and 10-fold cross-validations. The rRMSE index basically provides an assessment of accuracy for the ANN predictions, and the rBIAS index generally provides an indication of whether the proposed model seems to overestimate or underestimate. The results in Table 2 show that the rRMSE and rBIAS indices tended to improve for the five river basins when the number of folds in the crossvalidation increased. The analysis using the rBIAS index shows that the models produce underestimated loads. Further, Figure 6 presents the rRMSE index derived from the SANN and EANN models with the different cross-validations. This figure indicates that the EANN model seemed to have a better performance than the SANN model according to the rRMSE criterion. The 10-fold cross-validation for each basin provided enhanced performances among the four different kfold cross-validations tested.  Furthermore, we identified how the number of folds affects the model performance for load estimations based on the rRMSE of the SANN and EANN models. Figure 7 shows the results for the four types of fold cross-validation in the five river basins. In the figure, box plots are presented for each model. The center line in the box plot represents the median value of the load estimation, and the top and bottom of the plot show the 75th and 25th percentiles of the rRMSE of the estimation, respectively. The left box plots with the blue color indicate the results of the SANN model, while the right box plots with the red color indicate the results of the EANN model. The increasing number of k-folds showed a decreasing trend in both models. However, the model with the 10-fold cross validation for the five river basins presented a slightly decreasing tend or no noticeable trend compared to the model with the 7-fold cross validation. This phenomenon was observed in both the SANN and EANN models. To assess the sensitivity of the predictions based on k-fold crossvalidation, Rodriguez et al. [37] conducted an experimental study in which they changed the training set with various values of k. In their analysis, the use of the low value of k = 2 seemed to produce the most biased result, and the use of k = 5 or 10 was recommended to obtain a less biased result on the basis of the experimental results. We also observed that a k value of 5 seemed to produce a better estimation than a k value of 7 in the Cuyahoga and Sandusky basins. Based on the load estimations examined in this study, we propose the application of a k value of 10, which tends to provide a less biased error estimator for the loads.

Single ANN-CCA and Ensemble ANN-CCA
In order to identify whether the variable of time affects the ANN predictions, the CCA based SANN and EANN model results were examined by changing the number of folds in the crossvalidation. In conducting the CCA, the variables of discharge and time were used to estimate loads for the five river basins. The combination with the variables can help the performance of ANNs. Table  3 shows the rRMSE and rBIAS indices obtained from the SANN-CCA and EANN-CCA models for assessment of the model performance. This table also includes the results for 3-fold, 5-fold, 7-fold, and 10-fold cross validations. The model performance based on the two indices seemed to be enhanced when the number of folds increased. Overall, the CCA based models outperformed the SANN and EANN models. The EANN-CCA model showed a better performance according to the rRMSE and rBIAS indices than the SANN-CCA model for the 10-fold cross-validation. These results indicate that the use of the ANN models based on the combination of the variables in the CCA space can improve the performance relative to the ANN models with one variable. Figure 8 presents the results derived by using different folds of cross-validations for the SANN-CCA and EANN-CCA models in the five river basins. From this figure, we can observe that the performance of ANNs seemed to be improved for all sites except for the Sandusky basin.  The model performance based on the rRMSE index for load estimations was also analyzed with box plots for various folds in the cross-validation. Figure 9 shows the box plots according to the rRMSE index of the five river basins. As in Figure 7, the left box plots with the blue color represent the results of the SANN-CCA model, whereas the right box plots with the red color represent the results of the EANN-CCA model in Figure 9. This figure shows that there was a decreasing trend in the index for the ANN models when the number of folds increased. The results indicated that the variables that were significantly related to nutrient concentrations could improve the ANN predictions, and the model with the 10-fold cross-validation seemed to provide a better performance. This examination shows that the proposed model can produce good estimates by adding in the important variables correlated with water quality, and it can be used for improved load estimation applications.

Conclusions
A methodology based on ANN models to achieve nitrate load estimations was examined for river basin nutrient concentration assessments in this study. The SANN and EANN models were built for five river basins in the Midwest US, and these basins included the Cuyahoga, Raisin, Sandusky, Muskingum, and Vermilion in the states of Michigan and Ohio. To improve the model performance, CCA was also used with a combination of variables such as the discharge and time for load estimations. The proposed models were assessed by using 3-fold, 5-fold, 7-fold, and 10-fold cross validations. Two statistical indices, namely, the rRMSE and rBIAS were applied for the validation of the proposed models.
Application of the ANN based models in the study region showed that better performances can be obtained when the number of folds in the cross-validation is high. The best rRMSE and rBIAS indices seemed to be produced with the 10-fold validation for the five river basins and for the three river basins, respectively. Moreover, the EANN model tended to produce better estimations for loads than the SANN model. The box plots for the rRMSE index based on the two models were also analyzed, and the plots indicated that an increasing number of folds seemed to provide a better performance. However, the station data for the Raisin and Vermilion basins only tended to show improvements up to the 7-fold validation and results were nearly constant at the 10-fold validation. The station data for Cuyahoga and Sandusky basins seemed to show that the results of the 5-fold validation were better than the results of the 7-fold validation in the ensemble model. Overall, the use of a k value of 10 can be recommended to estimate loads when other basins in a different region are investigated for load estimations. This is because the k value of 10 steadily provides good estimations derived from the single and ensemble models proposed in the present study within the study regions.
Moreover, the CCA based ANN models were proposed for the achievement of better estimations. The SANN-CCA and EANN-CCA models were applied to obtain load estimations, and the corresponding statistical indices were compared for different folds of cross-validation. Compared to the SANN and EANN models, the CCA based ANN models led to a better performance in the measures including the rRMSE and rBIAS. The 10-fold cross-validation tended to provide a better estimation than the other fold cross-validations for the five river basins. The EANN-CCA model improved the performance in terms of both the rRMSE and rBIAS indices compared to the SANN-CCA model. The box plots for the rRMSE index were also examined, and the index seemed to show a decreasing trend when the number of folds increased for the studied regions.
The ANN based models were analyzed in this study for load estimations to determine a better estimation model. Ultimately, the CCA based ANN models with significant variables related to nitrate loads were proposed to improve the model performance. Based on this work, the models can be applied for load estimations, especially to deal with missing monitoring data of interest when investigating nitrate loads in a river basin. Additionally, other common methods such as EGRET can be applied to achieve a better load estimation technique.