Prediction of Surface Water Quality by Artiﬁcial Neural Network Model Using Probabilistic Weather Forecasting

: We developed an artiﬁcial neural network (ANN)-based water quality prediction model and evaluated the applicability of the model using regional probability forecasts provided by the Korea Meteorological Administration as the input data of the model. The ANN-based water quality prediction model was constructed by reﬂecting the actual meteorological observation data and the water quality factors classiﬁed using an exploratory factor analysis (EFA) for each unit watershed in Nam River. To apply spatial reﬁnement of meteorological factors for each unit watershed, we used the data of the Sancheong meteorological station for Namgang A and B, and the data of the Jinju meteorological station for Namgang C, D, and E. The predicted water quality variables were dissolved oxygen (DO), biochemical oxygen demand (BOD), chemical oxygen demand (COD), total organic carbon (TOC), total phosphorus (T-P), and suspended solids (SS). The ANN evaluation results reveal that the Namgang E unit watershed has a higher model accuracy than the other unit watersheds. Furthermore, compared with Namgang C and D, Namgang E has a high correlation with water quality due to meteorological effects. The results of this study will help establish a water quality forecasting system based on probabilistic weather forecasting in the long term.


Introduction
Water supply demands are increasing with environmental changes in river watersheds and developments due to urbanization. As a result, the effective environmental management of watersheds has become a necessity. Lee et al. [1] and Freeman et al. [2] reported that water pollution due to rainfall runoff resulting from land use changes by urbanization is serious and methods to evaluate these environmental effects are required. As a result of these environmental changes, water quality prediction for maintaining and managing rivers is directly related to ecology and the environment, and improvement directions and analysis of long-term water quality such as maintenance of the water supply are imperative.
Rainfall is a basic element required to maintain water resources. Rainfall causes runoff in watersheds, which directly affects the environmental changes in water quality. The runoff that flows into the watershed affects the water quality as well. Furthermore, river surface water is highly sensitive to climate change because it is exposed to sunlight and is directly affected by temperature. Because these water quality factors have nonlinear relationships with meteorological factors such as rainwater and temperature, it is difficult to define the correlations between them. The weather and water quality variations in a watershed have large spatiotemporal variability. In particular, water quality data are generated by very complex physical, chemical, and biological reaction mechanisms of the ecosystem, and have strong nonlinear characteristics. Therefore, various water quality models are being applied for the prediction, analytical study, and management of water quality. Wellen et al. [3] evaluated the latest status of watershed models based on a spatially distributed process and reported that 257 papers on watershed models had been published between 1992 and 2010. Ji [4] explained that great developments have been made in mathematical modeling for numerical simulation of water quality and that modeling is a powerful decision-making tool. However, it takes a considerable amount of time and effort to develop a water quality prediction model that considers the complex environments of watersheds, including artificial factors in natural rivers and the physical characteristics of water quality factors. For this reason, active research on prediction models has been conducted using the data-based ANN model as well as a physics-based model to predict water quality variations.
Wu et al. [5] reported that the applications of ANNs have become popular since the early 1990s in the environment and water resource modeling fields. Kim et al. [6] explains that an ANN is a powerful data-based model that can consider and express the linear and nonlinear relationships between input and output data. Furthermore, an ANN has been widely used for predicting water quality variables and processing the uncertainty of pollutants, and the nonlinearity of water quality data. They developed an ANN ensemble model to predict the water quality at the Sangdong point in the Nakdong River, South Korea. Palani [7] proposed a method of applying an ANN modeling technique for dynamic prediction of seawater quality. Palani [7] explained that the ANN model exhibited enormous potential as a prediction tool for seawater quality variables with low cost and acceptable accuracy by optimizing the water-quality monitoring network. Patki et al. [8] revealed that the ANN model outperforms the multiple regression technique for the prediction of water quality in the distribution system and it is a robust tool for understanding the poorly defined relations between water quality variables and the Water Quality Index (WQI) in a municipal distribution system. Even though many studies have been conducted on the water quality prediction model, no study, to the best of the authors' knowledge, has considered the meteorological factors that have significant effects on water quality together with water quality factors. Chang et al. [9] proposed a promising approach for reliable modeling of spatial NH 3 -N concentrations only based on hydrologic data but did not consider water quality variation characteristics through the reaction mechanism in water bodies. Water quality is sensitive to runoff owing to rainfall flowing into the watershed as well as to changes in the water environment. Dunn et al. [10] stated that rainfall runoff affected an increase in the concentration of heavy metals in water bodies and can occur in both pervious and impervious surfaces in urban areas. Jeong et al. [11] analyzed the correlations between phytoplankton biomass (chlorophyll a concentration) and rainfall and explained that dam operation management must be performed effectively according to the rainfall received for water quality management. Meteorological factors as well as water quality factors should be considered when developing a water quality prediction model.
In this study, we developed an ANN-based water quality prediction model that considers meteorological factors that affect water quality as well as various water quality factors. Kim et al. [12] analyzed water quality variation characteristics using exploratory factor analysis (EFA) and proposed a systematic evaluation method. The present study also developed a prediction model for water quality factors with high prediction accuracy using EFA and by considering the water quality variation characteristics. The developed model was verified by comparing its predictions with actual measurements. In addition, the applicability of the ANN-based water quality prediction model was evaluated using the probability forecasts of temperature and precipitation from 2014 provided by the Korea Meteorological Administration as input data. This study attempted to provide the foundation for a river water quality forecast system that considers the meteorological factors of water quality according to weather forecasts.

Study Area and Data Description
The Nam River is the first tributary of the Nakdong River in South Korea. The Nam River watershed consists of five unit watersheds, whose water quality is managed through the total maximum daily load (TMDL). The upstream unit watersheds from the Namgang Dam are Namgang A, Namgang B, and Namgang C. It is characterized by a high proportion of mountainous areas and a steep river slope. The downstream unit watersheds from the Namgang Dam are Namgang D and Namgang E. Non-point pollution sources in the surrounding small and medium-sized cities and industrial areas are scattered, and the slope of the riverbed is very gradual as it goes downstream. As shown in Figure 1, there is an 8-day interval water quality monitoring station at the end of each unit watershed, and there are two meteorological stations in the Namgang Watershed: The Sancheong meteorological station and the Jinju meteorological station. Tables 1 and 2 list the data collection variables at each monitoring point and the collection period.
Water 2021, 13, x FOR PEER REVIEW 3 foundation for a river water quality forecast system that considers the meteorologica tors of water quality according to weather forecasts.

Study Area and Data Description
The Nam River is the first tributary of the Nakdong River in South Korea. The River watershed consists of five unit watersheds, whose water quality is man through the total maximum daily load (TMDL). The upstream unit watersheds from Namgang Dam are Namgang A, Namgang B, and Namgang C. It is characterized high proportion of mountainous areas and a steep river slope. The downstream uni tersheds from the Namgang Dam are Namgang D and Namgang E. Non-point poll sources in the surrounding small and medium-sized cities and industrial areas are tered, and the slope of the riverbed is very gradual as it goes downstream. As show Figure 1, there is an 8-day interval water quality monitoring station at the end of each watershed, and there are two meteorological stations in the Namgang Watershed Sancheong meteorological station and the Jinju meteorological station. Tables 1 and the data collection variables at each monitoring point and the collection period.    The Korean Meteorological Administration provides two weather forecasts: Probability long-term forecasts and quantitative weather forecasts. Long-term forecasts refer to forecasts obtained over a period of 11 days or more and include weekly and monthly barometer trends and prospects, and temperature and precipitation forecasts. The forecast area comprises 12 regions of the Korean Peninsula. In this study, we used the probability forecasts of the Busan, Ulsan, and Gyeongsangnam-do regions as the input variables of the ANN model. For the probability forecast, the simulation results of a climate model for various conditions were statistically analyzed, and the precipitations during the forecast period were classified into low, similar, and large relative to the average year and are provided as probabilities. Probability forecasts provide forecast information that contains uncertainty about the future as quantitative probability values. They have the advantage of allowing for various decisions regarding the establishment of response policies for abnormal weather and long-term plans for water resources. With the rising frequency and intensity of extreme events (flood and drought) that we have not experienced before due to climate change, higher accuracy and practicality of long-term forecasts are required. In this situation, probability forecasts, unlike the conclusive forecasts of the past, will allow for more flexible responses.

Exploratory Factor Analysis (EFA)
EFA is an analysis technique for analyzing the correlations among variables that uses the covariance and correlations among many variables, identifying the correlation and structure between items and variables based on the analysis results, and grouping the information of many variables into a small number of factors. EFA condenses the information about many variables into a few key intrinsic factors, making the information easier to understand and easier to use in additional analyses. However, EFA can become difficult if the determined factors have randomness. Hence, attention should be paid to the validity and reliability tests of the condensed analysis results. Figure 2 shows a flowchart of EFA, which is a process of deciding the number of factors and the common factors through the eigenvalues and eigenvectors derived from the factor matrix. Based on the result of EFA, the input variables that have a significant effect on the variability of the prediction factors in the ANN model were distinguished. The variables classified as the same factor exhibited the same variation trend. The eigenvalue is the total variance of variables that can be described by each factor and is calculated by summing the squares of the factor loading of every variable for each factor. In other words, it is a ratio that indicates how much the information contained in a variable can be expressed by a factor. The eigenvalue of a previously extracted factor is always larger than the eigenvalue of the factor that is extracted next. In this study, the input data of the model were constructed using factors whose eigenvalues were larger than 1. The cumulative value is the cumulative number of variances accounted for by the classified factors and indicates the explanatory power of the factor. When each factor is added one by one, when the cumulative variance ratio reaches a sufficiently high value, the addition of factors is terminated. That is, if there are N index variables, the last cumulative variance ratio calculated as the Nth is 1.0.
Water 2021, 13, x FOR PEER REVIEW 5 of 19 variables that can be described by each factor and is calculated by summing the squares of the factor loading of every variable for each factor. In other words, it is a ratio that indicates how much the information contained in a variable can be expressed by a factor. The eigenvalue of a previously extracted factor is always larger than the eigenvalue of the factor that is extracted next. In this study, the input data of the model were constructed using factors whose eigenvalues were larger than 1. The cumulative value is the cumulative number of variances accounted for by the classified factors and indicates the explanatory power of the factor. When each factor is added one by one, when the cumulative variance ratio reaches a sufficiently high value, the addition of factors is terminated. That is, if there are N index variables, the last cumulative variance ratio calculated as the Nth is 1.0. Water quality is affected by the characteristics of the watershed; therefore, the characteristics of the unit watershed that affect water quality need to be examined. Customized water quality prediction for watershed management is therefore required after determining which factors cause variations in water quality characteristics. Therefore, this study aimed to analyze the water quality characteristics of each unit watershed through EFA and build input data to improve the prediction accuracy of the water quality prediction model. EFA was performed using the water quality variables, flow variables, and meteorological variables for each location using the five water quality stations located in the Namgang unit watershed.

Artificial Neural Network (ANN)
An ANN is a parallel information processing system developed to generalize the perception process of neurons, the basic units of the human brain, into a mathematical model as a statistical refinement technique. ANNs can be largely categorized by the hierarchical structure of the neural network into single layer neural networks, which only have input and output layers, and multilayer neural networks, which have an input layer, one or more intermediate layers (hidden layers), and an output layer. The multi-layer neural networks that have one or more hidden layers are used often. Figure 3 shows the general structure of the multilayer neural network. The neurons are interconnected and play the role of synapses in the biological neurons, which are called connection strength or weight vectors in the ANN. In an ANN model, the numbers of input and hidden neurons, and the number of cases to be learned have a critical effect on learning performance. In this study, the backpropagation algorithm, which calculates weights using the differences be- Water quality is affected by the characteristics of the watershed; therefore, the characteristics of the unit watershed that affect water quality need to be examined. Customized water quality prediction for watershed management is therefore required after determining which factors cause variations in water quality characteristics. Therefore, this study aimed to analyze the water quality characteristics of each unit watershed through EFA and build input data to improve the prediction accuracy of the water quality prediction model. EFA was performed using the water quality variables, flow variables, and meteorological variables for each location using the five water quality stations located in the Namgang unit watershed.

Artificial Neural Network (ANN)
An ANN is a parallel information processing system developed to generalize the perception process of neurons, the basic units of the human brain, into a mathematical model as a statistical refinement technique. ANNs can be largely categorized by the hierarchical structure of the neural network into single layer neural networks, which only have input and output layers, and multilayer neural networks, which have an input layer, one or more intermediate layers (hidden layers), and an output layer. The multi-layer neural networks that have one or more hidden layers are used often. Figure 3 shows the general structure of the multilayer neural network. The neurons are interconnected and play the role of synapses in the biological neurons, which are called connection strength or weight vectors in the ANN. In an ANN model, the numbers of input and hidden neurons, and the number of cases to be learned have a critical effect on learning performance. In this study, the backpropagation algorithm, which calculates weights using the differences between output and target values, was used as the ANN learning method. The backpropagation algorithm determines the size of the weight by finding the minimum of the error function through a gradient descent using a differentiable activation function [13]. tween output and target values, was used as the ANN learning method. The backpropagation algorithm determines the size of the weight by finding the minimum of the error function through a gradient descent using a differentiable activation function [13]. In this study, the water quality factors selected through EFA, a multivariate statistical method, and the precipitation and average temperature of meteorological observation were used as input data. The model performance, excluding the effect of initial weights, was evaluated using the ensemble modeling technique, which statistically evaluates the results of multiple ANN models with different initial weights. Figure 4 shows the structure of an ANN model that uses the ensemble modeling technique. To evaluate the results of the ANN model, considering the variability of the ANN results according to the initial weights, an optimal model was derived through ensemble modeling for the initial weights.  In this study, the water quality factors selected through EFA, a multivariate statistical method, and the precipitation and average temperature of meteorological observation were used as input data. The model performance, excluding the effect of initial weights, was evaluated using the ensemble modeling technique, which statistically evaluates the results of multiple ANN models with different initial weights. Figure 4 shows the structure of an ANN model that uses the ensemble modeling technique. To evaluate the results of the ANN model, considering the variability of the ANN results according to the initial weights, an optimal model was derived through ensemble modeling for the initial weights.
Water 2021, 13, x FOR PEER REVIEW 6 of 19 tween output and target values, was used as the ANN learning method. The backpropagation algorithm determines the size of the weight by finding the minimum of the error function through a gradient descent using a differentiable activation function [13]. In this study, the water quality factors selected through EFA, a multivariate statistical method, and the precipitation and average temperature of meteorological observation were used as input data. The model performance, excluding the effect of initial weights, was evaluated using the ensemble modeling technique, which statistically evaluates the results of multiple ANN models with different initial weights. Figure 4 shows the structure of an ANN model that uses the ensemble modeling technique. To evaluate the results of the ANN model, considering the variability of the ANN results according to the initial weights, an optimal model was derived through ensemble modeling for the initial weights.   Table 3 shows the model evaluation method used in this study. The coefficient of determination (R 2 ), which is widely used in various fields, including water quality modeling, is a quantitative measure of the linear relationship between measurements and simulation values. The range of the coefficient value is between 0 and 1; the more linear the relationship, the closer the coefficient is to 1. The NSE is a statistical measure that is most widely used in the water quality modeling field. It is recommended by the ASCE [14], Legates and McCabe [15], and Moriasi et al. [16]. It is still being used by researchers who perform water quality modeling. A value closer to 1.0 means that the simulation values reflect the tendency of the measurements more accurately. The root mean square error (RMSE) is a statistical measure that includes a unit for simulation items and can quantitatively indicate errors. However, it is difficult for non-experts to evaluate it because it only represents the absolute degree of error. Care should be taken as the equation takes a square form and is greatly affected by high values or outliers. Table 3. Model performance function for estimating ANN-based water quality prediction models.

Exploratory Factor Analysis (EFA) Results
The EFA results are outlined in Table 4. Based on the EFA results, for Namgang A, which is located upstream from the Namgang Dam, water temperature (W.T), air temperature (T), T-N, and DO were classified as Factor 1 (F1), and discharge (Q) and SS as Factor 2 (F2). Thus, SS was found to be significantly affected by discharge. For Namgang B, W.T, T, T-N, and DO were also classified as F1, and COD, BOD, TOC, SS, and T-P as F2. In the case of Namgang D, which is located downstream of the Namgang Dam, BOD, COD, TOC, and T-P were classified as the same factor. The EFA results revealed that W.T, T, and DO had negative correlations at most locations, thus indicating that the W.T reflects the characteristics of the decreasing dissolution rate of gas (oxygen) well. For Namgang D and E, which are immediately downstream of the Namgang Dam, water quality variables such as COD and nutrients were classified as the same factor. For Namgang E, BOD and Chl-a were classified as the same factor. As it joins with the Nakdong River, the main stream, a hydraulically stagnant flow occurs at the measurement point. Therefore, it can be considered that the effect of native BOD due to an increase in Chl-a in the stagnant river appeared as the same factor.
Most of the meteorological variables were not classified with water quality variables. The duration of sunshine (Sun) and solar radiation quantity (Rad) exhibited negative correlations with relative humidity (R.H.). This suggests that large variability did not appear because meteorological variables do not directly influence water quality variables, but indirect factors related to W.T or saturation do. The variables grouped together in the same factor are changed simultaneously by a certain factor. Even though variables that belong to different factors also vibrate together, the values are small. Thus, only variables with large variability were classified as the same factor. In this study, the water quality variation characteristics of each watershed were examined through EFA, and the classified factors were used as the input variables for learning the ANN-based water quality prediction model for each water quality variable of the unit watershed. Furthermore, even though the correlation between meteorological factors and water quality factors could not be revealed statistically through EFA, we tried to implement the nonlinear correlations between meteorological factors and water quality factors through ANN model learning. Because the weather already includes the characteristics that determine water quality, the water quality prediction direction was set through the weather forecasts of the future.

ANN Learning System
ANN model learning for water quality prediction of each unit watershed was performed using the meteorological observation data, which is the input variable used to reflect actual meteorological phenomena. To apply the spatial refinement of meteorological factors for each unit watershed, the data of the Sancheong meteorological station were used for Namgang A and B, and the data of the Jinju meteorological station for Namgang C, D, and E. The six water quality variables of the model were DO, BOD, COD, TOC, T-P, and SS. A total of 30 data were collected from five unit watersheds to build the ANN-based water quality prediction model.
The factors that have dominant influence on the variability of the prediction factors were selected as input variables of the model by using EFA. For the ANN model, the ensemble modeling technique was applied, which statistically evaluates the results of multiple ANN models with different initial weights to evaluate the model's performance, excluding the effects of initial weights. Figure 5 shows the input variables used to construct the ANN model and the total flowchart.
Water 2021, 13, x FOR PEER REVIEW 8 of 19 In this study, the water quality variation characteristics of each watershed were examined through EFA, and the classified factors were used as the input variables for learning the ANN-based water quality prediction model for each water quality variable of the unit watershed. Furthermore, even though the correlation between meteorological factors and water quality factors could not be revealed statistically through EFA, we tried to implement the nonlinear correlations between meteorological factors and water quality factors through ANN model learning. Because the weather already includes the characteristics that determine water quality, the water quality prediction direction was set through the weather forecasts of the future.

ANN Learning System
ANN model learning for water quality prediction of each unit watershed was performed using the meteorological observation data, which is the input variable used to reflect actual meteorological phenomena. To apply the spatial refinement of meteorological factors for each unit watershed, the data of the Sancheong meteorological station were used for Namgang A and B, and the data of the Jinju meteorological station for Namgang C, D, and E. The six water quality variables of the model were DO, BOD, COD, TOC, T-P, and SS. A total of 30 data were collected from five unit watersheds to build the ANNbased water quality prediction model.
The factors that have dominant influence on the variability of the prediction factors were selected as input variables of the model by using EFA. For the ANN model, the ensemble modeling technique was applied, which statistically evaluates the results of multiple ANN models with different initial weights to evaluate the model's performance, excluding the effects of initial weights. Figure 5 shows the input variables used to construct the ANN model and the total flowchart. The input variables for ANN model learning of each unit watershed were selected using the EFA results. The input variables for ANN model learning of each unit watershed were selected using the EFA results.
The water quality variables grouped as the same common factor were selected for input variables. The past measurements (t-1 and t-2) were not considered for water quality variables grouped as the same factor. This is because the effects of the present (t) water quality must be considered instead of the past (t-1 and t-2) water quality considering the temporal relationship characteristics of the water quality variables to be predicted (t+1). Table 5 lists the input variables for ANN model learning for each unit watershed.

ANN Learning Results
Figures 6-10 show the learning results of the water quality prediction model for each unit watershed. Table 6 shows the coefficients of determination and the model evaluation method.
Based on the ANN learning results, R 2 was 0.

Evaluation of the ANN Model That Utilizes Probability Forecasts
The model was evaluated by comparing the water quality prediction results obtained using the weather forecasts of the learned model as input data of the model with the actual measurement values. For the weather probability forecasting, the forecasts from July 2014 to June 2016 were used as input data. Table 6 shows the evaluation results of the five unit watersheds. R 2 was 0.673-0.866 for DO, 0.315-0.673 for BOD 5  In general, the Namgang E unit watershed showed higher model accuracy than the other unit watersheds. This is because the Namgang E unit watershed has many samples with cumulative water quality measurement points. Moreover, its characteristics have a higher correlation with water quality characteristics that vary with meteorological effects than the Namgang C and D watersheds, which are affected by artificial flow from the discharge of the Namgang Dam. As a result, the watershed characteristics were reflected well in the ANN learning. Significant quantitative model evaluation is difficult owing to the insufficient data of the probability forecasts that started in 2014 and the irregular water quality measurement dates. As Palani [7] found, lack of consistency between the observed and estimated data indicates that new patterns should be incorporated into the model; hence, the model needs to be readjusted and reconfirmed when more data are collected. Even though the amount of available data was small, reasonable results were obtained for water quality predictions using the validation dataset that were not visible in separate locations from the training data set station. Palani [7] reported that better predictions can be provided if more data are available. Moreover, additional data would improve the accuracy of the ANN-based water quality prediction model.

Conclusions
Many studies have been conducted on water quality prediction models that use ANNs. However, there has been no study on a water quality prediction model that considers meteorological factors that have significant effects on water quality. Water quality is sensitive to rainfall runoff to watersheds and the changing water environment. Moreover, surface water directly affects water temperature because it is exposed to sunlight. Therefore, research on the development of a water quality prediction model needs to consider meteorological factors as well as water quality factors. The water quality variation characteristics of each watershed were examined through EFA, and the classified factors were used as the input variables for learning the ANN-based water quality prediction model for each water quality variable of the unit watershed. In the present study, we developed and evaluated an ANN-based water quality prediction model considering various water quality variation characteristics. Through this study, it will be possible to refer to the selection of input data for constructing an ANN. It will also be able to provide information on meteorological correlations for water quality prediction.

1.
Based on the EFA results, the water temperature (W.T), temperature (T), and dissolved oxygen (DO) showed negative correlations at most locations and were classified as the same factor. This indicates that the characteristic of the decreasing dissolution rate of gas (oxygen) with decreasing W.T is reflected well. Immediately downstream of the Namgang Dam, water quality variables such as COD and nutrients were classified as the same factor. In Namgang E, BOD and Chl-a were classified as the same factor. This suggests that the native Chl-a and BOD have a high correlation owing to the hydraulically stagnant flow at the junction of the main stream and tributary.

2.
Most of the meteorological variables were not classified together with the water quality variables. This is because the meteorological variables did not exhibit large variability as they are not direct influencing factors for the water quality variables, but indirect factors related to the W.T or saturation. In other words, the nonlinear relationship between meteorological variables and water quality variables could not be statistically examined through EFA. However, we attempted to build a model that embodies the nonlinear correlation between the meteorological factors and water quality factors through ANN model learning.

3.
The coefficient of determination was determined, and the model was evaluated by building a water quality prediction model for each unit watershed, and the results were good for all water quality variables except for the SS. This seems to be attributable to the large changes in observation values due to changes in the watershed runoff characteristics caused by rainfall; moreover, the number of observations is extremely small to reflect the variation characteristics. It is expected that an enhanced model could be constructed if detailed ANN learning were performed through continuous accumulation of the water quality data of the existing water quality monitoring network. Significant quantitative model evaluation is difficult owing to the insufficient data of probabilistic weather forecasting, which started in 2014, and irregular water quality measurement dates. However, the improvement of accuracy through data accumulation in the future can be expected.

4.
The meteorological and water quality changes in the watershed have large spatiotemporal variability. Water quality data have strong nonlinear characteristics of the ecosystem due to very complex reaction mechanisms. Because the meteorological effects already contain some of the characteristics of water quality, the probabilistic forecasting of water quality will be possible through the ANN-based water quality forecast model in the future.