Prediction of Algal Chlorophyll-a and Water Clarity in Monsoon-Region Reservoir using Machine Learning Approaches

: The prediction of algal chlorophyll-a and water clarity in lentic ecosystems is a hot issue due to rapid deteriorations of drinking water quality and eutrophication processes. Our key objectives of the study were to predict long-term algal chlorophyll-a and transparency (water clarity), measured as Secchi depth, in spatially heterogeneous and temporally dynamic reservoirs largely influenced by the Asian monsoon during 2000–2017 and then determine the reservoir trophic state using a multiple linear regression (MLR), support vector machine (SVM) and artificial neural network (ANN). We tested the models to analyze the spatial patterns of the riverine zone (Rz), transitional zone (Tz) and lacustrine zone (Lz) and temporal variations of premonsoon, monsoon and postmonsoon. Monthly physicochemical parameters and precipitation data (2000– 2017) were used to build up the models of MLR, SVM and ANN and then were confirmed by cross-validation processes. The model of SVM showed better predictive performance than the models of MLR and ANN, in both before validation and after validation. Values of root mean square error (RMSE) and mean absolute error (MAE) were lower in the SVM model, compared to the models of MLR and ANN, indicating that the SVM model has better performance than the MLR and ANN models. The coefficient of determination was higher in the SVM model, compared to the MLR and ANN models. The mean and maximum total suspended solids (TSS), nutrients (total nitrogen (TN) and total phosphorus (TP)), water temperature (WT), conductivity and algal chlorophyll (CHL-a) were in higher concentrations in the riverine zone compared to transitional and lacustrine zone due to surface run-off from the watershed. During the premonsoon and postmonsoon, the average annual rainfall was 59.50 mm and 54.73 mm whereas it was 236.66 mm during the monsoon period. From 2013 to 2017, the trophic state of the reservoir on the basis of CHL-a and SD was from mesotrophic to oligotrophic. Analysis of the importance of input variables indicated that WT, TP, TSS, TN, NP ratios and the rainfall influenced the chlorophyll-a and transparency directly in the reservoir. These findings of the algal chlorophyll-a predictions and Secchi depth may provide key clues for better management strategy in the reservoir.

. These natural hazards had adverse effects on the ecosystems, which impede the use of water as a drinking and industrial source. In addition, citizens distrust the stability and safety of the water quality and quantity. The objectives of the present study are (1) to predict the chlorophyll-a and transparency (Secchi depth) using MLR, ANN and SVM by optimizing key model parameters, (2) to see the predictive performance and evaluate the prediction accuracy of MLR, ANN and SVM through model accuracy metrics, (3) assess the relative importance of input variable in MLR, ANN and SVM and (4) to determine the long-term trophic state of the reservoir.

Study Area
The Imha is an embankment reservoir located along the upstream section of the Nakdong River, South Korea (near Tae-Baek mountains). The study sites and land-use patterns of the reservoir have been shown in Figure 1. The catchment area of this reservoir is 1361 km 2 along with 595,000,000 m 3 water holding capacity [24]. In addition to that, this reservoir has a surface area of 26.4 km 2 and the water level is attained at an elevation of 163 m [24]. The mean depth of the Imha reservoir is 16 m. The surface water quality parameters data (2000)(2001)(2002)(2003)(2004)(2005)(2006)(2007)(2008)(2009)(2010)(2011)(2012)(2013)(2014)(2015)(2016)(2017) of Imha reservoir were collected from three different zones likes riverine zone (Rz, latitude: 36.513058, longitude: 128.965529), transitional zone (Tz, latitude: 36.513058, longitude: 128.965529) and lacustrine zone (Lz, latitude: 36.513058, longitude: 128.965529). The upstream of the reservoir (riverine zone) land is very scarce and the bed of the reservoir is composed of rock and hard homogenous granite [25]. The annual domestic and industrial water supply of the Imha reservoir is 363.6 × 10 6 m 3 [24]. Moreover, the annual irrigation water supplied by Imha reservoir is 13 × 10 6 m 3 [24]. This artificial reservoir also facilitates flood control, helps to maintain water quality and efficient development of water resources. Figure 1. The map showing sampling sites of Imha reservoirs, which are located at the east part of South Korea (Rz-riverine zone, Tz-transitional zone; intake tower for the drinking water supply for the citizens and Lz-lacustrine zone).

Analysis of Water Quality Parameters and Rainfall Data
The monthly surface water quality parameters data were obtained from the Korean Ministry of Environment. A portable multi-parameter analyzer (YSI Sonde Model 6600) had been used to measure the electrical conductivity (EC), dissolved oxygen (DO) and water temperature (WT). The concentration of total nitrogen (TN), biological oxygen demand (BOD) and chemical oxygen demand (COD) were calculated by the chemical testing method standardized by the Ministry of the Environment, Korea [26]. Total phosphorus (TP) was measured by the ascorbic method, which is also standardized by the Ministry of the Environment, Korea [26]. Total suspended solids (TSS) and algal chlorophyll-a (CHL-a) were determined by preweighted Whatman GF/C filters method and a spectrophotometer (Bechman Model DU-65), respectively, according to the US EPA guidance (US Environmental Protection Agency, US EPA 2007) [27]. A metal disk was used to measure the transparency (SD). The monthly precipitation data were collected from the Korean Meteorological Administration at a local weather station from 2000-2017 (Gyeongsangbuk-do, Andong, Angi-dong, latitude: 36.624913, longitude: 128.715379).

Multiple Linear Regression (MLR)
Multiple linear regression (MLR) is simply known as multiple regression, which is a statistical technique and uses several input variables (explanatory variable) to predict the outcome of a single output variable (response variable). The goal of multiple linear regression (MLR) is to model the linear relationship between the input variables and the output variable. The equation of MLR: yi = β0 + β1 xi1 + β2 xi2 + … + βp xip + ϵ (i = n observations), where, yi is the output variable, xi are input variables, β0 is the y-intercept (constant term), βp is the slope coefficients for each input variable and ε is the model's error term (also known as the residuals).

Support Vector Machine (SVM)
Recently, the support vector machine (SVM) becomes more popular because of its more attractive features and pragmatic performance [28,29]. In this study, we applied support vector regression (SVR) to predict the chlorophyll-a and Secchi depth. In SVM, the SVR find a function, which estimates the difference between the input and output variable. SVM consequently estimates the function from the following equation.
where Si is the network output, Zi is the input data, which diagramed into a higher-dimensional feature via a nonlinear mapping function Φ(Zi) and wi and b are the coefficients determined by minimizing the regularized risk function based on the network output and real value [5]. A kernel function trick, radial basis function (RBF) kernel was used to predict the chlorophyll-a and Secchi depth. The RBF kernel is defined as KRBF (Z, Zi) = exp[−γ//Z − Zi// 2 ], where γ is a parameter that sets the "spread" of the kernel.

Artificial Neural Network (ANN)
In recent times, artificial neural network (ANN) has paid a lot of attention for classifying patterns of multi-variable datasets and modeling complex environmental variables [5,30]. Generally, ANN consists of one or more input layers, one to many hidden layers and one output layer. The basic formula of ANN: Y = f (X, W) + €, where Y is the vector of model outputs, X is the vector of the model inputs, W is a vector of model weights and the function refers to the chosen functional relationship between outputs, inputs, and parameters of the model [31]. The chosen activation function of the network is called the sigmoid function. In the present study, for algal chlorophyll prediction, we used nine input variables and three hidden layers whereas for Secchi depth prediction ten input variables were used and three hidden layers (supplementary material).

K-Fold Cross-Validation and Model Accuracy Metrics
After doing the regression or building a model, we had to determine the accuracy of the model or the regression method. K-fold cross-validation (CV) is a robust method for estimating the accuracy of the model. It randomly split the dataset into K-subsets, then reserved one subset and train the model on all other subsets and repeat the process until K subsets and finally computed the average of the K recorded error. In our study, we used K = 5.

Mean Absolute Error
Mean absolute error (MAE) is the average of the difference between the observed values and the predicted values [5,20]. It gives us the measure of how far the predictions were from the actual output. The lower the MAE, the better the model. The MAE can be calculated by the following equation:

Root Mean Squared Error
Root mean squared error (RMSE) can also measure the average magnitude of the error [5,20]. It is the square root of the average of squared differences between prediction and real observation. Mathematically, the RMSE can be presented as the following equation. The lower the RMSE, the better the model.

Co-Efficient of Determination
Co-efficient of determination is the proportion of variation in the outcome that is explained by the predictor variables [5,20]. It represents the squared correlation between the observed outcome values and the predicted outcome values by the model. The higher the R 2 , the better the model.
where SSR is the sum of the square of residuals and SST is the total sum of squares. Sum of square residuals (SSR) is the deviations predicted from actual empirical values of data. It is a measure of the discrepancy between the data and an estimation model. A small SSR indicates a tight fit of the model to the data. It is used as an optimality criterion in parameter selection and model selection.
SST is the sum of the squares of the differences between the dependent variable and its mean.

Data Analysis
All kind of data analyses including MLR, SVM and ANN was done in R software (R 3.5.2 version). During the prediction of CHL-a in the reservoir using MLR, SVM and ANN, the input variable was nine (precipitation, DO, BOD, COD, TSS, TN, TP, NP ratios, WT and Cond) while it was ten variable (precipitation, DO, BOD, COD, TSS, TN, TP, NP ratios, WT, Cond and CHL-a) for the prediction of transparency (SD) in the riverine zone, transitional zone, lacustrine zone, premonsoon, monsoon and postmonsoon.

Water Quality Summary
The physicochemical parameters of the Imha reservoir showed heterogeneity spatially and temporarily ( Table 1). The maximum dissolved oxygen level has been observed in the riverine zone (17.6 mg L −1 ) in comparison to transitional (13.2 mg L −1 ) and lacustrine zone (15.4 mg L −1 ). The mean BOD level was highest in the lacustrine zone (2.12 mg L −1 ) than the riverine (2.05 mg L −1 ) and transitional (2.04 mg L −1 ) zone. The mean and maximum TSS, TN, TP, WT, Cond and CHL-a were in higher concentrations in the riverine zone compared to transitional and lacustrine zone due to surface run-off from the watershed. It is noticeable that the maximum and mean concentration of TP was found during the monsoon period due to heavy rainfall, which brings the nutrients into the Imha reservoir. During the monsoon time, the highest water temperature was also observed in the Imha reservoir. Table 1. Summary of Imha reservoir water quality parameters based on the riverine zone (Rz, n = 216), transitional zone (Tz, n = 216) and lacustrine zone (Lz, n = 216) Lz, and premonsoon (January-June, n = 319), monsoon (July-August, n = 108) and postmonsoon (September-December, n = 216), SD = standard deviation, CV = coefficient of variation, min = minimum and max = maximum.

Analysis of Hydrology Pattern
During the study, the annual and seasonal hydrology showed distinct differences from 2000-2017 ( In Korea, half of the annual rainfall occurs during the monsoon period (July-August). The monsoon rainy season has been divided into two phases in the Korean peninsula; the first phase occurs in July when the rains are frequent but less intense in comparison to the second phase. The second phase starts from August to early September when occasional typhoons pass over the Korean peninsula. These typhoons are always gone along with heavier rainfall and can significantly impact the hydrological, physical, chemical and biological conditions of the ecosystem. During the premonsoon and postmonsoon, the average annual rainfall was 59.50 mm, 54.73 mm whereas it was 236.66 mm during the monsoon period.

Chlorophyll-a Prediction, Cross-Validation and Trophic State in Different Zones
The chlorophyll-a concentration fluctuated from the riverine zone to the lacustrine zone and it was higher in the riverine zone in comparison to transitional and lacustrine zone ( Figure 3). Time series plot of CHL-a in the Rz, Tz and Lz revealed that the predicted value of the SVM is closer to the observed value than ANN and MLR ( Figure 3). After doing the regression by MLR, SVM and ANN, the model accuracy was evaluated by the cross-validation (CV, K = 5) approach. The modeling accuracy was quantitatively compared by the root mean square error (RMSE), coefficient of determination (R 2 ), and mean absolute error (MAE) between the predicted and the observed Chl-a concentrations ( Table 2). The RMSE was lower in Tz compared to Rz and Lz by MLR, SVM and ANN, respectively before validation ( The predictive relative importance of input variables was explained in Figure 4. Based on MLR, water temperature (WT) was identified as the most important input variable, followed by total nitrogen (TN) and total nitrogen:total phosphorus (NP) ratios in Rz (Figure 3). On the contrary, the leading driver was total suspended solids (TSS) in Rz explained by SVM and ANN model. At Tz, WT was the salient variable based on MLR and SVM whereas it was TSS in ANN. The results of ANN in Lz showed that BOD was the foremost important variable.
The trophic state based on chlorophyll-a has so much fluctuated in the riverine zone compared to the transitional and lacustrine zone. The severe oligotrophic condition based on chlorophyll-a had been observed during intense flooded year (2002, 2003 and 2004) in the reservoir. In the transitional zone, the trophic state of Imha was in oligotrophic to mesotrophic conditions. It was followed by the same pattern for the lacustrine zone. In the riverine zone, the concentration of chlorophyll-a was higher and the trophic state was from oligotrophic to eutrophic over the study period.

Chlorophyll-a Prediction, Cross-Validation and Trophic State in Different Season
The concentration of chlorophyll-a varied seasonally and it was highest in the monsoon season compared to premonsoon and postmonsoon ( Figure 5). The time series plot of observed and predicted chlorophyll-a based on MLR, SVM and ANN has been shown in Figure 5. The MLR results revealed that the root mean square (RMSE) value was the highest during the premonsoon, monsoon and postmonsoon seasons in comparison to SVM and ANN before validation ( Table 3). The SVM results exhibited that the lowest RMSE value was found in premonsoon (RMSE = 1.04) compared to monsoon and postmonsoon (RMSE = 1.51 and 1.80) after validation. The R 2 value was the highest during the premonsoon, monsoon and postmonsoon before and after validation by SVM than MLR and ANN. The lowest MAE value was observed in SVM than MLR and ANN at three different seasons before and after validation. In premonsoon, MLR and SVM indicated that TP was the most important predictor driver for chlorophyll-a fluctuations while ANN disclosed that it was the TSS (Figure 6). Based on the ANN model, TP was the most important input driver whereas it was the NP ratios followed by SVM and MLR in monsoon. WT was the most salient feature in MLR, SVM and ANN in postmonsoon. During

Transparency (Secchi Depth) Prediction, Cross-Validation and Trophic State in Different Zones
The transparency of Imha reservoir had fluctuated during the study period (Figure 7). The time series plot of observed and predicted transparency (Secchi depth) based on MLR, SVM and ANN has been shown in Figure 6. The lowest RMSE value was observed in SVM compared to MLR and ANN in three different zones at before and after validation (Rz, Tz and Lz; Table 4). The ANN model revealed that the RMSE value was the highest in Tz (RMSE = 1.49) than Rz (RMSE = 1.10) and Lz (RMSE = 1.17) after validation. The highest R 2 value was found in the SVM model in Rz, Tz and Lz paralleled to MLR and ANN at after and before validation. The lowest MAE value was observed at SVM compared to another model likes MLR and ANN in Rz, Tz and Lz. It is noticeable that TSS was the most important input driver in MLR, SVM and ANN, which influenced the Secchi depth of Imha reservoir (Figure 8

Transparency (Secchi Depth) Prediction, Cross-Validation and Trophic State in Different Season
The transparency of the Imha reservoir showed seasonal variation (Figure 9). Figure 9 showed that the observed and predicted transparency (Secchi depth) time series plot based on MLR, SVM and ANN during the premonsoon, monsoon and postmonsoon season. The highest RMSE value was observed during postmonsoon by the ANN model after validation (RMSE = 1.27; Table 5). On the contrary, the lowest RMSE value was found in SVM at premonsoon (RMSE = 0.41), monsoon (RMSE = 0.39) and postmonsoon (RMSE = 0.31) after validation (Table 5). Before validation, the R 2 value was highest during monsoon (R 2 = 0.75) by ANN model while it was highest during postmonsoon (R 2 = 0.92) by the SVM model. The minimum MAE value was observed by the SVM model in comparison to the MLR and ANN model before and after validation. The results of MLR, SVM, and ANN showed that TSS was the most important input variable, which influenced Secchi depth in the reservoir during premonsoon (Figure 10

Discussion
Our study showed that the SVM indicated the highest prediction accuracy for chlorophyll-a and transparency (Secchi depth) compared to MLR and ANN in Rz, Lz and TZ during premonsoon, monsoon and postmonsoon. This result concurred with the previous studies of Juam and Yeongsan reservoirs in South Korea [5]. The higher performance of SVM in comparison to MLR indicated that MLR has a highly complex of nonlinearity problems [20]. The SVM gave the better result in paralleled to ANN due to several reasons. First, SVM has a good ability to interpret of a nonlinear relationship than ANN and because of this ANN gave relatively poor accuracy [33]. Sebald and Bucklew (2000) revealed that SVM has a superior equalization performance than ANN [34]. Second, in terms of minimizing error, SVM is more effective than ANN due to the reason that SVM contains the structural risk minimization principle while ANN holds an empirical risk minimization principle [5,35]. Third, due to the inherent algorithm design of ANN, it is tough to determine optimized model parameters in comparison to SVM [36]. Fourth, due to the complex structure of ANN, the optimization of model parameters are not stable, even though the data set is the same whereas SVM has no that kind of problem [37].
In Rz, Tz and Lz, the results of MLR, SVM and ANN showed that WT was a more important predictor for chlorophyll-a than TP, TN, NP ratios and TSS. The result of the present study concurred with the previous study of a lake in China [38]. The chlorophyll-a growth was influenced by the nutrient availability (TP and TN) in Rz, Tz and Lz, which is similar to some previous studies [39,40]. The results of ANN model showed that during monsoon TP was the most important variable for chlorophyll growth in the reservoir. The present findings are similar to some recent studies [5,40]. During premonsoon, the precipitation influences the chlorophyll in the reservoir [39].
Generally speaking, transparency (Secchi depth) is influenced by watercolor, turbidity and nutrients [41][42][43]. Based on our MLR, ANN and SVM modeling approach, in different zones and season, there are three most important factors, which influence the Secchi depth in Imha Reservoir. Among these, one is total suspended solids (TSS), and two others are total phosphorus (TP) and total phosphorus:total nitrogen ratio (NP ratios). Time series of transparency clearly showed that Secchi depth had been increasing while the algal chlorophyll-a had been decreasing, which means that algal chlorophyll and transparency were inter-correlated in the reservoir system. The present findings of our study were in line with the previous studies in Taiwan, which was carried out in Te-Chi reservoir [41].

Summary
This research carried out different data-driven models for chlorophyll-a and transparency (Secchi depth) prediction in different zones and seasons in the reservoir. The SVM model gave a better performance than MLR and ANN in the present study. The most important input variable was WT, TSS, TP, NP ratios and precipitation, which influenced the chlorophyll-a and Secchi depth. The trophic state of the reservoir based on chlorophyll-a and Secchi depth was from mesotrophic to oligotrophic during the year of 2013 to 2017. Moreover, our present study suggests that different types of machine learning approaches should be used for further prediction of chlorophyll-a and Secchi depth in the reservoir studies.