# Prediction of Chlorophyll-a Concentrations in the Nakdong River Using Machine Learning Methods

^{1}

^{2}

^{3}

^{4}

^{5}

^{*}

^{†}

## Abstract

**:**

## 1. Introduction

## 2. Material and Methods

#### 2.1. Study Area and Data

#### Forward Selection

#### 2.2. Machine Learning Methods

#### 2.2.1. Support Vector Regression (SVR)

#### 2.2.2. Ensemble Learning

#### Bagging

#### Random Forest (RF)

#### Extreme Gradient Boosting (XGBoost)

#### 2.2.3. Recurrent Neural Network (RNN) and Long–Short-Term Memory (LSTM)

#### 2.3. Workflow for Predicting Chlorophyll-a Concentration

#### 2.4. Cross–Validation and Model Accuracy Metric

#### 2.5. Model Validation

## 3. Results

## 4. Discussion

## 5. Conclusions

## Author Contributions

## Funding

## Conflicts of Interest

## References

- Seo, D.I.; Nam, G.S.; Lee, S.H.; Lee, E.H.; Kim, M.; Choi, J.Y.; Kim, J.H.; Chang, K.H. Plankton Community in Weir Section of the Nakdong River and Its Relation with Selected Environmental Factors. Korean J. Environ. Biol.
**2013**, 31, 362–369. [Google Scholar] [CrossRef] - Jung, S.Y.; Kim, I.K. Analysis of water quality factor and correlation between water quality and Chl-a in middle and downstream weir section of Nakdong River. J. Korean Soc. Environ. Eng.
**2017**, 39, 89–96. [Google Scholar] [CrossRef] - Kim, B.C.; Jung, S.M.; Jang, C.W.; Kim, J.K. Comparison of BOD, COD and TOC as the indicator of organic matter pollution in streams and reservoirs of Korea. J. Korean Soc. Environ. Eng.
**2007**, 29, 640–643. [Google Scholar] - Boyer, J.N.; Kelble, C.R.; Ortner, P.B.; Rudnick, D.T. Phytoplankton bloom status: Chlorophyll-a biomass as an indicator of water quality condition in the southern estuaries of Florida, USA. Ecol. Indic.
**2009**, 9, s56–s67. [Google Scholar] [CrossRef] - Cho, S.; Lim, B.; Jung, J.; Kim, S.; Chae, H.; Park, J.; Park, S.; Park, J.K. Factors affecting algal blooms in a man-made lake and prediction using an artificial neural network. Measurement
**2014**, 53, 224–233. [Google Scholar] [CrossRef] - Vellidis, G.; Barnes, P.; Bosch, D.D.; Cathey, A.M. Mathematical simulation tools for developing dissolved oxygen TMDLs. Trans. ASABE
**2006**, 49, 1003–1022. [Google Scholar] [CrossRef] - Hoanh, C.T.; Phong, N.D.; Gowing, J.W.; Tuong, T.P.; Ngoc, N.V.; Hien, N.X. Hydraulic and water quality modeling: A tool for managing land use conflicts in inland coastal zones. Water Policy
**2009**, 11, 106–120. [Google Scholar] [CrossRef] - Brown, L.; Barnwell, T. The Enhanced Stream Water Quality Models QUAL2E: Documentation and User’s Manual; United States Environmental Protection Agency: Washington, DC, USA, 1987.
- Jeong, K.S.; Kim, D.K.; Joo, G.J. River phytoplankton prediction model by Artificial Neural Network: Model performance and selection of input variables to predict time-series phytoplankton proliferation in a regulated river system. Ecol. Inf.
**2006**, 1, 235–245. [Google Scholar] [CrossRef] - Maier, H.R.; Dandy, G.C. Neural networks for the prediction and forecasting of water resources variables: A review of modelling Issues and applications. Environ. Model. Softw.
**2000**, 15, 101–124. [Google Scholar] [CrossRef] - Sutton, C.D. Classification and regression trees, bagging, and boosting. Handb. Stat.
**2005**, 24, 303–329. [Google Scholar] - Chon, T.S.; Park, Y.S.; Moon, K.H.; Cha, E.Y. Patternizing communities by using an artificial neural network. Ecol. Model.
**1996**, 90, 69–78. [Google Scholar] [CrossRef] - Lek, S.; Delacoste, M.; Baran, P.; Dimopoulos, I.; Lauga, J.; Aulagnier, S. Application of neural networks to modelling nonlinear relationships in ecology. Ecol. Model.
**1996**, 90, 39–52. [Google Scholar] [CrossRef] - Huang, W.; Foo, S. Neural Network Modeling of Salinity in Apalachicola River. Water Res.
**2002**, 36, 356–362. [Google Scholar] [CrossRef] - Papale, D.; Valentini, R. A new assessment of European forests carbon exchanges by eddy fluxes and artificial neural network spatialization. Glob. Chang. Biol.
**2003**, 9, 525–535. [Google Scholar] [CrossRef] - Jeong, D.; Kim, Y.O. Rainfall-runoff models using artificial neural networks for ensemble stream flow prediction. Hydrol. Process. Int. J.
**2005**, 19, 3819–3835. [Google Scholar] [CrossRef] - Hopfield, J.J. Neural networks and physical systems with emergent collective computational abilities. Proc. Natl. Acad. Sci. USA
**1982**, 79, 2554–2558. [Google Scholar] [CrossRef][Green Version] - Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput.
**1997**, 9, 1735–1780. [Google Scholar] [CrossRef] - Recknagel, F.; French, M.; Harkonen, P.; Yabunaka, K.I. Artificial neural network approach for modelling and prediction of algal blooms. Ecol. Model.
**1997**, 96, 11–28. [Google Scholar] [CrossRef] - Recknagel, F. Application of machine learning to ecological modelling. Ecol. Model.
**2001**, 146, 303–310. [Google Scholar] [CrossRef] - Mille, D.F.; Weckman, G.R.; Fahnenstiel, G.L.; Carrick, H.J.; Ardjmand, E.; Young, W.A.; Sayers, M.J.; Shuchman, R.A. Using artificial intelligence for CyanoHAB niche modelling: Discovery and visualization of Microcystis-environmental associations within western Lake Erie. Can. J. Fish. Aquat. Sci.
**2014**, 71, 1642–1654. [Google Scholar] [CrossRef][Green Version] - Muttil, N.; Chau, K.-W. Machine-learning paradigms for selecting ecologically significant input variables. Eng. Appl. Artif. Intell.
**2007**, 20, 735–744. [Google Scholar] [CrossRef][Green Version] - Wang, Z.; Huang, K.; Zhou, P.; Guo, H. A hybrid neural network model for cyanobacteria bloom in Dianchi Lake. Procedia Environ. Sci.
**2010**, 2, 67–75. [Google Scholar] [CrossRef][Green Version] - Sunil, K.; Sarah, A.S.; Thomas, J.S.; Karl, A.H.; Travis, S.S.; Loren, L.B. Potential habitat distribution for the freshwater diatom Didymosphenia geminata in the continental US. Front. Ecol. Environ.
**2009**, 7, 415–420. [Google Scholar] - Park, Y.; Cho, K.H.; Park, J.; Cha, S.M.; Kim, J.H. Development of early-warning protocol for predicting chlorophyll-a concentration using machine learning models in freshwater and estuarine reservoirs, Korea. Sci. Total Environ.
**2015**, 502, 31–41. [Google Scholar] [CrossRef] [PubMed] - Zeng, Q.; Liu, Y.; Zhao, H.; Sun, M.; Li, X. Comparison of models for predicting the changes in phytoplankton community composition in the receiving water system of an inter-basin water transfer project. Environ. Pollut.
**2017**, 223, 676–684. [Google Scholar] [CrossRef] - Li, X.; Sha, J.; Wang, Z.-L. Application of feature selection and regression models for chlorophyll-a prediction in a shallow lake. Environ. Sci. Pollut. Res.
**2018**, 25, 19488–19498. [Google Scholar] [CrossRef] - Segura, A.M.; Piccini, C.; Nogueira, L.; Alcantara, I.; Calliari, D.; Kruk, C. Increased sampled volume improve Microcystis aeruginosa complex (MAC) colonies detection and prediction using Random Forests. Ecol. Indic.
**2017**, 79, 347–354. [Google Scholar] [CrossRef] - Yajima, H.; Derot, J. Application of the Random Forest model for chlorophyll-a forecasts in fresh and brackish water bodies in Japan, using multivariate long-term databases. J. Hydroinform.
**2018**, 20, 206–220. [Google Scholar] [CrossRef][Green Version] - Zhang, F.; Wang, Y.; Cao, M.; Sun, X.; Du, Z.; Liu, R.; Ye, X. Deep-learning-based approach for prediction of algal blooms. Sustainability
**2016**, 8, 1060. [Google Scholar] [CrossRef][Green Version] - Li, X.; Peng, L.; Yao, X.; Cui, S.; Hu, Y.; You, C.; Chi, T. Long short-term memory neural network for air pollutant concentration predictions: Method development and evaluation. Environ. Pollut.
**2017**, 231, 997–1004. [Google Scholar] [CrossRef] [PubMed] - Lee, S.; Lee, D. Improved prediction of harmful algal blooms in four Major South Korea’s Rivers using deep learning models. Int. J. Environ. Res. Public Health
**2018**, 15, 1322. [Google Scholar] [CrossRef] [PubMed][Green Version] - Yin, J.; Tsai, F.T.C. Bayesian set pair analysis and machine learning based ensemble surrogates for optimal multi-aquifer system remediation design. J. Hydrol.
**2020**, 580, 124280. [Google Scholar] [CrossRef] - Jeong, K.S.; Joo, G.J.; Kim, H.W.; Ha, K.; Recknagel, F. Prediction and elucidation of phytoplankton dynamics in the River (Korea) by means of a recurrent artificial neural network. Ecol. Model.
**2001**, 146, 115–129. [Google Scholar] [CrossRef] - Torgo, L. Data Mining Using R: Learning with Case Studies; CRC Press: Boca Raton, FL, USA, 2010. [Google Scholar]
- Boser, B.; Guyon, I.; Vapnik, V.N. A training algorithm for optimal margin classifiers. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, Pittsburgh, PA, USA, 27–29 July 1992; pp. 144–152. [Google Scholar]
- Bottou, L.; Cortes, C.; Denker, J.; Drucker, H.; Guyon, I.; Jackel, L.; LeCun, Y.; Muller, U.; Sackinger, E.; Simard, P.; et al. Comparison of classifier methods: A case study in handwriting digit recognition. In Proceedings of the 12th IAPR International Conference on Pattern Recognition, Jerusalem, Israel, 9–13 October 1994; Volume 3, pp. 77–87. [Google Scholar]
- Vert, J.P. Support vector machine prediction of signal peptide cleavage site using a new class of kernels for strings. Biocomputing
**2002**, 7, 649–660. [Google Scholar] - Drucker, H.; Burges, C.J.C.; Kaufman, L.; Smola, A.; Vapnik, V. Support vector regression machines. Adv. Neural Inf. Process. Syst.
**1997**, 9, 155–161. [Google Scholar] - Gunn, S.R. Support vector machines for classification and regression. ISIS Tech. Rep.
**1998**, 14, 5–16. [Google Scholar] - Smola, A.J.; Scholkopf, B. A Tutorial on support vector regression. Stat. Comput.
**2004**, 14, 199–222. [Google Scholar] [CrossRef][Green Version] - Keerthi, S.S.; Lin, C.-J. Asymptotic behaviors of support vector machines with Gaussian kernel. Neural Comput.
**2003**, 15, 1667–1689. [Google Scholar] [CrossRef] - Bourel, M.; Crisci, C.; Martinez, A. Consensus methods based on machine learning techniques for marine phytoplankton presence-absence prediction. Ecol. Inform.
**2017**, 42, 46–54. [Google Scholar] [CrossRef] - Hollister, J.W.; Milstead, W.R.; Kreakie, B.J. Modelling Lake Trophic State: A Random Forest Approach. Ecosphere
**2015**, 7, e01321. [Google Scholar] - Uddameri, V.; Silva, A.L.B.; Singaraju, S.; Mohammadi, G.; Hernandez, E.A. Tree-Based Modeling Methods to Predict Nitrate Exceedances in the Ogallala Aquifer in Texas. Water
**2020**, 12, 1023. [Google Scholar] [CrossRef][Green Version] - Zhang, D.; Qian, L.; Mao, B.; Huang, C.; Huang, B.; Si, Y. A data-driven design for fault detection of wind turbines using random forests and XGboost. IEEE Access
**2018**, 6, 21020–21031. [Google Scholar] [CrossRef] - Vafaeipour, M.; Rahbari, O.; Rosen, M.A.; Fazelpour, F.; Ansarirad, P. Application of sliding window technique for prediction of wind velocity time series. Int. J. Energy Environ. Eng.
**2014**, 5, 105. [Google Scholar] [CrossRef][Green Version] - Gers, F.; Schraudolph, N.; Schmidhuber, J. Learning precise timing with LSTM recurrent networks. J. Mach. Learn. Res.
**2002**, 3, 115–143. [Google Scholar] - Haque, M.M.; Rahman, A.; Hagare, D.; Chowdhury, R.K. A comparative assessment of variable selection methods in urban water demand forecasting. Water
**2018**, 10, 419. [Google Scholar] [CrossRef][Green Version] - Mamun, M.; Kim, J.J.; Alam, M.A.; An, K.G. Prediction of Algal Chlorophyll-a and Water Clarity in Monsoon-Region Reservoir Using Machine Learning Approaches. Water
**2020**, 12, 30. [Google Scholar] [CrossRef][Green Version]

**Figure 1.**(

**a**) Monitoring sites along the Nakdong River. Data from Dasa weir and the weirs upstream of Dasa weir (totaling seven weirs) are used for the analysis in this study. (

**b**) Concentration of chlorophyll-a at each monitoring site. The blue arrow represents the direction of the river flow.

**Figure 2.**Response variable at time step t at Dasa (${\widehat{y}}_{t}^{S1})$ and explanatory variables with time–lagged variables by site for predicting the response variable. The dotted arrows denote that the explanatory and time–lagged variables of the identified sites affect the response variable.

**Figure 7.**Schematics of the (

**a**) Recurrent Neural network (RNN) and (

**b**) Long–Short-Term Memory (LSTM) models.

**Figure 8.**Flow chart of the four–step process used to find the optimal chlorophyll-a concentration prediction model.

**Figure 9.**1–step ahead recursive prediction; the input data were added to the model step by step during model construction for future prediction. (

**a**) Cumulative learning and (

**b**) rolling window learning.

**Figure 10.**Prediction results for riverine chlorophyll-a concentrations of the final model using only the selected variables via forward selection; 1–step ahead recursive predictions with (

**a**) cumulative learning and (

**b**) rolling window learning; (1) Support Vector Regression (SVR), (2) Bagging, (3) Random Forest (RF), (4) Extreme Gradient Boosting (XGBoost), (5) Recurrent Neural Network (RNN), and (6) Long–Short-Term Memory (LSTM).

Variable | Descriptions | Unit | Mean | SD | Min | Max |
---|---|---|---|---|---|---|

Chla | Chlorophyll-a | $\mathrm{mg}/{\mathrm{m}}^{3}$ | 17.5 | 10.2 | 0.1 | 61.9 |

AvgTemp | Average Temperature | °C | 15.1 | 9.5 | −10.9 | 32.1 |

Sunshine | Sunshine hours | h | 6.4 | 3.9 | 0.0 | 13.5 |

WaterTemp | Water Temperature | °C | 17.6 | 8.7 | 2.0 | 33.5 |

pH | pH value | 8.1 | 0.5 | 6.9 | 9.2 | |

EC | Electrical conductivity | $\mathsf{\mu}\mathrm{mhos}/\mathrm{cm}$ | 313.2 | 73.4 | 131.0 | 561.0 |

DO | Dissolved Oxygen | $\mathrm{mg}/\mathrm{L}$ | 10.1 | 1.9 | 5.0 | 14.7 |

TOC | Total Organic Carbon | $\mathrm{mg}/\mathrm{L}$ | 3.3 | 0.6 | 2.0 | 7.4 |

RainFall | Amount of rainfall | mm | 2.5 | 7.5 | 0.0 | 85.3 |

InFlow | Total Inflow | ${\mathrm{m}}^{3}/\mathrm{s}$ | 98.9 | 131.0 | 0.0 | 103.3 |

OutFlow | Total outflow | ${\mathrm{m}}^{3}/\mathrm{s}$ | 98.7 | 129.2 | 3.3 | 984.9 |

Variable | Estimate | Standard Error | t Value | p-Value |
---|---|---|---|---|

Intercept | 1.110 | 0.258 | 4.304 | <0.001 |

${\mathrm{Chla}}_{\mathrm{t}-1}^{\mathrm{S}1}$ | 0.862 | 0.014 | 60.009 | <0.001 |

${\mathrm{RainFall}}_{\mathrm{t}-1}^{\mathrm{S}1}$ | −0.011 | 0.003 | −3.444 | <0.001 |

${\mathrm{pH}}_{\mathrm{t}-5}^{\mathrm{S}5}$ | −0.102 | 0.025 | −4.100 | <0.001 |

${\mathrm{SunShine}}_{\mathrm{t}-1}^{\mathrm{S}2}$ | 0.018 | 0.004 | 4.238 | <0.001 |

${\mathrm{RainFall}}_{\mathrm{t}-2}^{\mathrm{S}3}$ | −0.006 | 0.002 | −3.989 | <0.001 |

${\mathrm{Outflow}}_{\mathrm{t}-1}^{\mathrm{S}4}$ | 0.063 | 0.021 | 2.959 | 0.003 |

${\mathrm{TOC}}_{\mathrm{t}-1}^{\mathrm{S}2}$ | −0.380 | 0.074 | −5.158 | <0.001 |

${\mathrm{TOC}}_{\mathrm{t}-1}^{\mathrm{S}4}$ | 0.049 | 0.014 | 3.574 | <0.001 |

${\mathrm{TOC}}_{\mathrm{t}-3}^{\mathrm{S}7}$ | 0.199 | 0.054 | 3.665 | <0.001 |

${\mathrm{Outflow}}_{\mathrm{t}-4}^{\mathrm{S}4}$ | −0.054 | 0.020 | −2.684 | 0.007 |

${\mathrm{Chla}}_{\mathrm{t}-4}^{\mathrm{S}7}$ | −0.010 | 0.003 | −3.034 | 0.002 |

${\mathrm{SunShine}}_{\mathrm{t}-4}^{\mathrm{S}3}$ | −0.008 | 0.003 | −3.066 | 0.002 |

${\mathrm{TOC}}_{\mathrm{t}-4}^{\mathrm{S}7}$ | −0.137 | 0.053 | −2.587 | 0.01 |

${\mathrm{TOC}}_{\mathrm{t}-4}^{\mathrm{S}2}$ | 0.189 | 0.073 | 2.589 | 0.01 |

Method | 1–Step Ahead Recursive Prediction | ||||||
---|---|---|---|---|---|---|---|

Cumulative Learning | Rolling Window Learning | ||||||

MAPE (%) | RMSE | NSE | MAPE (%) | RMSE | NSE | ||

All variables | SVR | 11.02 | 3.864 | 0.5308 | 11.12 | 3.8950 | 0.5233 |

Bagging | 28.58 | 10.41 | −2.4049 | 33.88 | 11.6356 | −3.2545 | |

RF | 16.53 | 5.760 | −0.0426 | 16.23 | 5.5398 | 0.0356 | |

XGBoost | 8.69 | 3.5854 | 0.5960 | 09.16 | 3.6702 | 0.5767 | |

RNN | 11.28 | 4.2872 | 0.1765 | 15.88 | 5.4196 | −0.1164 | |

LSTM | 16.38 | 5.8634 | 0.2136 | 14.15 | 5.2502 | −0.1653 | |

Selected variables based on forward selection | SVR | 9.85 | 3.1717 | 0.6838 | 9.85 | 3.1750 | 0.6832 |

Bagging | 16.33 | 6.3604 | −0.2712 | 13.37 | 4.9668 | 0.2247 | |

RF | 8.50 | 3.1213 | 0.6939 | 8.98 | 3.1920 | 0.6798 | |

XGBoost | 10.21 | 3.9305 | 0.5145 | 9.53 | 3.7712 | 0.5531 | |

RNN | 7.54 | 2.6843 | 0.7601 | 7.27 | 2.6453 | 0.7516 | |

LSTM | 14.40 | 4.6984 | 0.3783 | 17.25 | 5.7119 | 0.1077 |

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Shin, Y.; Kim, T.; Hong, S.; Lee, S.; Lee, E.; Hong, S.; Lee, C.; Kim, T.; Park, M.S.; Park, J.;
et al. Prediction of Chlorophyll-*a* Concentrations in the Nakdong River Using Machine Learning Methods. *Water* **2020**, *12*, 1822.
https://doi.org/10.3390/w12061822

**AMA Style**

Shin Y, Kim T, Hong S, Lee S, Lee E, Hong S, Lee C, Kim T, Park MS, Park J,
et al. Prediction of Chlorophyll-*a* Concentrations in the Nakdong River Using Machine Learning Methods. *Water*. 2020; 12(6):1822.
https://doi.org/10.3390/w12061822

**Chicago/Turabian Style**

Shin, Yuna, Taekgeun Kim, Seoksu Hong, Seulbi Lee, EunJi Lee, SeungWoo Hong, ChangSik Lee, TaeYeon Kim, Man Sik Park, Jungsu Park,
and et al. 2020. "Prediction of Chlorophyll-*a* Concentrations in the Nakdong River Using Machine Learning Methods" *Water* 12, no. 6: 1822.
https://doi.org/10.3390/w12061822