# Machine Learning Approaches for Predicting Health Risk of Cyanobacterial Blooms in Northern European Lakes

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Materials and Methods

#### 2.1. Dataset

^{2}), mean lake depth (MeanDep, m) and maximum lake depth (MaxDep, m). Table 1 summarizes the number of lakes per country included in the dataset, the median and mean number of observations per lake (variable), the sampling months and the corresponding sampling years. As stated above, the dataset is temporally unbalanced, reflecting the different monitoring schemes of the countries that provided data to WISER [39]. For some lakes, there may be a complete dataset covering all sampling months with multiple measurements, while for others we may have a single measurement for the whole time period. To deal with this temporal heterogeneity, there were several options, such as averaging all values to one per lake, selecting only the latest observation per lake, or using a mixed-effects hierarchical model. In this research, no temporal aggregation was applied and all values per lake were used with no weighing/selection. This was done because any type of aggregation would shrink the size of the dataset, a factor that greatly impacts the performance of machine learning algorithms, rendering small datasets unsuitable [40].

#### 2.2. Explorative Analysis

_{ij}= (X

_{ij}– X

_{jmin}) / (X

_{jmax}– X

_{jmin}),

_{ij}is the normalized value for the ith observed value of the jth variable (feature), X

_{ij}is the ith observed value of the jth variable, and X

_{jmax}and X

_{jmin}are the maximum and minimum observed values of the jth variable, respectively.

#### 2.3. Stepwise Multiple Linear Regression

#### 2.4. Path Analysis

#### 2.5. Machine Learning Methods

^{−7}mg we converted cyanobacterial cell density to biomass [43]. Table 3 presents the three risk categories for recreational use, according to cyanobacteria biomass.

- DT is a supervised machine learning technique for inducing a decision tree from training data. A decision tree, also referred to as a classification tree, is a flowchart-like diagram that shows the various outcomes from a series of decisions. Practically, it is the mapping of observations about an item to conclusions about its target value [51].
- k-NN is a relatively simple approach to classification that is completely nonparametric. Given a point x
_{0}that one wishes to classify into one of the K groups, the algorithm finds the k observed data points that are nearest to x_{0}. The classification rule is to assign x_{0}to the population that has the most observed data points out of the k-nearest neighbors. Points for which there is no majority are either classified to one of the majority populations at random or left unclassified [52]. - SVM is an algorithm that classifies data by determining the optimal hyperplane that separates observations according to their class labels. The central concept of this method is to accommodate classes that are separable by linear and non-linear class boundaries [53].
- RF is a classifier algorithm that evolved from decision trees. It collects the classifications and chooses the most voted prediction as the result. RFs sample data from the original dataset and a subset of features is randomly selected from the optional features to grow the tree at each node. The strength of the RFs relies on the capability to enable a large number of weak or weakly correlated classifiers to form a strong classifier [54].

## 3. Results and Discussion

#### 3.1. Identifying Variables Explaining CBB Variation

^{2}= 0.33). Although Chl-a, as a single predictor, showed a proportional ability to capture the variation in CBB (R

^{2}= 0.33), when Chl-a is combined with TN, the overall performance of the model exceeds that of Chl-a, as indicated by the performance criteria summarized in Table 4 (BIC, AIC and RMSE). In the subgroup of shallow lakes, stepwise regression resulted in a four-variable linear model, including Chl-a, TN, MeanDep and TN/TP as best predictors according to the AIC criterion (R

^{2}= 0.27). Similarly to the all lakes dataset, although the best single-variable linear model (Chl-a) and the best two-variable (Chl-a and TN) and three-variable models (Chl-a, TN and MeanDep) had similar R-square values, their overall performance was weaker compared to the four-variable model. Furthermore, the predictive power of the best linear model for the subgroup of shallow lakes was weaker compared to the model that represents the subgroup of the original dataset for all lakes. Finally, in the subgroup of deep lakes, the model that better predicted the variation of the CBB included Chl-a, TN and TN/TP (R

^{2}= 0.44). The two-variable (Chl-a and TN/TP) and single-variable (Chl-a) best models in this case, showed lower R

^{2}values and weaker predictive performance criteria. In general, the linear models predicting CBB in deep lakes exhibited a remarkable improvement against the all-lakes and shallow-lakes models. Although it is somewhat surprising, since cyano-blooms is a more common problem in shallow lakes, this finding is justified by the fact that prediction is more difficult in shallow lakes because of more fluctuating conditions. However, similar results, where modelling efficacy improves in deep lakes, can be found in other studies [17,28]. In Table 4, the best univariate and multivariate models as well as their performance indicators are summarized. Our analysis shows that using linear models to predict CBB concentration from biological and physical–chemical lake variables has a relatively low reliability, which is reflected by the low R

^{2}values. To deal with this, we explored the possibility of producing more reliable results by predicting health risk levels, instead of concentrations. This is presented in Section 3.3.

#### 3.2. Describing Dependent Relationships among Variables

#### 3.3. Evaluating the Performance of the Machine Learning Methods

## 4. Conclusions

^{2}values that are in line with values reported by previous studies, even though they are relatively low (R

^{2}= 0.33 for all lakes, R

^{2}= 0.27 for shallow and R

^{2}= 0.44 for deep lakes). When differentiating the predicting approach and translating CBB to risk categories associated with impacts on public health by the recreational use of lakes, predictive ability was fundamentally improved. Machine learning methods and especially Random Forest proved to be a reliable and highly accurate tool towards the categorization of lakes to risk levels, achieving model accuracy levels as high as 95.81% after having optimized the parameters of the algorithm. Confusion matrix analysis resulted in the quantification of the probability of false alarms for the three different risk levels. Focusing on current machine learning techniques to assess cyanobacteria risk levels to human health can give crucial insights to water managers and consequently raise public awareness in a timely manner as to when the lake water is inappropriate to serve recreational uses.

## Author Contributions

## Funding

## Acknowledgments

## Conflicts of Interest

## References

- Carmichael, W. A world overview—One-hundred-twenty-seven years of research on toxic cyanobacteria—Where do we go from here? In Cyanobacterial Harmful Algal Blooms: State of the Science and Research Needs; Hudnell, H.K., Ed.; Springer: New York, NY, USA, 2008; Volume 619, pp. 105–125. [Google Scholar]
- Paerl, H.W.; Huisman, J. Blooms like it hot. Science
**2008**, 320, 57–58. [Google Scholar] [CrossRef] [Green Version] - O’Neil, J.M.; Davis, T.W.; Burford, M.A.; Gobler, C.J. The rise of harmful cyanobacteria blooms: The potential roles of eutrophication and climate change. Harmful Algae
**2012**, 14, 313–334. [Google Scholar] [CrossRef] - Carmichael, W.W.; Boyer, G.L. Health impacts from cyanobacteria harmful algae blooms: Implications for the North American Great Lakes. Harmful Algae
**2016**, 54, 194–212. [Google Scholar] [CrossRef] [PubMed] - Mellios, N.; Papadimitriou, T.; Laspidou, C. Predictive modeling of microcystin concentrations in a hypertrophic lake by means of Adaptive Neuro Fuzzy Inference System (ANFIS). Eur. Water
**2016**, 55, 91–103. [Google Scholar] - Lévesque, B.; Gervais, M.-C.; Chevalier, P.; Gauvin, D.; Anassour-Laouan-Sidi, E.; Gingras, S.; Fortin, N.; Brisson, G.; Greer, C.; Bird, D. Prospective study of acute health effects in relation to exposure to cyanobacteria. Sci. Total Environ.
**2014**, 466, 397–403. [Google Scholar] [CrossRef] - Hamilton, D.P.; Wood, S.A.; Dietrich, D.R.; Puddick, J. Costs of harmful blooms of freshwater cyanobacteria. In Cyanobacteria: An Economic Perspective; Sharma, N.K., Rai, A.K., Stal, L.J., Eds.; John Wiley & Sons: Chichester, UK, 2013; Volume 1, pp. 245–256. [Google Scholar]
- Solheim, A.L.; Rekolainen, S.; Moe, S.J.; Carvalho, L.; Philips, G.; Ptacnik, R.; Penning, W.E.; Tóth, L.G.; O’Toole, C.; Schartau, A.K.; et al. Ecological threshold responses in European lakes and their applicability for the Water Framework Directive (WFD) implementation: Synthesis of lakes results from the REBECCA project. Aquat. Ecol.
**2008**, 42, 317–334. [Google Scholar] [CrossRef] - Francis, G. Poisonous Australian Lake. Nature
**1878**, 18, 11–12. [Google Scholar] [CrossRef] [Green Version] - Carpenter, S.R.; Stanley, E.H.; Vander Zanden, M.J. State of the world’s freshwater ecosystems: Physical, chemical, and biological changes. Annu. Rev. Environ. Resour.
**2011**, 36, 75–99. [Google Scholar] [CrossRef] [Green Version] - Elliott, J.A. The seasonal sensitivity of cyanobacteria and other phytoplankton to changes in flushing rate and water temperature. Glob. Chang. Biol.
**2010**, 16, 864–876. [Google Scholar] [CrossRef] - Paerl, H.W.; Otten, T.G. Harmful cyanobacterial blooms: Causes, consequences, and controls. Microb. Ecol.
**2013**, 65, 995–1010. [Google Scholar] [CrossRef] - Wells, M.L.; Trainer, V.L.; Smayda, T.J.; Karlson, B.S.O.; Trick, C.G.; Kudela, R.M.; Ishikawa, A.; Bernard, S.; Wulff, A.; Anderson, D.M.; et al. Harmful algal blooms and climate change: Learning from the past and present to forecast the future. Harmful Algae
**2015**, 49, 68–93. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Laspidou, C.; Kofinas, D.; Mellios, N.; Latinopoulos, D.; Papadimitriou, T. Investigation of factors affecting the trophic state of a shallow Mediterranean reconstructed lake. Ecol. Eng.
**2017**, 103, 154–163. [Google Scholar] [CrossRef] - Mellios, N.; Kofinas, D.; Laspidou, C.; Papadimitriou, T. Mathematical modeling of trophic state and nutrient flows of Lake Karla using the PCLake model. Environ. Process.
**2015**, 2, 85–100. [Google Scholar] [CrossRef] [Green Version] - Richardson, J.; Feuchtmayr, H.; Miller, C.; Hunter, P.D.; Maberly, S.C.; Carvalho, L. Response of cyanobacteria and phytoplankton abundance to warming, extreme rainfall events and nutrient enrichment. Glob. Chang. Biol.
**2019**, 25, 3365–3380. [Google Scholar] [CrossRef] [Green Version] - Beaulieu, M.; Pick, F.; Gregory-Eaves, I. Nutrients and water temperature are significant predictors of cyanobacterial biomass in a 1147 lakes data set. Limnol. Oceanogr.
**2013**, 58, 1736–1746. [Google Scholar] [CrossRef] - Moe, S.J.; Couture, R.M.; Haande, S.; Lyche Solheim, A.; Jackson-Blake, L. Predicting lake quality for the next generation: Impacts of catchment management and climatic factors in a probabilistic model framework. Water
**2019**, 11, 1767. [Google Scholar] [CrossRef] [Green Version] - Romo, S.; Soria, J.; Fernandez, F.; Ouahid, Y.; Baron-Sola, A. Water residence time and the dynamics of toxic cyanobacteria. Freshw. Biol.
**2013**, 58, 513–522. [Google Scholar] [CrossRef] - Paerl, H.W.; Fulton, R.S.; Moisander, P.H.; Dyble, J. Harmful freshwater algal blooms, with an emphasis on cyanobacteria. Sci. World J.
**2001**, 1, 76–113. [Google Scholar] [CrossRef] - Wood, S.A.; Prentice, M.J.; Smith, K.; Hamilton, D.P. Low dissolved inorganic nitrogen and increased heterocyte frequency: Precursors to Anabaena planktonica blooms in a temperate, eutrophic reservoir. J. Plankton Res.
**2010**, 32, 1315–1325. [Google Scholar] [CrossRef] [Green Version] - Noges, T.; Laugaste, R.; Noges, P.; Tonno, I. Critical N: P ratio for cyanobacteria and N 2-fixing species in the large shallow temperate lakes Peipsi and Võrtsjärv, North-East Europe. Hydrobiologia
**2008**, 599, 77–86. [Google Scholar] [CrossRef] - Havens, K.E.; Phlips, E.J.; Cichra, M.F.; Li, B.L. Light availability as a possible regulator of cyanobacteria species composition in a shallow subtropical lake. Freshw. Biol.
**1998**, 39, 547–556. [Google Scholar] [CrossRef] - Scheffer, M.; Rinaldi, S.; Gragnani, A.; Mur, L.R.; Van Nes, E.H. On the dominance of filamentous cyanobacteria in shallow, turbid lakes. Ecology
**1997**, 78, 272–282. [Google Scholar] [CrossRef] - Carey, C.C.; Ibelings, B.W.; Hoffmann, E.P.; Hamilton, D.P.; Brookes, J.D. Eco-physiological adaptations that favour freshwater cyanobacteria in a changing climate. Water Res.
**2012**, 46, 1394–1407. [Google Scholar] [CrossRef] [PubMed] - Brookes, J.D.; Carey, C.C. Resilience to blooms. Science
**2011**, 334, 46–47. [Google Scholar] [CrossRef] - Kosten, S.; Huszar, V.L.; Bécares, E.; Costa, L.S.; Van Donk, E.; Hansson, L.A.; Jeppesen, E.; Kruk, C.; Lacerot, G.; Mazzeo, N.; et al. Warmer climates boost cyanobacterial dominance in shallow lakes. Glob. Chang. Biol.
**2012**, 18, 118–126. [Google Scholar] [CrossRef] - Richardson, J.; Miller, C.; Maberly, S.C.; Taylor, P.; Globevnik, L.; Hunter, P.; Jeppesen, E.; Mischke, U.; Moe, S.J.; Pasztaleniec, A.; et al. Effects of multiple stressors on cyanobacteria abundance vary with lake type. Glob. Chang. Biol.
**2018**, 24, 5044–5055. [Google Scholar] [CrossRef] [Green Version] - Psilovikos, A. Water Resources; Tziolas: Thessaloniki, Greece, 2020; ISBN 978-960-602-0. (In Greek) [Google Scholar]
- Karamoutsou, L.; Psilovikos, A. The use of Artificial Neural Network in Water Quality Prediction in Lake Kastoria, Greece. In Proceedings of the 14th Conference of the Hellenic hydrotechnical Association (HHA), Volos, Greece, 16–17 May 2019; pp. 882–889. [Google Scholar]
- Rigosi, A.; Carey, C.C.; Ibelings, B.W.; Brookes, J.D. The interaction between climate warming and eutrophication to promote cyanobacteria is dependent on trophic state and varies among taxa. Limnol. Oceanogr.
**2014**, 59, 99–114. [Google Scholar] [CrossRef] [Green Version] - Taranu, Z.E.; Zurawell, R.W.; Pick, F.; Gregory-Eaves, I. Predicting cyanobacterial dynamics in the face of global change: The importance of scale and environmental context. Glob. Chang. Biol.
**2012**, 18, 3477–3490. [Google Scholar] [CrossRef] - Wei, B.; Sugiura, N.; Maekawa, T. Use of artificial neural network in the prediction of algal blooms. Water Res.
**2001**, 35, 2022–2028. [Google Scholar] [CrossRef] - Recknagel, F.; French, M.; Harkonen, P.; Yabunaka, K.I. Artificial neural network approach for modelling and prediction of algal blooms. Ecol. Model.
**1997**, 96, 11–28. [Google Scholar] [CrossRef] - World Health Organization. Guidelines for Safe Recreational Waters: Coastal and Fresh Waters; Chapter 8; WHO Publishing: Geneva, Switzerland, 2003; Volume 1, pp. 136–158. [Google Scholar]
- Bláha, L.; Babica, P.; Maršálek, B. Toxins produced in cyanobacterial water blooms-toxicity and risks. Interdiscip. Toxicol.
**2009**, 2, 36–41. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Moe, S.J.; Schmidt-Kloiber, A.; Dudley, B.J.; Hering, D. The WISER way of organising ecological data from European rivers, lakes, transitional and coastal waters. Hydrobiologia
**2013**, 704, 11–28. [Google Scholar] [CrossRef] [Green Version] - Hering, D.; Borja, A.; Carvalho, L.; Feld, C.K. Assessment and recovery of European water bodies: Key messages from the WISER project. Hydrobiologia
**2013**, 704, 1–9. [Google Scholar] [CrossRef] - Schmidt-Kloiber, A.; Moe, S.J.; Dudley, B.; Strackbein, J.; Vogl, R. The WISER metadatabase: The key to more than 100 ecological datasets from European rivers, lakes and coastal waters. Hydrobiologia
**2013**, 704, 29–38. [Google Scholar] [CrossRef] [Green Version] - Jordan, M.I.; Mitchell, T.M. Machine learning: Trends, perspectives, and prospects. Science
**2015**, 349, 255–260. [Google Scholar] [CrossRef] [PubMed] - Revelle, W. psych: Procedures for Personality and Psychological Research, Northwestern University, Evanston, Illinois, USA. 2017. Available online: https://CRAN.R-project.org/package=psych/ (accessed on 25 November 2019).
- Team, R.C. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. Available online: https://www.R-project.org/ (accessed on 20 November 2019).
- Carvalho, L.; McDonald, C.; De Hoyos, C.; Mischke, U.; Phillips, G.; Borics, G.; Poikane, S.; Skjelbred, B.; Solheim, A.L.; Van Wichelen, J.; et al. Sustaining recreational quality of European lakes: Minimizing the health risks from algal blooms through phosphorus control. J. Appl. Ecol.
**2013**, 50, 315–323. [Google Scholar] [CrossRef] [Green Version] - Ghaffar, S.; Stevenson, R.J.; Khan, Z. Cyanobacteria Dominance in Lakes and Evaluation of Its Predictors: A Study of Southern Appalachians Ecoregion, USA. In MATEC Web of Conferences. EDP Sci.
**2016**, 60, 02001. [Google Scholar] - Wickham, H.; Averick, M.; Bryan, J.; Chang, W.; D’Agostino McGowan, L.; François, R.; Grolemund, G.; Hayes, A.; Henry, L.; Hester, J.; et al. Welcome to the Tidyverse. J. Open Source Softw.
**2019**, 4, 1686. [Google Scholar] [CrossRef] - Kuhn, M.; Wing, J.; Wenston, S.; Williams, A.; Keefer, C.; Engelhardt, A.; Cooper, T.; Mayer, Z.; Kenkel, B.; Team, R.C.; et al. Caret: Classification and regression training. R Package Version
**2016**, 6, 78. [Google Scholar] - Lumley, T.; Miller, A. Leaps: Regression subset selection. R Package Vesion
**2009**, 2, 2366. [Google Scholar] - Venables, B.D.; Ripley, W.N. Modern Applied Statistics with S, 4th ed.; Springer: New York, NY, USA, 2008; pp. 1–496. [Google Scholar]
- Rosseel, Y. Lavaan: An R package for structural equation modeling and more. Version 0.5–12 (BETA). J. Stat. Softw.
**2012**, 48, 1–36. [Google Scholar] [CrossRef] [Green Version] - Rokach, L. Ensemble-based classifiers. Artif. Intell. Rev.
**2010**, 33, 1–39. [Google Scholar] [CrossRef] - Rokach, L.; Maimon, O. Top-down induction of decision trees classifiers-a survey. IEEE Trans. Syst. ManCybern. Part C
**2005**, 35, 476–487. [Google Scholar] [CrossRef] [Green Version] - Neath, R.C.; Johnson, M.S. Discrimination and Classification. In International Encyclopedia of Education, 3rd ed.; Baker, E., McGaw, B., Peterson, P., Eds.; Elsevier Ltd.: London, UK, 2010; Volume 1, pp. 135–141. [Google Scholar]
- Hsu, C.W.; Lin, C.J. A comparison of methods for multiclass support vector machines. IEEE Trans. Neural Netw.
**2002**, 13, 415–425. [Google Scholar] [PubMed] [Green Version] - Mao, W.; Wang, F.Y. Cultural Modeling for Behavior Analysis and Prediction. In New Advances in Intelligence and Security Informatics, 1st ed.; Academic Press: Waltham, MA, USA, 2012; pp. 91–102. [Google Scholar]
- Wickham, H. ggplot2: Elegant Graphics for Data Analysis, 1st ed.; Springer: New York, NY, USA, 2016. [Google Scholar]
- Therneau, T.; Atkinson, B.; Ripley, B. Rpart: Recursive Partitioning and Regression Trees, R Package Version 4.1-13. 2018. Available online: https://CRAN.R-project.org/package=rpart/ (accessed on 10 January 2020).
- Meyer, D.; Dimitriadou, E.; Hornik, K.; Weingessel, A.; Leisch, F. e1071: Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), TU Wien. R Package Version 1.7-3. 2019. Available online: https://CRAN.R-project.org/package=e1071 (accessed on 10 January 2020).
- Auguie, B. gridExtra: Miscellaneous Functions for "Grid" Graphics. R Package Version 2.3. 2017. Available online: https://CRAN.R-project.org/package=gridExtra (accessed on 10 January 2020).
- Liaw, A.; Wiener, M. Classification and regression by randomForest. R News
**2002**, 2, 18–22. [Google Scholar] - Dolman, A.M.; Rücker, J.; Pick, F.R.; Fastner, J.; Rohrlack, T.; Mischke, U.; Wiedner, C. Cyanobacteria and cyanotoxins: The influence of nitrogen versus phosphorus. PLoS ONE
**2012**, 7, e38757. [Google Scholar] [CrossRef] - Shakhari, S.; Banerjee, I. A multi-class classification system for continuous water quality monitoring. Heliyon
**2019**, 5, e01822. [Google Scholar] [CrossRef] [Green Version] - Leisch, F.; Dimitriadou, E. mlbench: Machine Learning. Benchmark Problems. R Package Version 2.1-1. 2010. Available online: https://cran.r-project.org/web/packages/mlbench/index.html (accessed on 15 January 2020).
- Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed.; Springer Science & Business Media: New York, NY, USA, 2017. [Google Scholar]
- Landis, J.R.; Koch, G.G. An application of hierarchical kappa-type statistics in the assessment of majority agreement among multiple observers. Biometrics
**1977**, 33, 363–374. [Google Scholar] [CrossRef]

**Figure 2.**Spatial distribution of the biochemical and geomorphological variables across the subset of the WISER lakes database: (

**a**) cyanobacteria biomass (CBB); (

**b**) Chl-a; (

**c**) total nitrogen (TN); (

**d**) total phosphorus (TP); (

**e**) mean air temperature; (

**f**) mean depth; (

**g**) surface area; (

**h**) elevation. All variables correspond to the record with the maximum CBB concentration for each lake.

**Figure 3.**Box plots of the biochemical and geomorphological variables across the subset of the WISER lakes database: (

**a1**) CBB; (

**a2**) CBB after outlier filtering; (

**b**) Chl-a; (

**c**) TN; (

**d**) TP; (

**e**) mean air temp; (

**f**) mean depth; (

**g**) surface area; (

**h**) elevation. Boxplots (

**b**) to (

**h**) present data after outlier filtering.

**Figure 4.**Correlation matrix of the response variable CBB against the candidate explanatory variables for the multiple linear regression and machine learning models. The upper right panel displays the correlation coefficient (r) of each variable pair, the lower left panel displays the scatter plot with a smoothed regression curve and on the diagonal the distribution of all the variables is illustrated.

**Figure 5.**Schematic representation of path analysis scenarios predicting CBB using the main variables TN, TP, MeanATemp and Chl-a. Standardized path coefficients and relevant statistics are shown for each scenario. (

**a**) TN, TP and MeanATemp have indirect effects while Chl-a has a direct effect on CBB. (

**b**) Similar to (

**a**) with the addition of TN having a direct effect on CBB. (

**c**) Similar to (

**a**) with the addition of MeanATemp having a direct effect on CBB.

**Figure 6.**Confusion matrices for Distance Tree (DT), K-Nearest Neighbor (k-NN), Support-vector machine (SVM) and Random Forest (RF). On top of each matrix the performance indices—Accuracy and Kappa—are illustrated. (

**a**) Results of the DT algorithm, (

**b**) results of the k-NN algorithm, (

**c**) results of the SVM algorithm and (

**d**) results of the RF algorithm.

**Figure 7.**Custom tuning of Random Forest parameters in R. This diagram illustrates how Random Forest performs in terms of accuracy for different combinations of the parameters under tuning (mtry and ntree) and for 10-fold repeated cross-validation.

**Figure 8.**Confusion matrixes for the RF algorithm under different combinations of tuning parameters regarding the test subset. On the top of each matrix the performance indices—Accuracy and Kappa— are illustrated. (

**a**) The RF results for mtry = 2 and ntree = 1000 and (

**b**) RF results for mtry = 2 and ntree = 2000.

**Figure 9.**Analysis of the confusion matrix results for the four machine learning algorithms: DT, k-NN, SVM and RF. Results are presented for the three risk level classes—Low, Medium and High—and for three relevant metrics—precision, recall and specificity.

**Table 1.**Lake dataset geographical distribution, along with the corresponding number of observations, sampling month and yearly range.

Country | Number of Lakes | Median/Mean No. of Observations Per Lake | Sampling Month | Time Period |
---|---|---|---|---|

UK | 81 | 3/3.39 | June to October | 2007–2008 |

Denmark | 20 | 19/17.06 | June to September | 1989–2012 |

Norway | 408 | 4/5.41 | May to October | 1988–2009 |

Sweden | 77 | 4.5/6.71 | May to October | 2001–2009 |

Finland | 217 | 1/3.42 | May to October | 1993–2009 |

Lithuania | 19 | 1/2.47 | June to September | 2011–2012 |

Variable | Minimum | Maximum | Median | Mean |
---|---|---|---|---|

Latitude | 50.078 | 69.897 | 59.866 | 60.341 |

Elevation (m a.s.l.) | −4 | 1057 | 126 | 79.9 |

Surface area (km^{2}) | 0.019 | 1377 | 39.1 | 2.1 |

Mean depth (m) | 0.096 | 239 | 12.1 | 6.6 |

Max depth (m) | 1 | 516 | 35.79 | 22 |

Mean air temperature (^{°}C) | −0.3 | 21.1 | 13.6 | 14.1 |

Max air temperature (^{°}C) | 9.8 | 34.1 | 23.58 | 23.8 |

Total nitrogen (μg/L) | 47 | 6841.7 | 656.9 | 435 |

Total phosphorus (μg/L) | 0.5 | 1270 | 28.86 | 12 |

TN/TP | 0.92 | 565 | 34.5 | 43.3 |

Chlorophyll-a (μg/L) | 0 | 310.1 | 10.89 | 4.4 |

Cyanobacteria biomass (mg/L) | 0 | 71.5 | 0.642 | 0.00844 |

**Table 3.**The three risks to human health categories for the recreational use of lakes according to cyanobacteria biomass.

Risk Category | Limits According to Cyanobacterial Biomass |
---|---|

Low | CBB ≤ 2 mg/L |

Medium | 2 mg/L< CBB ≤ 10 mg/L |

High | CBB > 10 mg/L |

Linear Model | R^{2} | BIC | AIC | RMSE |
---|---|---|---|---|

All lakes | ||||

CBB = −0.32 + 27.33 × Chl-a | 0.33 | 20,662 | 20,643 | 2.698 |

CBB = −0.45 + 25.71 × Chl-a + 2.1 × TN | 0.33 | 20,651 | 20,626 | 2.697 |

Shallow lakes | ||||

CBB = 0.03 + 24.99 × Chla | 0.28 | 4704 | 4690 | 5.667 |

CBB = −0.56 + 21.44 × Chl-a + 5.83 × TN | 0.27 | 4698 | 4680 | 5.676 |

CBB = −2.26 + 21.86 × Chl-a + 6.16 × TN + 250.2×MeanDep | 0.27 | 4695 | 4672 | 5.648 |

CBB = −2.64 + 22.26 × Chl-a + 9.63 × TN + 268.13 × MeanDep – 18.13 × TN/TP | 0.27 | 4694 | 4669 | 5.643 |

Deep Lakes | ||||

CBB = −0.4 + 29.48 × Chl-a | 0.43 | 11,695 | 11,676 | 1.294 |

CBB = −0.4 + 26.98 × Chl-a + 35.67 × TN/TP | 0.43 | 11,598 | 11,568 | 1.291 |

CBB = −0.21 + 26.2 × Chl-a – 3.47 × TN + 86.68 × TN/TP | 0.44 | 11,594 | 11,565 | 1.278 |

**Table 5.**Results of the path analysis concerning the four competing scenarios tested. For each direct relationship the z-value and Pr (>|z|) are presented.

Scenario | z-Value | Pr (>|z|) |
---|---|---|

1 | ||

Chl-a ~ TN | 21.985 | 0.000 |

Chl-a ~ TP | 24.122 | 0.000 |

Chl-a ~ MeanATemp | 8.346 | 0.000 |

CBB ~ Chl-a | 39.611 | 0.000 |

2 | ||

Chl-a ~ TN, TP, MeanATemp | As in Scenario 1 | |

CBB ~ Chl-a | 32.946 | 0.000 |

CBB ~ TN | 4.389 | 0.000 |

3 | ||

Chl-a ~ TN, TP, MeanATemp | As in Scenario 1 | |

CBB ~ Chl-a | 34.31 | 0.000 |

CBB ~ TP | 0.929 | 0.353 |

4 | ||

Chl-a ~ TN, TP, MeanATemp | As in Scenario 1 | |

CBB ~ Chl-a | 38.327 | 0.000 |

CBB ~ MeanATemp | 1.736 | 0.083 |

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Mellios, N.; Moe, S.J.; Laspidou, C.
Machine Learning Approaches for Predicting Health Risk of Cyanobacterial Blooms in Northern European Lakes. *Water* **2020**, *12*, 1191.
https://doi.org/10.3390/w12041191

**AMA Style**

Mellios N, Moe SJ, Laspidou C.
Machine Learning Approaches for Predicting Health Risk of Cyanobacterial Blooms in Northern European Lakes. *Water*. 2020; 12(4):1191.
https://doi.org/10.3390/w12041191

**Chicago/Turabian Style**

Mellios, Nikolaos, S. Jannicke Moe, and Chrysi Laspidou.
2020. "Machine Learning Approaches for Predicting Health Risk of Cyanobacterial Blooms in Northern European Lakes" *Water* 12, no. 4: 1191.
https://doi.org/10.3390/w12041191