Machine Learning Approaches for Predicting Health Risk of Cyanobacterial Blooms in Northern European Lakes

Cyanobacterial blooms are considered a major threat to global water security with documented impacts on lake ecosystems and public health. Given that cyanobacteria possess highly adaptive traits that favor them to prevail under different and often complicated stressor regimes, predicting their abundance is challenging. A dataset from 822 Northern European lakes is used to determine which variables better explain the variation of cyanobacteria biomass (CBB) by means of stepwise multiple linear regression. Chlorophyll-a (Chl-a) and total nitrogen (TN) provided the best modelling structure for the entire dataset, while for subsets of shallow and deep lakes, Chl-a, mean depth, TN and TN/TP explained part of the variance in CBB. Path analysis was performed and corroborated these findings. Finally, CBB was translated to a categorical variable according to risk levels for human health associated with the use of lakes for recreational activities. Several machine learning methods, namely Decision Tree, K-Nearest Neighbors, Support-vector Machine and Random Forest, were applied showing a remarkable ability to predict the risk, while Random Forest parameters were tuned and optimized, achieving a 95.81% accuracy, exceeding the performance of all other machine learning methods tested. A confusion matrix analysis is performed for all machine learning methods, identifying the potential of each method to correctly predict CBB risk levels and assessing the extent of false alarms; random forest clearly outperforms the other methods with very promising results.


Introduction
Cyanobacterial harmful algal blooms pose a major concern for freshwater and coastal ecosystems worldwide. The occurrence, frequency and duration of cyanobacterial blooms has increased significantly in recent years as eutrophication, a state of high primary productivity, is appearing in more and more lake ecosystems [1][2][3]. Cyanobacteria may potentially produce a variety of toxins (cyanotoxins), responsible for a wide range of adverse environmental and health impacts. As a result, authorities are occasionally forced to block the intended use of lakes in order to protect public health [4][5][6]; indeed, surface freshwater uses, such as water supply, irrigation, fishing and recreation, can be critically affected by the excessive presence of cyanobacteria and raise health safety issues [7]. Moreover, cyanobacteria biomass is a component of the classification system of the ecological status for lakes according to the European Water Framework Directive [8], since a high abundance of cyanobacteria can reduce aquatic biodiversity and food quality for zooplankton. The identification of synergistic factors favoring relative risk to human health associated with the level of cyanobacterial abundance in fresh waters, facilitating water management.
Validation is an important component of the modelling procedure; however, many of the empirical models presented in the literature lack a consistent validation methodology, which adds controversy on whether they are capable to perform with similar efficiency on different datasets out of the training ones. Training and validating the models used to predict CBB would empower their capability and effectiveness to adequately perform on a different dataset. In this article, we address this gap and perform model validation of the reported results.
In this study, we use a large-scale monitoring dataset compiled from 822 lakes belonging to six Northern European countries to (i) predict cyanobacterial biomass concentration from a series of possible explanatory variables and check the predictive power of multiple regression on the entire dataset and on subsets of shallow and deep lakes by following a 10-fold cross-validation procedure; (ii) investigate through path analysis how nutrients, chlorophyll-a and air temperature affect cyanobacteria by evaluating all possible direct and indirect pathways; and (iii) calibrate, validate and/or optimize a series of machine learning algorithms, namely Decision Tree (DT), K-Nearest Neighbors (k-NN), Support-vector Machine (SVM) and Random Forest (RF), towards the prediction of risk levels to human health associated with the use of lakes for recreational activities.

Materials and Methods
The methodology followed in this study consists of five steps. In Step 1, the dataset is thoroughly described, followed by spatial visualization of variables and box plots for CBB and the main candidate explanatory variables considered in the analysis. In Step 2, a correlation matrix is constructed, and pairwise linear correlation coefficients of all variables are presented. In Step 3, we perform stepwise multiple linear regression in order to define which variables explain most of the variation of CBB; the analysis is done on the whole dataset and on shallow and deep lakes subsets. In Step 4, we continue with path analysis and examine all possible direct and indirect effects of the predictor variables on CBB. Finally, in Step 5, we implement four different machine learning methods to predict the risk to human health associated with the level of cyanobacterial abundance for recreational water use. For this analysis, continuous CBB data are transformed to categorical classifying data in levels that correspond to three distinct health risk levels (low-medium-high), as defined by the WHO for recreational use [35]. In Figure 1, all five steps describing the sequence of the methodology are presented.
Water 2020, 12, x FOR PEER REVIEW 3 of 20 novelty of this article is that it focusses on predicting the relative risk to human health associated with the level of cyanobacterial abundance in fresh waters, facilitating water management. Validation is an important component of the modelling procedure; however, many of the empirical models presented in the literature lack a consistent validation methodology, which adds controversy on whether they are capable to perform with similar efficiency on different datasets out of the training ones. Training and validating the models used to predict CBB would empower their capability and effectiveness to adequately perform on a different dataset. In this article, we address this gap and perform model validation of the reported results.
In this study, we use a large-scale monitoring dataset compiled from 822 lakes belonging to six Northern European countries to (i) predict cyanobacterial biomass concentration from a series of possible explanatory variables and check the predictive power of multiple regression on the entire dataset and on subsets of shallow and deep lakes by following a 10-fold cross-validation procedure; (ii) investigate through path analysis how nutrients, chlorophyll-a and air temperature affect cyanobacteria by evaluating all possible direct and indirect pathways; and (iii) calibrate, validate and/or optimize a series of machine learning algorithms, namely Decision Tree (DT), K-Nearest Neighbors (k-NN), Support-vector Machine (SVM) and Random Forest (RF), towards the prediction of risk levels to human health associated with the use of lakes for recreational activities.

Materials and Methods
The methodology followed in this study consists of five steps. In Step 1, the dataset is thoroughly described, followed by spatial visualization of variables and box plots for CBB and the main candidate explanatory variables considered in the analysis. In Step 2, a correlation matrix is constructed, and pairwise linear correlation coefficients of all variables are presented. In Step 3, we perform stepwise multiple linear regression in order to define which variables explain most of the variation of CBB; the analysis is done on the whole dataset and on shallow and deep lakes subsets. In Step 4, we continue with path analysis and examine all possible direct and indirect effects of the predictor variables on CBB. Finally, in Step 5, we implement four different machine learning methods to predict the risk to human health associated with the level of cyanobacterial abundance for recreational water use. For this analysis, continuous CBB data are transformed to categorical classifying data in levels that correspond to three distinct health risk levels (low-medium-high), as defined by the WHO for recreational use [35]. In Figure 1, all five steps describing the sequence of the methodology are presented.

Dataset
The original dataset used in this study, which contains a wide range of hydrological, biological, chemical and geomorphological features of European lakes, was extracted from the central database [37] of the EU-funded project WISER, which aimed at developing integrative systems to assess the ecological status of water bodies according to the Water Framework Directive (WFD) [38]. Under the umbrella of the project, monitoring data from rivers, lakes and coastal waters were compiled from several European research institutions and environmental agencies, to analyze the responses of biological quality elements to physico-chemical stressors under varying environmental conditions. The extracted dataset contained data on the abovementioned features for 1851 lakes, covering the growing season period (from the 5th to 10th month) from 1980 to 2009. However, the dataset exhibited a great heterogeneity in terms of the number and frequency of the observations and monitored features for each lake. Consequently, each lake contributed a variable number of observations and features, for a variable time period and for different months during the growing season. In order to obtain the same number of monitored features per lake, a preliminary screening was conducted that produced an optimized dataset with data for as many features as possible. The resulting dataset consists of 822 lakes from six Northern European countries, namely, the UK, Denmark, Norway, Sweden, Finland and Lithuania. The data include observations for cyanobacteria biomass (CBB, mg/L), chlorophyll-a concentration (Chl-a, µg/L), total nitrogen concentration (TN, µg/L), total phosphorus concentration (TP, µg/L), mean air temperature (MeanATemp, • C), max air temperature (MaxATemp, • C) and the geomorphological characteristics of the lakes-elevation (Elev, m above sea level (a.s.l.)), surface area (SurArea, km 2 ), mean lake depth (MeanDep, m) and maximum lake depth (MaxDep, m). Table 1 summarizes the number of lakes per country included in the dataset, the median and mean number of observations per lake (variable), the sampling months and the corresponding sampling years. As stated above, the dataset is temporally unbalanced, reflecting the different monitoring schemes of the countries that provided data to WISER [39]. For some lakes, there may be a complete dataset covering all sampling months with multiple measurements, while for others we may have a single measurement for the whole time period. To deal with this temporal heterogeneity, there were several options, such as averaging all values to one per lake, selecting only the latest observation per lake, or using a mixed-effects hierarchical model. In this research, no temporal aggregation was applied and all values per lake were used with no weighing/selection. This was done because any type of aggregation would shrink the size of the dataset, a factor that greatly impacts the performance of machine learning algorithms, rendering small datasets unsuitable [40]. Across the Northern European lakes included in the dataset, there is a wide variation in environmental variables. The large geographic extent covers various climatic conditions with the mean air temperature ranging between −0.33 and 21.1 • C. This is a typical temperature variation within the temperate zone and especially within the Atlantic and Continental types of climate. Although the cyanobacteria biomass is low in most lakes, it ranges between 0 and 71.47 mg/L under different combinations of biochemical and geomorphological conditions. The distribution of variables across the lakes in this study is summarized in Table 2, which shows the minimum, maximum, median and mean values for each lake variable.  Figure 2 shows the variation in the observed values of the main variables considered for all the lakes under study. Since each lake has one or more observations for each variable, the record with the highest CBB value has been chosen for mapping in the figure. Therefore, for each of the 822 lakes, the visualization was driven by the maximum CBB observed in each lake and the corresponding observations for the remaining variables. The rest of the variables mapped are the ones that correspond to that maximum CBB record for each lake. The analysis was conducted with ArcGIS software.  Exploratory data analysis was carried out and box plots for the dependent (CBB) and the explanatory variables were constructed for displaying the data distribution ( Figure 3). Exploratory data analysis was carried out and box plots for the dependent (CBB) and the explanatory variables were constructed for displaying the data distribution ( Figure 3). All box plots are displayed by filtering the outliers except for CBB ( Figure 3a1). The latter shows that the main bulk of observations is concentrated around zero-indicating that most lakes have very low cyanobacteria biomass concentrations. In Figure a2, CBB is shown after the removal of higher values, with most values ranging between 0 and 0.07. All box plots are displayed by filtering the outliers except for CBB ( Figure 3a1). The latter shows that the main bulk of observations is concentrated around zero-indicating that most lakes have very low cyanobacteria biomass concentrations. In Figure 3a2, CBB is shown after the removal of higher values, with most values ranging between 0 and 0.07.

Explorative Analysis
Prior to the modeling analysis, the set of explanatory variables was processed to establish a unified scale of 0-1 for all features considered. For this purpose, the "min-max" normalization formula was applied: where Z ij is the normalized value for the ith observed value of the jth variable (feature), X ij is the ith observed value of the jth variable, and X jmax and X jmin are the maximum and minimum observed values of the jth variable, respectively. The next step was to investigate how CBB is correlated with all the candidate predictor variables ( Table 2). In this regard, a correlation matrix was produced under the Pearson's correlation coefficient method using the "psych" package [41] and the "pairs.panels" function in the programming environment R version 3.6.2 [42]. The relationships between the lake monitoring, meteorological and geomorphological data were examined by bivariate linear correlation and scatter plot smoothing, while the distribution of each variable was also investigated. In the upper right panel of Figure 4, pairwise linear correlation coefficients of the variables are illustrated, while the lower left panel displays the data as scatter plots with smooth regression curves. According to the correlation matrix in Figure 4, the explanatory variables with the highest linear correlation compared with CBB were Chl-a (r = 0.52), TN (r = 0.3) and TP (r = 0.26), which indicates that the correlation of CBB with Chl-a is the highest; in turn, the correlations of CBB with TN and TP are comparable, and even though lower than those with Chl-a, are still significant.

Stepwise Multiple Linear Regression
Stepwise multiple linear regression was applied to the whole dataset containing all the lakes, to the subset containing shallow lakes (mean depth less or equal to 3 m) and to the subset containing deep lakes (mean depth above 3 m), in order to decide which subset of predictor variables results in the best model that explains the biggest variation in CBB in each case. Even though the response of

Stepwise Multiple Linear Regression
Stepwise multiple linear regression was applied to the whole dataset containing all the lakes, to the subset containing shallow lakes (mean depth less or equal to 3 m) and to the subset containing deep lakes (mean depth above 3 m), in order to decide which subset of predictor variables results in the best model that explains the biggest variation in CBB in each case. Even though the response of cyanobacteria to eutrophication is typically non-linear [8,43], multiple linear regression models are commonly used in predicting cyanobacteria abundance in several studies, which in many cases outperform non-linear ones in predictability and are much less complicated [17,28,44]. Furthermore, cyanobacteria abundance is far from normally distributed, which can be confirmed with the large difference between the mean and median in Table 2 and the histogram in Figure 4. The distribution has a strong positive skewness and normalization is not expected to be very helpful; as a result, assumptions underlying linear regression are violated. This is not uncommon and is tolerated because of the simplicity of multiple linear models and their satisfying predictability.
Among the predictor variables (Figure 4), max depth and max air temperature showed high collinearity with mean depth (0.94) and mean air temperature (0.75), respectively, and were omitted from the final set considered in our analysis. The final set of the predictor variables includes latitude (Lat), elevation (Elev), surface area (SurArea), mean depth (MeanDep), mean air temperature (MeanATemp), total nitrogen (TN), total phosphorus (TP), total nitrogen-to-total phosphorus ratio (TN/TP) and chlorophyll-a (Chl-a). The stepwise regression consists of iteratively adding and removing predictors, in the predictive model, in order to find the subset of variables in the dataset resulting in the best performing model; that is, a model that lowers the prediction error. In our analysis we performed backward selection or backward elimination, which starts with all predictors, iteratively removing the least contributive predictors, and stops when the model includes the predictors that are statistically significant. The Akaike Information Criterion (AIC) was used as an estimator of out-of-sample prediction error and thereby determined the best predictive model produced by the backward selection method. The performance criteria R-squared, Schwarz Bayesian Criterion (BIC) and RMSE were also calculated to compare the produced models. The dataset used in each case-the all-lakes dataset, as well as the shallow-lake and deep-lake subsets-was split in training and testing sets by the 10-fold cross-validation method in order to better assess the prediction strength of each model and to prevent overfitting. The stepwise multiple linear regression analysis was conducted in R by using the "tidyverse", "caret", "leaps" and "MASS" packages [45][46][47][48].

Path Analysis
Path analysis was used to decompose correlations into different pieces for interpretation of effects (e.g., how does TN, TP and MeanATemp influence CBB, directly or indirectly through Chl-a). Although cyanobacteria biomass is a component of total algal biomass, which is also measured by Chl-a, Chl-a is often used as a predictor for CBB [19]. With this type of causal modeling, we can test the cause and effect relationship of various variables that are theoretically assumed to be causally related; we can now see whether and to what degree this assumption is supported by the data. The resulting models show the causal mechanisms and the direct and indirect effects of the explanatory variables on CBB. In our case, Chl-a, TN, TP along with MeanATemp were used to construct models under various scenarios that explore all possibilities, with different combinations of direct and indirect effects of the predictor variables on CBB. All variables were chosen as the strongest predictors resulting from the literature [15], but also from the stepwise multiple regression performed. Validation of the produced models was carried out by examining the z-values and Pr (>|z|); the former is computed as the test statistic for the hypothesis test that the true corresponding regression coefficient β of variable X (Chl-a, TN, TP and MeanATemp) in predicting variable Y (CBB) is 0. The latter is the significance level of the hypothesis test, with the value of 0.05 taken as the threshold. In other words, with path analysis, we determined whether a variable matters or not, i.e., whether it has a significant relationship with variable Y. A series of permutations are explored to define the importance of each variable in predicting Water 2020, 12, 1191 9 of 19 either Chl-a or CBB. Overall statistics, such as AIC, BIC, Root Mean Square Error of Approximation (RMSEA) and Chi-square indices are also used to assess goodness of fit in each case. All statistical calculations and representations were done in R with the "lavaan" package [49].

Machine Learning Methods
The prediction of risk to human health associated with the level of cyanobacterial abundance for recreational water use was conducted by means of a series of machine learning algorithms. The WHO [35] has published a guideline for safe practice in managing recreational water according to which there is a relatively low probability of adverse health effects when cyanobacterial cell count is less than or equal to 20,000 cells/mL; a medium or moderate probability when cyanobacterial cell count ranges from 20,000 cells/mL to 100,000 cells/mL and a high probability when that cell count is above 100,000 cells/mL. Since the dataset includes concentrations and not cell counts, the latter had to be transformed to cyanobacterial biomass, to enable the classification of data in the WHO data ranges. Considering that the mean cyanobacterial cell weighs approximately 10 −7 mg we converted cyanobacterial cell density to biomass [43]. Table 3 presents the three risk categories for recreational use, according to cyanobacteria biomass. Table 3. The three risks to human health categories for the recreational use of lakes according to cyanobacteria biomass.

Risk Category Limits According to Cyanobacterial Biomass
Low CBB ≤ 2 mg/L Medium 2 mg/L< CBB ≤ 10 mg/L High CBB > 10 mg/L By matching CBB to the three risk categories, the continuous dependent variable was transformed to a categorical one representing three levels-low, moderate and high. Subsequently, the whole dataset was divided into training and testing subsets following an 80% to 20% ratio, respectively. Four machine learning algorithms were applied in order to determine which algorithm has better efficiency in predicting the risk categories with the input parameters being the eleven explanatory variables presented in Table 2 and in the correlation matrix in Figure 4. Since machine learning algorithms are based on the simple idea of "the wisdom of the crowd" and the aggregation of the results of multiple predictors gives better predictions, all explanatory variables were used [50]. The machine learning methods used were Decision Tree (DT), K-Nearest Neighbors (k-NN), Support-vector Machine (SVM) and Random Forest (RF), which are briefly described as follows: • DT is a supervised machine learning technique for inducing a decision tree from training data. A decision tree, also referred to as a classification tree, is a flowchart-like diagram that shows the various outcomes from a series of decisions. Practically, it is the mapping of observations about an item to conclusions about its target value [51]. • k-NN is a relatively simple approach to classification that is completely nonparametric. Given a point x 0 that one wishes to classify into one of the K groups, the algorithm finds the k observed data points that are nearest to x 0 . The classification rule is to assign x 0 to the population that has the most observed data points out of the k-nearest neighbors. Points for which there is no majority are either classified to one of the majority populations at random or left unclassified [52]. • SVM is an algorithm that classifies data by determining the optimal hyperplane that separates observations according to their class labels. The central concept of this method is to accommodate classes that are separable by linear and non-linear class boundaries [53]. • RF is a classifier algorithm that evolved from decision trees. It collects the classifications and chooses the most voted prediction as the result. RFs sample data from the original dataset and a subset of features is randomly selected from the optional features to grow the tree at each node. The strength of the RFs relies on the capability to enable a large number of weak or weakly correlated classifiers to form a strong classifier [54].

Identifying Variables Explaining CBB Variation
By applying stepwise linear regression with backward elimination to the original dataset and to the subsets of shallow and deep lakes with AIC as performance indicator, the following results were obtained. Regarding the all-lakes dataset, the best overall model structure that explains the most variation in CBB is the one including Chl-a and TN as predictors (R 2 = 0.33). Although Chl-a, as a single predictor, showed a proportional ability to capture the variation in CBB (R 2 = 0.33), when Chl-a is combined with TN, the overall performance of the model exceeds that of Chl-a, as indicated by the performance criteria summarized in Table 4 (BIC, AIC and RMSE). In the subgroup of shallow lakes, stepwise regression resulted in a four-variable linear model, including Chl-a, TN, MeanDep and TN/TP as best predictors according to the AIC criterion (R 2 = 0.27). Similarly to the all lakes dataset, although the best single-variable linear model (Chl-a) and the best two-variable (Chl-a and TN) and three-variable models (Chl-a, TN and MeanDep) had similar R-square values, their overall performance was weaker compared to the four-variable model. Furthermore, the predictive power of the best linear model for the subgroup of shallow lakes was weaker compared to the model that represents the subgroup of the original dataset for all lakes. Finally, in the subgroup of deep lakes, the model that better predicted the variation of the CBB included Chl-a, TN and TN/TP (R 2 = 0.44). The two-variable (Chl-a and TN/TP) and single-variable (Chl-a) best models in this case, showed lower R 2 values and weaker predictive performance criteria. In general, the linear models predicting CBB in deep lakes exhibited a remarkable improvement against the all-lakes and shallow-lakes models. Although it is somewhat surprising, since cyano-blooms is a more common problem in shallow lakes, this finding is justified by the fact that prediction is more difficult in shallow lakes because of more fluctuating conditions. However, similar results, where modelling efficacy improves in deep lakes, can be found in other studies [17,28]. In Table 4, the best univariate and multivariate models as well as their performance indicators are summarized. Our analysis shows that using linear models to predict CBB concentration from biological and physical-chemical lake variables has a relatively low reliability, which is reflected by the low R 2 values. To deal with this, we explored the possibility of producing more reliable results by predicting health risk levels, instead of concentrations. This is presented in Section 3.3.

Describing Dependent Relationships among Variables
Path analysis was performed for four different competing scenarios, in order to explore how strongly certain variables (TN, TP and MeanATemp) act in mediating the relationship between Chl-a and CBB. We explore how the model regresses Chl-a on TN, TP and MeanATemp and how the model directly regresses CBB on Chl-a, TN, TP and MeanATemp. Results of the path analysis, as produced from the R studio package, are shown in Table 5. In Scenario (1), we explore how the model regresses Chl-a on TN, TP and MeanATemp. For all three variables, we see that z-value > 2 and Pr (>|z|) < 0.05, so they are all significant in determining Chl-a. We also explore how CBB is regressed on Chl-a and again we see that Chl-a is significant. In Scenario (2) and having secured that TN, TP and MeanATemp are significant, we further explore how the model regresses CBB directly on TN (in addition to Chl-a). Results show that TN is indeed significant. Similarly, in Scenarios (3) and (4) we explore sequentially the significance of TP and MeanATemp, respectively, in influencing CBB directly. In Scenario (3), results show that TP is not significant, since the z-value is less than 2 and Pr (>|z|) > 0.05. In Scenario 4, the z-value (1.736) is close to 2 and Pr (>|z|) (0.083) is close to 0.05, indicating a marginal significance of MeanATemp in influencing CBB directly. In Figure 5, we see a schematic representation that shows the standardized path coefficients for Scenarios (1), (2) and (4), while other statistics, such as Chi-square, RMSEA, AIC and BIC are shown as well. According to the statistics, Scenario (2) is the best, since the statistics are the lowest. The dashed lines depict the correlation coefficients between the two variables they connect (repeated from Figure 4). Our path analysis shows that TP is not included in the variables that are significant for CBB, while TN is included. This fact indicates a potential nitrogen limitation, meaning that even when TP increases, it stops being significant since CBB is nitrogen dependent; similar results are reported in the literature [60]. This is especially true since the lakes in our dataset seem to be highly P enriched (see Table 2). A series of machine learning algorithms were used to compare their predictive potential towards identifying the risk to human health associated with CBB level for recreational activities. In Figure 6, confusion matrices and the corresponding performance indices for algorithms DT, k-NN, SVM and RF are presented. These matrices depict the number of instances that the predicted values of the CBB risk levels matched the reference (actual) risk levels or failed to predict the risk levels correctly. The instances presented on the diagonal are all True Positives, since the algorithm prediction matches the actual classification (what is predicted as a low is actually a low, a predicted medium is a medium, etc.) In Figure 6a, this number is 784. The sum of the actual instances that are not predicted correctly are the False Negatives; in Figure 6a, the False Negatives are 17 for "Low", 25 for "Medium" and 10 for "High", and they are calculated as the sum of each column minus the True Positives. False Positives are the instances that are predicted falsely and are the sum of each row minus the True Positives. In Figure 6a, the False Positives are 23 for "Low", 25 for "Medium" and 4 for "High". Finally, the True Negatives for a certain class are those instances that do not belong in that class (either predicted or reference), so in Figure 6a, we have 29 True Negatives for "Low", 771 True Negatives for "Medium" and 820 True Negatives for "High".

Evaluating the Performance of the Machine Learning Methods
A series of machine learning algorithms were used to compare their predictive potential towards identifying the risk to human health associated with CBB level for recreational activities. In Figure 6, confusion matrices and the corresponding performance indices for algorithms DT, k-NN, SVM and RF are presented. These matrices depict the number of instances that the predicted values of the CBB risk levels matched the reference (actual) risk levels or failed to predict the risk levels correctly. The instances presented on the diagonal are all True Positives, since the algorithm prediction matches the actual classification (what is predicted as a low is actually a low, a predicted medium is a medium, etc.) In Figure 6a, this number is 784. The sum of the actual instances that are not predicted correctly are the False Negatives; in Figure 6a, the False Negatives are 17 for "Low", 25 for "Medium" and 10 for "High", and they are calculated as the sum of each column minus the True Positives. False Positives are the instances that are predicted falsely and are the sum of each row minus the True Positives. In Figure 6a, the False Positives are 23 for "Low", 25 for "Medium" and 4 for "High". Finally, the True Negatives for a certain class are those instances that do not belong in that class (either predicted or reference), so in Figure 6a, we have 29 True Negatives for "Low", 771 True Negatives for "Medium" and 820 True Negatives for "High".  The overall metrics we use to assess the performance of these algorithms are "Accuracy" and "Kappa"-a single value per confusion matrix is calculated for these two variables. Accuracy is defined as the ratio of all True Positives of the matrix divided by the sum of all instances in the dataset and it expresses the ability of the model to correctly identify Low, Medium and High instances. Kappa statistics considers the fact that some of the correct predictions may be identified as such by chance, so it adjusts the reported model accuracy by considering the effect of randomness in correct predictions [61]. Comparing the four implemented machine learning algorithms, in Figure 6 we see that both for accuracy and Kappa statistics, the RF performs best. Thus, according to the results, RF has the highest metrics (Accuracy = 95.45% and Kappa = 58.84%) and was selected for tuning its parameters, in order to further improve its performance. The tuning procedure was conducted in R by using the packages "randomForest", "mlbench" and "caret" [46,59,62]. We applied the grid search method under 10-fold repeated cross validation, in order to search for the best combination of parameters, namely the functions "mtry" and "ntree". Mtry is the number of variables randomly sampled as candidates for each split, while ntree is the number of trees to grow [63]. After exhaustively checking all parameters and combinations with the grid search method, we concluded that the highest levels of accuracy are achieved for a mtry equal to 2 and for ntree taking the values The overall metrics we use to assess the performance of these algorithms are "Accuracy" and "Kappa"-a single value per confusion matrix is calculated for these two variables. Accuracy is defined as the ratio of all True Positives of the matrix divided by the sum of all instances in the dataset and it expresses the ability of the model to correctly identify Low, Medium and High instances. Kappa statistics considers the fact that some of the correct predictions may be identified as such by chance, so it adjusts the reported model accuracy by considering the effect of randomness in correct predictions [61]. Comparing the four implemented machine learning algorithms, in Figure 6 we see that both for accuracy and Kappa statistics, the RF performs best. Thus, according to the results, RF has the highest metrics (Accuracy = 95.45% and Kappa = 58.84%) and was selected for tuning its parameters, in order to further improve its performance. The tuning procedure was conducted in R by using the packages "randomForest", "mlbench" and "caret" [46,59,62]. We applied the grid search method under 10-fold repeated cross validation, in order to search for the best combination of parameters, namely the functions "mtry" and "ntree". Mtry is the number of variables randomly sampled as candidates for each split, while ntree is the number of trees to grow [63]. After exhaustively checking all parameters and combinations with the grid search method, we concluded that the highest levels of accuracy are achieved for a mtry equal to 2 and for ntree taking the values of 1000 and 2000. Results of this analysis is shown in Figure 7: Each colored line represents the fluctuation of accuracy (y-axis) when different ntree values (x-axis) are applied on each of the possible values for mtry (colored lines). We focus on the highest accuracy, i.e., which one of the colored lines achieves the highest y value (this is true for the purple line that has the highest peaks and corresponds to mtry = 2, according to the legend in Figure 7). For this line, the highest peaks correspond to ntree (x-axis) values of 1000 and 2000; thus, these are the chosen model parameters that optimize accuracy. . We focus on the highest accuracy, i.e., which one of the colored lines achieves the highest y value (this is true for the purple line that has the highest peaks and corresponds to mtry = 2, according to the legend in Figure 7). For this line, the highest peaks correspond to ntree (x-axis) values of 1000 and 2000; thus, these are the chosen model parameters that optimize accuracy. In order to check the performance of RF under these two prevailing combinations of parameters on the dataset, we produced two confusion matrices-one for mtry = 2 and ntree = 1000 and another for mtry = 2 and ntree = 2000. As we show in Figure 8, the best performance was achieved for mtry = 2 and ntree = 2000 (Accuracy = 95.81% and Kappa = 60.97%) (Figure 8b). According to [64], a model is considered to produce accurate predictions when Kappa exceeds 60%, which in our case is succeeded. In order to check the performance of RF under these two prevailing combinations of parameters on the dataset, we produced two confusion matrices-one for mtry = 2 and ntree = 1000 and another for mtry = 2 and ntree = 2000. As we show in Figure 8, the best performance was achieved for mtry = 2 and ntree = 2000 (Accuracy = 95.81% and Kappa = 60.97%) (Figure 8b). According to [64], a model is considered to produce accurate predictions when Kappa exceeds 60%, which in our case is succeeded.
To further analyze the modeling results, we compared in detail the performance of the four algorithms by using three metrics: precision, recall and specificity [61]. Precision is defined for each class (Low, Medium and High) as the ratio of True Positives by the sum of True Positives and False Positives for that class. In Figure 9, we see that precision is the highest for the Low class, most probably due to the large number of observations in that class; it approaches 1, which means that given many observations, all algorithms predict with great precision. In the High class, where we have the least observations, precision is close to 80% for SVM and RF, with SVM giving slightly better results. It is weakest in the Medium class, with RF still giving the best results. It should be noted that precision becomes important when the cost of False Positives is high. In our case, this means the cost of predicting that CBB concentrations are High (Low) when they are actually medium or Low (High). The former case of having high CBB concentrations being falsely predicted as low, which is an undesirable scenario in terms of public health, has a precision of about 80%. Given the very small number of observations in the High category, this is a strong outcome in terms of the capacity of the optimized RF scheme to predict events with excellent precision.
Forest performs in terms of accuracy for different combinations of the parameters under tuning (mtry and ntree) and for 10-fold repeated cross-validation.
In order to check the performance of RF under these two prevailing combinations of parameters on the dataset, we produced two confusion matrices-one for mtry = 2 and ntree = 1000 and another for mtry = 2 and ntree = 2000. As we show in Figure 8, the best performance was achieved for mtry = 2 and ntree = 2000 (Accuracy = 95.81% and Kappa = 60.97%) (Figure 8b). According to [64], a model is considered to produce accurate predictions when Kappa exceeds 60%, which in our case is succeeded.
(a) (b) To further analyze the modeling results, we compared in detail the performance of the four algorithms by using three metrics: precision, recall and specificity [61]. Precision is defined for each class (Low, Medium and High) as the ratio of True Positives by the sum of True Positives and False Positives for that class. In Figure 9, we see that precision is the highest for the Low class, most probably due to the large number of observations in that class; it approaches 1, which means that given many observations, all algorithms predict with great precision. In the High class, where we have the least observations, precision is close to 80% for SVM and RF, with SVM giving slightly better results. It is weakest in the Medium class, with RF still giving the best results. It should be noted that precision becomes important when the cost of False Positives is high. In our case, this means the cost of predicting that CBB concentrations are High (Low) when they are actually medium or Low (High). The former case of having high CBB concentrations being falsely predicted as low, which is an undesirable scenario in terms of public health, has a precision of about 80%. Given the very small number of observations in the High category, this is a strong outcome in terms of the capacity of the optimized RF scheme to predict events with excellent precision. Recall is calculated as the ratio of True Positives by the sum of True Positives and False Negatives, which is the ratio of True Positives by Total Actual Positives. Recall calculates how many of the actual Lows (Highs) the algorithm captures through labeling them as Lows (Highs). We see that, again, in the Low class where we have many observations, recall is almost 1. In the Medium and High classes, recall is lower, but the optimized RF algorithm performs significantly higher in all classes. The recall metric becomes important when there is a high cost associated with False Negatives; in our case, a False Negative would mean that a High risk level is falsely predicted by the model as either Medium or Low. In terms of protecting public health, this would have a really high cost and authorities want to have the right tools to avoid such situations. In the High and Medium classes, where we have few data, we see that recall values are between 50% and 60%, which indicates Recall is calculated as the ratio of True Positives by the sum of True Positives and False Negatives, which is the ratio of True Positives by Total Actual Positives. Recall calculates how many of the actual Lows (Highs) the algorithm captures through labeling them as Lows (Highs). We see that, again, in the Low class where we have many observations, recall is almost 1. In the Medium and High classes, recall is lower, but the optimized RF algorithm performs significantly higher in all classes. The recall metric becomes important when there is a high cost associated with False Negatives; in our case, a False Negative would mean that a High risk level is falsely predicted by the model as either Medium or Low. In terms of protecting public health, this would have a really high cost and authorities want to have the right tools to avoid such situations. In the High and Medium classes, where we have few data, we see that recall values are between 50% and 60%, which indicates that the instances that the model will predict False Negatives is relatively high. The fact that the numbers are close to 1 for the Low class, in which we have many observations, signifies the importance of the number of data observations. Finally, specificity is the true negative rate and is calculated as the number of correct negative predictions (for the Low class, this is the sum of observations that are actually Medium and High minus the ones that are falsely predicted as Low) divided by the total number of Negatives (True Negatives + False Positives-for the Low class, this includes all observations that are actually Medium and High). This metric is an indication of how good the model is at avoiding false alarms. So, when we report High (Medium), we have a probability of almost 100% to avoid false alarms, which means that the CBB concentrations are indeed High (Medium). When the Low class is reported, then the results show that there is an average probability for avoiding false alarms, which means that it is possible that a false alarm is created. This possibility for a false alarm is the highest in the Low class, which is not ideal, since it means that we may report Low but actual risk levels may indeed be Medium or High. Specificity would improve if we had more data in the Medium of High class. The fact that our dataset is very unbalanced, with many more observations in the Low class than in the other two classes, creates the corresponding imbalance in our results, with excellent predictions for the Low class and less accurate predictions in the other two classes. This of course is an important finding, since it indicates that the multitude of observations is critical in obtaining solid predictions. When authorities have the capacity to obtain such observations through online real-time monitoring, for example, resulting in large datasets, they can count on machine learning algorithms for reliable results that will assist them in taking informed decisions that protect public health and wellbeing.

Conclusions
This article explores the complexity of mechanisms that determine cyanobacteria abundance in lake ecosystems through stepwise multiple linear regression and a series of machine learning methods, namely Decision Tree, K-Nearest Neighbors, Support-vector Machine and Random Forest. The analysis advocates the multifactorial nature of cyanobacteria response to environmental conditions. Path analysis on the direct-indirect and cause-effect relationships of the stressors reveals that when Chl-a is included as one of the predictors of CBB, TN significantly affects CBB, air temperature only marginally affects its abundance, while TP is found to be not significant. Multiple linear regression analysis reports R 2 values that are in line with values reported by previous studies, even though they are relatively low (R 2 = 0.33 for all lakes, R 2 = 0.27 for shallow and R 2 = 0.44 for deep lakes). When differentiating the predicting approach and translating CBB to risk categories associated with impacts on public health by the recreational use of lakes, predictive ability was fundamentally improved. Machine learning methods and especially Random Forest proved to be a reliable and highly accurate tool towards the categorization of lakes to risk levels, achieving model accuracy levels as high as 95.81% after having optimized the parameters of the algorithm. Confusion matrix analysis resulted in the quantification of the probability of false alarms for the three different risk levels. Focusing on current machine learning techniques to assess cyanobacteria risk levels to human health can give crucial insights to water managers and consequently raise public awareness in a timely manner as to when the lake water is inappropriate to serve recreational uses.

Funding:
The work described in this paper has been conducted within the project WATER4CITIES-Holistic Surface Water and Groundwater Management for Sustainable Cities-which is implemented in the framework of the EU Horizon2020 Program, Grant Agreement Number 734409. This paper and the content included in it do not represent the opinion of the European Union, and the European Union is not responsible for any use that might be made of its content.