Machine Learning Classiﬁcation Algorithms for Predicting Karenia brevis Blooms on the West Florida Shelf

: Harmful algal blooms (HABs), events that kill ﬁsh, impact human health in multiple ways, and contaminate water supplies, have increased in frequency, magnitude, and impacts in numerous marine and freshwaters around the world. Blooms of the toxic dinoﬂagellate Karenia brevis have resulted in thousands of tons of dead ﬁsh, deaths to many other marine organisms, numerous respiratory-related hospitalizations, and tens to hundreds of millions of dollars in economic damage along the West Florida coast in recent years. Four types of machine learning algorithms, Support Vector Machine (SVM), Relevance Vector Machine (RVM), Naïve Bayes classiﬁer (NB), and Artiﬁcial Neural Network (ANN), were developed and compared in their ability to predict these blooms. Comparing the 21 year monitoring dataset of K. brevis abundance, RVM and NB were found to have better skills in bloom prediction than the other two approaches. The importance of upwelling-favorable northerly winds in increasing K. brevis probability, and of onshore westerly winds in preventing blooms from dispersing offshore, were quantiﬁed using RVM, and all models were used to explore the importance of large river ﬂows and the nutrients they supply in regulating blooms. These models provide new tools for management of these devastating algal blooms. Author Contributions: Conceptualization, M.F.L.; methodology, M.F.L. V.L.; validation, M.F.L.; resources, M.F.L. and P.M.G.; writing—original draft preparation, M.F.L.; writing—review and editing, P.M.G. and V.L.; supervision, P.M.G.; project administration, P.M.G.; funding acquisition,


Introduction
The incidence of harmful algal blooms (HABs) has increased globally. HABs are now occurring more frequently, in new and different places, and often last longer, having a wide range of environmental and toxic impacts and in numerous fresh, estuarine, and marine waters [1][2][3]. Both nutrient pollution and climate change are now recognized to play important roles in this expansion [4][5][6][7][8]. Nutrient runoff is increasing with increases in human population and associated changes in diets and the food supply chain, and rising temperatures and climate changes are leading to changes in precipitation patterns and increased intensity or frequency of storm events, that, in turn, alter coastal runoff and physical processes, such as upwelling and stratification [8]. From local to global scales, environmental conditions supporting HABs are changing, leading to increasing challenges for understanding-and modeling-the habitats that support and stimulate them.
Blooms of the toxic dinoflagellate Karenia brevis occur almost annually on the West Florida Shelf (WFS), and historical accounts show that they have occurred since at least the 16th century [9]. However, recent analyses suggest that bloom events have increased 15-fold from the 1950s to the 1990s, although quantifying patterns and trends is complicated by the inconsistency of data collection over this period [10]. During 2017-2019, southwest Florida experienced an unusually prolonged (18 months) K. brevis bloom. At its maximum, this Although K. brevis is typically thought of as a coastal bloom species, blooms are generally initiated offshore and then transported to coastal waters where they flourish and persist for months in nutrient-rich waters [9]. Blooms of K. brevis usually begin in the late summer or early fall, and can persist until the late fall or winter [14]. Upwelling transports K. brevis cells to the coast [15][16][17], but strong upwelling over the shelf break may actually suppress K. brevis blooms or favor competing taxa such as diatoms [18,19]. It is thought that northerly wind generates the coastal upwelling that transports K. brevis from offshore regions to coastal waters, producing favorable conditions for growth. Once K. brevis reaches coastal waters, the westerly wind keeps populations near the coast and prevents them from dispersing offshore. In the current study, the effects of winds were further explored herein.
The nutrient sources, pathways, and processes supporting and maintaining K. brevis blooms not only include upwelling, but also riverine nutrient inputs that bring wastewater effluent and agricultural runoff [14]. Other nutrient sources include benthic nutrient fluxes, atmospheric deposition, nutrients released by other phytoplankton and decaying fish from fish deaths, submarine groundwater discharge, and mixotrophic grazing-suggesting complex environmental interactions of nutrients with bloom occurrence and strength [14,[20][21][22][23][24][25]. Nutrient relationships and riverine flow were also tested herein.
The massive bloom of 2017-2019, and another recent, large-scale bloom, observed in 2005, were likely propelled by unusual events. Hu et al. [20] suggested that nutrient inputs resulting from a series of hurricanes in southwest  Although K. brevis is typically thought of as a coastal bloom species, blooms are generally initiated offshore and then transported to coastal waters where they flourish and persist for months in nutrient-rich waters [9]. Blooms of K. brevis usually begin in the late summer or early fall, and can persist until the late fall or winter [14]. Upwelling transports K. brevis cells to the coast [15][16][17], but strong upwelling over the shelf break may actually suppress K. brevis blooms or favor competing taxa such as diatoms [18,19]. It is thought that northerly wind generates the coastal upwelling that transports K. brevis from offshore regions to coastal waters, producing favorable conditions for growth. Once K. brevis reaches coastal waters, the westerly wind keeps populations near the coast and prevents them from dispersing offshore. In the current study, the effects of winds were further explored herein.
The nutrient sources, pathways, and processes supporting and maintaining K. brevis blooms not only include upwelling, but also riverine nutrient inputs that bring wastewater effluent and agricultural runoff [14]. Other nutrient sources include benthic nutrient fluxes, atmospheric deposition, nutrients released by other phytoplankton and decaying fish from fish deaths, submarine groundwater discharge, and mixotrophic grazing-suggesting complex environmental interactions of nutrients with bloom occurrence and strength [14,[20][21][22][23][24][25]. Nutrient relationships and riverine flow were also tested herein.
The massive bloom of 2017-2019, and another recent, large-scale bloom, observed in 2005, were likely propelled by unusual events. Hu et al. [20] suggested that nutrient inputs resulting from a series of hurricanes in southwest Florida in 2004 were linked with the severity of the 2005 bloom. Hurricanes can accelerate the yield of new sources of landbased nutrients from high riverine flow. Similarly, Hurricanes Irma (2017) and Michael (2018), and Tropical Storm Gordon (2018), are suspected of contributing to the severity of the 2017-2019 K. brevis bloom [11]. Moreover, long periods of wet weather through 2018, combined with increased discharges from Lake Okeechobee and the Caloosahatchee River, added nutrients to coastal waters, sustaining large K. brevis blooms through early 2019. In late 2020, another hurricane, Eta, appears to have played a similar role in helping to sustain a very recent large bloom that has lasted through the winter and spring of 2020-2021.
There is a strong need to advance predictions of K. brevis, and other HABs more generally, to protect human health, fisheries, and economies, but there are many challenges in modeling discrete HAB species [26][27][28][29][30][31]. Several types of models have been developed and are currently in operation for predicting K. brevis blooms [9,[32][33][34]. Weisberg et al. [34] developed a high-resolution coastal ocean circulation model to track the movement of water particles associated with K. brevis populations. An operational forecasting modeling system, maintained by the National Oceanic and Atmospheric Administration, provides 3-5 day outlooks of K. brevis blooms, using satellite remote sensing of chlorophyll a, in situ sampling of K. brevis cell density, and wind buoy data [32,33]. The main goal of these forecasts is to inform managers and the public in coastal areas where public health may be compromised [33]. Walsh et al. [35], in addition to Weisberg and He [15], used a threedimensional (3D) biophysical-coupled model to hindcast bloom initiation and explore the impact of individual forcing functions [35][36][37][38]. However, these models utilize many biochemical and physiological parameters, some of which have not been well characterized either in situ or in the laboratory. Furthermore, these 3D-coupled biophysical models are computationally expensive.
Due to their powerful nonlinear modeling capability, machine learning methods are beginning to be used to predict HAB events, including K. brevis events [39]. An Artificial Neural Network (ANN) model was used to predict algal blooms in Hong Kong coastal waters [40] and to predict outbreaks of the dinoflagellate Dinophysis acuminata in southern Spain [41]. More recently, a Neural Network (NN) approach was used to predict the presence/absence and abundance of the dinoflagellate Karlodinium and the diatom Pseudo-nitzschia in Alfacs Bay in the northwest Mediterranean Sea [42], and Support Vector Machine (SVM) models were used to predict blooms in freshwater reservoirs [43]. Shen et al. [44] used SVM models to simulate algal blooms in the tidal freshwater of James River in response to riverine nutrient loading.
Machine learning approaches have also been used in predicting HABs in the Gulf of Mexico, but with different objectives. Liu and Weisberg [16] used such approaches to demonstrate the role of deep-ocean forcing on WFS in the major bloom occurrences. Weisberg et al. [18] reported that the position of the Loop Current can affect blooms. When the Loop Current is in its southern position, it creates an upwelling of deep nutrients and fosters a diatom bloom that may outcompete any nascent K. brevis blooms. Liu et al. [19] used Self-Organizing Maps to classify spatial patterns of the sea surface height anomalies associated with the Loop Current and found no bloom developed when the Loop Current was in the southern position. That work focused exclusively on the potential effects of the Loop Current on K. brevis blooms and did not consider other factors such as river flows and riverine nutrient loading. Gokaraju et al. [45] proposed a machine learning based spatiotemporal data mining approach to detect HABs from SeaWiFS (Sea-viewing Wide Field-of-view) and MODIS (Moderate Resolution Imaging Spectroradiometer)-Aqua space-born sensor measurements. Recently, Hill et al. [46] used satellite remote sensing of chlorophyll a from 2003 to 2018, sea surface temperature, and bathymetry as inputs to a convolutional NN (designed for spatial data) to detect the presence of K. brevis blooms on WFS, achieving a maximum detection accuracy of 91%. Such approaches have yet to be used to assess the effects of winds, river flows, and river nutrient discharge on the likelihood of K. brevis blooms.
The State of Florida, along with numerous other states and government entities around the world, has established, or is working to establish, nutrient reduction targets to mitigate water quality problems in their water bodies. Continued monitoring and assessment methods will be essential, and improved approaches for establishing criteria for additional waters and to manage water quality across greater regions will continue to be required.
In this research, four machine learning algorithms were used to predict K. brevis on the WFS over a 21 year period. Specifically, the performance of these different machine learning approaches was assessed with regard to forecasting the probability of K. brevis blooms with changing wind, discharge from different rivers, differing nutrient loads, and sea surface height (as a proxy for temperature and upwelling strength). New modeling approaches will provide new tools for defining scientifically defensible protective nutrient loads in future re-evaluations of Florida's water quality criteria and in predicting blooms to protect human health and commercial interests.

The Dataset and Preparation of Explanatory and Dependent Variables
To develop the machine learning models, in situ data of K. brevis cell densities (cells L −1 ) over a 21 year period (1998-2018) on the WFS were obtained from the database available from the Florida Fish and Wildlife Conservation Commission (FFWRCC) [47]. All data represent near-surface cell counts taken using light microscopy; samples are collected by various state and county agencies, private research institutions, and university researchers and routinely reported to FFWRCC. These data represent water samples collected during regular monitoring along the Florida coast and during suspected or confirmed K. brevis events. Although data are available for earlier decades, since 1998 more consistent inshore and offshore stations have been sampled, and thus this analysis is limited to data post-1998.
The data used herein were limited to water samples collected between latitudes of 25.85 degrees north (Marco Island) and 29.14 degrees north (Mouth of Suwanee River) and at most 9 km from the coast because most of the K. brevis blooms occurred within this area. Moreover, setting a fixed area for the data analysis ensured data consistency [48]. Because the K. brevis measurements were largely collected if and when blooms were documented, and not made on a continuous or regular basis, the database has an undersampling of K. brevis under low cell density conditions. To overcome the spatial and temporal inconsistency in the data, the 5 highest cell counts across the fixed area were averaged for each week to produce a weekly mean, following the approach used in previous studies [19]. These cell numbers were discretized and weekly averages were combined into a binary variable, with mean cell densities greater than 10 5 cells L −1 counting as K. brevis events, consistent with the commonly used threshold for K. brevis blooms.
Streamflow data were obtained from United States Geological Survey (USGS) stations in the major rivers that discharge onto the WFS [49] (Figure 1). The USGS stations used included: Tampa Bay (USGS 2306647), Peace River (USGS 2296750), Lake Okeechobee (USGS 2274325), Suwanee River (USGS 2323500), Withlacoochee River (USGS 2319000), Hillsborough River (USGS 2303330), Little Manatee River (USGS 2300500), Myakka River (USGS 2298830), and Caloosahatchee Canal (USGS 2292000). Nutrient data from the major rivers, including the total nitrogen (TN) and total phosphorus (TP) concentrations, were downloaded from the Tampa Bay and Charlotte Harbor Water Atlas (University of South Florida Water Institute) [50]. No nutrient concentration data were available for the Suwanee River. Weekly averaged nutrient concentrations were multiplied by weekly averaged streamflow to estimate weekly TN and TP loads.
Hourly wind and temperature data were obtained from the National Data Buoy Center (NDBC) stations [51] (Figure 1) across the WFS. The hourly wind speeds were used to calculate weekly averages using a simple vector average. The hourly temperature was used to calculate weekly averages. Satellite altimetry from the GLOBAL-REANALYSIS-PHY−001-030 reanalysis product, provided by the E.U. Copernicus Marine Service Monitoring Service (CMEMS) [52] was used to calculate the difference in sea surface height at two locations to quantify the strength of the deep-sea coastal upwelling caused by the Loop Current, following Maze et al. [48].
Data were aggregated into the following form; each row i = 1, . . . , 1083 of the dataset is x i 1 , x i 2 , x i 3 , . . . , x i 33 , y i , where x i 1 , x i 2 , x i 3 , . . . , x i 33 are the explanatory variables of river discharge, nutrient concentration, wind speed and direction, temperature, and sea surface height difference, and y i is the dependent variable of discretized K. brevis cell densities. Machine learning algorithms aim to map x i 1 , x i 2 , x i 3 , . . . , x i 33 to y i . The number of HAB events (318, based on the criteria of cell density >10 5 cells L −1 ) was less than half of the number of events without HABs (765, cell density <10 5 cells L −1 ), resulting in an imbalanced classification problem [53]. Machine learning algorithms for classification predictive models are designed assuming an equal distribution of classes. Because there are fewer examples of the minority class (non-HAB events) than the majority class (HAB events), it becomes harder to predict the non-event periods.
Several approaches have been developed to address this issue and two different approaches were applied herein: (1) the minority class of the training data was randomly oversampled such that the sample size of events with and without HABs were roughly equal in the synthetic training dataset [54]; and (2) the minority class was oversampled by generating new synthetic data using a synthetic minority oversampling technique (SMOTE) preprocessing algorithm [55][56][57].

Machine Learning Algorithms
To predict K. brevis cell density and test the strength of various explanatory variables, the following machine learning algorithms were used: a) Support Vector Machine (SVM), b) Relevance Vector Machine (RVM, a modification of SVM), c) Naïve Bayes (NB), and d) Artificial NN (ANN). These approaches represent a range of machine learning algorithms with different methodologies and varying complexity. SVM belongs to a class of algorithms called kernel methods. RVM has an identical functional form to SVM but provides probabilistic classification. NB is a family of simple probabilistic classifiers. ANN is based on a system of connected nodes to mirror neurons in a biological brain.

Support Vector Machine
The SVM model is a supervised machine learning algorithm that seeks the hyperplane that best separates two labeled classes from each other; the optimal hyperplane maximizes the marginal distance from the nearest support vector for each class [45] (Figure 2a): where x represents the vector for the explanatory variables, w, x + b = 0 is the hyperplane that separates the two classes, w is the slope of the hyperplane, b is the intercept, and f(x) is the classifier output which takes the value of either 1 or −1. SVM seeks a solution for the hyperplane by maximizing the width of the gap between the two data clouds, represented by the cost function (CF, Equation (2), Figure 2a): Sometimes the SVM cannot achieve a perfect separation. The soft-margin loss formulation allows some data points to lie within the margin of tolerance but penalizes them in the cost function [58] as follows (Equation (3)): where in the slack variables, ξ i = max(0, 1 − y i · ( w, x i + b)), y i is either 1 or −1, and C is a hyperparameter which determines the trade-off between maximizing the margin width and minimizing the associated error ( Figure 2a). This new cost function is then optimized, yielding the linear support vector expansion for the classifier (Equation (4)): where α i are the Lagrangian multipliers and w is rewritten as a linear combination of the training patterns [59]. The constant b can be found with the Karush-Kuhn-Tucker Conditions [60,61]: is a hyperparameter which determines the trade-off between maximizing the margin width and minimizing the associated error ( Figure 2a). This new cost function is then optimized, yielding the linear support vector expansion for the classifier (Equation (4)): where are the Lagrangian multipliers and is rewritten as a linear combination of the training patterns [59]. The constant b can be found with the Karush-Kuhn-Tucker Conditions [60,61]: (a) A schematic diagram of the Support Vector Machine (SVM) classifier. The SVM model is a supervised machine learning algorithm that seeks a hyperplane that best separates two labeled classes from each other. The SVM maximizes the width of the gap between the two data clouds. In some cases, not all of the data points can be fitted into the two data clouds outside the shaded gap region. In the soft margin formulation of the SVM, points are allowed inside the gap but penalized in the cost function. (b) A schematic diagram of the Artificial Neural Network (ANN) model. The ANN is based on the feedforward multilayer perceptron architecture, consisting of an input layer, one or more sets of hidden layers, and one output layer. The ANN can be turned into a classifier by discretizing the network's output. The basic substructure of the ANN is perceptron. For all but the input layer, the perceptron has an input (the outputs of the previous layer). The vectors of inputs and the neuron's weights are multiplied by a dot product. Then, a transfer function is applied to the sum, giving an output for the next layer of perceptrons.
The linear support vector expansion cannot be used to describe nonlinear relationships between the explanatory and dependent variables. To describe nonlinear datasets, kernel functions are used to map the data to higher dimensions where they exhibit linear patterns and the linear model can be applied in that feature space [62,63]. The Gaussian radial basis function was chosen as the kernel function because of its computational efficiency (Equation (5)), where γ is the kernel parameter controlling the sensitivity of the kernel function. SVM has two hyperparameters that cannot be determined from optimization: C and γ. Both were determined with a grid search method on the training data. The SVM model is a supervised machine learning algorithm that seeks a hyperplane that best separates two labeled classes from each other. The SVM maximizes the width of the gap between the two data clouds. In some cases, not all of the data points can be fitted into the two data clouds outside the shaded gap region. In the soft margin formulation of the SVM, points are allowed inside the gap but penalized in the cost function. (b) A schematic diagram of the Artificial Neural Network (ANN) model. The ANN is based on the feedforward multilayer perceptron architecture, consisting of an input layer, one or more sets of hidden layers, and one output layer. The ANN can be turned into a classifier by discretizing the network's output. The basic substructure of the ANN is perceptron. For all but the input layer, the perceptron has an input (the outputs of the previous layer). The vectors of inputs and the neuron's weights are multiplied by a dot product. Then, a transfer function is applied to the sum, giving an output for the next layer of perceptrons.
The linear support vector expansion cannot be used to describe nonlinear relationships between the explanatory and dependent variables. To describe nonlinear datasets, kernel functions are used to map the data to higher dimensions where they exhibit linear patterns and the linear model can be applied in that feature space [62,63]. The Gaussian radial basis function was chosen as the kernel function because of its computational efficiency (Equation (5)), where γ is the kernel parameter controlling the sensitivity of the kernel function. SVM has two hyperparameters that cannot be determined from optimization: C and γ. Both were determined with a grid search method on the training data.

Relevance Vector Machine
The RVM model has an identical functional form to SVM but uses a Bayesian probabilistic framework to estimate the parameters [64,65]. To obtain the maximum likelihood estimate of w, and to avoid overfitting, the Bayesian approach is taken to constrain the parameters by defining an explicit prior probability distribution over them. The prior probability distribution is chosen to be a Gaussian distribution, and RVM introduces a vector to enforce a preference for smoothness. Then, the posterior estimate of the unknown parameters given the data is obtained using Bayes' rule. Because the posterior probability can be evaluated exactly, RVM seeks to maximize the marginal likelihood with respect to the hyperparameters. RVM typically uses considerably fewer basis functions than SVM. RVM was applied herein using the radial basis function as the kernel function.

Naïve Bayes
The NB classifier is a simple probabilistic classifier based on the Bayes' rule and requires strong "naïve" independence between the features [66,67]. Given a new observation x, it finds the class C k that maximizes the conditional probability p(C k |x) , the likelihood of a class given the observation. Using Bayes' theorem, the conditional probability can be calculated as follows (Equation (6)): where p(C k ) is the prior probability of observing a class C k , p(x|C k ) is the likelihood of observing x given C k , and p(x) is the probability of observing x. Assuming strong naïve independence, the probabilistic chain rule can be used to transform the likelihood p(x|C k ) of x into the probabilities of each of the features of x given a class (Equation (7)): This study used the Gaussian NB in which the Gaussian distribution (Equation (8)), is assumed to underlie the sample distribution. To train the NB classifier, the data were segmented by the classes, and the mean and standard deviation of each of the features for each of the classes were calculated, giving a probability distribution for each of the classes.

Artificial Neural Network
ANN is based on the feedforward multilayer perceptron architecture, consisting of an input layer, one or more sets of hidden layers, and one output layer [68,69]. ANN can be turned into a classifier by discretizing the network's output. The basic substructure of ANN is a perceptron (Figure 2b). Each perceptron has an input (the outputs of the previous layer), a series of weights, a transfer function, and an output. A transfer function is applied to the dot product of the inputs and weights for each perceptron, giving an output for the next layer. The output y (l) j for node j in layer l is as follows (Equation (9)): where x are the input variables, y (l−1) j is the output at layer (l−1), w ji are the synaptic weights, and ϕ is the activation function.
Initially, random numbers are assigned to the synaptic weights. The weights are adjusted with the training data. There are two main steps to the training of the ANN: forward computation and back propagation. In forward propagation, input signals are propagated through the network, layer by layer. In back propagation, the error for the entire network is calculated [70]. Then, the errors are computed for each neuron, and then the local gradients for the synaptic weights of the network are calculated. Gradient descent is then used to adjust the synaptic weights. These steps are repeated until the error falls below a desired threshold. Herein, two hidden layers with 20 and 10 neurons were used in the ANN model, because rules of thumb suggest that two hidden layers are sufficient given the number of explanatory variables in this classification problem and that each layer should have approximately half the number of nodes as the preceding layer [71]. Hyperparameter tuning was undertaken by a nested k-fold cross-validation procedure, which is described in detail in the following section.

Model Evaluation and Metrics
The predictive skill of the machine learning algorithms was evaluated using two approaches. First, a k-fold cross-validation approach that has been widely used in machine learning classification [81,82] was applied. In this approach, the data are randomly divided into k disjointed subsets of equal size. Then, for each combination of k−1 of the k subsets, one of the k models is trained, and the test statistic for that model is evaluated on the remaining subset [83,84]. The mean of the test statistics over all k models is called the cross-validation estimate. In this study, k = 10 so that each subset spanned 2 years of data. The data are assumed to be independent during the k-fold cross-validation. However, this assumption may be inappropriate for time series that may be auto-correlated. Thus, the data herein were further validated by block cross-validation [85][86][87][88]. To do this, the data were divided by chronological order into 10 subsets of 2 years each: 1998-1999, 2000-2001, . . . , 2017-2018. In one iteration of the cross-validation procedure, the models were trained on the data from 1998-2016 and then tested on data from 2017-2018. This procedure was repeated for all the 2 year blocks.
Four metrics were used to evaluate the performance of the machine learning classifiers in predicting K. brevis blooms [53,89]. Accuracy measures the overall accuracy of the prediction (Equation (10)): where TrPos is the number of weeks with blooms predicted correctly (true positives), FNeg is the number of weeks with blooms predicted to be non-HAB weeks (false negatives), TrNeg is the number of non-HAB weeks predicted correctly (true negatives), and FPos is the number of non-HAB weeks predicted to be weeks with HABs (false positives). A is the measure of all the correctly identified cases. Recall (R) is the ratio of the correctly-predicted HAB weeks to the total number of the observed HAB weeks (Equation (11)): Precision (P) is the ratio of the correctly-predicted HAB weeks to the total number of the predicted HAB weeks (Equation (12)): F1 measures the balance between precision and recall (Equation 13): Although A is most often used when there are similar amounts of each class, F1 score is a better metric where there are imbalanced classes. The testing metrics were averaged for both the k-fold or block validation procedures. To further test the models' predictions, a time series of the cross-validation predictions was created.
In addition to the four metrics, the robustness of each model, such as sparsity, was examined. The mean number of support vectors (SVs)/relevance vectors (RVs) was calculated for the SVM and RVM, and the Akaike Information Criterion (AIC) was determined for the ANN.

Sensitivity Analysis
In order to determine how environmental factors affect the probability of K. brevis, Platt scaling [90] was applied. This uses a logistic transformation to convert classifier predictions into probability distributions over the classes. Platt scaling is ideal for this study because of its simplicity and the size of the training dataset. First, each machine learning algorithm was trained on the entire dataset. Platt scaling was then applied to calculate the probability of K. brevis blooms (Equation (14)): where y i is a sample, C +1 is one of the classes, f (x) is the classifier output, and A and B are scalar constants [91].

In Silico Experiments
Using the various models, the impacts of different environmental variables on probabilities of K. brevis occurrence were assessed. To do so, each explanatory variable or variables were varied by 1-2 standard deviations around its mean whereas the other variables were set to their respective annual mean values. Line plots and contour diagrams of HAB probability as a function of explanatory variables were created by varying one or two explanatory variables at a time.
The probability of K. brevis blooms as a function of wind speed components in the north-south direction (negative for northerly wind) and the east-west direction (negative for easterly wind) were examined using SVM. To do so, the wind components were varied 1-2 standard deviations above and below the long-term mean while holding other factors constant.
The probabilities of K. brevis outbreaks as a function of discharge from the Suwanee, Hillsborough, Myakka, Peace, and Caloosahatchee Rivers-all of which discharge into the WFS-were examined using SVM, RVM, and NB. Using the same three models, the probabilities of blooms for each river as a function of their TN or TP loads were also estimated based on variations of 1-2 standard deviations from the mean (and setting other features to the mean).

Overall Model Performance
The predictability of K. brevis blooms over the 21 year time series (1998-2018) was tested relative to the observed K. brevis cell concentrations along the WFS using all four machine learning approaches. All models captured the general time series of K. brevis events, encompassing both prolonged blooms with high cell counts, and periods of only a short duration with relatively low cell counts, but the RVM model was the most robust (Figures 3 and 4).  Results from the random oversampling and SMOTE sampling method were similar for all four approaches (Table 1). According to the block cross-validation, SVM and ANN achieved significantly higher prediction accuracy (0.62 and 0.61, respectively, from random oversampling) than RVM and NB (0.55 and 0.47, respectively). In contrast, when comparing the recall values using the same block cross-validation with random oversampling, NB had the highest recall (0.72), followed by RVM (0.58), implying that these models correctly predicted 72% and 58% of the prior K. brevis blooms, respectively, whereas SVM and ANN had much lower recall values (0.27 and 0.33 respectively). All models predicted a similar number of false positives, as shown by their precision values ranging between 0.32 and 0.35. The F1 score, the balance between recall and precision, was highest for RVM and NB, at 0.43 and 0.45 respectively.

Role of Wind
Winds differed in years with and without a bloom. During years with blooms, stronger northerly and westerly winds occurred ( Figure 5). Using RVM, the probability of K. brevis blooms as a function of wind speed components was examined in the northsouth direction (negative for northerly wind) and the east-west direction (negative for easterly wind). Bloom probability was much higher under northerly winds than under southerly winds ( Figure 6). Bloom probability reached a maximum of 0.57 under northerly wind, whereas strong southerly wind reduced bloom probability to <0.3. Northerly winds drive coastal upwelling, thereby transporting K. brevis from the offshore waters to coastal waters. Westerly winds corresponded to higher probability (up to 0.53) of a bloom, compared with easterly winds, with bloom probability as low as 0.36 for the strongest easterly winds ( Figure 6). Once K. brevis reaches nearshore locations, westerly winds help hold K. brevis blooms against the shore where they can access nutrient sources from land and rivers.

Role of Wind
Winds differed in years with and without a bloom. During years with blooms, stronger northerly and westerly winds occurred ( Figure 5). Using RVM, the probability of K. brevis blooms as a function of wind speed components was examined in the north-south direction (negative for northerly wind) and the east-west direction (negative for easterly wind). Bloom probability was much higher under northerly winds than under southerly winds ( Figure 6). Bloom probability reached a maximum of 0.57 under northerly wind, whereas strong southerly wind reduced bloom probability to <0.3. Northerly winds drive coastal upwelling, thereby transporting K. brevis from the offshore waters to coastal waters. Westerly winds corresponded to higher probability (up to 0.53) of a bloom, compared with easterly winds, with bloom probability as low as 0.36 for the strongest easterly winds ( Figure 6). Once K. brevis reaches nearshore locations, westerly winds help hold K. brevis blooms against the shore where they can access nutrient sources from land and rivers.  Figure 5. Examples of winds associated with blooms (top panels) and winds for periods without blooms (lower panels). Arrows denote wind direction. Figure 5. Examples of winds associated with blooms (top panels) and winds for periods without blooms (lower panels). Arrows denote wind direction. Figure 6. Probability of Karenia brevis as a function of wind speed and direction, obtained from the sensitivity analysis using the RVM.

Role of river flow and associated nutrients
Seasonally, riverine flow typically increases for all rivers in all years during the summer months, but interannual variability is high (illustrated for the 21 year record for the Suwanee, Peace, Myakka, and Hillsborough Rivers, Figure 7a). For all rivers, and across all discharge levels examined, the probability of K. brevis blooms increased with river flow (Figure 7b).
For all rivers, bloom probability predicted by the NB was always greater than that predicted by the RVM, which was always larger than that predicted by the SVM (Figure  7b). This result can be explained by Table 1: the NB had the highest recall value; the RVM ranked second; the SVM had the lowest value. The ANN was excluded from the sensitivity analysis because the ANN yields probability predictions that are either very close to 0 (<0.01) or 1 (>0.99. This presents numerical instability issues when computing the weights Figure 6. Probability of Karenia brevis as a function of wind speed and direction, obtained from the sensitivity analysis using the RVM.

Role of River Flow and Associated Nutrients
Seasonally, riverine flow typically increases for all rivers in all years during the summer months, but interannual variability is high (illustrated for the 21 year record for the Suwanee, Peace, Myakka, and Hillsborough Rivers, Figure 7a). For all rivers, and across all discharge levels examined, the probability of K. brevis blooms increased with river flow (Figure 7b).
For all rivers, bloom probability predicted by the NB was always greater than that predicted by the RVM, which was always larger than that predicted by the SVM (Figure 7b). This result can be explained by Table 1: the NB had the highest recall value; the RVM ranked second; the SVM had the lowest value. The ANN was excluded from the sensitivity analysis because the ANN yields probability predictions that are either very close to 0 (<0.01) or 1 (>0.99. This presents numerical instability issues when computing the weights of the logistic function (Equation (14)) in Platt's scaling, yielding ANN probability curves with discontinuous jumps that are either very close to 0 (<0.01) or 1 (>0.99), like the initial ANN predictions.
As discharge changed, using RVM as the example, the slope in bloom probability was highest with the Hillsborough River, with low discharge yielding a 0.20 probability in blooms, increasing to 0.55 with high discharge (Figure 7b). Increases in discharge from the Peace and Suwanee Rivers also increased bloom probability substantially, from 0.33-0.52 and 0.23-0.54, respectively (with RVM as the example), across the range of typical flows. Changes in discharge from the Myakka River yielded probabilities that changed from 0.34 to 0.55. Across all discharge levels, bloom probability was consistently higher (0.49-0.51 with RVM) with increased Caloosahatchee River discharge than for the other rivers examined, and it increased linearly as river discharge increased. The Caloosahatchee River has the highest discharge of the rivers examined, and it transports the highest amount of nutrients.
The composition of the nutrients discharged by the different rivers also varied and accordingly the probability of blooms varied for their different nutrient loads (Figure 7c,d). Applying the three models SVM, RVM, and NB, with increasing TN, the largest increase in bloom probability was found for the Myakka River, whereas smaller increases were found for the Peace and Caloosahatchee Rivers (Figure 7c). For the Hillsborough River, K. brevis probability as a function of the TN loads resembles a parabolic function. For TP, increases in probability were seen for the Peace and Caloosahatchee Rivers, but a parabolic relationship was noted for the Hillsborough and Myakka Rivers (Figure 7d). As nutrient loads increase, it is possible that K. brevis may be either outcompeted by a different species or and/or become limited by a different growth factor. However, there were some differences among the three models. For example, the bloom probability versus the Peace River TP load had a slope that was less steep in the NB model than in the RVM and SVM. Although the probability increased slightly with TN in the Caloosahatchee River in the NB and SVM, it decreased in the SVM. of the logistic function (Equation (14)) in Platt's scaling, yielding ANN probability curves with discontinuous jumps that are either very close to 0 (<0.01) or 1 (>0.99), like the initial ANN predictions. As discharge changed, using RVM as the example, the slope in bloom probability was highest with the Hillsborough River, with low discharge yielding a 0.20 probability in blooms, increasing to 0.55 with high discharge (Figure 7b). Increases in discharge from the Peace and Suwanee Rivers also increased bloom probability substantially, from 0.33-0.52 and 0.23-0.54, respectively (with RVM as the example), across the range of typical flows. Changes in discharge from the Myakka River yielded probabilities that changed from 0.34 to 0.55. Across all discharge levels, bloom probability was consistently higher (0.49-0.51 with RVM) with increased Caloosahatchee River discharge than for the other rivers examined, and it increased linearly as river discharge increased. The Caloosahatchee River has the highest discharge of the rivers examined, and it transports the highest amount of nutrients.
The composition of the nutrients discharged by the different rivers also varied and accordingly the probability of blooms varied for their different nutrient loads ( Figure  7c,d). Applying the three models SVM, RVM, and NB, with increasing TN, the largest By comparing TN and TP discharge from different rivers, it can be seen that large reductions in both nutrients are needed from multiple sources to substantially reduce the frequency of K. brevis blooms (Figure 8), based on the results from the RVM. These comparisons, based on variations of 1-2 standard deviations from the mean (and setting other features to the mean), illustrate the magnitude of reductions necessary to reduce the probability of blooms from >0.6 to <0.2.

Role of Sea Surface Height
Sea surface height difference was chosen as one of the explanatory variables in our machine learning algorithms because previous studies [48] have related this variable to the position of the Loop Current, and the associated temperature and degree of upwelling. Results for the RVM model were nearly identical with or without this explanatory variable ( Figure 9) and, given this outcome, this factor was not tested with the other models. This may be due to the fact that the Loop Current is inhibited from penetrating the WFS due to the sloping typography; shelf currents are controlled to a larger extent by local winds [92]. Nevertheless, for 1998Nevertheless, for , 2002Nevertheless, for , 2009Nevertheless, for , 2010 Peace River TP load had a slope that was less steep in the NB model than in the RVM and SVM. Although the probability increased slightly with TN in the Caloosahatchee River in the NB and SVM, it decreased in the SVM.
By comparing TN and TP discharge from different rivers, it can be seen that large reductions in both nutrients are needed from multiple sources to substantially reduce the frequency of K. brevis blooms (Figure 8), based on the results from the RVM. These comparisons, based on variations of 1-2 standard deviations from the mean (and setting other features to the mean), illustrate the magnitude of reductions necessary to reduce the probability of blooms from >0.6 to <0.2.

Role of Sea Surface Height
Sea surface height difference was chosen as one of the explanatory variables in our machine learning algorithms because previous studies [48] have related this variable to the position of the Loop Current, and the associated temperature and degree of upwelling. Results for the RVM model were nearly identical with or without this explanatory variable ( Figure 9) and, given this outcome, this factor was not tested with the other models. This may be due to the fact that the Loop Current is inhibited from penetrating the WFS due to the sloping typography; shelf currents are controlled to a larger extent by local winds [92]. Nevertheless, for 1998Nevertheless, for , 2002Nevertheless, for , 2009Nevertheless, for , 2010

Discussion.
Machine learning provides powerful approaches for predicting HAB occurrences and testing their potential change under different conditions. The flexibility of these tools and ability to find data-driven solutions when mechanistic relationships are yet unknown or hard to parameterize are particularly important for advancing research on factors that result in toxin-producing blooms. Prior modeling of HABs accommodating nonlinearity of relationships and non-normality of distributions has been undertaken by statistical regression models, and the power of machine learning techniques is beginning to be recognized [39]. Machine learning has become a powerful tool in water quality assessment, from freshwater to marine waters [93][94][95].
Models, however imperfect, can be useful for testing the strength of particular factors or variables on outcomes. In silico experiments in which nutrient sources are turned on or off, or climate variables altered, provide clear clues regarding the importance of such factors both for present and future conditions [26,27]. Such in silico experiments are also insightful regarding the potential magnitude of impacts of nutrient reductions if undertaken by management.
Models may yield either false positives (indicating blooms or conditions for blooms when they do not occur) or false negatives (indicating no bloom when in fact they do occur). Both can be problematic in terms of protecting human health and economies. False positives may be preferable if the goal is to protect human health-better to be "safe than sorry"-but false positives can also be more expensive economically [31]. For example, a fishery may be closed when it was not necessary to do so. False negatives are not protective of human or ecological health.
In this study, four different machine learning classifiers were used to predict the like-Precision Figure 9. Precision of the RVM model by year.

Discussion
Machine learning provides powerful approaches for predicting HAB occurrences and testing their potential change under different conditions. The flexibility of these tools and ability to find data-driven solutions when mechanistic relationships are yet unknown or hard to parameterize are particularly important for advancing research on factors that result in toxin-producing blooms. Prior modeling of HABs accommodating nonlinearity of relationships and non-normality of distributions has been undertaken by statistical regression models, and the power of machine learning techniques is beginning to be recognized [39]. Machine learning has become a powerful tool in water quality assessment, from freshwater to marine waters [93][94][95].
Models, however imperfect, can be useful for testing the strength of particular factors or variables on outcomes. In silico experiments in which nutrient sources are turned on or off, or climate variables altered, provide clear clues regarding the importance of such factors both for present and future conditions [26,27]. Such in silico experiments are also insightful regarding the potential magnitude of impacts of nutrient reductions if undertaken by management.
Models may yield either false positives (indicating blooms or conditions for blooms when they do not occur) or false negatives (indicating no bloom when in fact they do occur). Both can be problematic in terms of protecting human health and economies. False positives may be preferable if the goal is to protect human health-better to be "safe than sorry"-but false positives can also be more expensive economically [31]. For example, a fishery may be closed when it was not necessary to do so. False negatives are not protective of human or ecological health.
In this study, four different machine learning classifiers were used to predict the likelihood of K. brevis blooms between 1998 and 2018. Comparing the 20 year monitoring dataset of the abundance of this dinoflagellate using all algorithms, RVM and NB were found to have better skills in bloom prediction than the two other approaches. All models were comparable in how frequently false negatives were reported. Because the number of weeks with blooms was about 42% of the number of non-bloom weeks, it required the classifiers to learn from an imbalanced dataset. This challenge was resolved by two different methods that oversample the minority class: random oversampling and generation of synthetic data using SMOTE. The predictive skills were very similar between the two data sampling methods (Table 1). This result is perhaps not surprising because the ratio of the samples in the minority to the samples in the majority class was only 1:2.4, in comparison to models in which the class imbalance reached 1:10 or 1:100 [53].
Another feasible alternative machine learning model is that of the random forest. This approach often outperforms other models and has the option of ranking variables by their individual importance. However, in the current data set for which there are~1000 data points, any resulting decision tree would be based only on 10s−100s of data points. Thus, the approaches used herein were more suitable to the size of the data set.
Both k-fold and block cross-validation methods were used to evaluate the predictive skills of the machine learning classifiers. Although the SVM achieved good scores (recall = 0.63, precision = 0.64, and F1 = 0.64 using random oversampling) during the k-fold cross-validation, its performance deteriorated significantly during the block crossvalidation (recall = 0.27, precision = 0.32, and F1 = 0.29) ( Table 1). A similar deterioration was seen in the ANN between k-fold and block cross-validation methods. When trained using the random oversampling approach and tested with the k-fold cross-validation procedure, the SVM used 533 support vectors and the ANN had a high Akaike Information Criterion score of 1814. It is possible that these two algorithms overfitted the training data and their predictive skills deteriorated when tested on completely independent data, as undertaken in the block cross-validation analysis. In contrast, the RVM had only 19 relevance vectors, and NB was a simple probabilistic classifier, thus producing more robust results. The accuracy, recall, precision, and F1 scores of RVM and NB remained higher, regardless of the cross-validation methods.
Blooms of K. brevis occur almost annually in the eastern Gulf of Mexico, typically initiating in early fall, but varying in intensity and duration. The bloom of 2017-2019 was among the largest and most expensive in recent history. It caused the deaths of hundreds of tons of fish, hundreds of manatees, dolphins, and sea turtles, in addition to many cases of respiratory distress [13]. Fisheries closures, and revenue lost by local businesses, also had economic impacts in the tens to hundreds of millions of dollars [11]. Understanding the links between physical controls (upwelling, river flow), nutrient inputs, and extreme weather events has been a high priority in order to make long-term predictions to protect environmental and human health [14,18,34]. The results reported herein confirm that wind direction, river flow, and nutrient load are important explanatory variables with regard to K. brevis probabilities. Although sea surface height (as a proxy for the Loop Current) did not contribute to improved forecasts, the increase in false positives for select years (lower precision values) suggests that the height values alone do not capture the effect of this current adequately, i.e., it was not a sensitive proxy.
Using a convolutional NN approach, Hill et al. [46] achieved high accuracy in detecting blooms of K. brevis. There are several differences between the methodology applied herein and the analysis of Hill et al. [46]. The study of Hill et al. [46] used satellite remote sensing of chlorophyll a as a proxy to detect K. brevis, whereas direct cell counts were used here. Moreover, Hill et al. [46] did not explore the role of wind speed, river flow, or nutrient loads. These results are complementary and show the promise of machine learning approaches not only in modeling various aspects of K. brevis blooms, but of HAB events more generally.
Although there have been debates about the extent to which anthropogenic nutrients fuel K. brevis blooms [10,14], and references therein], there is no doubt that Florida's continuing population growth has accelerated eutrophication. Florida is working to establish nutrient reduction targets to mitigate water quality problems in the water bodies. With Florida's continuing population growth and its coastal development and dependence on tourism, more people are exposed to K. brevis and its toxins than in previous decades, and the prolonged duration of recent blooms is also increasing the period of exposure when blooms do occur [14]. Blooms, which traditionally occur from the late summer until late winter, have been sustained throughout the summer months in recent years, raising important questions about the interplay of physical controls and biological responses, including changing temperature conditions. The nutritional pathways, and sources of nutrients supporting K. brevis blooms, are complex [14,21,24,25]. The fact that nutrient loads have increased is, in itself, an insufficient explanation for the expansion in K. brevis blooms. The right nutrients are required at the right time to create conditions conducive for these blooms to form [7]. Changes in flow, such as that due to hurricanes or intensive wet weather, bring new nutrients that can help to support blooms. The statistical analysis by Maze et al. [48] indicates that there are significant differences in the Peace and Caloosahatchee River flows between periods of large blooms and periods without blooms. It is important to note that flows from the Caloosahatchee are actively managed and regulated-to reduce potential flooding in the Lake Okeechobee region-whereas those of the Peace are not. The machine learning algorithms used here illustrated strong relationships between river flow and blooms. Strong river flow that occurs following hurricanes, regardless of whether those flows are natural or enhanced by active management, especially when these flows follow extended droughts, deliver substantial nutrient loads.
Air temperatures over Eastern North America (including Florida) are expected to increase~1.5 • C by 2050 and 3-4 • C by 2100 (relative to 2000), according to recent climate projections [96]. Additionally, rainfall over Florida is projected to decrease by 20-30% during the summer but to increase by 10-20% during the fall-winter, which is the season during which K. brevis blooms typically occur. This work underscores the important interactive roles of nutrient pollution and river flow in the increased frequency of K. brevis blooms in Florida. Due to climate change and the predicted increase in extreme precipitation events in a warming climate [97][98][99], it is expected that HABs will occur more frequently in the future, in Florida and elsewhere, unless substantial reductions in TN and TP land-based use and loading in the major rivers are achieved.

Conclusions
In conclusion, four new machine learning models were developed for the WFS and explored with regard to wind direction, temperature, river flow, nutrient load, and sea surface height as explanatory variables in predicting K. brevis blooms. The models had different strengths due to the differing degrees of complexity of the models, and they responded differently to the cross-validation procedures used. Overall, the RVM and NB models performed the best in predicting past events. By manipulating the range of explanatory variables, insight into the strength of their impact on blooms was obtained. These findings highlight that not only are reductions in both N and P necessary to reduce blooms, but reductions from multiple rivers are more effective than reductions from a single river. These models can be helpful in exploring the most effective combinations of nutrient reductions. Because river drainage basins are large, a 10-20% increase in fall-winter rainfall will translate into increases in discharges of multiple rivers with their combined higher nutrient loads during the K. brevis bloom period. This implies that to control blooms through nutrient reductions, greater reductions will be required than under present day flow conditions. Data Availability Statement: All the data and code are publicly available and accessible online. The data and code can be found at: https://github.com/mfli-ml/WFS_ML/.