Machine Learning Methods Applied to the Prediction of Pseudo-nitzschia spp. Blooms in the Galician Rias Baixas (NW Spain)

: This work presents new prediction models based on recent developments in machine learning methods, such as Random Forest (RF) and AdaBoost, and compares them with more classical approaches, i.e., support vector machines (SVMs) and neural networks (NNs). The models predict Pseudo-nitzschia spp. blooms in the Galician Rias Baixas . This work builds on a previous study by the authors (doi.org/10.1016/j.pocean.2014.03.003) but uses an extended database (from 2002 to 2012) and new algorithms. Our results show that RF and AdaBoost provide better prediction results compared to SVMs and NNs, as they show improved performance metrics and a better balance between sensitivity and speciﬁcity. Classical machine learning approaches show higher sensitivities, but at a cost of lower speciﬁcity and higher percentages of false alarms (lower precision). These results seem to indicate a greater adaptation of new algorithms (RF and AdaBoost) to unbalanced datasets. Our models could be operationally implemented to establish a short-term prediction system.


Introduction
Harmful algae blooms (HABs) are an increasingly frequent and intense event in coastal areas worldwide [1,2].HABs affect the ecosystem and human health and impact on fish and aquaculture activities and regional economies [3].
The detection and monitoring of HABs is traditionally based on field samplings [3,4].Recently, increasing attention has been paid to the development of prediction models, which could aid in the early warning of blooms and improve the effectiveness of management programs.The prediction of phytoplankton blooms includes the application of several methods which can vary in modelling approach and complexity [5].
Conventionally, the prediction of harmful algal events is based on statistical numerical models, such as logistic regression, regression trees or Bayesian models [6][7][8][9][10].However, most of these approaches are limited to linear systems, while HABs usually occur in complex and highly dynamic coastal environments [11,12].
Machine learning models have been used due to their capability to deal with complex, often non-linear and noisy datasets and to generate predictive models of relatively high accuracy [13][14][15][16].
Support vector machines (SVMs) were first described in 1992 [17] and later developed for classification and regression [18,19].Briefly, SVMs are a linear classifier operating in a higher dimensional feature space using kernel functions.SVMs search for the optimal hyperplane by maximizing the margin using the Lagrange method to solve a quadratic optimization problem constrained by linear restrictions [20].SVMs have achieved good results in different applications related to the detection of blooms of different types of algae in both freshwater [15,[21][22][23][24][25] and coastal [16,20,[26][27][28][29] environments.
Neural networks (NNs) are widely used in environmental applications due to their ability to model multivariate, complex and non-linear data [30].Theoretical foundations of NNs were introduced by McCulloch and Pitts in the 1940s [31].A multilayer perceptron (MLP) NN is composed by a set of nonlinear computational elements (neurons or nodes) arranged in multiple layers that are interconnected in a feedforward way with each connection defined by a weight value.It usually includes an input layer, one or more hidden layers and an output layer.While the input layer distributes the input signals into the network, neurons in the hidden and output layers process their input signals using an activation function.MLPs are trained by a back-propagation learning procedure which adjusts iteratively the weight values in order to minimize an error function [30].NNs have been extensively applied to HABs prediction, especially in fresh-water systems [15,23,[32][33][34].In coastal waters, NN models have been developed for predicting blooms of specific HAB species [16,28,29,[35][36][37][38] or estimating in advance the chlorophyll-a concentration, which is often used as a proxy for phytoplankton abundance [39,40].
Random Forest (RF) was initially created by Tin Kam Ho applying random decision trees.Breiman and Cutler extended these random decision trees by introducing the bootstrap aggregation (or bagging) technique [41].RF is an ensemble machine learning algorithm which consists of multiple decision trees working in parallel, so that the final output is defined using a "voting" system.In case of binary classification (bloom/no bloom), the final result is the most "popular" class, i.e., the output for the majority of trees [41].RF has been extensively used in ecological modelling [42][43][44], and during the last years, it has been successfully employed to HAB prediction [15,26,28,[45][46][47][48].
Boosting is a machine learning technique based on combining a set of weak classifiers to create a high-performance prediction rule.AdaBoost (Adaptative Boosting), introduced by [49], is one of the most widely used boosting algorithms.AdaBoost combines single-node decision trees named stumps, in series: once a tree is trained, the subsequent tree tries to correct the errors in the previous one, by adapting the weights associated with each tree in order to take the final decision.AdaBoost has been recently applied to different ecological applications [50,51], while few works are also focused on microalgae [16,26,52].
Some studies show comparative results using different machine techniques.For example, Yu et al. [16] found that gradient boosted descent tree achieves a better performance as compared to other methods such as SVM, NN or AdaBoost.Ribeiro and Torgo [23] reported that SVM provides better accuracy than regression trees or NN predicting algae blooms in the Douro River (Portugal).The consensus method showed the best accuracy for most of the phytoplankton datasets analyzed by [26].The latter also obtained good results using boosting, SVM and specially RF.In the approach based on remote sensing data developed by [28], long short-term memory recurrent NN showed better results than MLP, RF or SVM.Results by [29] in Tolo Harbour (Hong Kong) indicated that MLP outperforms SVM or generalized regression neural networks.
In view of these results, there is not an unequivocal "best" machine learning method for HABs prediction, and the development of an adequate approach is dependent on the target species, the region and the characteristics of the available dataset.
This study focuses on the HABs caused by Pseudo-nitzschia spp. in the Rias Baixas area in Galicia (NW Spain).Several species of the genus Pseudo-nitzschia such as Pseudonitzschia multiseries and Pseudo-nitzschia australis have been associated with the production of the neuro-toxin domoic acid (DA), which causes amnesic shellfish poisoning (ASP) toxicity [54,55].This genus is generally common in upwelling systems, for example, in the California Current System [56], in the Iberian System [20,57,58] and the Benguela area [59].
Few works have addressed the prediction of Pseudo-nitzschia spp.blooms on coastal areas.Blum et al. [60] developed multiple linear regression models to predict DA level using data from culture and natural blooms in Cardigan Bay (Canada).Stepwise linear and logistic regression models were developed to predict both DA and Pseudo-nitzschia spp.abundance in the Santa Barbara Channel [61], Monterey Bay [8] and Chesapeake Bay [6].A mechanistic model was proposed by [62] to simulate the DA production.Empirical approaches based on the prediction of favorable conditions for the development of blooms of Pseudo-nitzschia spp.were explored for the Lisbon Bay [58] or the Galician coast [63].Cusack et al. [64] forecasted the onset, abundance and duration of Pseudo-nitzschia blooms in the southwest coastal waters of Ireland using a zero-inflated negative binomial model.The application of particle tracking models to simulate the short-term dynamics of Pseudonitzschia spp.events have been proposed for southwest Ireland [65] and the north-west Pacific [66].Townhill et al. [67] have projected the future distribution of Pseudo-nitzschia spp.and other species in the north-west European shelf using the maximum entropy MaxEnt model.Absence/presence and abundance short-term forecast models based on NNs and using biological and environmental 20-years long time-series from Alfacs Bay (NW Mediterranean) were developed by [36].
This study builds on previous work developed by [20], in which absence/presence and no bloom/bloom Pseudo-nitzschia spp.SVM models were developed for the Galician Rias Baixas using a long-term dataset (1992-2001) of environmental parameters.For no bloom/bloom models, one of the main issues is the imbalance of the dataset, with only around 15% of samples identified as bloom.Consequently, models with a good sensitivity tend to show a higher rate of false alarms.Therefore, the main aim is to explore the potential of different machine learning methods to find a better balance between sensitivity and precision and improve the results found by [20].In addition to the most extended machine learning technologies for HABs prediction (SVM and NN), we introduce recent ensemble learning methods (RF and AdaBoost), which are expected to show a better generalization ability by combining multiple weak learners [16].
In summary, we present a comparison of different machine learning approaches, i.e., MLP, SVM, RF and AdaBoost, to predict blooms of Pseudo-nitzschia spp. in the Rias Baixas.Models were developed and validated using a long-term weekly dataset (2002-2012) of environmental parameters using the same approach and dataset structure (combinations of variables) proposed by [20].

Study Area
A ria (rias in plural) is a Spanish name for the V-shape coastal embayments located on the west coast of Galicia (NW Spain), which were formed by the partial submergence of ancient river valleys.The Rias Baixas are the four southern rias, from south to north: Vigo, Pontevedra, Arousa and Muros (see Figure 1).They are located along the northern boundary of the NW African upwelling system [68].In this region, the seasonality of the ocean dynamics is governed by the relative strengths and latitudinal shifts of the Azores high-pressure and the Iceland low-pressure systems.Wind-driven processes in the area generate strong upwelling of rich nutrient cold waters in the period from May to September (these waters rise up to the photic zone, reaching the surface in the more intense upwelling) [69,70], leading to significant increases in primary production (up to 7.4 g C m −2 d −1 ) during upwelling events [71].This area supports an intensive mollusk (mainly mussels) culture using floating rafts.Galicia is the region with the highest production of aquaculture mussels in Europe and one of the world leaders.Mussel production in this area reaches approximately 250,000 t per year, equivalent to 41% of the European production and 15% of the world production [72].
These activities are seriously hindered by harmful algae blooms (HABs), which cause an important ecological, social and economic impact since they can even force the closure of production areas [73,74].HABs are a frequent and well-documented phenomenon in Galicia, and a great number of studies can be found in the literature since the 1950s [75][76][77][78].One of the main HAB-forming taxonomic groups is Pseudo-nitzschia spp., which has been detected since 1994 as the causative agent of ASP toxic events due to the production of domoic acid (DA) [79][80][81].
Due to the economic importance of aquaculture, a monitoring program of HAB species was set up in Galician waters.The Technological Institute for the Control of the Marine Environment of Galicia (INTECMAR) conducts a weekly routine sampling, measuring the abundance of Pseudo-nitzschia spp.(and other potentially toxic species), water quality parameters and biotoxin levels in mussels and other mollusks [20].
Prediction in advance of HABs is very important for mollusk producers in terms of organization and logistics, as well as for the social policies related to one of the main economic driving forces in the region [20,80,81].

Dataset
The dataset used in this study includes weekly records of water parameters (i.e., temperature, salinity and Pseudo-nitzschia spp.abundance) and upwelling indices between January 2002 and December 2012.We used the same dataset structure and variables presented in [20].
Water parameters (i.e., temperature, salinity, Pseudo-nitzschia spp.abundance) were measured by INTECMAR as part of its monitoring program consisting of a weekly sampling at 38 sampling stations distributed across the four Rias Baixas.We only considered data from the outer stations located at the mouth of each ria (Figure 1; V5 in the ria of Vigo, P4 in the ria of Pontevedra, A0 in the ria of Arousa and M5 in the ria of Muros).These stations are not affected by local processes (e.g., river discharges) and are considered to be representative of the open ocean conditions.
Temperature and salinity measurements were collected in situ from the surface to 5 m depth, approximately every 10 cm, using a Seabird Model 25 CTD.Specifically, temperature was measured using a Seabird SB3 thermistor (range: −5 • to +35 • C; precision: 0.02 • C; resolution: 0.0003 • C) and salinity is based on a Sea Bird SBE4 conductivity cell (range: 0-7 S/m; precision: 0.0003 S/m; resolution: 0.00004 S/m).Temperature and salinity values for each week and station were then computed by integrating all the available valid measurements using the trapezoidal rule.
Regarding phytoplankton, samples were collected using tow nets (10 µm mesh) from the surface to 15 meters depth, fixed with formaldehyde 4% and stored under dark and cool conditions.Total abundances (in cells L −1 ) of Pseudo-nitzschia spp.and other potential toxic taxonomic groups were counted using an inverted light microscope at 250× and 400× magnification [82].
As proposed by [20], five upwelling indices (Iw), one for the sampling day and four for each one of the four previous days, were estimated from wind data measured at Cape Silleiro oceanographic buoy (42.12 • N, 9.43 • W, see Figure 1).This buoy, moored at a depth of 323 m, is a SeaWatch buoy of Puertos del Estado (http://www.puertos.es,accessed on 20 January 2021) equipped with meteorological instruments to measure wind speed (range: 0-60 m/s; accuracy: ±0.3 m/s) and direction (range: 0 to 360 • ; accuracy: ±3 • ) at 10 m above sea level every ten minutes.Winds at this location are considered representative of wind conditions in the Galician area [83].
Upwelling indices were computed using the Bakun's method [84]: In the equation above, τ y is the meridional component of the wind force (N m −2 ), ρ w is the density of seawater (1025 kg m −3 ), f is the factor of Coriolis (9.9 × 10 −5 s −1 at 42 • de latitude), ρ a is the density of the air (1.2 kg m −3 at 15 • C), C D is a dimensional empirical drag coefficient (1.4 × 10 −3 ), and W and W y are the daily average of the wind speed and its meridional component, respectively.Positive upwelling indices indicate upwelling conditions (dominant northerly winds, negative W y values) while negative indices are related to downwelling situations (dominant southerly winds, positive W y values).
Variables are summarized in Table 1.Note that water parameters (temperature, salinity, Pseudo-nitzschia spp.abundance) were not available every week due to two reasons: (1) there was no field sampling campaign because of bad weather or ship breakdown; (2) sampling problems, such as erroneous readings, instruments failures or excessive detritus.The number of valid records for these variables is also shown in Table 1.Models were developed using the same combination of variables proposed by [20] and labelled using letters from A to D: A includes all the available variables; B excludes the spatial and temporal effects (ria code and day of the year); C also discards information about bloom occurrence in previous weeks (bloom-1w, bloom-2w) and D is only based on upwelling indices.A dataset was built for each combination by associating the input variables with the corresponding output (bloom or no bloom) for each day and ria, removing records with invalid or missing values for any of the input or output variables.The variables and number of records available for each combination are shown in Table 2.After removing invalid and missing values, final datasets include between 79.66% (combination A or B) and 83.62% (combination B) of 2296 potential records (i.e., 574 weeks × 4 stations).

Model Selection
We divided the complete dataset for each combination of features (A, B, C or D) into two independent datasets, the training (2/3 of the total records) and validation sets (1/3 of the records).Both subsets were built including a similar percentage of both classes (bloom and no bloom).Models were developed (i.e., trained) using the training set, while the validation set was useful for obtaining an independent set of performance measurements to select the optimal model.Table 3 summarized the number of records included in each dataset for both classes.In the case of the SVM and NN, input data were linearly scaled to a range between −1 and +1 before training in order to avoid a greater influence of variables with larger numeric ranges [85].Categorical features (bloom occurrence in previous weeks and ria code) were firstly converted into numeric data.Table 1 shows the minimum and maximum values used to scale these variables.
The optimal parametric configuration for each combination of features and method was selected by a hyperparameter optimization.This was based on a grid search approach, i.e., models using different parameters controlling the learning process are trained and evaluated, and the best models are selected according two metrics (F1-score and distance to point (0,1) in the operating receiver characteristics (ROC) curve, see Section 2.3).The metrics were computed using the validation set.Selected hyperparameters and swept values are shown in Section 2.4.All the experiments were carried out using the libraries available in MATLAB with the exception of NNs, which were also programmed in Matlab but using a custom code.The code is available in a GitHub repository: https://github.com/currobellas/NeuralNetwork (accessed on 20 January 2021).

Performance Measurements
The models' performance was evaluated by comparing the results with the real output through a confusion matrix [86], a 2 × 2 table showing the number of samples correctly classified for both classes (no bloom and bloom), as well as the number of false positives and false negatives.
Different metrics extracted from the confusion matrix are proposed in the literature.In this work, we considered the sensitivity and the specificity, i.e., the percentage of bloom and no bloom records correctly classified, respectively; and the precision, fraction of true positives with respect to the total number of records classified as bloom.Regarding global metrics, global accuracy (percentage of records correctly classified) could provide misleading information because of the unequal distribution of classes (around 16% of bloom, 84% of no bloom).Hence, we used the F1-score, defined as the harmonic mean of precision and sensitivity, which is widely used with imbalanced datasets [87].
In order to compare models graphically, we also built operating receiver characteristics (ROC) curves plotting the sensitivity against the false positive rate (1-specificity) and computed the distance to the optimal point (0,1) (i.e., all the bloom records are correctly classified at this point) as a metric combining sensitivity and specificity [88,89].
Although confusion matrixes were built from the training and validation subsets, results from the validation set are expected to be more reliable since this subset is not included in the training process.Therefore, optimal models were selected according to two criteria based on metrics computed from the validation set: maximum F1-score and minimum distance to the optimal point (0,1) in the ROC curve.All the metrics shown in this work are based on the validation results.

Support Vector Machines (SVM)
In this work, in addition to the Gaussian radial basis (RBF) kernel chosen by [20], we tested polynomials kernels of various degrees (until 50) for each combination of features (A, B, C or D).

Multilayer Perceptron (MLP)
For each combination of features (A, B, C and D), we performed 660 tests varying the following settings:

•
Number of hidden layers in the neural network: from 1 to 10.

Random Forest (RF)
We employed the following parameters to unravel the optimal parametric configuration for each combination of variables (A, B, C or D):

•
Number of weak predictors selected as subset for each tree: between 2 and 4.
Note that a weak predictor is considered as a variable to make a decision, and hence the number of weak predictors has to be lower than the total number of input variables.

AdaBoost
We have combined two parameters for searching the optimal model for each combination of features (A, B, C or D):

Learning Curves
Learning curves were obtained by plotting the training and validation errors against the number of samples used in the training process [90].Training error is expected to be low when the samples number is low, but it will increase as new samples are included in the training process.The validation error is expected to decrease with an increasing number of training samples.These curves are useful for analyzing the trade-off between bias and variance errors.
If a model shows a good fitting to the training set but fails to fit the validation set, there is a problem of variance or overfitting.In the learning curve, as the number of samples increases, the training error remains low, but the validation error is high.In this case, the model could be improved by increasing the number of training samples.
However, if both training and validation errors are similar but with high values, there is a problem of bias or underfitting.In the learning curves, as the number of samples increases, both error curves tend to touch themselves.In this case, increasing the number of samples would not reduce the error and the model could be improved by adding new variables.

Pseudo-nitzschia Spp. Distribution
Pseudo-nitzschia spp. was detected (above the detection limit) in approximately 63% of samples, although only 15% were identified as bloom (abundances greater than 10 5 cell/L).Eleven records showed abundances over 10 6 cell/L, with a peak of 3.19 × 10 6 cell/L and an average value of around 62,000 cells/L.Table 4 shows the sampling effort, i.e., the total number of weeks sampled, and the bloom incidence, defined as the percentage of weeks affected by a Pseudo-nitzschia spp.bloom, for each ria and year.Note that Muros shows a lower sampling effort because this ria is not protected by islands located at the mouth (Figure 1) and hence the station M5 is more affected by severe weather situations.Vigo was the most affected ria with 94 blooms (near 30% of the total number of blooms), followed by Pontevedra (84), Arousa (76) and Muros (72).Bloom incidences varied from 16.85% in Vigo to 13.74% in Arousa, with values around 15% in both Pontevedra and Muros (Table 4).These results differ from the ones observed in [20] for the previous decade (1992)(1993)(1994)(1995)(1996)(1997)(1998)(1999)(2000), which reported Pontevedra as the most affected ria.In fact, bloom incidence was remarkably lower in Pontevedra (from 24.28% to 15.11%) and Muros (from 21.39% to 14.81%) but showed similar values in Vigo (from 17.13% to 16.85%) and Arousa (from 13.55% to 13.74%).Further research would be required to test if there was an actual change in the Pseudo-nitzschia spp.spatial distribution pattern or these variations simply fall within the expected variability.
There were remarkable variations in the bloom distribution throughout the years covered in the dataset (Table 4).The number of weeks with at least one ria affected by a Pseudo-nitzschia spp.bloom varied from 18 in 2005 and 2009 to less than 10 in 2006 and 2012.Moreover, spatial patterns were also very variable: although Vigo was generally the most affected ria (2003, 2007, 2009, 2011 and 2012), Pontevedra, Arousa and Muros showed the highest incidence in 2005, 2002 and 2010, respectively.As compared to the dataset used in [20], it is noteworthy that none of the years showed more than 20 bloom weeks as observed in 1992 and 1993.
Figure 2 shows the monthly distribution of Pseudo-nitzschia spp. in the area for the complete period (2002-2012).Most of blooms (83.74 %) were identified between May and September, while the number was very low between October and April, and even zero in December and January.The results confirm the findings of previous studies showing that Pseudo-nitzschia spp. is common in spring and late summer as a consequence of the upwelling favorable conditions [20,58].Bloom incidence between May and September remains in values around 30% apart from a sharp decrease in June.This pattern is different from the one observed in the 1992-2002 period [20]: it almost doubles in May and September (from ~15% to ~30%) but decreases in June (from 29.27% to 21.98%) and July (from 42.47% to 31.61%).Variations could be related to changes in upwelling patterns which would need to be further researched.

Input Variables
In this study, we worked with the same input variables selected in [20], which were already reported to be related to the occurrence of Pseudo-nitzschia spp.blooms in the study area.These relationships were confirmed by the exploratory analysis of the dataset.
All the numerical variables are significantly positively correlated (p < 0.01) with the Pseudo-nitzschia spp.abundance (Table 5).Moreover, despite the fact that there is some degree of overlapping between no bloom and bloom classes (check minimum and maximum values in Table 5), significant differences (p < 0.01) were also found for each variable between both classes using Mann-Whitney tests.
Table 5. Basic statistics (average ± standard deviation, minimum -maximum) of each numerical variable for the complete dataset (combination A) and for no bloom and bloom classes, as well as the Pearson correlation coefficient (r) between each variable and Pseudo-nitzschia spp.abundance (transformed as log10 (1 + [Pseudo-nitzschia spp.(cells/L)]) (** significant at p < 0.01).The higher average temperature of bloom samples is explained by the typical seasonal pattern of Pseudo-nitzschia spp., which is more abundant in spring-summer (Figure 2).As indicated in [20], although blooms of Pseudo-nitzschia spp.often coincide with decreases in temperatures associated with upwelling events in spring-summer, temperatures in autumn and winter are even lower.

Variable
Regarding the salinity, increases in precipitation and/or river inputs lead to episodes of low salinity, which are more frequent in autumn-winter [20].Hence, the lower average salinity of no bloom samples could be also related to the seasonal pattern.
Upwelling events (positive upwelling indices) are usually more frequent in springsummer while downwelling events (negative upwelling indices) are common in autumnwinter [20].Moreover, blooms of Pseudo-nitzschia spp.are usually detected during both upwelling events and the posterior relaxation period [20].As a consequence, all the upwelling indices showed average positive values for the bloom class versus negative or nearer zero average values of no bloom samples (Table 5).Note that the maximum correlation and difference between both classes is found using the upwelling index measured three days before the sampling date (Table 5).
The occurrence of blooms of Pseudo-nitzschia spp. in the previous weeks to the observed was included because phytoplankton blooms (including Pseudo-nitzschia spp.) seem to progress along the Galician coast according to the dominant winds and currents [20,74].Significant differences for both bloom-1w and bloom-2w between no bloom and bloom were found using Mann-Whitney tests.Around 89% of no bloom samples were not related to blooms in previous weeks (bloom-1w = 0 and bloom-1w = 0), while 47.62% and 39.12% of bloom samples showed a bloom in the previous week (bloom-1w ≥ 1) and two weeks earlier (i.e., bloom-2w ≥ 1), respectively.
The ria code and the day of the year are related to the spatial and temporal distribution of Pseudo-nitzschia spp.abundance (see Section 3.1.1).For these variables, there is not significant differences between both classes according to the Mann-Whitney tests results because of a higher overlapping.
Comparing the dataset in [20] (1992-2002) with the one used in this work (2002-2012), the slight increase in the average temperature (from 14.75 • to 15.02 • ) and upwelling index (from 39.68 to 170.43, indicating a higher prevalence of upwelling conditions) are remarkable.Further research would be required to check if these variations are related to the climate variability.

Support Vector Machines (SVM)
Figure 3 shows the F1-score computed in the validation set using SVMs with RBF and polynomial (of degrees from 2 to 9) kernels for each combination of features.Overall, polynomial kernels improve the results from the RBF kernel, with an increasing sensitivity and specificity but at the cost of an increasing number of false positives (less precision) as the polynomial degree increases.
The best global results are achieved with model B, specifically using a polynomial kernel of degree 5 (F1-score = 0.46).The best SVM with combination A is achieved with a degree of 4, while combinations C and D require a higher degree of 9.

Neural Networks (NN)
The best NN results are achieved using the combination A, although models B and C show similar results (Table 6).Note that model C requires more computational load (optimal achieved with 10 hidden layers) than A (six hidden layers) or B (four hidden layers).Results using combination D are poorer in terms of both F1-score and distance to point (0,1) in the ROC curve, indicating that upwelling indices are not sufficient to obtain reliable predictions using NNs.Table 6.Results of the multilayer perceptron (MLP) neural network (NN) with the best parametric configuration for each variable combination.Criteria for the selection of the best model are maximum F1-score (F1), minimum distance to the optimal point (0,1) in the operating receiver characteristics curve (ROC) or both (F1/ROC).All the metrics are computed from the validation set.(OA: overall accuracy; Sens.: sensitivity; Spec.: specificity; Prec.: precision; Dist.(0,1): distance to the point (0,1) in ROC curve).The best global MLP NN, with a maximum F1-score of 0.53 and based on combination A, was obtained with six hidden layers, 500 iterations, and a regulation factor of 0.05.The MLP with the minimum distance to point (0,1) in the ROC, which is also based on A, shows a higher sensitivity but at the cost of a lower specificity and precision (more false positives) (Table 6).

Random Forest (RF)
Overall, RF shows good results with a good balance between sensitivity, specificity and precision, regardless of the training parameters.Unfortunately, there is also a high tendency for overfitting, with metrics near 1 in the results computed in the training set, especially in models A, B and C (Table 7).
Table 7. Results of the Random Forest (RF) with the best parametric configuration for each variable combination.Criteria for the selection of the best mode are maximum F1-score (F1), minimum distance to the optimal point (0,1) in the ROC curve (ROC) or both (F1/ROC).All the metrics are computed from the validation set.(OA: Overall accuracy; Sens.: sensitivity; Spec.: specificity; Prec.: precision; Dist.(0,1): distance to the point (0,1) in ROC curve).However, RF tends to improve using less variables, indicating that these algorithms could be improved by including more data instead of more variables.In fact, the optimal predictor (40 bags and four variables selected as weak predictors) was based on combination D (with only five variables), with a maximum F1-score of 0.60 and a minimum distance to point (0,1) in the ROC curve of 0.45.It also shows less overfitting as compared to the other optimal models (Table 7).

AdaBoost
Unlike other machine learning methods shown in this work, AdaBoost models were selected according to the maximum F1-score and minimum distance to point (0,1) in the ROC curve being different for all the combinations of variables (Table 8).In general, models with minimum distance show a higher sensitivity, but at the cost of a lower specificity and precision, producing more false positives.However, while models based on A and B present a better balance among the different metrics and similar results in both predictors, there are more marked differences between both optimal models for combinations C and D. Overall, better models were obtained using the RUSBoost variant, with the exception of the model D based on the GentleBoost maximizing F1-score.
Table 8. Results of the AdaBoost models with the best parametric configuration for each variable combination.Criteria for the selection of the best model are maximum F1-score (F1), minimum distance to the optimal point (0,1) in the ROC curve (ROC) or both (F1/ROC).All the metrics are computed from the validation set.AdaBoost variant is also shown.(OA: Overall accuracy; Sens.: sensitivity; Spec.: specificity; Prec.: precision; Dist.(0,1): distance to the point (0,1) in ROC curve).The best global models are based on combination B, including all the variables except for the day of the year and ria.The model maximizing F1-score, trained with 200 cycles, shows a F1-score of 0.61 and a good balance between sensitivity and specificity.Results of the predictor minimizing the distance to point (0,1) in the ROC curve (based on 10 cycles) are very similar, with a slightly higher sensitivity but lower specificity.The results of the model D maximizing F1-score (F1-score = 0.59) are also remarkable, since this model uses only upwelling indices (Table 8).

Models' Performance
In general, a good binary classifier is expected to show a good balance between sensitivity and specificity, i.e., good individual accuracies for both classes (no bloom and bloom).Hence, the minimum distance to the optima point (0,1) in the ROC curve was used as one of the criteria to search for the optimal models.However, due to the unequal distribution of both classes (~15% of bloom, see Table 3), models selected according to this criterion show a high percentage of false positives (low precision).Therefore, the F1-score was also used for the selection of optimal algorithms, since this metric is considered more balanced for unbalanced datasets [91,92].
On some occasions, models with the maximum F1-score show also the minimum distance to the optimal point (0,1) in the ROC curve.Where there is a discrepancy, algorithms with the maximum F1-score generally show a lower sensitivity than the equivalent ones selected with minimum distance, but a better balance between sensitivity, specificity and precision.
Table 9 compares the metrics computed from the validation set using the best models for each method according to both criteria, also including results from [20].Overall, new algorithms (RF and AdaBoost) provide more suitable results than classic ones (SVM and NN), showing better values for global metrics (F1-score and distance) and a better balance between sensitivity and specificity.Classic algorithms show a higher sensitivity, but at the cost of lower specificity and precision and a high percentage of false alarms.
Table 9. Results of the best models for each machine learning method, as well as for the SVM model based on A developed by Gonzalez Vilas at al. [20].Subscripts indicate the criterion for model selection in the case of discrepancy between the maximizing F1-score and minimizing distance to point (0,1) in the ROC curve.All the metrics are computed from the validation set.(OA: Overall accuracy; Sens.: sensitivity; Spec.: specificity; Prec.: precision; Dist.(0,1): distance to the point (0,1) in ROC curve).Figure 4 shows the ROC curves for the four methods, evincing graphically the models with a better balance between sensitivity and specificity (i.e., with a shorter distance to the optimal point (0,1)).For NNs and SVMs, better models are based on all the variables (combination A).For RF, models using variables in combination D show clearly shorter distances, evidencing that upwelling indices are sufficient to produce reliable predictions results.In case of AdaBoost, models B show the best global results.In fact, RUSBoost algorithms based on combination B show not only a short distance to the optimal point (0,1) in the ROC curve, but also high F1-score values (Table 8).The better results obtained by new algorithms (RF and AdaBoost) could be explained by several reasons.Firstly, classic algorithms require data scaling since they rely on homogenous feature ranges to work properly.However, scaling has no impact on the performance of RF and AdaBoost [85,86,90].Secondly, the combination of learners of ensemble methods leads to a better bias/variance tradeoff and an improvement of the generalization ability as compared to algorithms relying on single hypothesis [16,86].Note that results were computed from a validation subset that was not used in the training procedure.Finally, results also seem to indicate a better adaptation of ensemble algorithms (RF and AdaBoost) to unbalanced datasets [93,94].

Method
Comparing the classical algorithms, MLP improved SVMs despite the fact that SVMs are expected to be a more robust technique for two-class classification problems.However, the SVM results seem to be more affected by the imbalance between both classes.The learning curve (Figure 5) shows some overfitting, so that the model tends to better predict the most frequent class (no bloom) leading to a higher specificity but at the cost of a lower sensitivity.Weighted SVM, i.e., the application of a different weight for each class to correct the imbalance effect [20], could improve the results.On the other hand, the learning curve for MLP (Figure 5) indicates an underfitting situation, and hence the model would be expected to improve by adding new variables.Between RF and AdaBoost, AdaBoost works slightly better.Although the best RF model shows the best values of overall accuracy, specificity and precision, it fails in sensitivity as compared to AdaBoost models, which show the highest sensitivities for the decade 2001-2012 and the best balance between metrics, leading to the maximum F1-score and the minimum distance to the point (0,1) in ROC curve.
As with the SVM, the learning curve of the best RF algorithm (model D, Figure 5) evidences some overfitting, leading to a higher accuracy of the majority class (no bloom) and a lower sensitivity as compared to AdaBoost.In general, models with overfitting are expected to be improved by increasing the number of samples.In case of RF, variance error can be also lowered by reducing the model complexity (i.e., the number of individual decision trees) selecting less input variables.In fact, as explained in Section 3.3, overfitting is clear with RF based on A, B and C, with an almost perfect fitting observed in the training set, while model D shows better results.Note that this model, despite of working with less variables, uses more input samples due to the availability of more valid records (see Tables 2 and 3).
In the case of AdaBoost, the learning curve (Figure 5) of the best model indicates some underfitting, and hence the model could be improved with another selection of input variables.The number of samples does not seem to be critical, since the training error remains more or less constant over 1000 training samples.
One of the main limitations of modelling blooms of Pseudo-nitzschia spp. in Galicia is the imbalance in the dataset.Results could be improved by building a balanced dataset through the selection of a number of no bloom samples approximately equal to the number of bloom records.However, this process could lead to a loss of important information for the discrimination of both classes.
Future work could include the development of specific models for a ria and/or time period to work with more balanced datasets.For instance, for the period from May to September, which sums up more than 80% of all the Pseudo-nitzschia spp.events (see Section 3.1.1),the dataset would include near 30% of bloom (70% of no bloom) samples.However, in addition to their limited temporal and/or spatial applicability, the smaller number of records or the loss of key information might affect the results.
In this study, we trained the models using 2/3 of the total records, using the remaining records (1/3) as an independent set for validation.The comparison of models developed using a growing number of training samples (e.g., 2/3, 3/4 and 4/5), and hence less records for the validation (e.g., 1/3, 1/4 and 1/5), would be an interesting approach for gaining insight into the generalization capability of the different machine learning techniques or feature combinations.
This work mainly focused on the comparison of different machine learning techniques, so that models were developed using default options and a simple grid-search approach for selecting the optimal values of their main parameters.Therefore, all the results could be improved by a finer adjustment or the application of advanced approaches in the pre-processing steps, as scaling (for SVM and MLP) or feature selection.

Variable Contribution
Despite the correlations found between the input variables and the Pseudo-nitzschia spp.abundance (Table 5), discrimination between bloom and no bloom classes is a complex and non-linear problem.None of the variables themselves are sufficient to identify bloom conditions because of the overlapping between both classes.
Overall, models B perform better than models C and similar or better than models A (e.g., AdaBoost, see Table 8), indicating the importance of the occurrence of blooms in the previous week for the predictions.In contrast, several models work correctly without temporal (day of the year) or spatial (ria) information.Note that significant differences between both classes were not found for these two variables using Mann-Whitney tests (see Section 3.1.2),while other variables are directly related to the temporal (e.g., temperature, salinity, upwelling indices) and spatial (bloom occurrence in previous weeks) distribution patterns of Pseudo-nitzschia spp.
Upwelling indices seem to be a critical and necessary variable for prediction, since even models running only with upwelling indices (model D) achieve reasonable results (e.g., RF, see Table 7).Upwelling indices include several positive and negative peaks, so that the good results achieved by RF/D could be explained by its robustness to outliers as compared to other algorithms [41].
The inclusion of new variables could improve the results, especially for models showing underfitting in the learning curves (MLP and AdaBoost).Without using new datasets, variations of temperature or salinity from the previous to the present week could provide important information about the evolution of Pseudo-nitzschia spp.blooms.
Chlophyll-a concentration is a good proxy for phytoplankton abundance.Spyrakos et al. [74] showed the potential of satellite regional chlorophyll-a algorithms to detect "patches" of high Pseudo-nitzschia abundances in the rias.Several authors have proposed the use of satellite data (chlorophyll-a concentration, sea surface temperature) as an input of Pseudo-nitzschia spp.prediction models [61,63,65].
The relationship between the abundance of Pseudo-nitzschia spp.and the concentrations of different nutrients (mainly nitrate, nitrite, phosphate and silicate) have been extensively analyzed in observational studies [95,96].Defining the role of nutrients is a challenging task since concentrations vary during the bloom development and hence relationships can be different depending of the instant in which the abundance is measured.Even so, nutrients concentrations have been successfully included as a forcing driver in Pseudo-nitzschia spp.prediction models [6,8,41].
Other interesting variables could be river input or precipitation, as indirect indicators of the input of nutrients though freshwater discharges.Note that this effect is included, to some extent, in the salinity values.Moreover, some authors ( [97] and references therein) have indicated that upwelling is the main source of nutrients in the Rias Baixas area, while freshwater discharges play a less important role.

Comparison with Other Works
SVM algorithms with data from 1992 to 2002 [20], show a good sensitivity, but at the cost of a high number of false positives, showing a poorer specificity and precision as compared to RF or AdaBoost (Table 9).However, it improves the best SVM for the decade 2002-2012 considering both F1-score and distance to the point (0,1) in the ROC curve.This better performance could be explained because SVMs developed by [20] include a finer adjustment of the models, using a different weight for each class to correct the imbalance effect.
The best-fit linear model for Pseudo-nitzschia spp.abundance developed by [61] for the Santa Barbara Channel was able to correctly classify around 75% of bloom observations and 93% of no bloom observations, resulting in an excellent F1-score of 0.82.Results are not directly comparable, since they worked with a small (n = 77) and balanced dataset with a log-normal distribution of the Pseudo-nitzschia spp.abundance, used a different threshold to discriminate between no bloom and bloom classes and input features included data from satellite images and nutrient concentrations.
Good results were also found by [8] using logistic regression for the annual and seasonal prediction of toxigenic blooms of Pseudo-nitzschia spp. in the Monterey Bay (California), using temperature, salinity, upwelling indices, nutrient concentrations and river flow as input features.Working with more balanced datasets (around 40% of blooms, between 207 and 222 samples), they reported F1-score values between 0.70 and 0.76 in their best models.
Using similar input features as [8] and a long-term 22-year dataset, a logistic generalized linear model (GLM) approach was developed to predict potentially toxigenic Pseudo-nitzschia blooms in the Chesapeake Bay [6], reporting optimal values of sensitivity and precision of 0.75 and 0.48, respectively.They worked with unbalanced datasets (around 10% of blooms with a 100 cell/L threshold), but applied lower thresholds (10, 100 and 1000 cell/L) than in our work to discriminate no bloom and bloom classes.

Conclusions
In this work, we have developed a set of models to predict the appearance of Pseudonitzschia spp.blooms in the Rias Baixas area on a specific date and ria.The models are based on variables that can be operationally monitored and/or predicted in the short-term, and hence could be easily implemented to establish an early warning system.This system could provide useful information to the local mussel producers to complement the monitoring program and help to mitigate the potential impact of blooms on mussel production and public health.
Blooms are defined in terms of overall abundance (10 5 cells/L).Therefore, an important limitation is that toxic blooms are not discriminated, considering that DA production depends on the species and not all the blooms are toxic [14].Although data on specific species are not available, there is limited information about the closure of mussel floating farming parks caused by ASP and DA concentration on molluscs (only since 2016), which could be useful for discrimination toxic blooms and improve the models in future.New variables, e.g., nutrient concentrations or vertical distributions, could also be introduced to improve the algorithms.
Models maximizing the F1-score were selected as the optimal ones to establish a HAB prediction system because they are expected to show a high sensitivity (if a bloom occurs in the study area, the system must predict it) and precision (if a bloom is predicted, it must exist) [60].However, if users are more interested in predicting more blooms at the cost of more false alarms, models minimizing the distance to the optimal point (0,1) in ROC curve could be a better option.
HAB prediction systems will be more accurate if they include monitoring data with a higher temporal frequency and spatial coverage.For instance, the analysis of map products derived from optical satellite images, including chlorophyll-a concentration or species indicators, could complement the existing monitoring program based on direct observations [98].Moreover, the development of automated sensor systems with a high frequency of measurements at different stations and depths, or the use of drones will allow a breakthrough in the detection of toxic algae development events in coastal systems.
Finally, machine learning methods could also be useful for studying potential changes in the distribution patters of HAB-forming species associated with climate variability.

Figure 1 .
Figure 1.The Rias Baixas (Muros, Arousa, Pontevedra and Vigo) are located on the SW coast of Galicia (NW of the Iberian Peninsula).Locations of the stations used for this study (M5, A0, P4, V5 and Buoy of Cabo Silleiro) are shown.

Figure 2 .
Figure 2. Monthly distribution of the total number of blooms of Pseudo-nitzschia spp.(black line, left axis) and bloom incidence (grey bar, right axis) for the Galician Rias Baixas from 2002 to 2012.

Figure 3 .
Figure 3.Comparison of support vector machines (SVM) models using the F1-score statistic.A, B, C, D: combination of features; rbf: Gaussian radial basis kernel; p2-p9: polynomial kernel of degree from 2 to 9.

Figure 4 .
Figure 4. ROC curves plotting classification results computed from the validation set for models based on different parametric configurations and variable combinations (A, B, C or D) for the four machine learning methods used in this work: (a) SVM; (b) MLP NN; (c) RF; and (d) AdaBoost.

Figure 5 .
Figure 5. Learning curves for the best models (maximizing F1) for each machine learning method.

Table 1 .
List of variables included in the models, indicating the number of records and valid records, as well as the minimum and maximum values used to scale each variable.

Table 2 .
Combinations of variables used to develop the models, and total number of records available for each combination.

Table 3 .
Number of records available for de training and validation of the models for each combination of features (A, B, C or D, see Table2).

Table 4 .
Sampling effort (Tot.) and bloom incidence (Bloom) for each ria and year.