A Remote Sensing and Machine Learning-Based Approach to Forecast the Onset of Harmful Algal Bloom

: In the last few decades, harmful algal blooms (HABs, also known as “red tides”) have become one of the most detrimental natural phenomena in Florida’s coastal areas. Karenia brevis produces toxins that have harmful effects on humans, ﬁsheries, and ecosystems. In this study, we developed and compared the efﬁciency of state-of-the-art machine learning models (e.g., XGBoost, Random Forest, and Support Vector Machine) in predicting the occurrence of HABs. In the proposed models the K. brevis abundance is used as the target, and 10 level-02 ocean color products extracted from daily archival MODIS satellite data are used as controlling factors. The adopted approach addresses two main shortcomings of earlier models: (1) the paucity of satellite data due to cloudy scenes and (2) the lag time between the period at which a variable reaches its highest correlation with the target and the time the bloom occurs. Eleven spatio-temporal models were generated, each from 3 consecutive day satellite datasets, with a forecasting span from 1 to 11 days. The 3-day models addressed the potential variations in lag time for some of the temporal variables. One or more of the generated 11 models could be used to predict HAB occurrences depending on availability of the cloud-free consecutive days. Findings indicate that XGBoost outperformed the other methods, and the forecasting models of 5–9 days achieved the best results. The most reliable model can forecast eight days ahead of time with balanced overall accuracy, Kappa coefﬁcient, F-Score, and AUC of 96%, 0.93, 0.97, and 0.98 respectively. The euphotic depth, sea surface temperature, and chlorophyll-a are always among the most signiﬁcant controlling factors. The proposed models could potentially be used to develop an “early warning system” for HABs in southwest Florida.


Introduction
Harmful algal blooms (HABs) in saline waters, often referred to as "red tide" events, have been reported in many areas around the world, including the Gulf of St. Lawrence in Canada, Tampa Bay in the Gulf of Mexico, the Bay of Bengal, the Arabian Sea, the Bay of Biscay in Spain, Paracas Bay in Peru, Lisbon Bay in Portugal, and the Persian Gulf [1][2][3][4][5][6][7][8]. In the USA, red tides have been reported in many locations, including the Gulf Coast of Florida, the Gulf of Maine, and Monterey Bay in California [9][10][11][12]. The reddish color of the ocean water is most often caused by the proliferation of a microscopic photosynthetic organism called Karenia brevis (formerly known as Gymnodinium breve and Ptychodiscus brevis) [13][14][15]. Over the last few decades, K. brevis has become the predominant HAB phytoplankton species among 5000 known species and one of the most harmful natural phenomena in many areas in the Gulf of Mexico [13,16,17] including our study area, Charlotte County in southwest Florida (Figure 1) [18]. HABs can significantly change the quality (i.e., color, odor, and taste) of bodies of water, adversely affect fresh and saltwater ecosystems, and produce neurotoxins called brevetoxins [19] that have harmful effects on human health, the fishing industry, marine mammals, seabirds, and local economies [20][21][22]. These algae-related adverse effects require local authorities to take costly measurements and/or remedies including closure of beaches and conducting costly filtration processes and decontamination activities [23,24]. Several environmental factors contribute to the growth and propagation of algae, such as nutrients introduced from agricultural activities, lighting condition (low irradiance), salinity, and water temperature [9,25]. K. brevis consumes both inorganic and organic nitrogen and phosphorus compounds [26]. Nitrogen-based fertilizers are the main source of nitrogen in the Gulf of Mexico [27,28], where the nutrients of these fertilizers are transported from the agricultural fields within the Mississippi-Atchafalaya River basin [29] into prone water bodies (e.g., bay areas) by surface runoff and infiltration of nutrient-rich waters and groundwater flow towards neighboring water bodies [30,31].
There has been a long standing desire, and a need for, forecasting and mapping HABs [24] given their adverse effects on human health [21,22] and on the biodiversity and habitats of aquatic ecosystems [16,17]. The majority of earlier attempts to detect, map, and forecast HABs can be lumped in two groups: ones that rely heavily on the utilization of satellite remotely acquired data and ones that do not [9,32]. The latter research activities entail the acquisition of in situ real-time field monitoring of relevant parameters, such as chlorophyll-a concentration, dissolved oxygen, and nutrients [33]. Real-time nucleic acid sequence-based amplification assays and simple test kits have been used to detect and quantify the red tide dinoflagellate K. brevis [32]. The HAB Program of Florida Fish and Wildlife Research Institute (FWC) is one such program that designs and employs light and electron microscopy and genetic tools to identify and/or quantify HABs [24]. Hydrometeorological variables (e.g., sea surface temperature [SST], wind speed, cloud amount, salinity, and rainfall) were used in statistical models (fuzzy reasoning and the ensemble method classifiers) to predict HABs occurrences [34,35]. HABs can significantly change the quality (i.e., color, odor, and taste) of bodies of water, adversely affect fresh and saltwater ecosystems, and produce neurotoxins called brevetoxins [19] that have harmful effects on human health, the fishing industry, marine mammals, seabirds, and local economies [20][21][22]. These algae-related adverse effects require local authorities to take costly measurements and/or remedies including closure of beaches and conducting costly filtration processes and decontamination activities [23,24]. Several environmental factors contribute to the growth and propagation of algae, such as nutrients introduced from agricultural activities, lighting condition (low irradiance), salinity, and water temperature [9,25]. K. brevis consumes both inorganic and organic nitrogen and phosphorus compounds [26]. Nitrogen-based fertilizers are the main source of nitrogen in the Gulf of Mexico [27,28], where the nutrients of these fertilizers are transported from the agricultural fields within the Mississippi-Atchafalaya River basin [29] into prone water bodies (e.g., bay areas) by surface runoff and infiltration of nutrient-rich waters and groundwater flow towards neighboring water bodies [30,31].
There has been a long standing desire, and a need for, forecasting and mapping HABs [24] given their adverse effects on human health [21,22] and on the biodiversity and habitats of aquatic ecosystems [16,17]. The majority of earlier attempts to detect, map, and forecast HABs can be lumped in two groups: ones that rely heavily on the utilization of satellite remotely acquired data and ones that do not [9,32]. The latter research activities entail the acquisition of in situ real-time field monitoring of relevant parameters, such as chlorophyll-a concentration, dissolved oxygen, and nutrients [33]. Real-time nucleic acid sequence-based amplification assays and simple test kits have been used to detect and quantify the red tide dinoflagellate K. brevis [32]. The HAB Program of Florida Fish and Wildlife Research Institute (FWC) is one such program that designs and employs light and electron microscopy and genetic tools to identify and/or quantify HABs [24]. Hydro-meteorological variables (e.g., sea surface temperature [SST], wind speed, cloud amount, salinity, and rainfall) were used in statistical models (fuzzy reasoning and the ensemble method classifiers) to predict HABs occurrences [34,35].
Additional approaches involved the construction of wind-driven models or threedimensional physical hydrodynamic models to forecast the dominant regional physical processes that result in water exchange events and bloom propagation [33,34]. Most of these models were designed for same-day mapping and modeling the onset of blooms. Although the above-mentioned field-based approaches have been shown to be successful in detecting HABs [24], their application in many parts of the world has been hindered by their spatiotemporal limitations, high cost, and labor-intensive operational procedures [36].
The footprint of many of the remote-sensing sensors cover large areas with high temporal resolution; thus, they can potentially capture the spatial and temporal variabilities of HABs, as evidenced by the extensive literature describing the detection, monitoring, and forecasting of HABs using remote sensing-based techniques and sensors [9]. Investigations utilizing moderate-resolution imaging spectroradiometers (MODIS-Aqua and MODIS-Terra), SeaWiFS, MERIS, Sentinel-2, and unmanned aerial vehicles have contributed the most to these studies [9,[37][38][39][40][41][42]. The more recent and advanced satellites (e.g., Sentinel-3, launched in February 2016) provide added valuable resources for ocean color products, yet their recent deployment and, hence, their short record of historical data compared to earlier operational satellites (e.g., MODIS: 1999-present) puts them on the waiting list for future machine learning-based forecasting projects.
Many of the earlier attempts for HAB detection used reflectance-based classification algorithms and targeted chlorophyll-a, a good proxy for phytoplankton biomass [43][44][45]. These include classifications based on chlorophyll-a concentration [46], band ratios (e.g., blue-green band ratios) [47], ocean color band difference algorithms such as fluorescence line height (FLH), and maximum chlorophyll index. Chlorophyll-a concentrations derived from Landsat-8 (OLI) images over inland lakes in China using machine learning techniques (XGBoost) were shown to be more reliable than outputs from band ratio algorithms [48]. A comprehensive review of all these remote sensing-based methods was compiled by [43]. The use of these simple and straightforward indices and measurements, although successful, often introduces uncertainties, including false positive detections [49].
One approach to reduce these false positives is to develop statistical models that use more of the available ocean color products [50]. Using remotely sensed data, a number of machine learning studies were conducted to detect, monitor, and forecast HABs. Applying artificial neural networks and multiple linear regression (MLR) techniques in Kuwait Bay, a hybrid method showed a correlation between a variety of spatial and temporal ocean color products and HAB propagation and growth in the bay [50]. In early machine learning (ML) studies, and using remote sensing data over Monterey Bay, random forest (RF) and support vector machines (SVMs) were applied to build a decision support system for predicting the distribution of algal blooms in the bay [51]. A machine learning-based spatio-temporal data mining approach using kernel-based SVM was applied to detect HAB events in the Gulf of Mexico [52]. In a red tide detection study, a deep learning method was applied to Landsat 8 Operational Land Imager data acquired over the southern coastal region of the Korean Peninsula [53]. Additionally, in a recent study, spatiotemporal SVM, RF, and deep learning long-and short-term memory methods were adopted to develop an HAB detection and forecasting system for the whole west coast of Florida [54]. These methods apply a state-of-the-art machine learning algorithm; however, most of them use only a limited number of variables (one to five) and do not consider as one of the main targets of their investigations the lag time between the onset of a bloom and the time it takes for a variable to have a maximum impact on bloom propagation.
MODIS-derived ocean color products, along with field data, were used to develop data-driven statistical models based on MLR expressions to identify factors controlling HAB propagation [55] and to forecast bloom occurrences up to three days in advance. These models assumed a unified lag time for the significant variables, an assumption that does not adequately portray the complex interactions between the controlling factors, leading to the propagation of the HABs [56][57][58]. Addressing this problem will lead to the development of more realistic modeling structures that can better account for the HAB growth patterns [59]. This could be accomplished by allowing each of the independent variables to have different lag times and the model to select the significant variables, each with its optimum lag time. In practice, the more satellite data and lag time choices we provide, the better and more comprehensive the model. There are advantages for selecting statistical models that portray the relative significance of, and the correlation between, the factors controlling the onset of HABs. For example, artificial neural networks and deep learning (DL) models are powerful functions for modeling real-world problems [60,61]. The growth of HABs was successfully predicted using historical data and ecological informatics and applying DL methods [62]. The DL methods were also used to predict algal growth in rivers [63][64][65][66], lakes [67,68], and coastal areas [69,70]. However, these methods function as black boxes, and the neuron connections, their weights, and different layers cannot be associated much with the concept of the physical problem at hand [71]. The paucity of data has always been one of the challenges facing researchers-a good model should work and be trained with limited datasets [71,72]. Different machine learning models (e.g., linear versus non-linear and tree-based versus non-tree-based models) need to be adopted to compare and contrast the results in terms of consistency and model performance. The proposed approach addresses the aforementioned shortcomings, and provides an adequate forecasting period (up to 9 days) because of its spatio-temporal features [9].
In this manuscript, we first demonstrate the enhanced predictive power of the statistical model when: (1) multiple day (>2 days) satellite data acquisitions are utilized instead of single and 2-day models, (2) the optimum forecasting period is identified and the variations in lag times for the independent variables are accommodated, and (3) multiple statistical models are tested and the optimum predictive model is selected. In light of our findings, we then identify and use the optimum predictive statistical model and data structure to develop multiple sequential forecasting models that utilize available cloud-free scenes that span a period ranging from 1 to 11 days ahead of the onset of the HAB bloom. In doing so, our approach addresses the paucity and temporal discontinuity of satellite ocean color products due to cloud coverage or missing values in areas close to shorelines due to masking and processing data (from levels 0 to 2) and random and systematic errors during data acquisition. Additionally, each of the individual models allows for variations in lag times of up to 2 days for the independent variables. In addition, we utilize statistical models that portray the relative significance of the factors controlling the onset of HABs.

Study Area
The study area incorporates the coastal areas (width:~10-60 km, length: 180 km) of Charlotte County in southwest Florida and nearby estuaries, where freshwater and seawater mix ( Figure 1). Like many other coastal zones within the Gulf of Mexico, there have been persistent HAB outbreaks that pose serious environmental challenges to the tourism and fishery industries in the county. Charlotte County has a relatively high density of septic systems in areas where the water table is often less than 2 feet below land surface. Shallow ground water and defective septic systems cause seepage of septic effluent into the water table. The introduction of nitrogen from septic systems into lakes, estuaries, and coastal areas is of concern given that nitrogen is one of the primary nutrients responsible for algal blooms occurrences. The majority of the samples are close to the shoreline and the sampling density decreases as we move away from the shoreline towards the ocean. Unfortunately, the study area lacks comprehensive, continuous, organized field-based monitoring systems.

Data and Methods
In this study, we apply an approach that addresses the paucity of continuous satellite temporal data and lag time variations among the controlling factors, provides predictions for bloom occurrences up to 9 days in advance, and provides insights into the factors controlling the onset of HABs, while predicting optimum solutions. Our approach takes advantage of remote sensing datasets, GIS technologies, and machine learning data-driven modeling. These data-driven models recognize hidden patterns in the collected data (dependent and independent variables) and provide insights into the behavior patterns of the observed ecosystems, namely the factors controlling the onset of HABs. In our case, the factors controlling, or correlating with, HAB bloom growth and propagation are represented by the independent variables and the HAB occurrences are the dependent or the response variable. The workflow ( Figure 2) involved four major steps: (1) downloading and processing of daily MODIS data; (2) developing statistical linear and non-linear models based on historical HAB occurrences and ocean color products derived from consecutive day MODIS data; (3) comparison of the performance of the models, and (4) selection of the optimum model and structure.
In this study, we apply an approach that addresses the paucity of continuous satellite temporal data and lag time variations among the controlling factors, provides predictions for bloom occurrences up to 9 days in advance, and provides insights into the factors controlling the onset of HABs, while predicting optimum solutions. Our approach takes advantage of remote sensing datasets, GIS technologies, and machine learning data-driven modeling. These data-driven models recognize hidden patterns in the collected data (dependent and independent variables) and provide insights into the behavior patterns of the observed ecosystems, namely the factors controlling the onset of HABs. In our case, the factors controlling, or correlating with, HAB bloom growth and propagation are represented by the independent variables and the HAB occurrences are the dependent or the response variable. The workflow ( Figure 2) involved four major steps: (1) downloading and processing of daily MODIS data; (2) developing statistical linear and non-linear models based on historical HAB occurrences and ocean color products derived from consecutive day MODIS data; (3) comparison of the performance of the models, and (4) selection of the optimum model and structure.

Data
Two types of data were used to construct our data-driven models for the period from late 2000 to March 2020. First were the independent variables-daily ocean color satellite products acquired by the National Aeronautics and Space Administration (NASA) MODIS Aqua satellite. Automatic selection of cloud-free (<10%) MODIS data (2905 scenes) was performed and used for this study. Only a small fraction (5%) of the omitted cloudy scenes was found to be cloud-free over the study area based on visual inspection of a subset of those scenes. Second was the dependent variable, daily K. brevis abundance (cells/L) observations from the National Oceanic and Atmospheric Administration (NOAA) and the Florida Fish and Wildlife Conservation Commission (FWC).
The independent variables were extracted from NASA's ocean color website (https://oceancolor.gsfc.nasa.gov/) for the daily acquired user-defined region of interest (ROI). Southwestern Florida was selected as ROI, and MODIS in Aqua mode as the source of data. The automatic data downloading was scheduled within the Linux environment. Following the download of Level 0 data, it was processed to Level 1, then to Level 2 using SeaDAS (NASA, Greenbelt, MD, USA, version 7.4) Ocean Color Science Software. Radio- Figure 2. Flowchart describing the adopted methodology.

Data
Two types of data were used to construct our data-driven models for the period from late 2000 to March 2020. First were the independent variables-daily ocean color satellite products acquired by the National Aeronautics and Space Administration (NASA) MODIS Aqua satellite. Automatic selection of cloud-free (<10%) MODIS data (2905 scenes) was performed and used for this study. Only a small fraction (5%) of the omitted cloudy scenes was found to be cloud-free over the study area based on visual inspection of a subset of those scenes. Second was the dependent variable, daily K. brevis abundance (cells/L) observations from the National Oceanic and Atmospheric Administration (NOAA) and the Florida Fish and Wildlife Conservation Commission (FWC).
The independent variables were extracted from NASA's ocean color website (https: //oceancolor.gsfc.nasa.gov/) for the daily acquired user-defined region of interest (ROI). Southwestern Florida was selected as ROI, and MODIS in Aqua mode as the source of data. The automatic data downloading was scheduled within the Linux environment. Following the download of Level 0 data, it was processed to Level 1, then to Level 2 using SeaDAS (NASA, Greenbelt, MD, USA, version 7.4) Ocean Color Science Software. Radiometric and geometric calibrations were performed to correct for differences in scene acquisition geometries (level 1 processing), and ocean color products were generated (level 2 processing).
The dependent variable (historical occurrences of K. brevis and their cell count) was compiled from two resources, namely from FWC and NOAA. The FWC and NOAA datasets contain daily observation of K. brevis, and both cover the period from 2000 to 2020.

Independent Variables
Daily ocean color satellite products were automatically downloaded and processed. These include euphotic depth (ED), Secchi disk depth, chlorophyll-a, chlorophyll-gsm, chlorophyll-giop, diffuse attenuation coefficient (Kd_490), SST, FLH, particulate backscat-Remote Sens. 2021, 13, 3863 6 of 22 tering coefficient at 547 nm (bbp_547_giop), and turbidity index. Because previous work has shown that one or more of these variables could affect, or correlate with, the onset of HABs, each of these potential controlling factors was included in the statistical analysis; the individual variables are described below with their potential contribution to HABs' growth and propagation.
• Euphotic depth (m) and Secchi disk depth The ED, represents the depth at which about 1 percent of the total incoming light on the ocean's surface can reach [73]. Beyond ED, light cannot penetrate, net photosynthesis and productivity decreases, and nutrients and algae diminish [74]. ED varies with change in season and latitude from only a few centimeters in highly turbid eutrophic waters to around 300 m in the open ocean. Low EDs can represent high nutrient content and provide desirable conditions for HAB growth and propagation [75,76]. The ED was calculated using the approach described in [77]. The Secchi disk depth has a similar concept; it is the depth at which a disk with alternating black and white quadrants disappears as it is lowered in the water column, and thus, it is a measure of the water transparency.

•
Chlorophyll-a (mg/m 3 ) Three common pigments (chlorophyll-a, -b, and -c) can be found in HABs, but the former (chlorophyll-a) was found to be the best proxy for measuring algal growth in aquatic environments [78,79]. Three different semi-analytical algorithms were developed to compute the chlorophyll-a concentration: chlorophyll-a OC3M (ocean chlorophyll threeband algorithm for MODIS [80]), chlorophyll-a GSM (Garver-Siegel-Maritorena [81]), and chlorophyll-a GIOP (Generalized Inherent Optical Property [82]). These three chlorophylla measurements products are highly correlated with HAB cell count, yet they are not redundant; often, one of these algorithms can best estimate the chlorophyll-a concentration in a particular optically complex estuarine environment [83]. In general, the increase in chlorophyll-a concentration has been found to have a strong correlation with the HAB distribution [66].
The diffuse attenuation coefficient of downwelling irradiance at 490 nm reflects the attenuation of the light in blue to green wavelength regions for turbid water and is one of the most important optical properties of ocean water [84]. In one study the Kd_490 coefficient was used as a proxy for the growth of phytoplankton in turbid coastal waters, where the light attenuation was shown to be controlled by the concentration of scattering particles, HABs being one of them [85]. In another study under normal and red tide outbreak conditions in the Persian Gulf, the MODIS Chlorophyll-a normalized line fluorescence height, and Kd_490 were compared; a high correlation was observed between chlorophyll-a and Kd_490 during red tides [86]. The Kd_490 was calculated using the technique described in [87].
Phytoplankton and HAB growth and productivity is directly correlated with SST. The HABs can thrive under specific habitat characteristics and temperature range. The temperature controls the survival of the HABs and the availability and solubility of nutrients that are vital for the growth of HABs as well [88,89]. The correlation between SST and algal bloom growth and its distributions has been successfully demonstrated in various settings worldwide [89][90][91][92].
• Fluorescence line height FLH provides a standard method for measuring radiance, leaving the coastal and ocean surface in the chlorophyll fluorescence emission band (676 nm) [44]. A strong positive correlation was reported between chlorophyll-a concentration and the FLH in ocean waters containing HABs [40]. FLH alone and together with backscattering coefficient have been • Particulate backscattering coefficient This factor represents the backscattering coefficient of water particles at 547 nm. Earlier studies have shown its utility in identifying HABs distribution, particularly the K. brevis in the Gulf of Mexico [96]. In two different studies at the West Florida shelf, the particulate backscattering coefficient at 551 nm, in conjunction with fluorescence (in the first study) and chlorophyll-a (in the second) was utilized to detect K. brevis [94,97] The backscatter coefficient of particles at 547 nm was calculated using an algorithm provided in [98].

• Turbidity index
The turbidity index is based on the reflectance in the green part of the spectrum. It provides a measure of the water clarity based on the amount of the scattered light caused by water-suspended particles [99]. When it is low, water is clearer, and more light can penetrate down into the water column, providing favorable living and growing conditions for HABs [100]. On the other hand, HAB growth increase turbidity, per se. Turbidity alone is not a direct indicator of HAB concentration, but it can be used in conjunction with other aforementioned factors. It has been successfully used to estimate the severity of HABs and to identify phytoplankton blooms [4,101]. The turbidity index was calculated using the method described in a previous study [102].

Target Variable
The number of K. brevis (cells/L)) in shallow (depth: 0.5 m) waters is considered to be the target dependent variable (response variable). A threshold of 10,000 cells/L was adopted for classification purposes, because at concentrations exceeding 10,000 cells/L, respiratory irritation and fish kills are more likely to occur (https://myfwc.com/research/ redtide/statewide/) and the chlorophyll-a concentration is high enough to enable the detection of HABs from satellite data [103]. Moreover, the adopted cell count groupings in this study are those used by the HABs Observing System (+ve: >10,000 cells/L; −ve: <10,000 cells/L) [12].

Data Preparation
In the proposed models, the number of K. brevis cells (cells/L) is used as the response variable. For the classification application a threshold of 10,000 cells/L was adopted to separate cell counts into two classes of positive and negative events. Ten level-02 ocean color products are used as controlling factors. Chromophoric dissolved organic matter index was manually removed from the list of level 02 products due to the discontinuous and patchy nature of this variable over the investigated period. In the generation of the models, the dataset was randomly split into train (70%) and test (30%). All the models were tested on roughly 300 positive and negative events that have not been seen by the models covering the observation time period from 2000 to 2020.

Machine Learning Modeling
We developed data-driven machine learning models to address the problem. The adopted state-of-the-art machine learning models are discussed in two categories: linear versus non-linear, and tree-based versus non-tree-based models. Shrinkage methods were adopted as an example of regularized linear models, SVM for non-tree-based models, and XGBoost and RF as examples for non-linear tree-based models. Due to data size, data distribution, and the complexity of patterns in data, we follow a common practice in which we utilize, compare, and contrast a set of statistical models described in the following sections.

Linear Models
We chose linear models because they have advantage in interpretability. By stacking up the same variables for different days (multicolinearity alert), we significantly increase the feature space (e.g., three consecutive days and 10 predictors for each day). We used the shrinkage method that applies a penalty term to the loss function embedded in the linear regression (LR) to avoid over-and underfitting. Shrinkage models shrink insignificant variables (coefficient estimates) into zero, which leaves us with the most significant variables to address the lag times [72].

Tree-Based Models (Non-Linear)
Since we are stacking up a few sets of the same variables in consecutive days, the correlation among the same variables is very high. This imposes unsolvable multicollinearity to linear models. Therefore, we resorted to nonlinear models, such as tree-based models, to address the nonlinearity content of the problem. These models provide variable importance plots and can handle limited training datasets, which is the case in our investigation. Trees can be non-robust with high variance, which is why we considered ensemble models such as extreme gradient boosting (XGB) and RF to improve the prediction accuracy and lower the variance [72,104].

•
Extreme gradient boosting XGB is a scalable learning algorithm designed for higher speed and performance. It uses a regularized model formalization to control overfitting.
In gradient boosting, the input predictors X (X 1 ... X n ) are utilized to predict the corresponding target values Y (Y 1 ... Y n ). In fact, we need to minimize the sum of the loss function (J) by improving the model F(X). Equations (1)-(5) are from [105,106] where L is a differentiable convex loss function to measure the difference between the predicted values F(x i ) and the real target values (y i ).
In applying the XGB, we went through the following iterations. We first calculated the negative gradients of J with respect to (x i ), which is − ∂j ∂F(xi) . Then, we fit a classification tree, h, to ∂j ∂F(xi) . The new updated (Xi) is (Xi) + γh, where γ is the step size to reach the estimated minimum of J.
The iteration continues to the point at which we achieve the minimum difference between prediction and observation. In XGB, the loss function is: where: where T is the number of tree leaves and ω is the weights of those leaves. The function Ω penalizes the model complexity. The optimal weight ω of leaf j was calculated using Equation (4): where g j = − ∂j ∂F(Xi) and h is the j th classification tree fitted to g j . The optimal value of the loss function was calculated using Equation (5) [106]: Remote Sens. 2021, 13, 3863 9 of 22 The additional regularization term was added to avoid overfitting [105,106]. To achieve the optimum structure for the XGB model, parameters such as number of boosting iterations, gamma, maximum depth, and learning rate (eta) were tuned. Gamma is a pseudo-regularization hyperparameter in gradient boosting (complexity control). The higher the gamma is, the higher the regularization, and the more conservative the algorithm will become. The eta specifies the participation of each tree and reduces overfitting. Maximum depth determines the maximum number of end nodes in each leaf of the trees. These hyperparameters were calculated based on grid search and cross validation in R. We found that the optimum hyper parameters for the XGB including the number of boosting iterations, gamma, maximum depth, and learning rate (eta) were 100, 0, 10, and 0.05, respectively.

•
Random forest In RF, we build hundreds of trees on bootstrapped training samples. However, each time we generate an individual tree and a split in a tree is considered, a random fresh selection of M predictors is chosen as split candidates from the full set of the P predictors to avoid the strongest predictor always being utilized in the process [107]. Typically, M is calculated using Equation (6): At each split a new sample of predictors is considered according to a user-defined number of predictors (M try ). Another user-specified hyperparameter is the number of trees (N tree ). Small values were avoided to enable making of the forest and to enhance variance-bias tradeoff [108,109]. For global optimum, two-third of the samples were used for training, and the remaining out-of-bag (OOB) were used to cross-validate the RF model. In fact, the OOB error was utilized to calculate the prediction error and to evaluate the variable importance measures [110]. The RF hyperparameters (M try and N tree ) that were used in this study to optimize the model performance were 6 and 1000, respectively.

• Support vector machines
To evaluate the reliability of results, a non-parametric (insensitive to the distribution of data) non-tree-based supervised learning method called SVM was adopted [111]. SVM proposes a nonparametric approach of finding linear discriminant functions. It tries to find a unique hyperplane between each pair of the classes in a multidimensional feature space [112].
The linear function general formula is g(x) = W T ·x + b, which is a hyperplane in higher dimensions and is represented in Figure 3. In Figure 3, x1 and x2 are two features, the green and white classes are separated by the line W T x + b = 0, and n is the normal vector of the hyperplane (n = || || ), where ‖w⃗‖ is the Euclidian distance between w⃗ and the origin. The main objective of SVM is to find a set of weights that specify two hyperplanes Equation (7) Given a set of data points {(xi, yi)}, i = 1, 2… n; where: In Figure 3, x 1 and x 2 are two features, the green and white classes are separated by the line W T x + b = 0, and n is the normal vector of the hyperplane (n = w ||w|| ), where w → is the Euclidian distance between w → and the origin. The main objective of SVM is to find a set of weights that specify two hyperplanes Equation (7) Given a set of data points {(x i , y i )}, i = 1, 2 . . . n; where: k = 1 after scale transformation on both w and b.
We have infinite possible discrimination functions. One way to find the optimal hyperplane is by maximizing the width of the margin (margin width: 2 ||w|| ) or minimizing 1 2 ||w|| 2 = 1 2 w t w such that: y i ( w t xi + b) ≥ 1: By solving this optimization equation for w and b, each of the y i data points are correctly classified. In fact, y i indicates the class value (transformed either to +1 or −1).
In the adopted SVM, a radial basis function kernel yielded a better performance and was applied to address nonlinearity and overfitting. Model optimization was performed using the tune function in the R software package on the SVM hyperparameters (gamma [γ] and cost [c]). The cost hyperparameter specifies the cost of a violation to the margin; at small cost values, margins will be wide and many support vectors will be available, and vice versa at high cost values. The model overfits data as the values of c or γ increase and underfits as their values decrease. The average number of support vectors, optimum γ, and optimum c were selected at 95, 0.5, and 0.1, respectively.

Assessment of Models
The performance of the models was evaluated using the test data with binary classes of low and high concentration of K. brevis and applying a confusion matrix.
Some of the important metrics of confusion matrix appropriate for imbalanced datasets are Cohen's kappa, balanced accuracy, and F-score. On top of that, receiver operating characteristic (ROC) curve and the area under the curve (AUC) were calculated to compare different classifiers. ROC is a graphical plot that represents false and true positive rates on the x and y axes, respectively. ROC indicates a model's diagnostic ability when the class discrimination threshold varies [107]. These criteria are calculated as follows [107]: where: Po = TP + TN n and Pe = 1 √ N ((TP + FN)(TP + FP) + (FP + TN) (FN + TN)). (14) In the above equations, N is the total number of cases, n points to the number of accurately categorized incidents or non-incidents, TP, TN, FP, and FN refer to true positive, true negative, false positive, and false negative, respectively.
There are two measures of variable importance in the RF models. The first is based on how much the accuracy decreases when we exclude the variable. The second measure is based on the decrease in Gini impurity when a variable is selected to split a node. For boosting-based models (XGBoost), learning is done in a serial way; when there are several correlated features (as in our case), boosting will tend to choose one and use it in several trees (if necessary), and the use of other correlated features will be limited. On the other hand, not each tree of an RF is built from the same features (there is a random selection of features to use for each tree). Therefore, RF has the most intuitive feature importance for our case. In addition, each time we run the forest-based classification, we get slightly different results due to both randomness introduced in the model to avoid overfitting and the random subsetting of validation data. Therefore, instead of a bar chart, we get a variable importance box-plot that shows the distribution of importance across many runs.

Model Structure Comparison and Selection of Optimum Model Structure
Using the XGB model, we compared the performance of 1-day (−7; −8; −9), 2 consecutive day (−7, −8; −8, −9), and 3 consecutive day (−7, −8, −9) models ( Table 1). Examination of Table 1 shows that the 3 consecutive day structure outperforms the 2 consecutive day models, which in turn outperform the 1-day models. A similar exercise was conducted using the SVM model, and again the 3 consecutive day structure was found to outperform the 2 consecutive day models, which in turn outperform the single day.
Although not shown, we observe a general enhancement in the performance of each of the remaining three models (XGBoost, RF, and SVM) with increasing number of consecutive days. Thus, there are added benefits for increasing the number of consecutive day entries to our models given the same number of variables. Unfortunately, the availability of cloud-free (<10%) MODIS data over the study area limits our ability to develop models that utilize more than three consecutive days. Table 2 shows that out of a total of 2905 scenes that were acquired over the study area and period, the single scenes constituted 36% (1039) of the cloud-free scenes, the 2 consecutive day scenes constituted 20% (562 scenes) of the cloud-free scenes, the 3 consecutive day scenes constituted 9% (260 scenes) of the cloud-free scenes, the 4 consecutive day scenes constituted <2% (57 scenes) of the cloud-free scenes, and each of the 5, 6, and 7 consecutive scenes constituted less than 1% of the cloud-free scenes. Table 2. Availability of cloud-free (<10%) MODIS data (2905 scenes) acquired over the study area and period (2000−2020). Given the paucity of consecutive MODIS data for periods exceeding three days and the lesser chances for finding field observations (dependent variable) in 4 consecutive day scenes for model training purposes, we chose to develop our models based primarily on the 3 consecutive day structure.

Days (2000-2020) < 10% Cloud Scenes Frequency
Eleven 3 consecutive day models were generated (Table 3, top). This structure and bundle of temporally overlapped models provides short-to mid-term HAB forecasting through a range of spatio-temporal models, and addresses, at least in part, the differences in optimum lag times between each of the individual independent variables and the onset of HABs. For example, the first model uses ocean color products acquired a day prior, two days prior, and three days prior to onset of an HAB occurrence; the last model uses data acquired 11, 12, and 13 days in advance. As described earlier, the 3 consecutive day models produce better results than the 2-day models. For comparison purposes, the structure of 11 2 consecutive day models is shown in Table 3 (bottom).

Comparison of the Performance of Statistical Models and Selection of the Optimum Model
A variety of machine learning algorithms were adopted based on the nature of data and the problem in hand. The Lasso regression analysis was first adopted, but all the variables were found to shrink to zero due to very high multicollinearity among the variable sets. Tree-based models (XGBoost, RF, and SVM) were then applied. The ROC curve (AUC), balanced accuracy, kappa, and F-score derived from confusion matrix were adopted to evaluate the performance of models on the test dataset. The comparison of forecasting models is displayed in Table 4 and Figure 4. The model structures are represented by numbers ranging from −1 to −13, representing the days in advance of an HAB occurrence. The best metrics among the three models (XGB, RF, and SVM) are boldfaced. For example, XGBoost achieved the highest performance among the models for eight (−8) days forecasting with all four metrics (accuracy: 96%; Kappa: 0.93; F-score: 0.97; and AUC: 0.98). Figure 4 displays the ROC plots for the train and test datasets for 8-day SVM, RF, and XGBoost models. The three model ROC curves and the area under them (AUC) indicates a slightly higher performance for the XGBoost compared to the other models (AUC: XGBoost, 0.98; RF, 0.96; SVM, 0.94). XGBoost was selected as the optimum model. Table 3. Temporal modeling structure for 2 and 3 consecutive day models.

Comparison of Lag Times and Selection of Optimum Lag Time
Inspection of Table 4 reveals that, in general, the forecasting models of ~5-9 days in advance (5-to 9-day models) achieved relatively more reliable and comparable results, with the 7-and 8-day forecasting models being the optimum models. For example, Figure  5 shows a good correspondence between the reported concentration of K. brevis in 170 random test samples within the study area with the predicted concentration for each of those samples from an 8-day RF model. The samples have been classified correctly with an accuracy of 95%.
XGBoost (the top subtable) also demonstrated a more uniform superiority performance in this interval, which is portrayed by boldfaced figures. The models out of this range (5-9 days) in general showed a relatively lower performance, especially on the

Comparison of Lag Times and Selection of Optimum Lag Time
Inspection of Table 4 reveals that, in general, the forecasting models of~5-9 days in advance (5-to 9-day models) achieved relatively more reliable and comparable results, with the 7-and 8-day forecasting models being the optimum models. For example, Figure 5 shows a good correspondence between the reported concentration of K. brevis in 170 random test samples within the study area with the predicted concentration for each of those

Identification of Controlling Factor Importance
An RF feature importance plot was selected for the depiction of the significant variables because it provides the most intuitive display of variable importance (refer to Section 3.2.3). Figure 6 shows the variable importance boxplot for the 8-day RF model. The x-axis displays 30 controlling factors, 10 factors for each of the 3 days (days 8, 9, and 10) in three sets; the y-axis is the scaled variable importance. Boxplots show the range of variations in variable importance in different model runs (n = 100). Each set of 10 variables is separated by a vertical blue line. Chlorophyll-a, SST, Secchi disc depth, and ED from the 8 th day are amongst the most significant variables. Although not shown, XGBoost showed generally similar, yet not identical results in feature importance. XGBoost (the top subtable) also demonstrated a more uniform superiority performance in this interval, which is portrayed by boldfaced figures. The models out of this range (5-9 days) in general showed a relatively lower performance, especially on the kappa metric, which is an indicator of model performance in comparison to a random guess (random classifier). For example, in general, the 1-, 2-, 3-, 4-, 10-, and 11-day models in all three ML methods showed low kappa values compared to the 5-, 6-, 7-, 8-, and 9-day models.

Identification of Controlling Factor Importance
An RF feature importance plot was selected for the depiction of the significant variables because it provides the most intuitive display of variable importance (refer to Section 3.2.3). Figure 6 shows the variable importance boxplot for the 8-day RF model. The x-axis displays 30 controlling factors, 10 factors for each of the 3 days (days 8, 9, and 10) in three sets; the y-axis is the scaled variable importance. Boxplots show the range of variations in variable importance in different model runs (n = 100). Each set of 10 variables is separated by a vertical blue line. Chlorophyll-a, SST, Secchi disc depth, and ED from the 8th day are amongst the most significant variables. Although not shown, XGBoost showed generally similar, yet not identical results in feature importance.

Discussion
This study was intended to provide guidelines for the development of comprehensive predictive HABs models using temporal remotely acquired data in ways that can address, at least in part, two of the main shortcomings of remote sensing-based HAB predictive models: (1) the paucity of satellite data due to cloudy scenes and other systematic and random missing data that prevent us from making reliable models and (2) the differences in lag time between the period at which the individual variables reach their highest correlation with the target and the time the bloom occurs.
Our findings suggest that these shortcomings could be addressed by using multiple sequential, consecutive day models, as opposed to 1-day models [55]. The larger the number of consecutive days, the better the results. In our case, the paucity of consecutive cloud-free data limited our analysis to 2 and 3 consecutive day models (11 models). The 3 consecutive day models predict the onset of HABs from 1 to 11 days in advance and accommodate differences in lag time of up to 2 days for the independent variables. Three consecutive day models increased the variability within the models and increased the overall performance of all models (SVM, XGBoost, and RF) in comparison with the 1-day or 2-day models. In absence of 3 consecutive day data, one can resort to the use of 2 consecutive day models. Comparisons of our 3-day XGB model outputs with previous 1-day model outputs [54] over the study area reveal enhanced overall performance for our 3 consecutive day models. The 1-day model achieved 65% accuracy for the one day in advance model, and 72.1% for the two-day model [54]. The overall performance of our 1-day

Discussion
This study was intended to provide guidelines for the development of comprehensive predictive HABs models using temporal remotely acquired data in ways that can address, at least in part, two of the main shortcomings of remote sensing-based HAB predictive models: (1) the paucity of satellite data due to cloudy scenes and other systematic and random missing data that prevent us from making reliable models and (2) the differences in lag time between the period at which the individual variables reach their highest correlation with the target and the time the bloom occurs.
Our findings suggest that these shortcomings could be addressed by using multiple sequential, consecutive day models, as opposed to 1-day models [55]. The larger the number of consecutive days, the better the results. In our case, the paucity of consecutive cloud-free data limited our analysis to 2 and 3 consecutive day models (11 models). The 3 consecutive day models predict the onset of HABs from 1 to 11 days in advance and accommodate differences in lag time of up to 2 days for the independent variables. Three consecutive day models increased the variability within the models and increased the overall performance of all models (SVM, XGBoost, and RF) in comparison with the 1-day or 2-day models. In absence of 3 consecutive day data, one can resort to the use of 2 consecutive day models. Comparisons of our 3-day XGB model outputs with previous 1-day model outputs [54] over the study area reveal enhanced overall performance for our 3 consecutive day models. The 1-day model achieved 65% accuracy for the one day in advance model, and 72.1% for the two-day model [54]. The overall performance of our 1-day and 2-day forecasting XGBoost models adjusted for imbalance yielded an accuracy of 73.0% and 73.9, respectively. In both studies, the SST, chlorophyll-a, KD_490, and euphotic depth, were among the most important controlling factors.
Following the general trend in almost all four metrics, the performance of the models was enhanced with the~5-9-day models, and the best results were those obtained from the 7-8-day models. For example, in the XGBoost analysis (Table 4), the kappa ranged from 0.5 to 0.6 for day 1, 2, 3, and 4 models, rose to up to 0.93 in the day-8 model, then decreased to 0.68 in the 11-day model. The mid-term forecasting of 5-9 days is not only more accurate, but is also more functional compared to short-term forecasting models. They provide enough time to execute appropriate warning and mitigation steps prior to the onset of HAB occurrences. One explanation for the above-mentioned findings is that better (−5 to −9) and optimum lag times (−7, −8) are met within these time frames. The VI boxplot shows that the ED and Secchi disc depth were found to be highly significant factors, probably due to the spatial variability of these factors throughout the study area and their correlation with the distribution and concentration of HABs. Chlorophyll-a and SST were found to be significant factors as well. Our findings are consistent with previous non-data-driven studies [57,58,113]. In the Arabian Sea, phytoplankton biomass was found to correlate with SST with a lag time of 2-8 days [58]; in Santa Monica, it was 5 days [114]. In addition, a 3-6 day lag time was reported between the introduction of phosphorus nutrients and the occurrence of HAB events [115].
The XGBoost model outperformed the SVM and RF models. It combines the advantages of RF and gradient boosting. The high performance of the XGBoost model can be attributed to the specific data patterns and size, data imbalance, and more robustness of the model to noisy data and outliers due to its loss function flexibility, which has a regularization term to reduce the complexity of the tree functions.
Our method has its limitations. The application of our methodology in an area will ultimately depend on the availability of consecutive cloud-free data. In arid parts of the world, that should not be a major problem, but in temperate areas, cloud-free data for the application of the optimum 3-day models (5-to 9-day models) might not be available, and in such cases, less accurate 3-day models (1-4, 10, and 11-day models) or even 2-day models will have to be used instead. The generated models are specific for the investigated area, and thus, similar models have to be tailored to individual areas. Moreover, the proposed approach is labor-intensive, since it requires the development of many models-in our case, some 20 models, including extensive data engineering, data blending, and data wrangling. Using archival field and satellite data (as a training dataset) for areas of interest, future work should concentrate on the development of automated systems that can construct tens to hundreds of models at various combinations of consecutive days, (2-day models, 3-day models, 4-day models, 5-day models, etc.), evaluate their performance using test data, rank the models based on their performance, and, depending on availability of cloud-free ocean color data, select and apply the model with the highest performance.

Conclusions
We developed, compared, and contrasted the efficiency of state-of-the-art data-driven machine learning models (XGBoost, RF, and SVM) in predicting the occurrence of HABs. The number of K. brevis cells in surface water samples collected during red tides over the past 20 years were used as a binary response to the environmental controlling factors (target variable) and 10 level-02 ocean color products extracted from daily archival MODIS satellite data were used as environmental controlling factors.
Two main shortcomings of earlier models were addressed: (1) the paucity of satellite data due to cloudy scenes and other systematic and random missing data and (2) the lag time between the period at which a variable reaches its highest correlation with the target and the time the bloom occurs. Eleven spatio-temporal models were generated, each from three consecutive days' satellite datasets, with a forecasting span of 1 to 11 days. One or more of the generated 11 models could be used to predict HAB occurrences with acceptable performance, depending on availability of the cloud-free consecutive days.
Findings indicate: (1) XGBoost outperformed the remaining methods, (2) the forecasting models of 5-9 days achieved the best and most reliable results, (3) the most reliable model can forecast eight days ahead of time, and (4) ED, SST, and chlorophyll-a are always among the most significant variables.
The findings from this study could serve as guidelines for the development of remote sensing-based early warning systems for HABs in southwest Florida with short-to midterm forecasting capabilities. As described above, the generated models are specific to the study area and their development is labor intensive. Thus, speedy and widescale applications of the developed concepts in areas outside of the study area requires development of fully automated algorithms that will accomplish the following functions: downloading of daily data acquisition for the desired study area from a big data platform, designing and maintaining a database management system for data blending and query-based data engineering, online ML modeling, and interactive model evaluation.
In southwest Florida, it was difficult to get cloud-free ocean color acquisition for more than three consecutive days. There are many other coastal areas around the world, especially in arid areas in which cloud free scenes are more available, where the developed methodologies could be readily applied. The development of automated systems that can construct many models at various combinations of consecutive days could facilitate the application of the advocated methods over areas where cloud-free data is limited.
Additional approaches to address the paucity of cloud-free consecutive day data should be explored and their performance evaluated. For example, additional statistical models that rely on non-consecutive day data could be generated; alternatively, the values of the missing days could be estimated using forward window averages or data imputation prior to feeding the data into the ML algorithms.

Conflicts of Interest:
The authors declare no conflict of interest.