Evaluation of Machine Learning Predictions of a Highly Resolved Time Series of Chlorophyll-a Concentration

: Pelagic chlorophyll-a concentrations are key for evaluation of the environmental status and productivity of marine systems, and data can be provided by in situ measurements, remote sensing and modelling. However, modelling chlorophyll-a is not trivial due to its nonlinear dynamics and complexity. In this study, chlorophyll-a concentrations for the Helgoland Roads time series were modeled using a number of measured water and environmental parameters. We chose three common machine learning algorithms from the literature: the support vector machine regressor, neural networks multi-layer perceptron regressor and random forest regressor. Results showed that the support vector machine regressor slightly outperformed other models. The evaluation with a test dataset and veriﬁcation with an independent validation dataset for chlorophyll-a concentrations showed a good generalization capacity, evaluated by the root mean squared errors of less than 1 µ g L − 1 . Feature selection and engineering are important and improved the models signiﬁcantly, as measured in performance, improving the adjusted R 2 by a minimum of 48%. We tested SARIMA in comparison and found that the univariate nature of SARIMA does not allow for better results than the machine learning models. Additionally, the computer processing time needed was much higher (prohibitive) for SARIMA.


Introduction
Pelagic chlorophyll-a concentrations (chl-a) are a common indicator of primary production and key to evaluation of the health and productivity of marine and freshwater systems [1,2].It is therefore of crucial importance to accurately measure/predict chlorophyll from proxy parameters in such systems [3].Accelerated global warming is exacerbating climate change and unsettling ecosystems' processes, while the impacts of this are directly affecting marine primary production and triggering an upwards transfer of effects that reach humans.Thus, the importance of modelling chlorophyll is emphasized in environments undergoing change resulting from global warming [4].
Prediction of chlorophyll-a time series data is a challenge due to their complexity and nonlinearity, and indeed, conventional approaches show limitations with prediction of unobserved data [5,6].To date, all conventional approaches, including factors based on single measurements, are limited with regard to prediction accuracy of chlorophylla concentrations [7].A few previous studies have tried to implement various machine learning techniques to predict chlorophyll concentrations, mainly in fresh water systems, with a few in marine regions [8][9][10][11].
Machine learning (ML) techniques constitute a set of tools belonging to the fields of computer science and artificial intelligence.The versatility of these techniques allow the successful application in many fields of science and to a great variety of problems.The focus is often placed on tackling pattern recognition problems and on the construction of predictive models to make data-driven decisions [12].According to [13], the general benefits of ML algorithms for time series prediction over classical methods include the ability of supporting noisy features, noise and complexity in the relationships between variables and in the handling of irrelevant features.
State-of-the-art ML algorithms for time series regression include random forest regressor (RF), support vector machine regressor (SVR) and neural networks multi-layer perceptron regressor (MLP).All of these have been used to some degree in the literature for the prediction of chlorophyll-a concentrations in aquatic systems, and have achieved significantly accurate results in both error and goodness of fit metrics [3,11,14].These are studies based in chl-a time series either with short length and daily frequency or long-term, low frequency sampling time series, using different ML methods to best predict chl-a behavior.The features applied as predictors in these studies are limited to just a few, but it must be considered that the dynamics in lacustrine systems are distinct from those presented in marine systems.Here we extend these ideas and test these methods on a good quality long-term time series, the Helgoland Roads time series, evaluating the prediction using unseen data.With the purpose to compare ML methods with a classical statistical regression model, we included an improved autoregressive integrated moving average (ARIMA) model, called seasonal ARIMA (SARIMA), which includes seasonal parameters to support data with a seasonal component [15].
The objective of this work is to evaluate the accuracy of machine learning algorithms for the estimation of chlorophyll-a concentration, using in situ high resolution long-term datasets.We (1) assess three ML algorithms-random forest, support vector regressor and neural networks multi-layer perceptron regressor-for chlorophyll-a concentration estimation; (2) examine the importance of feature selection and engineering in the different models; (3) compare with, and evaluate, a univariate SARIMA classical regression model.

Materials and Methods
All the ML models used in this study were implemented applying the "Scikit-Learn package", which is an open-source Python module project that integrates a wide range of common ML algorithms [16,17], while the SARIMA model was implemented with the statsmodels package [18].The preprocessing was also implemented in the Python environment, using the well-known packages Pandas, NumPy and SciPy [19].

Datasets
The Helgoland Roads is a long-term pelagic monitoring site (54 • 11.3 N, 7 • 54.0 E) about 60 km off the German coast and represents a marine transition zone between coastal waters and open sea (Figure 1) [20].Since 1962, surface water samples have been collected on working days, taken with a bucket lowered from a research vessel.Secchi depth and water temperature (SST) are measured in situ and the water samples analyzed in the laboratory for nutrients (nitrate, phosphate and silicate) and salinity.Chlorophyll-a concentration measurements were started at the end of 2001, acquired in laboratory by FluoroProbe (bbe Moldaenke GmbH, Kiel, Germany) [21] and, since, 2004 have been complemented with high-performance liquid chromatography analysis (HPLC) [22,23].
Sunshine duration, wind speed and direction [24][25][26], North Atlantic Oscillation (NAO) daily index (NOAA ESRL Physical Sciences Laboratory, Boulder, CO, USA, 2020) and zooplankton abundance [27], were added to the Helgoland Roads parameter matrix for this work (Table 1).As indicated in the literature [28][29][30], and also from working experience, the included parameters are environmental variables which determine algal verdure and, thus, modulate chlorophyll-a concentrations in marine systems.Sunshine duration, wind speed and direction [24][25][26], North Atlantic Oscillation (NAO) daily index (NOAA ESRL Physical Sciences Laboratory, Boulder, Colorado, USA, 2020) and zooplankton abundance [27], were added to the Helgoland Roads parameter matrix for this work (Table 1).As indicated in the literature [28][29][30], and also from working experience, the included parameters are environmental variables which determine algal verdure and, thus, modulate chlorophyll-a concentrations in marine systems.

Data Preprocessing
The raw data of Helgoland Roads are characterized by long-term measurements on work-daily frequency, with missing values during weekends and extreme bad weather days.When merged with date of other features such as zooplankton abundance, it ends with approximately 40% of missing data in the time series.To fill the missing data and creating a regular sampled daily time series, a number of imputation methods were tested in sunlight duration, a feature added to the Helgoland Roads from an external source, with no missing values.After creating a synthetic missing values dataset with sunlight duration, we calculated root mean square error (RMSE) and coefficient of determination (R 2 ) between the original and interpolated data.Minimum changes in frequency distribution between missing data and interpolated variables, lowest RMSE and highest R 2 , were

Data Preprocessing
The raw data of Helgoland Roads are characterized by long-term measurements on work-daily frequency, with missing values during weekends and extreme bad weather days.When merged with date of other features such as zooplankton abundance, it ends with approximately 40% of missing data in the time series.To fill the missing data and creating a regular sampled daily time series, a number of imputation methods were tested in sunlight duration, a feature added to the Helgoland Roads from an external source, with no missing values.After creating a synthetic missing values dataset with sunlight duration, we calculated root mean square error (RMSE) and coefficient of determination (R 2 ) between the original and interpolated data.Minimum changes in frequency distribution between missing data and interpolated variables, lowest RMSE and highest R 2 , were the basis for the decision to use a linear interpolation, supported by [30].After the interpolation, we have daily datasets of parameters in Table 1 comprising approximately 13 years, from 2 November 2001 to 22 April 2015, and presented in Supplementary Materials, Figure S1.
In this study, to validate the performance of the ML models, the dataset was split in 80% (n = 3940) for model training, and 20% (n = 980) for model testing, so we could investigate the model generalization ability [31].To eliminate the dimensional differences of the data and improve the prediction ability of the models, we used the StandardScaler method from the Scikit-Learn package, which standardizes features by removing the mean and scaling to unit variance.In this study, to validate the performance of the ML models, the dataset was split in 80% (n = 3940) for model training, and 20% (n = 980) for model testing, so we could investigate the model generalization ability [31].To eliminate the dimensional differences of the data and improve the prediction ability of the models, we used the StandardScaler method from the Scikit-Learn package, which standardizes features by removing the mean and scaling to unit variance.
The training dataset, the sample of data used to fit the model, dates from 2 November 2001 to 15 August 2012 (~11 years), while the test set is from 16 August 2012 to 22 April 2015 (~2.5 years) and it is used for model evaluation (Figure 2).For independent validation, we used a linear interpolated time series of HPLC estimated chlorophyll data (05 May 2015 to 27 November 2018, n = 348).

Feature Engineering and Selection
The Pearson correlation coefficients were calculated to investigate linear relationships between chlorophyll-a concentration and the other variables (Table 2).All correlation coefficients were lower than 0.5, indicating no strong linear correlation between chlorophyll and any other variable.

Feature Engineering and Selection
The Pearson correlation coefficients were calculated to investigate linear relationships between chlorophyll-a concentration and the other variables (Table 2).All correlation coefficients were lower than 0.5, indicating no strong linear correlation between chlorophyll and any other variable.Prediction is a major task of time series data mining, which uses known historical values to estimate future values, and feature selection and engineering is essential and crucial for accurate predictions [32].To seek improvement, 15 days lagged predictors were generated, totalizing 211 features [33].The choice of lags was based in a two-week period where all the predictors supposedly influence chlorophyll-a concentration, including chl-a past values, i.e., the lagged target values were used as predictors (t − 1, . . ., t − n; with t as the current time and n = 15).As there are significant seasonal differences, e.g., summer and winter nutrients uptake, the definition of two weeks seemed reasonable for this work to input information, considering that the machine learning algorithms are data-driven and they are not mechanistic models [34].Additionally, date features were generated, namely, "year" and "day of year" from 1 to 365 or 366.The cyclic variables "day of year" and "wind direction" were transformed with sin [2π (day of year)/(number of days in year)] (1) cos [2π (day of year)/(number of days in year)] ( 2) to ensure that the last day of a year was understood to be in sequence with the first day of the next year and 0 • degree in direction was equal to 360 • [35].
A large number of features in the dataset drastically affects both the training time as well as the accuracy of machine learning models.One means to limit model complexity from multiple variables is to reduce the model by selectively eliminating predictors.Feature selection procedure was conducted applying a combination of Recursive Feature Elimination.We used Scikit-Learn module Recursive Feature Elimination with cross-validation (Scikit-Learn feature.selectionRFECV module) and Ridge estimator, to estimate the best number of features balanced with accuracy (Figure 3).After the best number of features were defined with the Ridge cross-validation method, we applied Recursive Feature Elimination (Scikit-Learn feature.selectionRFE module) with SVR linear estimator, this way selecting the 17 best parameters to model chl-a in a robust manner (Table 3) [36].
crucial for accurate predictions [32].To seek improvement, 15 days lagged predictors were generated, totalizing 211 features [33].The choice of lags was based in a two-week period where all the predictors supposedly influence chlorophyll-a concentration, including chla past values, i.e., the lagged target values were used as predictors (t − 1, …, t − n; with t as the current time and n = 15).As there are significant seasonal differences, e.g., summer and winter nutrients uptake, the definition of two weeks seemed reasonable for this work to input information, considering that the machine learning algorithms are data-driven and they are not mechanistic models [34].Additionally, date features were generated, namely, "year" and "day of year" from 1 to 365 or 366.The cyclic variables "day of year" and "wind direction" were transformed with sin [2π (day of year)/(number of days in year)] (1) cos [2π (day of year)/(number of days in year)] ( 2) cos [2π (wind direction (°))/(360)] (4) to ensure that the last day of a year was understood to be in sequence with the first day of the next year and 0° degree in direction was equal to 360° [35].
A large number of features in the dataset drastically affects both the training time as well as the accuracy of machine learning models.One means to limit model complexity from multiple variables is to reduce the model by selectively eliminating predictors.Feature selection procedure was conducted applying a combination of Recursive Feature Elimination.We used Scikit-Learn module Recursive Feature Elimination with cross-validation (Scikit-Learn feature.selectionRFECV module) and Ridge estimator, to estimate the best number of features balanced with accuracy (Figure 3).After the best number of features were defined with the Ridge cross-validation method, we applied Recursive Feature Elimination (Scikit-Learn feature.selectionRFE module) with SVR linear estimator, this way selecting the 17 best parameters to model chl-a in a robust manner (Table 3) [36].

Model Selection and Hyperparameter Tuning
The algorithms evaluated in this study are random forest regressor (RF) [37], support vector machine regressor (SVR) [38] and multi-layer perceptron regressor neutral network (MLP) [39,40].These were chosen to be widely used and to present available information that allows the easy application in any level of knowledge concerning ML.Compared with deep learning approaches, traditional machine learning does not need large amounts of data to train and the computer processing can be performed in low-end machines without a GPU (Graphics Processing Unit) [41].
SVR is a kernel-based nonlinear regression method.It transforms the original input data space into a high-dimensional input space (hyperplanes) and performs linear regression in the high-dimensional space by defining a maximum margin separator, which minimizes expected generalization error instead of the prediction error in the training dataset.The kernel functions, which take as input the dot products of pairs of input points, allows the SVR to map the inputs efficiently compared to calculating the corresponding points of each input in the high-dimensional space.Basically, SVR finds hyperplanes that minimize the errors and maximize the margins of continuous data [6].
RF is a machine learning technique that utilizes an ensemble of decision trees for regression tasks.It randomly takes subsets of the data and input variables, and the results of all trees are averaged to achieve a better result than individual trees.The use of random samples of the training data for multiple decision trees reduces overfitting compared to using the entire training set with a single decision tree [42].
MLP is an artificial neural network and it consists of connected nodes, resembling the neurons in a biological brain.It consists of at least three layers of nodes: the input layer, hidden layer and output layer.Excluding the input layer nodes, each node receives inputs from the other nodes, and the outputs are calculated using a nonlinear activation function.The learning process for MLP involves continually adjusting weights in the network to minimize the error rate using backpropagation.Backpropagation computes the gradient of the loss function with respect to the weights and updates the weights in the network using methods such as stochastic gradient descent [42].
Depending upon the study cases, different ML algorithms usually require some adjustments.These are often crucial for the development of a successful application.Each ML algorithm has parameters, so-called hyperparameters, which define the setup of the machine to modelling the target function.For each model, a search range of hyperparameters was tested.In cases where a value was selected at the edge of the search range, a new cross-validation was conducted including more values.
All hyperparameter tuning of the models (Table 4) is based on GridSearchCV in the Scikit-Learn package, which can evaluate all possible given combinations of hyperparameter values using 10-fold cross-validation.This procedure determines the best combination of hyperparameters of the model that gives the best accuracy, in terms of coefficient of determination (R 2 ).
Cross-validation is a model validation technique for obtaining reliable and stable models.The use of multiple models in the evaluation removes possible biases of some models with some data sets.We used the training dataset to search for the best parameters, and reported the prediction performances on the test dataset using these parameters [43].The mentioned grid search was performed independently for each model on the training subset.R 2 , adjusted coefficient of determination (adj R 2 ) and RMSE were the metrics used in this work to evaluate the predictions.The use of adj R 2 in multiple regression is important because it increases only when new independent variables that increase the explanatory power of the regression equation are added; this makes it a useful measure of how well a multiple regression equation fits the sample data.A linear base model, available in Scikit-Learn, was used to observe the improvements using the more sophisticated algorithms.

SARIMA Model
For the SARIMA model, the univariate chl-a data was used, while maintaining the partitions in the training and test dataset.To test stationarity, the Augmented Dickey-Fuller test (ADF) was applied indicating significant stationarity (p < 0.05) in the train and test datasets.To fill the model (p, d, q) × (P, D, Q) 365 , where 365 represents the seasonality, the best autoregressive (p, P) and moving average (q, Q) parameters were selected using an iterative method in the train dataset.The parameters ranged from 0 to 4 in the nonseasonal parameters (p, q) and 0 to 2 in the seasonal parameters (P, Q), selecting the combination with lowest Akaike information criterion (AIC).The difference order parameters d and D were 0, due to the stationarity results of the ADF test.The best parameters selected using the training dataset were (4, 0, 1) × (2, 0, 1) 365 , and this SARIMA model was used to fit the test dataset.

Results
For this study, the best R 2 , adj R 2 and RMSE achieved for predicting chlorophyll-a using support vector machine regressor, random forest regressor, and neural network multilayer perceptron regressor are presented in Table 5.In a combination of hyperparameters tuning and feature selection, the models showed improvement compared with the default models (no feature selection, no tuning) for the test datasets.Comparing the algorithms, SVR reached the best R 2 (0.78) and RMSE (1.113 µg L −1 ), however, these were only slightly better results (MLP = 0.76; 1.144 µg L −1 and RF = 0.75; 1.189 µg L −1 ).The algorithms presented good performances for the subsets of training dataset during the cross-validation step (Figure 4).In addition, the predicted values were close to the observed data (Figure 5).All the ML algorithms were better than the linear base model.rameters tuning and feature selection, the models showed improvement compared with the default models (no feature selection, no tuning) for the test datasets.Comparing the algorithms, SVR reached the best R 2 (0.78) and RMSE (1.113 µg L −1 ), however, these were only slightly better results (MLP = 0.76; 1.144 µg L −1 and RF = 0.75; 1.189 µg L −1 ).The algorithms presented good performances for the subsets of training dataset during the crossvalidation step (Figure 4).In addition, the predicted values were close to the observed data (Figure 5).All the ML algorithms were better than the linear base model.The algorithms gave a good performance for the training dataset and allowed a good generalization for the test dataset, as it can be seen from how close the predicted values are from those observed in Figure 5.Using all of the 211 features and the default hyperparameters, the results in the test data were not as good as those from the optimized models (Table 5), mainly due to overfitting, when the models are more complex than necessary and the fitting in the training dataset is affected by noise [44].
Appl.Sci.2021, 11, x FOR PEER REVIEW 9 of 14 The algorithms gave a good performance for the training dataset and allowed a good generalization for the test dataset, as it can be seen from how close the predicted values are from those observed in Figure 5.Using all of the 211 features and the default hyperparameters, the results in the test data were not as good as those from the optimized models (Table 5), mainly due to overfitting, when the models are more complex than necessary and the fitting in the training dataset is affected by noise [44].Considering the features used as inputs in each of the algorithms, the Recursive Feature Elimination was implemented by combining Ridge and SVR linear estimators and selecting a maximum number of 17 predictors.This generated the following result: ('SD, 'SST', 'Salinity', 'SD_-1', 'SST_-1', 'SST_-2', 'SST_-9', 'SST_-12', 'SST_-13', 'SST_-14', 'SST_-15', 'Salinity_-1', 'Chl_-1', 'Chl_-4', 'Chl_-5', 'Chl_-7', 'Chl_-8'), with the negative numbers in the codes (Table 2) representing the applied lag in days.The adj R 2 results, which are sensitive to the number of used predictors, showed improvement from 0.02 to 0.76 for MLP, while for SVR the result improved from 0.63 to 0.77 and from 0.15 to 0.74 for RF in the test dataset.
For the independent validation, a chl-a dataset acquired by HPLC, the predictions had better RMSE and R 2 than the test datasets (Figure 6).Again, the higher values had limitations in prediction, but the lower variance compared with the training and testing datasets allowed for better evaluation indicators, with RMSE for all algorithms in the order of 0.3 µg L −1 and R 2 reaching approximately 0.90.For the independent validation, a chl-a dataset acquired by HPLC, the predictions had better RMSE and R 2 than the test datasets (Figure 6).Again, the higher values had limitations in prediction, but the lower variance compared with the training and testing datasets allowed for better evaluation indicators, with RMSE for all algorithms in the order of 0.3 µg L −1 and R 2 reaching approximately 0.90.The iterative SARIMA parameters selection uses much more computer processing time compared with the GridSearchCV method in machine learning.The latter is a scale of seconds to minutes while the former hours to days.It took around two weeks to select the best p, q, P and Q parameters in the daily data considering a yearly seasonality.Fitting The iterative SARIMA parameters selection uses much more computer processing time compared with the GridSearchCV method in machine learning.The latter is a scale of seconds to minutes while the former hours to days.It took around two weeks to select the best p, q, P and Q parameters in the daily data considering a yearly seasonality.Fitting the test dataset with the SARIMA model gave the worst results when compared with the ML models (Figure 7).The iterative SARIMA parameters selection uses much more computer processing time compared with the GridSearchCV method in machine learning.The latter is a scale of seconds to minutes while the former hours to days.It took around two weeks to select the best p, q, P and Q parameters in the daily data considering a yearly seasonality.Fitting the test dataset with the SARIMA model gave the worst results when compared with the ML models (Figure 7).

Discussion
Machine learning analysis was conducted on the Helgoland Roads time series to develop the best fit of chlorophyll-a concentrations over time using different parameters and their lagged correlates.For the three algorithms implemented, the model results were virtually equal in the evaluation metrics, presenting similar results in prediction, with slightly better values for the model SVR.For the time predictions, each of the three models' performances are acceptable with high R 2 values greater than 0.70 and RMSE lower than 1.5 µg L −1 , ~40% smaller than the chlorophyll-a concentration standard deviation of 2.9 µg L −1 .However, all of the algorithms were unable to predict extreme values (Figure 8).It was expected that a certain degree of decrease in accuracy would be incurred because

Discussion
Machine learning analysis was conducted on the Helgoland Roads time series to develop the best fit of chlorophyll-a concentrations over time using different parameters and their lagged correlates.For the three algorithms implemented, the model results were virtually equal in the evaluation metrics, presenting similar results in prediction, with slightly better values for the model SVR.For the time predictions, each of the three models' performances are acceptable with high R 2 values greater than 0.70 and RMSE lower than 1.5 µg L −1 , ~40% smaller than the chlorophyll-a concentration standard deviation of 2.9 µg L −1 .However, all of the algorithms were unable to predict extreme values (Figure 8).It was expected that a certain degree of decrease in accuracy would be incurred because of the difficulty in capturing and reproducing these extreme peaks [45].One hypothesis that would explain the underestimation of extreme values is the absence of predictive features, e.g., hydrodynamics can result in the transport of chlorophyll from other areas as an input event, even though salinity and wind parameters are reliable indicatives for current and wave dynamics in the German Bight [46].As these events do not present as a temporal pattern, the ML models do not recognize the influence on the target. of the difficulty in capturing and reproducing these extreme peaks [45].One hypothesis that would explain the underestimation of extreme values is the absence of predictive features, e.g., hydrodynamics can result in the transport of chlorophyll from other areas as an input event, even though salinity and wind parameters are reliable indicatives for current and wave dynamics in the German Bight [46].As these events do not present as a temporal pattern, the ML models do not recognize the influence on the target.Because each algorithm is based on different algebraic assumptions and procedures, they can result in different predictions.Between SVR and MLP, [14] points to differences in the nonlinear equalization performance and the structural risk minimization principle of SVR being more effective than the empirical risk minimization principle of neural networks in terms of minimizing error.According to [47], in MLP the method for determin- Because each algorithm is based on different algebraic assumptions and procedures, they can result in different predictions.Between SVR and MLP, [14] points to differences in the nonlinear equalization performance and the structural risk minimization principle of SVR being more effective than the empirical risk minimization principle of neural networks in terms of minimizing error.According to [47], in MLP the method for determining global solutions is difficult to converge because of its inherent algorithm design and model parameters are more complex than SVR, whereas the SVR has ready access to global optimal solutions, obtained by solving a linearly constrained quadratic programming problem [14].Between SVR and RF, as we saw, the linear base model gave good results.There is the possibility of a linear dependency that is better captured by SVR, probably a result from the linear interpolation in the preprocessing step of this study.
The feature selection and tuning of hyperparameters was extremely important and improved the results substantially.This was noticeable in the adj R 2 results for default and optimized models.Analyzing the 17 features used in SVR and described in the results section, the algorithm considered SST, lagged SST, lagged chlorophyll, salinity and Secchi depth to reach the best results presented in this work.It is important to point that ML is a data-driven approach, but it is possible to make inferences about the selected features.The number of selected features was a response of balancing bias and variance in the learning algorithms [48].For this study, we noticed the choice of SST as an important feature, probably representing the seasonal patterns in the chlorophyll target.
Better R 2 , adj R 2 and RMSE results in the independent validation dataset are possibly due to less variability and absence of extreme values, and shows the good generalization that the ML models are capable of.All of the good results, for both the test and independent validation data, show the better prediction power of the three ML algorithms evaluated in this study.Comparing with the classical SARIMA model, the univariate and linear background did not achieve the results needed for it to outperform the ML models.Compared with the ML literature, studies such as [3] and [11] achieved results of R 2 ranging from 0.50 to 0.80, analyzing shorter time series of chl-a in lakes.The authors of [49] predicted variations of chlorophyll-a in different sites of the North Sea using generalized additive models (GAM) and the R 2 results ranged from 0.15 to 0.63.In [28], using GAM to predict chl-a in a spatial approach for the North Atlantic, got the best result for R 2 at 0.83.All of these values show how variable different methods' performances in predicting chlorophyll can be, not necessarily meaning one method is better than the other, but more adaptive.ML models proved their generalization capacity and high accuracy.

Conclusions
In this work, we evaluated three machine learning algorithms in a regression task.Support vector regressor presented a slightly better performance, with the advantage that it used less computational time, and generated chlorophyll concentration predictions with 0.78 correlation to the observed data, in comparison to 0.76 and 0.75 for MLP and RF, respectively.Moreover, the root mean square error was approximately 1.1 µg L −1 for the test dataset and less than one for the independent validation data, which is approximately 38% percent smaller than the standard deviation of 2.9 µg L −1 .This study demonstrates the ability of machine learning models to use environmental in situ time series to predict the chlorophyll concentration with significant accuracy (R 2 ), higher than 70%, and the importance of tuning hyperparameters and defining the best predictors (feature selection).Most chlorophyll-a prediction studies are conducted in fresh water environments or using satellite data and limited time series, so this work can be considered a step toward the use of machine learning algorithms in marine areas based on long-term time series.Being aware of the limitations presented in this study, in future works it would be interesting to work with irregular sampled time series, improve the method for feature selection, ensemble results of different ML and classical statistical models, and evaluate the forecasting power of these models in the short and long term.Besides, the use of deep learning approaches

Figure 1 .
Figure 1.Helgoland Roads monitoring site position (black triangle) in the German Bight, between the Helgoland (H) and Dune (D) islands.

Figure 1 .
Figure 1.Helgoland Roads monitoring site position (black triangle) in the German Bight, between the Helgoland (H) and Dune (D) islands.
The training dataset, the sample of data used to fit the model, dates from 2 November 2001 to 15 August 2012 (~11 years), while the test set is from 16 August 2012 to 22 April 2015 (~2.5 years) and it is used for model evaluation (Figure2).For independent validation, we used a linear interpolated time series of HPLC estimated chlorophyll data (5 May 2015 to 27 November 2018, n = 348).
Figure S1.In this study, to validate the performance of the ML models, the dataset was split in 80% (n = 3940) for model training, and 20% (n = 980) for model testing, so we could investigate the model generalization ability[31].To eliminate the dimensional differences of the data and improve the prediction ability of the models, we used the StandardScaler method from the Scikit-Learn package, which standardizes features by removing the mean and scaling to unit variance.The training dataset, the sample of data used to fit the model, dates from 2 November 2001 to 15 August 2012 (~11 years), while the test set is from 16 August 2012 to 22 April 2015 (~2.5 years) and it is used for model evaluation (Figure2).For independent validation, we used a linear interpolated time series of HPLC estimated chlorophyll data (05 May 2015 to 27 November 2018, n = 348).

Figure 2 .
Figure 2. The train and test partition in chlorophyll-a concentration target (black solid and gray solid lines, respectively), and the HPLC chl-a validation dataset (black dashed).After the split, the testing dataset will remain untouched, to guarantee no leakage of information to the training step.The validation dataset is the independent validation.

Figure 2 .
Figure 2. The train and test partition in chlorophyll-a concentration target (black solid and gray solid lines, respectively), and the HPLC chl-a validation dataset (black dashed).After the split, the testing dataset will remain untouched, to guarantee no leakage of information to the training step.The validation dataset is the independent validation.

Figure 3 .
Figure 3. Result of RFECV with Ridge estimator.The black dot represents the maximum value of 17 selected features (predictors) to reach the highest explained variance.After the maximum value, there is an exponential decay/increase in the R 2 /RMSE.RMSE unit is µg L −1 .

Figure 3 .Table 3 .
Figure 3. Result of RFECV with Ridge estimator.The black dot represents the maximum value of 17 selected features (predictors) to reach the highest explained variance.After the maximum value, there is an exponential decay/increase in the R 2 /RMSE.RMSE unit is µg L −1 .

Figure 4 .
Figure 4. Boxplot of accuracy in the 10-fold cross-validation training step for the SVR, MLP and RF models, showing the mean and the number of folds (n) or subsets in the training data used to define the best hyperparameters.

Figure 4 .
Figure 4. Boxplot of accuracy in the 10-fold cross-validation training step for the SVR, MLP and RF models, showing the mean and the number of folds (n) or subsets in the training data used to define the best hyperparameters.

Figure 5 .
Figure 5. Results of prediction (black dashed) and comparison with the observed test dataset (gray solid).For the three algorithms, R 2 is higher than 0.7 and RMSE lower than 1.2 µg L −1 .(a) SVR, (b) MLP and (c) RF.

Figure 5 .
Figure 5. Results of prediction (black dashed) and comparison with the observed test dataset (gray solid).For the three algorithms, R 2 is higher than 0.7 and RMSE lower than 1.2 µg L −1 .(a) SVR, (b) MLP and (c) RF.

14 Figure 6 .
Figure 6.Results of prediction (black dashed) and comparison with the validation dataset (gray solid).For the three algorithms, R 2 is approximately 0.9 and RMSE lower than 0.3 µg L −1 .(a) SVR, (b) MLP and (c) RF.

Figure 6 .
Figure 6.Results of prediction (black dashed) and comparison with the validation dataset (gray solid).For the three algorithms, R 2 is approximately 0.9 and RMSE lower than 0.3 µg L −1 .(a) SVR, (b) MLP and (c) RF.

Figure 6 .
Figure 6.Results of prediction (black dashed) and comparison with the validation dataset (gray solid).For the three algorithms, R 2 is approximately 0.9 and RMSE lower than 0.3 µg L −1 .(a) SVR, (b) MLP and (c) RF.

Figure 7 .
Figure 7. Result of SARIMA fit (black dashed) in the test dataset (gray solid).The better fit in extreme values is counter-balanced by the estimation of negative values, decreasing/increasing R 2 /RMSE compared to the ML models results.

Figure 7 .
Figure 7. Result of SARIMA fit (black dashed) in the test dataset (gray solid).The better fit in extreme values is counter-balanced by the estimation of negative values, decreasing/increasing R 2 /RMSE compared to the ML models results.

Figure 8 .
Figure 8. Cross-plots of the modeled and observed chlorophyll values in (a) SVR, (b) MLP and (c) RF.It is possible to notice the deviation in extreme values, showing the limitation of the ML models in deal with these data values.

Figure 8 .
Figure 8. Cross-plots of the modeled and observed chlorophyll values in (a) SVR, (b) MLP and (c) RF.It is possible to notice the deviation in extreme values, showing the limitation of the ML models in deal with these data values.

Table 1 .
Statistical description of parameters used as determinants to predict chlorophyll-a concentration (target) after linear interpolation (std, min and max are standard deviation, minimum and maximum values, respectively).

Table 1 .
Statistical description of parameters used as determinants to predict chlorophyll-a concentration (target) after linear interpolation (std, min and max are standard deviation, minimum and maximum values, respectively).

Table 2 .
Pearson correlation among predictors and the target chlorophyll-a concentration.

Table 2 .
Pearson correlation among predictors and the target chlorophyll-a concentration.

Table 4 .
Hyperparameters tested in GridSearchCV and those applied to each ML algorithms.

Table 5 .
Comparison of nonoptimized (default) and optimized model performances for predicting chlorophyll-a concentration during training (train) and testing (test) steps.The linear model serves as a base model.

Table 5 .
Comparison of nonoptimized (default) and optimized model performances for predicting chlorophyll-a concentration during training (train) and testing (test) steps.The linear model serves as a base model.