Machine Learning Regression Approaches for Colored Dissolved Organic Matter ( CDOM ) Retrieval with S 2-MSI and S 3-OLCI Simulated Data

The colored dissolved organic matter (CDOM) variable is the standard measure of humic substance in waters optics. CDOM is optically characterized by its spectral absorption coefficient, aCDOM at at reference wavelength (e.g., ≈ 440 nm). Retrieval of CDOM is traditionally done using bio-optical models. As an alternative, this paper presents a comparison of five machine learning methods applied to Sentinel-2 and Sentinel-3 simulated reflectance (Rrs) data for the retrieval of CDOM: regularized linear regression (RLR), random forest regression (RFR), kernel ridge regression (KRR), Gaussian process regression (GPR) and support vector machines (SVR). Two different datasets of radiative transfer simulations are used for the development and training of the machine learning regression approaches. Statistics comparison with well-established polynomial regression algorithms shows optimistic results for all models and band combinations, highlighting the good performance of the methods, especially the GPR approach, when all bands are used as input. Application to an atmospheric corrected OLCI image using the reflectance derived form the alternative neural network (Case 2 Regional) is also shown. Python scripts and notebooks are provided to interested users.


Introduction
Ocean color retrievals from remote sensing have been produced regularly in the last decades with more or less accuracy in different regions of the planet and for several water types.The theory of aquatic optics was explained by Preisendorfer [1] and Jerlov [2]; and the related radiance transfer equations have been adapted by Mobley [3] for the Hydrolight model [4].The development and validation of water quality algorithms, many of them empirically developed and implemented using in situ data from very specific locations, are the main topic of many of the published investigations.First more focused on open ocean or Case-1 waters-where the optical properties are determined mainly by phytoplankton-, later with further development of algorithms for more complex or Case-2 waters (coastal and lakes), the water quality variables reported as able to be estimated by remote sensing are: concentration of total suspended matter (TSM), turbidity, colored dissolved organic matter (CDOM), concentration of chlorophyll_a (Chl-a), occurrence of surface accumulating algal blooms, concentration of phycocyanin, and Secchi depth, e.g., [5][6][7][8][9][10].The development of algorithms that do not require extensive in situ sampling for training is a central aim in remote sensing of water quality [11].The atmospheric correction, previous step to derive water leaving reflectance, has turned out to be demanding particularly for non-oligotrophic waters [12,13], and especially complicated for darker CDOM-rich waters so called Case-2 absorbing (C2A) or extreme absorbing waters (C2AX) [14].The water leaving signal is very low and the proportion of atmospheric noise can be very high (until 95% of the signal [15]).This atmospheric correction issue is, however, not part of the developments of this research, even though is a factor to take into account.
C2A and C2AX waters dominated by dissolved organic matter (DOM) are the focus of this research.DOM in the ocean has a relevant role and impact on the global carbon cycle, in concrete the colored component of DOM, which absorbs light exponentially decreasing from the ultra-violet (UV) to the visible parts of the spectrum.The estimation of CDOM from remote sensing data, as a proxy for dissolved organic carbon (DOC), requires of accurate algorithms [16].In the boreal temperate and cold regions like Finland, Sweden and Estonia, humic waters in lakes and some coastal zones are abundant.These waters typically have fairly low TSM and Chl-a concentrations, even though some cases of "black lakes" with high Chl-a and TSM values have been reported too [17].In these cases, the reflectance is negligible in the visible, and only in the red-near infrared is some times possible to detect the Chl-a signal.Within ESA's C2X project, extreme absorbing waters were characterized by CDOM absorption a CDOM (440) > 1 m −1 , which results in very low reflectance, typically with maximum below 0.005 sr −1 [14].In Finnish lakes, for instance, the median absorption coefficient of CDOM at 443 nm is around 3.7 m −1 [18,19].In Finland the humic matter concentration of lakes correlates with the share of peat land in the drainage area [20].Humic lakes can also originate from peat dredging, e.g., in the Netherlands.Information on humic substances is utilized in the application of official directives, lake management and climate change studies.The a CDOM parameter is considered to be a measure of dissolve organic carbon, which could help to estimate CO 2 efflux and to assess the carbon pool in carbon budget studies [19].An accurate measurement of the a CDOM parameter from remote sensing seems crucial in these types of water.However, it is known that CDOM is one of the most critical and uncertain ocean color (OC) product [21,22].
In the work presented here, we focus on the CDOM estimation, showing results of the application of several machine learning (ML) algorithms in 'typical boreal waters', with medium to extreme CDOM absorption.The main objective is the retrieval of the CDOM variable using the sensors developed by the European Space Agency (ESA) as part of the Copernicus Earth Observation Programme: Sentinel-2 Multi-Spectral Instrument (S2-MSI) and Sentinel-3 Ocean and Land Colour Instrument (S3-OLCI).The S2-MSI sensor and other high-spatial-resolution instruments have the drawback that, since they are designed initially for terrestrial applications, their spectral resolution, measurement frequency, and radiometric characteristics are not optimized for water quality mapping.Even though, S2-MSI gives more accurate water quality estimates through its enhanced channel configuration and better temporal resolution than other high resolution satellites like Landsat (see Section 2 for details).S3-OLCI is designed to measure ocean color over ocean and coastal zones with 300 m of spatial resolution (see Section 2 for specifications).With a very good signal-to-noise ratio, good radiometric stability, mitigation of sun-glint contamination and excellent cover of the global ocean, S3-OLCI images the spectral distribution of the radiance at top-of-atmosphere.After atmospheric correction, the upwelling radiance just above the sea surface (the water-leaving radiance) is retrieved and used to estimate a number of geophysical parameters through the application of specific bio-optical algorithms.S3-OLCI provides information on the atmosphere too, especially on the aerosols characterization necessary for the atmospheric correction process.
Concerning the retrieval of CDOM absorption, several band ratios have been proposed as predictive models for estimating CDOM from spectral data [11,23,24].These parametric approaches only take into account a few spectral bands and thus they disregard the information contained in other bands.Furthermore, they are usually based on models determined by training data, like in situ measurements that correlate with certain band combinations or ratios.Non-parametric regression algorithms can alternatively exploit the information contained in all spectral bands.For instance, semi-analytical or even neural network (NN) algorithms, i.e., MERIS Case II [25], Case 2 Regional Coast Color (C2RCC) [26,27], Boreal Lakes Procesor [19], S3-OLCI Neural Network Swarm (ONNS) [28], are used to parametrize the inverse relationship between inherent optical properties (IOPs) and reflectance, allowing the retrieval of certain concentrations like Chl-a, TSM and a CDOM .We assess here the performance of alternative non-parametric machine learning (ML) algorithms for CDOM estimation over typical boreal waters.The specific goals of the study are: (i)to test five different ML regression methods: multivariate regression (RLR), random forests (RFR), kernel ridge regression (KKR), Gaussian process regression (GPR) and support vector regression (SVR); (ii) to apply the methods on two datasets coming from two separate sources to demonstrate a wide range of applicability; (iii) to evaluate the validity by comparing with established methods, like empirically derived polynomial algorithms or neural networks (C2RCC and ONNS).
The remainder of the paper is organized as follows.§2 describes the materials and methods used, giving details about the distribution and components of the datasets; explaining details and giving references on the traditional band ratio methods; and finally introducing the ML approaches tested and associated statistics.§3 gives an empirical evidence of the performance of the proposed methods in comparison to standard bio-optical models for the particular datasets used; and it shows the validation results for the S3-OLCI-Case2Xtreme (C2X) dataset using an independent validation dataset.In §4 we focus the analysis on the C2X dataset, comparing the results of the S3-OLCI ML approaches with the results of the ONNS; finally, an application of the ML methods is tested on an S3-OLCI scene and the best method is compared with the C2RCC NN products included by default in the S3-OLCI L2.We conclude in §5 with remarks about the analysis done and an outline of the future work.

Datasets
The datasets used are based on simulated remote sensing reflectance (R rs ) with Hydrolight 5.2 [4].R rs is the ratio of water-leaving radiance to downwelling irradiance above the sea surface.Here, R rs refers to the simulations of clear atmosphere reflectance with Sun at zenith and viewing angle exactly perpendicular.We apply the ML approaches on two datasets: the first dataset comes from the Finnish Environmental Institute in Helsinki (Suomen ympäristökeskus, SYKE) , and it is used to test and establish the methodology.The SYKE dataset is derived from routine monitoring measurements, with a broad representation of the water types and water status that can be found trough the year in many of Finland's lakes and coastal waters.Specific inherent optical properties (SIOPs) were derived from literature [11,29].The SYKE dataset is divided into two configurations: the first configuration simulates the Multi-Spectral Instrument carried in the Sentinel-2 satellite (S2-MSI).The S2-MSI samples 13 spectral bands.The wavelengths and bandwidths of the S2-MSI bands are shown in Table 1.The second SYKE configuration is related to the Ocean and Land Colour Instrument (S3-OLCI).S3-OLCI is based on the opto-mechanical and imaging design of ENVISAT MERIS.The S3-OLCI bands have been optimized to measure ocean color over open ocean and coastal zones and the sensor consists of 21 spectral bands with characteristics summarized in Table 1 as well.
The second dataset consist of S3-OLCI simulations calculated for the ESA's C2X project.Within the framework of the C2X project, in-water radiative transfer simulations for S3-OLCI were likewise carried out with the commercial software Hydrolight [4].The source of the simulations is described in [30,31].In the C2X project, the results of the simulations were grouped in five subcategories: Case 1, C2A, C2AX, Case 2 extreme scattering (C2S) and C2SX.Each subcategory comprehends 20,000 individual combinations of concentration of water constituents, inherent optical properties (IOPs), and sun positions.This large dataset is used for training and testing of the S3-OLCI ONNS in-water processor [28].One part of the S3-OLCI simulated dataset is put aside for validation purposes, with more than 4000 spectra per sub-category reserved exclusively for it.The C2X dataset contains simulations in 21 bands from where a subset of 15 bands is used here for water quality parameter estimation.

Established Approaches Using Band Ratio Algorithms
Reports on established band ratio algorithms are mostly based on airborne and field measurements with negligible or only small atmospheric influence, like the simulated datasets used here.Important wavelength regions for the band ratio algorithms for CDOM retrieval are the ones between 400-600 nm, having 660-720 nm as reference.Following [11,23], CDOM can be estimated by a ratio of reflectance at wavelengths > 600 nm to reflectance in the 400-550 nm range.This ratio is valid for a wide range of water constituent combinations.In the work of [24] and [11] in situ measured data is used, and derived algorithms work well for a CDOM and TSM.The TSM is estimated using a single band at 709 nm.Kallio [18] compared two band ratios using three wavebands at 490, 665 and 709 nm.Following that research, the optimized regression in our dataset has a polynomial form with coefficients varying depending on the ratio used: Two band ratios are calculated with these simulated datasets and their correlation with the a400 nm value is shown in Figure 1.The 400 nm band is used here because its availability and because the CDOM absorption values at this ultraviolet (UV) wavelength are expected to be very high [18].The first ratio combines the bands 665 nm with the 490 nm (Ratio 1); the second ratio combines the band in 705 or 708.75 nm with the 490 nm (Ratio 2).Both ratios are compared with the in situ absorption measured at 400 nm (a400 m −1 ) and four new polynomial relations are then derived, two per each sensor configuration.The coefficients vary slightly from the ones in Tables 3 and 4 because all data available is used to study the ratio-parameter relationship.When applying the machine learning models, the dataset is divided randomly into a training set (75%) and a test set (25%).The S2-MSI and S3-OLCI configurations of the SYKE dataset include the whole range of Chl-a (0-120 mg m −3 ), with variable TSM (0-123 g m −3 ) and extreme CDOM at 400 nm (1-86 m −1 ).

Machine Learning Approaches
The methods based on ratios are parametric models, that is, an explicit relation is assumed between a subset of reflectance bands and the parameter of interest (CDOM in our case).Despite the good performance in general of such approximation, parametric models are generally limited because they do not exploit the wealth of spectral information and can only cope with (often too) simplistic and rigid relationships.An alternative approach is provided by nonparametric regression.In this case, an explicit relationship between reflectance and the parameter of interest is not assume, and the functional form is actually learned (adjusted, inferred, fitted) from the data.Nonlinear very flexible relations can be accommodated, and very often improved performance in accuracy and bias is obtained.In addition, the trained models offer very low computational cost in the production (test) phase.On top of these advantages, the fields of machine learning and statistics offer a solid mathematical background, which allows to obtain bounds of performance and in some case confidence intervals for the predictions.
Currently there are a great many machine learning methods available for tackling nonlinear regression.They can be categorized in several families: multivariate linear regression, tree-based algorithms, neural networks, and kernel methods.In this work, we explore and compare experimentally a representative method from each family.In particular, five machine learning algorithms for linear and non-linear regression are tested and compared to the polynomial regression explained above: (multivariate) RLR, RFR [32], KRR [33], GPR [34], and SVR [35].The selected methods cover all the machine learning families: decision trees, randomized methods, kernel methods and probabilistic nonparametric approaches.

Multivariate Linear Regression
In RLR the output y (CDOM) is assumed to be a weighted sum of B input variables, x = [x 1 , . . ., x B ] , that is ŷ = x w.Maximizing the likelihood is equivalent to minimizing the sum of squared errors, and hence one can estimate the weights w = [w 1 , . . ., w B ] by least squares minimization.Very often one imposes some smoothness constraints to the model and also minimizes the weights power, w 2 , thus leading to the regularized linear regression (RLR) method.

Decision Trees and Random Forests
An alternative method to linear regression is that of RFR [32].An RFR model is an ensemble learning method for regression that operates by constructing a multitude of decision trees at training time and outputting the mean prediction of the individual trees.They combine many decision trees working with different subsets of features.The RFR strategy is very beneficial by alleviating the often reported over-fitting problem of simple decision trees.In addition, random forests (RFs) are quite robust to including a large number of input variables, excel in the presence of missing entries, heterogeneous variables, and can be easily parallelized to tackle large scale problems.RFs classification and regression have been applied in different areas of concern in forest ecology, such as modeling the gradient of coniferous species [36], the occurrence of fire in Mediterranean regions [37], the classification of species or land cover type [38,39], and the analysis of the relative importance of the proposed drivers [39] or the selection of drivers [38,40,41].

Kernel Methods
Kernel methods constitute a family of successful methods for regression [42].We explore the performance of three instantiations: (i) the KRR is considered to be the (non-linear) version of the RLR [33]; (ii) GPR is a probabilistic approximation to non-parametric kernel-based regression, where both a predictive mean and predictive variance can be derived [43]; and (iii) the support vector regression (SVR) is the regression counterpart of the traditional support vector classifier, and delivers sparse solutions.
Kernel methods offer the same explicit form of the predictive model, which establishes a relation between the input (e.g., spectral data) x ∈ R B and the output variable (CDOM) is denoted as y ∈ R. The prediction for a new input (radiance or reflectance) vector x * can be obtained as: where {x i } N i=1 are the spectra used in the training phase, α i is the weight assigned to each one of them, α o is the bias in the regression function, and K θ is a kernel or covariance function (parametrized by a set of hyper-parameters θ) that evaluates the similarity between the test spectrum x * and all N training spectra.Depending on the functional to be minimized and the constraints imposed, different kernel methods can be derived giving rise to different values of the weights α i .

Kernel ridge regression
KRR is the kernel version of the regularized least squares linear regression [33,44].The KRR essentially performs a regularized linear least squares regression in a feature space where the samples have been transformed to by a nonlinear mapping.It can be shown that the solution is analytical and reduces to solving a matrix inversion.As in any kernel method, one has to select the form of the similarity kernel function.In our case, we selected the standard radial basis function (RBF) kernel, defined as where σ is the lengthscale parameter which is typically adjusted by cross-validation.
Gaussian process regression GPR model assumes that the observed CDOM content is a function of the remote sensing reflectance using a multivariate joint Gaussian distribution of the available observations with zero mean and covariance matrix K: where κ * is the covariance between the training vector and the test point, κ * * is the covariance between the test point with itself, and K + σ 2 I n is the noisy covariance matrix of the training inputs.
Applying Bayesian inversion is possible to compute the posterior distribution over the output x * given the new input and the training dataset.

Support vector regression
The SVR is the SVM implementation for regression and function approximation [35,49].Unlike the previous KRR and GPR, the standard SVR formulation uses Vapnik's ε-insensitive cost function in which an error e = y − ŷ up to ε is not penalized, otherwise will incur in a linear penalization.This gives rise to a solution that is sparse, thus many α i become zero, meaning that the associated data points x i are irrelevant in the model.The SVR model has three hyper-parameters to be tuned: the C penalization factor, the ε-insensitive zone, and the kernel parameter σ, which as for KRR, we used the RBF kernel.Tuning all these hyper-parameters together is challenging.In our implementation, we followed a cross-validation procedure to do this.

Comparison, Implementation and Reproducibility
A summary compilation of the main characteristics of the methods is given in Table 2.The running time of the training of the algorithms varies from model to model and it also depends on the inputs and amount of fine-tuned hyper-parameters.With these two particular databases the training time in a standard workstation with 16 GB of RAM is around 3 days for all the combinations of bands and models.In addition to this summary, we show a thorough comparison to a state-of-the-art neural network in §4.The code reproducing the results of this paper can be found in https://github.com/IPL-UV/mlregocean.An operational Matlab toolbox implementing all machine learning methods is available too at https://github.com/IPL-UV/simpleR.

Experiments Setup
The two datasets, SYKE and C2X, are used separately, each one with a different strategy to train and test the six statistical models.The SYKE dataset is randomly split into training and test datasets before fitting the models (test size = 25%, random state = 42).The C2X dataset has two separated sets for training and testing.The calculations are carried out keeping the independence of both datasets and results are specific for each one.All the models have specific hyper-parameters for controlling over-fitting.We adjusted this hyper-parameters using standard ten fold cross validation on the training data.The experiments include five configurations of the input bands, which are common for the two datasets (Tables 3 and 4): the two firsts use one single ratio as input (S*-Ratio 1 and S*-Ratio 2); the third configuration uses the two ratios together as input (S* two ratios); the fourth configuration takes the full spectral range found in the simulation dataset (S* all bands); and the fifth is a mix of the third and fourth configurations using the two ratios as well as the full spectra (S* all bands + ratios).The spectral range of the simulated Sentinel-2 data consisted of 6 bands corresponding to some of the S2-MSI sensor wavelengths.The Sentinel-3 OLCI simulations included a total of 12 to 15 bands from the visible to the NIR.These are the common set found in many of the ocean color sensors used for water quality retrievals (see bands marked in blue in Table 1).Statistics used to check the validity of the methods are: the coefficient of determination (R 2 ), the bias, root mean squared error (RMSE) absolute (abs) and relative (rel), the mean absolute error (MAE), the residual error (RES) and the Pearson's coefficient of correlation (R).

Sentinel 2 MSI Data
Table 3 offers an overview of the metrics for all six methods tested with the five different combinations of input variables for S2-MSI.When using only one input (S2-Ratio 1 = R rs (665)/R rs (490 nm), S2-Ratio 2 = R rs (709)/R rs (490 nm)) the five ML methods tested do not improve the results strikingly in comparison with the polynomial regression.The metrics are very similar to each other and the polynomial regression is especially close to the KRR and GPR methods.Both approaches give very good results, but computation time will then be the determining factor, and the polynomial regression requires less time and it still efficient.However, when using more than one variable, non-linear ML methods get more relevance, as it can be seen in the S2 two ratios, S2 all bands and S2 all bands + ratios sections on Table 3. GPR and SVR methods show the best results in terms of correlation coefficients (> 0.98) and errors (< 0.12).When the two ratios are used as inputs, the RLR approach could still be considered as the simplest solution and probably preferred over the more sophisticated and computational time consuming ML techniques.However, it is clear that using all bands available as inputs leads to a significant improvement on the statistics (S2 all bands + ratios), especially using GPRs and also SVRs as mentioned before, reducing considerably the residuals and bias and improving the RMSEs.
The distribution of residuals for S2 two ratios and S2 all bands + ratios are shown in Figure 2. The box-plots of the errors show the clear improvement of the non-linear ML models in comparison with the linear one (RLR) for all cases (residuals closer to zero).The use of all bands and band ratios available leads to an amelioration in the errors.
The regression plots in Figure 3 compare the performance of the models as the expression of the value of the ratio against the values of the test dataset.Data plotted are the training (gray dots) and test (pink dots) points.The Polyfit line is the polynomial regression performance of the single ratios, in dark red, an it will be taken as reference (see also Figure 1).The plot on the left shows the results for S-2 Ratio 1. Data behaves quite regularly until the absorption values are higher than 10 m −1 , where they start to show more dispersion.At this point, it is useful to see how the RFR model behaves (orange line), because the relative over-fitting of this model is indicating areas with outliers in the training data.The RLR is following a straight line (in blue), not fitting much the data at the beginning in the lower values and at the end of the relationship, although it has the same trend than the other models.The KRR (green line) and SVR (red line) reveal a better fit with the data and between them.The GPR (purple line) model seems similar to the KRR with the final portion closer to the SVR model, taking into account the higher a CDOM (400) values.The area in light gray is the 95% confidence interval of the GPR prediction (see Section 2.3).The plot on the right shows the same regression fitting for the models compared with the polynomial fitting of the S-2 Ratio 2. In this case many of the models approach the Polyfit regression closer than the the S-2 Ratio 1.The regression line of the RFR seems to be smoother than the previous one and the uncertainty of the GPR tighter.This would probably mean that the S-2 Ratio 2 is a better method to derive calibration coefficients for CDOM algorithm development with the SYKE dataset.An interesting tool for the analysis of the models is the permutation plots shown in Figure 4.These plots are the result of the permutation test, which calculates the probability of getting a value equal to or more extreme than an observed value of a test statistic under a specified null hypothesis by recalculating the test statistic after random re-orderings (shuffling) of the data [50].The permutation plot can be seen as a mean of feature ranking.The statistic used is the MAE, which means that Figure 4 shows the MAE after shuffling each of the regressors P = 30 times.According to our previous experience [51], permutation analysis results in robust estimates of variable relevance, and it is possible to achieve stable feature rankings with a moderate number of runs.More than 30 realizations does not reduce much the variance of the estimates.By using the permutation plots, we are measuring the importance of each of the bands to the proposed model.The S-2 all bands + ratios composition permutation plots of 4 models are shown as example.The higher the MAE value when a particular band is removed, the more weight it has in the model: that is, for the RLR bands in the blue part of the spectrum (443 and 490 nm) caused a higher MAE when removed compared with the other bands, followed by the 560 and the 705 nm bands.Precisely the combination of the 705 and the 490 is the S2-Ratio 2, which is crucial for the RLR method as observed in the plot.For the KRR method bands seem to have more balanced distribution, highlighting the importance of band at 443 nm.The GPR approach shows major weight of band 740 nm, followed by 665 nm, diminishing the role of the blue and ratio bands compared with other models.In the SVR approach, with lower MAE values than the rest, the two ratios play a major role.In general all methods highlight the importance of the bands used for the ratios or the ratios themselves, which agrees with the physical knowledge about the bands contributing more to CDOM determination.

Sentinel 3 OLCI Data
Statistic results are very similar for the S3-OLCI configuration (Table 4, left part), with in general good values for all approaches when using S3-Ratio 1 or S3-Ratio 2 as inputs (R > 0.94 and RMSEs < 0.7).The Polyfit and KRR methods stay within the same ranges, with very good performance in both cases.Increasing the number of inputs shows better results, especially with the more sophisticated ML approaches.The RLR fails when using S3 all bands, due to the non-linear behavior of the inputs as it happened with S2-MSI (R = 0.763, RMSEr > 1.4).The KRR, GPR and SVR approaches show the best statistics in S3 all bands and S3 all bands + ratios.
Figure 5 shows the residuals values in box plots for S3 two ratios and S3 all bands + ratios.Again the errors are reduced, in this case quite spectacularly, when using more sophisticated kernel-based methods (KRR, GPR and SVR) and they are especially low when using all inputs S3 all bands + ratios.In Figure 6, a similar pattern to the one described in the previous subsection with S2-MSI data (Figure 3) is found with the S-3 Ratio 1 and S-3 Ratio 2, which leads to similar conclusions.The main difference between S-2 Ratio 1 and S-3 Ratio 1 is that the latter suffers less over-fitting in the smaller values of the RFR regression line.

RLR
Figure 7 shows the permutation plots of the S3-OLCI bands for four models and the S3 all bands + ratios configuration.They are, in general, very similar to the ones for S2-MSI.For RLR the importance of the bands in the blue and red spectrum seems to be more relevant.The RFR model stresses the importance of the ratios when those are used as inputs (not shown), behavior also observed in S2-MSI.For the KRR model, bands at 665, 681.25 and 753.75 nm point out, but generally the weight of the bands seems too be more distributed with lower MAE than the RLR.The GPR approach shows a major weight of the red and NIR bands (673.75, 681.25 and 753.75 nm), similar to the KRR.Finally, the bands for the SVR approach behaves in a similar way in S3-OLCI and S2-MSI, with the S3-Ratio 2 pointing out.The higher weight of the blue bands in some of the approaches is expected, because the absorption maximum of CDOM takes place in these spectral ranges (400-490 nm), as it happens when the S3 all bands configuration is used (plots nor shown).The contrast of these blue bands with the spectrum in the red and NIR (620-709 nm) are the inputs used in CDOM band-ratio algorithms in many studies.When the ratios are used as inputs as well, they gain relevance, as shown for S2-MSI.

Application of the Models to the C2X Dataset
From the C2X dataset, we have selected all data classified as C2A and C2AX and in which chlorophyll is the main component of the specific absorption, that is only brown algae group dominates the signal.In total we have 5570 data that we use to train the models.Figure 8 shows the C2A data on the left and the C2AX on the right, both datasets are joined for the training of the models.Fifteen out of the 21 S3-OLCI wavebands are selected as inputs, including bands at 778.75, 865, 885 nm, which enlarge the experiment with new spectra.S3-Ratio 1 and S3-Ratio 2 are calculated to be included as inputs as well.The a CDOM (440) nm absorption coefficient, using both subgroups, has a range between 0.098 and 20 m −1 ; while the CHL content range from 0.03 rises until 200 mg m −3 ; TSM ranges from 0 to 10 g m −3 .It is worth noting that the absorption values are measure at a different wavelength than the SYKE dataset, which has to be taken into account when making comparisons.The C2X exclusive validation dataset is also filtered and 1783 spectra are left in an independent dataset, which is only used for validation purposes.
Table 4 (right part) shows the statistics of the model for comparison with the SYKE dataset.If the S3-Ratio 1 is used as the only input, statistics show quite good results using the Polyfit model, outperformed slightly by the KRR and GPR methods, which raises the question of model complexity vs. simple models with fair results.S3-Ratio 2 statistics show lower numbers than for S3-Ratio 1.
The RLR fails compared with other approaches, having lower R 2 (0.65) and higher RMSEr (8.47) and MAE (1.376).S3-Ratio 2 seems to under-perform if we compare with the results with the SYKE dataset.When the S3 two ratios are used as inputs, the RLR method gives similar results than when using only the S3-Ratio 1.The influence of the S3-Ratio 1 is then affecting the RLR and the rest of the models stronger than the S3-Ratio 2. However, if we compare the RLR and GPR permutation plots, the influence of S3-Ratio 2 is still visible in the GPR model (Figure 9), though not higher than S3-Ratio 1.If all bands are used as input, generally good results are obtained for all models except the simplest one (RLR), with R > 0.9 and MAE < 0.4.In terms of error, using all bands, the RFR does a very good job, with a high coefficient of determination (0.992) and one of the lowest RMSEr (0.881) and MAE (0.144) errors.If all reflectance bands and the ratios are used as inputs, improvements are clear with the RLR method in comparison with the other experiments.The best model is again the GPR, but all approaches have coefficients of determination and correlation > 0.96 and low errors and biases.
The permutation analysis shows that the GPR model has comparable results with the SYKE dataset (Figure 9, right plot), with major weight of the blue bands when no ratios are taken as inputs (not shown), changing the weights to bands 510 and 560 nm followed by the blue bands, and highlighting the importance of the ratios when they are taken into account.

Predicted vs. Measured CDOM with the C2X Dataset
As an example, plots and statistics comparing CDOM predicted (or retrieved) and CDOM measured (or simulated) for the six models and all possible band combinations have been generated.Figure 10 summarizes the main statistics using a linear regression of the two variables for some of models and band combinations.In general we obtain coefficients of correlation > 0.9 for all cases, except for the RLR method in S3 all bands with 0.782, and the RLR for the S3-Ratio 2. These two correlations also have the biggest standard error and deviate from the 1:1 line in the regression plots showing a negative bias.The best correlation results uses the S3 all bands + ratios combination, as pointed out before, with r values > 0.98 in all cases.The GPR and the SVR methods have similar results, with a slight lower error of the GPR.
Figure 10 (bottom row) shows an example of the best result -best band combination and best model-, which is the GPR S3 all bands + ratios, together with the RLR model with the same input.This comparison is made to observe the better adjustment of the results using the GPR approach, with a distribution of the data around the 1:1 line almost perfect when compared with the RLR results.The same figure shows the results for the simple Polyfit method on the two separated ratios on the top row, for a better understanding of the overall statistics and data distribution.

GPR_S3bands&ratios
Figure 10.Retrieval performance with respect to the C2A(X) CDOM simulations: on the top left the scatter plot of the Polyfit method with the Ratio1; on the top right the scatter plot of the Polyfit method with the Ratio2.On the bottom left RLR method using all available bands and the two ratios as input; on the bottom right the GPR method with all available bands and the two ratios.

Comparison of the Results Extracted with OLCI Neural Net Swarm
As it is the case with the machine learning methods, ONNS is an in-water algorithm, especially designed to retrieve water quality parameters from S3-OLCI satellite data.The aim of ONNS is to provide one algorithm that is suitable to all natural waters, from clearest oceanic waters to very turbid coastal or highly absorbing inland waters.For this purpose, a fuzzy logic optical water type classification is applied in conjunction with a set of specific neural networks.Input to the algorithm are normalized remote sensing reflectance at 11 OLCI bands, i.e., atmospheric corrected satellite data.ONNS retrieves 12 ocean color products with corresponding uncertainties, concentrations and optical properties, which form a system-inherent optical closure; CDOM absorption at 440 nm is one of the outputs.ONNS is based on the C2X dataset.The CDOM retrieval capabilities of ONNS are shown in Figure 10.Retrieval performance with respect to the C2A(X) CDOM simulations: on the top left the scatter plot of the Polyfit method with the Ratio1; on the top right the scatter plot of the Polyfit method with the Ratio2.On the bottom left RLR method using all available bands and the two ratios as input; on the bottom right the GPR method with all available bands and the two ratios.

Comparison of the Results Extracted with OLCI Neural Net Swarm
As it is the case with the machine learning methods, ONNS is an in-water algorithm, especially designed to retrieve water quality parameters from S3-OLCI satellite data.The aim of ONNS is to provide one algorithm that is suitable to all natural waters, from clearest oceanic waters to very turbid coastal or highly absorbing inland waters.For this purpose, a fuzzy logic optical water type classification is applied in conjunction with a set of specific neural networks.Input to the algorithm are normalized remote sensing reflectance at 11 OLCI bands, i.e., atmospheric corrected satellite data.ONNS retrieves 12 ocean color products with corresponding uncertainties, concentrations and optical properties, which form a system-inherent optical closure; CDOM absorption at 440 nm is one of the outputs.ONNS is based on the C2X dataset.The CDOM retrieval capabilities of ONNS are shown in comparison with the validation data in [28]; they state that lowest CDOM concentrations (in Case-1 waters) vanish in the noise, whereas high concentrations (C2AX) can be retrieved accurately.For C2A waters, the a CDOM (440) obtains an absolute RMSE (evaluated in the linear space) equal to 0.023; the coefficient of correlation is 0.985; and the bias is slightly negative with −0.005.CDOM concentration is higher in C2AX waters, thus, RMSE and bias are lager, 0.435 and −0.021 respectively, but the correlation keeps high at 0.993.
The C2X dataset used in the present work is a small subset of the original database, only the 'brown spectral phytoplankton group'.As an aside, in both cases C2A and C2AX, approximately 1.2 of cases are non-classifiable with ONNS, which results in NaNs after the application of the NN.Attending at the statistics of the linear regression between predicted and C2A(X) test data (see Figure 11 in log10 scale), with slope of 1.003, intercept of 0.004, correlation coefficient of 0.998, and standard error of 1.6 × 10 −3 , we can conclude that ONNS shows similar good results, with slightly higher standard errors but same correlation coefficients than the GPR-S3 all bands + ratios, the method that we have considered the best one in the previous section.

Comparison Against the Standard S3-OLCI Product
ML approaches have statistically demonstrated to be very promising techniques for bio-geophysical land and water quality parameter retrieval [46,48].To set the applicability of these methods for satellite image processing, here there is an example of the processing of an OLCI scene using the GPR model.This exercise cannot be considered a proper validation, since there are too many differences in the approaches, starting with the source of the reflectance bands.The ML methods rely on atmospherically corrected data, the training of the models is done with the C2X dataset, and then apply to the reflectance on the scene.The standard OLCI ocean processing module provides water-leaving reflectance in several steps (https://sentinel.esa.int/web/sentinel/user-guides/sentinel-3-olci/processing-levels/level-2).The first step is performed using two algorithms, selectable by a dedicated switch:

•
The Baseline Atmospheric Correction (BPAC) removes all the contributions to TOA reflectance, including glint correction and white cap effects.Later in the processing it estimates the near-infra-red water-leaving reflectance to perform the atmospheric correction.

•
The Alternate Atmospheric Correction (AAC) uses a neural network approach to provide water-leaving reflectance, but these products are not written by default in the standard L2 provided by EUMETSAT.
Several other processors compute all the needed products from the water-leaving reflectance (OC4ME Chlorophyll, IMT Neural Net, Transparency Product, PAR Product).The AAC atmospheric correction scheme originally designed for MERIS (Medium Resolution Imaging Spectrometer), the MERIS Case-2 neural network or C2R [25], is used as basis for the OLCI NN development [26].Some of the products retrieved using the C2R neural net are concentrations (CHL_NN, TSM_NN, ADG443_NN) together with atmospheric data (aerosol optical thickness and angstrom exponent at 865, integrated water column, etc.).
The scene processed here is downloaded from the CODA EUMETSAT service (https://eoportal.eumetsat.int),with the date of acquisition on the 24 May 2016 (09:09:53-09:11:53 am).Processor version at time of downloading is IPF-OL-2 06.11.It is difficult, due to the short operational time of the S3 satellite and the complex atmospheric situation of the area, to find perfectly clear scenes over the Baltic Sea area.Other issues regarding the atmospheric correction and its potential effect on CDOM retrieval is the overcorrection of naturally low radiances in the Baltic area, which can lead to estimations of negative reflectance when using the BPAC processing.This bias can affect the CDOM retrieval, with extremely high and unrealistic values.In order to overcome some of these potential problems, we downloaded the Level-1 image and processed it with the C2RCC algorithm, available in the Sentinel-3 Toolbox of the SeNtinel Application Platfor m (SNAP) (http://step.esa.int/main/toolboxes/snap/).We used the remote sensing reflectance bands for the GPR model previously trained with C2X dataset.The OLCI scene has dimensions 2728 × 4865, from where 2,625,816 is the number of valid pixels left after application of the C2RCC flags (Cloud_risk, Rtosa_OOS, Rtosa_OOR, Rhow_OOS, Rhow_OOR).Figure 12 shows in a bar plot how long it takes for each model and band combination to compute the results.As the model begins to be more complex, the computation time increases.We observe a minimum of a few seconds when we apply the RLR until approximately 10 min of the GPR using all bands and ratios.The comparison of the ADG443_NN product and the CDOM product calculated with the the GPR model is shown within a snapshot of the SNAP software in Figure 13.Both images show similar patterns: the ADG443_NN is shown in the left, in the right the GPR_CDOM output.The area in the map shows the Gulf of Finland and the eastern part of the Baltic Sea.Only valid sea pixels are shown, with land pixels masked in gray and clouds masked in white.Red pixels are the invalid pixels flagged by WQSF_lsb_OCNN_fail, Rtosa_OOS and Rhow_OOS (BPAC and C2R NN flags).A cloudy area in the center of the Baltic Sea is masked by the Cloud_risk flag and appears in white in the ADG443_NN product.This flag is applied previously in the GPR_CDOM processing and that is the reason the pixels of this area appear as black (Not A Number) in the derived CDOM product.The uncertainty maps of both products are shown in Figure 14.ADG443_NN_err is shown in the left (higher uncertainties in cyan and green); in the right the GPR associated uncertainties (higher uncertainties in green and red).The ADG443_NN and associated uncertainties include the CDOM absorption plus the absorption of the detritus (minerals), which makes the quantitative comparison of both products a bit more difficult.However, the patterns of the products and the map of the uncertainties look very similar, with higher uncertainties in the same areas, for instance in the southern part of Lake Peipus or the Gulf of Riga.
To better understand the differences, it is important to highlight that the C2R NN has been trained with a large amount of Hydrolight simulations generated by the forward model [27], plus a bio-optical model relating scattering and absorption coefficients to concentrations.The bio-optical model is also based on a large dataset of in situ measurements of inherent optical properties [26].The C2R NN converts the directional water leaving radiance reflectance (which recalculates within the NN) into a number of inherent optical properties.These are then converted into concentrations of water constituents by simple regression [26].The complexity of this NN, including the improved atmospheric correction, is being compared here with less complex, but still sophisticated algorithms, with results that are quite optimized.ML approaches can help to reduce the training phase -and the generation of simulations-and processing time of satellite data in complex waters giving more than fair results.Other models, including SVR and KRR are also applied, and they show very similar patterns to the ADG443_NN output (not shown here) and quantitative values in the Gulf of Finland and norther part of Lake Peipus closer to each other than the GPR output.However, these methods do not allow to derive the uncertainty maps.In any case, if we consider the ADG443_NN as a reference, we could say that the SVR, KRR and the GPR underestimate CDOM in the western part of this scene.In the Gulf of Finland the images are very similar, with a slight underestimation going further to the west too; and a slight overestimation of the GPR (not in KRR or SVR) in the Gulf of Riga, the area showing higher uncertainties in the standard product.

Conclusions
This study presents the statistical evaluation of five machine learning approaches for the retrieval of ocean color remote sensing products, focused in the CDOM parameter.The advanced statistical approaches used are classical and well-established machine learning methods: multivariate regression (RLR), random forests regression (RFR), kernel ridge regression (KKR), Gaussian process regression (GPR) and support vector regression (SVR).They are applied on remote sensing reflectance simulating S2-MSI and S3-OLCI bands.The experiments are made using two different datasets: the SYKE dataset (S2-MSI and S3-OLCI) and the C2X dataset (S3-OLCI).The datasets are representative of water types that can be classified as Case 2 absorbing and extreme absorbing waters (C2A and C2AX respectively).The results of these advanced statistic approaches are compared with more traditional empirical algorithms (second grade polynomial); and the inputs of each method is studied through permutation analysis to determine their weight in the models.The analysis tries to determine the relationship of the results of each model with observed natural behaviors and with the results of simpler algorithms.
The main message after analyzing the statistics is that the more input bands in the model, the better, independently of the instrument (MSI or OLCI).However, this would assume that the input values are perfect and no errors after the atmospheric corrections are included.In the real cases, errors are larger in some bands than in others, which makes the option of not including all of them well-founded.Furthermore, if the inputs contain already some features (ratios) that correlate with the output products (CDOM in this case), the gains are clearly visible since they help to improve the statistics significantly.The complexity of the models has a relevant role when all the available inputs are used; but with the proper inputs it is possible to rely in simpler models and obtain still quite good results.
Part of the validation of the models is made with an independent dataset derived from the C2X dataset and the results shown in this paper are limited to the OLCI sensor.Comparison between predicted values and measured/simulated ones is made by linear regression statistics.Basically all methods give quite good coefficients o f correlation (0.782 < r < 0.99), with greater success of the GPR and SVR methods in all the experiments.Results are also compared with other machine learning approaches like neural nets.The OLCI Neural Nat Swarm (ONNS) is an algorithm specifically design with the complete C2X dataset for all five main water types (C1, C2A, C2S, C2AX, C2SX).Only results for C2A and C2AX water types are compared here, with the GPR model (followed by SVR) and the ONNS offering similar good performance for both water types.This evaluation increase the confidence in ML approaches utilization for OC retrievals.
An application to a real OLCI scene is also made for verification of the applicability of the methods on atmospherically corrected satellite data.The R rs data used from the image is derived with the C2RCC processor.This fact can introduce already important discrepancies since the input reflectance data come from different sources.The standard proxy for CDOM (ADG443_NN) is found in the OLCI L2 product derived by ESA/EUMETSAT and calculated using the C2R NN.The C2R NN relies on the derivation of IOPs and calculates concentrations applying a bio-optical model, which introduces another important source of uncertainty in the comparison.However, results with the GPR, SVR and KRR outputs show similar patters and small differences in those areas with lower uncertainties, which gives hope on the good performance of a faster to train and simpler mathematical algorithms for operational production.
A rigorous validation procedure cannot be followed since there is still not a large match-up dataset available.Hopefully, the Sentinel-3 Validation Team and other public and private initiatives would help to solve this issue in the next few years, and more quantitatively validation of the products will be possible.
Future work will be focused on three main issues: (i) retrieval of other OC variables like chlorophyll-a or total suspended matter; (ii) adding new approaches like joint or deep Gaussian processes; (iii) extent and improve the quantitative validation with proper data and methods, including application to imagery that has been atmospherically corrected with the same or similar methods than the input data use for the training of the models.

Figure 2 .Figure 3 .Figure 4 .
Figure 2.Box-plots of the residuals S2 two ratios and S2 All bands + ratios.On each box, the central mark is the median, the edges of the box are the lower hinge (defined as the 25th percentile) and the upper hinge (the 75th percentile), the whiskers extend to the most extreme data points not considered outliers.

Figure 5 .
Figure 5. Box-plots of the residuals for S3 two ratios and S3 All bands + ratios.

Figure 6 .Figure 7 .
Figure 6.Comparison of the performance of the models using the linear regression representation, S3-OLCI.

Figure 8 .Figure 9 .
Figure 8. Ranges and mean of reflectance spectra at S3-OLCI wavebands from the C2X dataset for C2A and C2AX subsets.The utilized S2-MSI and S3-OLCI bands are highlighted for convenience.

Figure 11 .
Figure 11.Absorption coefficient of CDOM at 440 nm retrieved by ONNS (IOP NNs) for the C2A (red dots) and C2AX (blue dots) spectral types.Plot is in log10 scale.

Figure 12 .
Figure 12.Computational time in seconds of each model and band combination on an standard OLCI scene.

Table 2 .
Main characteristics of the machine learning methods considered in this work: regularized linear regression (RLR), random forest regression (RFR), kernel ridge regression (KRR), Gaussian process regression (GPR), and support vector regression (SVR), grouped by family.We give explicit descriptors, such as the order O of the computational and memory cost for training and testing as a function of dimension d (spectral bands), number if samples n for training and number of nodes t in the RFR model.Also the qualitative characteristics of the considered methods (strengths and weaknesses) is shown.
Generally accurate and robust to outliers Slow, three hyper-parameters to tune by cross-validation

Table 3 .
Results obtained obtained with empirical fitting and several machine learning methods on the simulated S2-MSI data (a CDOM (400)).

Table 4 .
Results obtained obtained with empirical fitting and the five machine learning methods on the simulated S3-OLCI data of the C2A(X) dataset.