Evaluation of Satellite-Based Algorithms to Retrieve Chlorophyll-a Concentration in the Canadian Atlantic and Paciﬁc Oceans

: Remote-sensing reﬂectance data collected by ocean colour satellites are processed using bio-optical algorithms to retrieve biogeochemical properties of the ocean. One such important property is the concentration of chlorophyll-a, an indicator of phytoplankton biomass that serves a multitude of purposes in various ocean science studies. Here, the performance of two generic chlorophyll-a algorithms (i.e., a band ratio one, Ocean Colour X (OCx), and a semi-analytical one, Garver–Siegel Maritorena (GSM)) was assessed against two large in situ datasets of chlorophyll-a concentration collected between 1999 and 2016 in the Northeast Paciﬁc (NEP) and Northwest Atlantic (NWA) for three ocean colour sensors: Sea-viewing Wide Field-of-view Sensor (SeaWiFS), Moderate Resolution Imaging Spectroradiometer (MODIS), and Visible Infrared Imaging Radiometer Suite (VIIRS). In addition, new regionally-tuned versions of these two algorithms are presented, which reduced the mean error (mg m − 3 ) of chlorophyll-a concentration modelled by OCx in the NWA from − 0.40, − 0.58 and − 0.45 to 0.037, − 0.087 and − 0.018 for MODIS, SeaWiFS, and VIIRS respectively, and − 0.34 and − 0.36 to − 0.0055 and − 0.17 for SeaWiFS and VIIRS in the NEP. An analysis of the uncertainties in chlorophyll-a concentration retrieval showed a strong seasonal pattern in the NWA, which could be attributed to changes in phytoplankton community composition, but no long-term trends were found for all sensors and regions. It was also found that removing the 443 nm waveband for the OCx algorithms signiﬁcantly improved the results in the NWA. Overall, GSM performed better than the OCx algorithms in both regions for all three sensors but generated fewer chlorophyll-a retrievals than the OCx algorithms.


Introduction
The product derived from satellite ocean colour that is the most used is undoubtedly chlorophyll-a (chla) concentration, an index of phytoplankton biomass, which has numerous applications in biogeochemical oceanography, such as phytoplankton ecology and phenology [1,2], carbon cycles [3], climate change, transfer of energy to higher trophic levels, and water quality [4]. Satellite-derived chla concentration is a mature product of ocean colour that is used not only by experts in the field of bio-optics but also by the entire oceanographic community in various ways, such as modelling and data assimilation [5], fisheries applications [6], and ecosystem management [7]. For instance, European programs such as Copernicus (https://www.copernicus.eu/) or NOAA's COASTWATCH (https://coastwatch.noaa.gov/cw/index.html) provide daily chla data that are accessible to the public for fisheries and water quality applications. In Canada, Fisheries and Oceans have been relying on ocean colour for several decades to monitor the state of the marine ecosystem [8,9], and in particular on chla concentration, and its derived product primary production, to infer ecosystem indicators that are at the foundation of ocean management.
Chla concentration is derived from remote sensing reflectance (R rs ), which is the ratio of water-leaving radiance to downwelling irradiance corrected for sun geometry to allow for comparisons independent of locations, times and dates. Several approaches to infer chla from R rs have been developed and embedded by space agencies in their data processing software, including band ratio [10] and semi-analytical models [11]. As its name suggests, band ratio algorithms exploit the ratio of wavebands in the blue and green to retrieve chla; the Ocean Colour X (OCx) (x stands for 2, 3, or 4 and indicates the number of bands that were used in the algorithm) suite of empirical algorithms have been developed by the National Aeronautics and Space Administration (NASA) using a global dataset of in situ measurements of chla concentration fitted to remote sensing reflectance (NOMAD, the NASA bio-Optical Marine Algorithm Dataset [12]) and a fourth-degree polynomial expression. On the other hand, semi-analytical algorithms (e.g., Garver-Siegel Maritorena (GSM)) consist of optimizing bio-optical parameters (including chla concentration) in an approximate solution of the radiative transfer equation to match modelled reflectance to the reflectance measured by the satellite. This type of algorithm has the advantage of decoupling the contribution of the optically active components (i.e., phytoplankton, non-algal particles and coloured dissolved organic carbon) such that chla concentration should, in theory, be retrieved with higher accuracy than the band ratio algorithms. These two types of approaches are well-suited for the so-called case-1 waters (i.e., waters where chla concentration drives the optical characteristics of bulk seawater) but do not perform as well in case-2 waters (i.e., where optical signals of marine components are uncorrelated). For case-2 waters, models that use fluorescence [13] or more advanced statistical methods, such as neural networks [14] or principal component analysis [15], have been demonstrated to perform better than band ratio or semi-analytical approaches. Note that the performance of algorithms that use remote sensing reflectance will be inherently dependent on the performance of the atmospheric correction procedure, which will not be addressed here.
It has been shown that OCx [15,16] and GSM [17] algorithms contain regional biases, such that even if its overall performance is satisfying, application to a given region results in systematic bias. Here, we assessed the performance of the OCx and GSM algorithms in Canadian waters (Northwest Atlantic (NWA) and Northeast Pacific (NEP)) using a dataset of in situ chla concentration that was collected by Fisheries and Oceans Canada as part of their monitoring of the marine ecosystem. Performance of these algorithms was evaluated for NASA's three sensors, namely the Sea-viewing Wide-field-of-View Sensor (SeaWiFS, 1998(SeaWiFS, -2010, the Moderate Resolution Imaging Spectrometer on the Aqua platform (MODIS-Aqua, 2002-current) and the Visible and Infrared Imaging Spectrometer on the NPP platform (VIIRS, 2012-current). Uncertainties of the algorithms with respect to environmental variables were also discussed.

Regions of Interest
Fisheries and Oceans Canada (DFO) carries out sea-going expeditions as part of its oceanographic monitoring programs to obtain information on the health of the marine ecosystem to support decision-making in the management of the ocean. In the Northwest Atlantic (NWA) Ocean, more than a hundred stations are visited twice a year in spring and fall on the Scotian Shelf and once a year (in spring) in the Labrador Sea as part of the fieldwork of the Atlantic Zone Monitoring Program (AZMP) and Atlantic Zone Offshore Monitoring Program (AZOMP) respectively ( Figure 1). During these missions, a large number of physical, chemical and biological oceanographic parameters are collected [8]. As part of its remote sensing operations, DFO archives satellite data of ocean colour and sea-surface temperature in a region bounded by 39 • to 82 • north and 95 • to 42 • west. These data were used to assess the performance of ocean colour chla data products in the NWA. In a similar fashion, DFO runs monitoring programs in the NEP along Line P three times a year (in winter, spring and summer), off the west coast of Vancouver Island twice a year (in spring and summer) and in the Strait of Georgia three times a year (in spring, summer and fall), where physical, chemical and biological data are collected [18]. The in situ data from this region used in this study are bounded by 47 • to 57 • north, 148 • to 123 • west. q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q 60°W 50°W q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 130°W Figure 1. Locations of satellite matchups (a) in the Northwest Atlantic (NWA) from January to June, (b) from July to December, (c) in the Northeast Pacific (NEP) from January to June, and (d) from July to December.

In Situ Samples
DFO collects data on chla concentration using two methods, namely, Turner fluorescence (TF) and high-performance liquid chromatography (HPLC). While the most comprehensive dataset is the one obtained using the TF method (tens of thousands of data), here we used the dataset containing HPLC-derived chla concentration since it provides the most accurate estimates of chla and it is also the method recommended by space agencies for validation activities. A large dataset was compiled for the purpose of this study, consisting of the in situ samples gathered in the NWA and NEP which were analyzed in their respective DFO regional laboratories. In brief, water samples were rapidly filtered after collection through 25 mm GF/F glass fiber filters, immediately flash-frozen using liquid nitrogen, and stored in a −80 • C freezer until analysis in the laboratory. The NWA samples were analyzed using a Beckman-Coulter Gold HPLC system (1998-2013) and more recently an Agilent 1200 (2013-2014) HPLC instrument following the protocol of Head and Horne [19] as modified by Stuart and Head [20]. Similarly, the NEP samples were analyzed using a Waters Alliance System HPLC following the protocol of Zapata et al. [21]. The analysis was performed for 23 pigments for both locations, but only chla concentrations from both regions and fucoxanthin ( f ucox) concentrations from the NWA were used in this study. Note that chlorophyll-a concentration includes chlorophyll-a epimer and allomer pigments.
The complete dataset consisted of 1857 samples from the NWA between 1999 and 2014, and 1231 samples from the NEP between 2006 and 2016, all collected within ten metres from the surface. Both datasets showed a skewed normal distribution with a peak around 0.43 (0.42) mg m −3 in the NWA (NEP) Ocean and a similar range of variation of chla from 0.03 to 29.41 mg m −3 , but samples collected in the NEP had a mean concentration (2.26 mg m −3 ) greater than in the NWA, where the mean concentration was 1.68 mg m −3 (Figure 2a). In the NWA, samples were collected mainly in spring (April to June) and fall (September to November) with a few exceptions of data collected in July and August, whereas the NEP dataset consisted of samples collected in winter (February to March), late spring-early summer (June to July) and late summer to early fall (August through October) (Figure 2b). Both datasets captured the seasonal cycle of phytoplankton, which was consistent with the wide range of biomass concentrations. The dataset was reduced to measurements collected within a day of satellite passes and ten kilometres from a valid pixel.

Satellite Matchups
The 2014 reprocessed datasets of SeaWiFS merged local area coverage (MLAC), MODIS-Aqua local area coverage (LAC), and VIIRS-Suomi NPP level-2 satellite scenes were downloaded from NASA's ocean colour L1/L2 browser (https://oceancolor.gsfc.nasa.gov/cgi/browse.pl?sen=am). For each image, a mask was applied to remove pixels with invalid data, defined as any pixel where one or more of the following criteria were met: 1. The pixel was marked by any of the following level 2 flags: atmospheric correction failure (ATMFAIL), deleted overlapping pixels (BOWTIEDEL), pixel overland (LAND), high sun glint (HIGLINT), pixel contains cloud or ice (CLDICE), radiance too high (HILT), high solar zenith (HISOLZEN), or high sensor zenith angle (HISATZEN). 2. The pixel contained a no-data (NA) value. 3. The pixel had more than one negative remote sensing reflectance (R rs ) value, implying the data might be flawed by the atmospheric correction procedure.
A search was performed for potential matches in any satellite image taken within a day of the in situ sample, as exact sampling times were not always available in the database. In order to match a satellite value to an in situ measurement, the satellite image was projected using a Gnomonic projection centred on the in situ point, which stretched great circles to straight lines. In other words, the image is projected onto a plane tangent to the in situ point, from which Cartesian coordinates (x, y) can be extracted for each corresponding (longitude, latitude) pair. For the sake of speed and simplicity, the Pythagorean theorem was then used to compute the distance between the in situ point and each satellite pixel. Though this method did not account for Earth's curvature, the computed distances at the maximum allowed radius around the in situ point (i.e., 10 km) were found to be within ±23 m of the more accurate geodetic distance as calculated by the distGeo() function using R Statistical Software [22]. A successful match was defined as the closest pixel, within a 10 km radius, that contained at least three valid pixels in a 3 × 3 pixel box extracted around the matching point. The distance between the in situ sample and the closest satellite pixel was recorded in the database. The median of the 3 × 3 pixel matrix was computed for each of the extracted remote sensing reflectance (R rs ) wavelengths, and the resulting dataset was considered as the total number of matches, mapped in Figure 1.
From the initial dataset of 1857 and 1231 samples for the NWA and NEP respectively, only 416 (SeaWiFS), 530 (MODIS) and 176 (VIIRS) in situ samples were matched to valid satellite pixels in the NWA, and 45 (SeaWiFS), 487 (MODIS) and 342 (VIIRS) in the NEP ( Table 1). The small number of matchups compared to the initial dataset attests to the difficulty of compiling archives of satellite data that match in situ samples for validation purposes and emphasizes the need for recurrent measurements such as monitoring programs and moored automatic sensors. Finally, chla concentration was derived using the OCx algorithm (Section 2.4.1) and pixels with a coefficient of variation (computed as the ratio of the standard deviation to the mean of all pixels available in the 3 × 3 matrix) exceeding 0.5 were removed from the matchup dataset.

Chlorophyll-a Algorithms
Two main algorithms were tested: The Ocean Colour (OCx) [10] and the Garver-Siegel Maritorena (GSM) [23] algorithms, as well as variations of these original versions that were regionally-tuned. The variations consist of four new versions for OCx, following the same polynomial format in degrees 1-4 with regional coefficients, and three optimal versions of GSM, with different combinations of three modifications to the algorithm, described in Section 2.4.4. For the sake of simplicity, results from only six algorithms were analyzed in this study: the two original algorithms (OCx and GSM), two regionally-tuned counterparts for OCx, and two for GSM. The results of the remaining algorithms have been included in Appendix A. The algorithms were implemented using version 3.3.2 of R Statistical Software [24] and several functions and packages referenced in the text.

OCx-Type Algorithms
These empirical algorithms use a polynomial equation to express the log-10 transformed chla as a function of the log10-transformed ratio of blue to green remote sensing reflectance (Table 1). The blue band is selected to maximise the blue-to-green ratio, R OCx : The ratio of blue to green R rs values is used as opposed to a single R rs value in an attempt to normalize the data, as R rs values in the blue part of the spectrum vary greatly with chla, whereas the green wavebands have a more limited range of variation. As chla increases, R rs (443) will decrease and become close to zero, reaching the noise level of the sensor. To avoid this issue, the algorithm switches to longer wavebands where the signal still correlates with chla and remains detectable. Additionally, longer wavelengths (e.g., 490 and 510) are less affected than the 443 nm band by variations in the concentration of coloured dissolved organic matter (i.e., yellow substances), and are therefore selected in the numerator of the ratio. Log10 chla is calculated using the following polynomial algorithm: NASA uses the method prescribed in O'Reilly et al. [10] to retrieve the optimal coefficients a, b, c, d, and e. This is an iterative method that constrains the slope and intercept of the linear regression between model and in situ chla to 1 and 0 respectively, minimizing the RMSLE (root mean squared logarithmic error, see Equation (12) in Section 2.5.1) and maximizing the coefficient of determination, r 2 . The comparison between band ratio models is then simplified since the slope, intercept, and consequently, the bias are all equal, and only the RMSLE and r 2 need to be evaluated. The in situ data used for the optimization of the original OCx algorithm is the global NASA Bio-Optical Marine Algorithm Dataset (NOMAD, version 2). Each sensor has a different set of wavelengths in the blue and green part of the spectrum (Table 1), such that the band ratio algorithm parameters are optimized for each sensor and are referred to as OC3M for MODIS, OC4 for SeaWiFS, and OC3V for VIIRS.

Regional Tuning of the OCx Algorithm
To remain consistent with the matchup exercise, the median value of the 3 × 3 pixel matrix of each waveband was computed to represent the R rs (λ) of the matchup. Note that the 443 nm waveband was removed from the potential blue wavebands in the band ratio as it was found to negatively affect the retrieval of chla (see discussion in Section 3.1). Several polynomial formulations from degrees one to four were tested and are hereafter referred to as "POLY1", "POLY2", "POLY3", and "POLY4" respectively. The optimal coefficients were retrieved using an iterative procedure to force the slope and intercept of the linear regression between satellite-derived and in situ chla to one and zero respectively as in O'Reilly et al. [10]. A 95% confidence interval for each optimal coefficient was computed using the boot() function in R with 2000 iterations [25,26]. For the sake of clarity and to investigate any potential benefits of adding higher-degree terms to the polynomial, the scores were computed for every polynomial degree, but only the results of the first and fourth-degree polynomials are discussed in the study. The parametrization and full results for the second and third polynomial are reported in Tables A1, A2 and A7 of Appendix A.

Original GSM
The GSM semi-analytical algorithm was developed by Garver and Siegel [27], and modified by Maritorena et al. [23]. The model uses a quadratic formulation (Equation (3)) to relate the underwater remote-sensing reflectance values to the ratio of total absorption, a (m −1 ), and backscattering, b b (m −1 ) [28]. Note that the wavelength dependence in the following equations has been removed for the sake of legibility: where g 1 = 0.0949 and g 2 = 0.0794. Total absorption and backscattering are defined as follows: = a w (λ) + chla × a ph (λ) + a dg (443) exp(−S(λ − 443)), where λ is the wavelength (nm). The term a corresponds to the total absorption term and is the sum of the absorption terms of pure seawater (a w ), phytoplankton (a ph ), and coloured detrital and dissolved organic materials (a dg ), which are combined in a single term given their similar spectral shapes; S expresses the exponential decrease of a dg with increasing wavelength using the reference wavelength at 443 nm (i.e., a dg (443)). The phytoplankton absorption term is defined as the concentration of chla multiplied by its specific absorption coefficient, a ph (i.e., absorption per unit chla), to account for its spectral variation. The total backscattering term is the sum of the backscattering terms of pure seawater (b bw ) and particulate matter (b bp ), the latter expressed as the product between a reference value (i.e., b bp (443)) and the power-law-like decrease with increasing wavelength by the exponent Y.
The coefficient a ph , given in Table A10 of Appendix A, was determined using in situ measurements of chla and phytoplankton absorption at the wavelengths of interest, and computing the mean value of absorption divided by chla across all available records within the AZMP database that spans from 1998 to 2014. The a w coefficients were derived from tabulated values with a 2.5 nm resolution from 400 to 700 nm [29], and when necessary, interpolated to the corresponding wavelengths of each sensor. Similarly, the b bw coefficients were derived from a 10 nm resolution table of values generated by Smith and Baker [30]. The spectral slope of yellow substances absorption S and the power-law decrease of particulate backscattering Y were optimized for use in the original algorithm to 0.02061 and 1.03373 respectively using a large simulated dataset [31]. The updated GSM models in this study tuned S and Y using the dataset collected from the NWA and NEP (see Section 2.4.4 for details). R rs,satellite (0 − ) was retrieved by taking the median R rs,satellite (0 + ) of the 3 × 3 matrix at each wavelength for a given match, then converting it from surface to underwater reflectance using Lee et al. [32]: Non-linear least squares regression, using the Gauss-Newton algorithm of the nls() function [24] to minimize the difference between modelled and satellite-derived R rs , was implemented with a maximum of 30 iterations to derive the remaining three unknowns in Equation (3): chla, the absorption coefficient of coloured detrital and dissolved material at the reference wavelength (a dg (443)), and the backscattering coefficient of particulate matter at the reference wavelength (b bp (443)). After retrieving these unknowns, a correction factor of 0.754188 was applied to the a dg (443) value according to Maritorena et al. [23]. The following constraints were then placed on each to ensure realistic values: The original version of the model is abbreviated here as GSM_ORIG.

Regional Tuning of GSM
Similarly to the optimization of the OCx algorithm, R rs (λ) for a given satellite matchup was computed as the median of the corresponding 3 × 3 pixel matrix centered on the matchup pixel. Three modifications to the original GSM algorithm were tested. First, a new optimized exponent, P, was added to the chla term to correct for systematic underestimation at high chla and overestimation at low chla (see Sections 3.2 and 3.3) that were potentially caused by the change of a ph with increasing chla (i.e., packaging effect [33]). The expression of the total absorption became: Second, the g parameters (see Equation (3) and Table A8) were optimized and spectral dependency was included as well as a new term g 3 (Equation (11)) using the synthetic dataset from the IOCCG working group on remote sensing of inherent optical properties [31]. Coefficients were retrieved at 10 nm intervals from 400 nm to 700 nm and interpolated to sensor-specific wavelengths when necessary.
Third, a sensitivity analysis was performed to compute the optimal exponents S (in the a dg term), Y (in the b bp term), and the new exponent P (in the a ph term), which provided the best agreement between in situ and satellite-derived chla for the NWA and NEP regions. A total of 5355 (17 × 35 × 9) possible combinations of the three exponents were tested, ranging from 0.008 to 0.04 with an increasing step of 0.002 for S, 0.5 to 2.2 with an increasing step of 0.05 for Y, and 0.4 to 0.8 with an increasing step of 0.05 for P. These combinations of exponents were evaluated for all sensors and regions, as well as for both constant g values (see Equation (3)) and the new spectral g values (see Equation (11)). The best set of exponents was determined using a similar scoring method as that used for the algorithm performance evaluation (see Section 2.5.2), where each set of exponents was ranked according to a selected set of metrics. To remain consistent with the methods of optimizing band ratio coefficients, the slope and intercept, as well as RMSLE and r 2 , were selected for this test, and a score was assigned to each statistic based on its mean across all possible combinations and its value relative to other combinations. The optimal set of exponents was defined as the median of all sets with the highest sum of scores. At least 50% valid retrievals were required for each potential combination in the test, in order to avoid sets with low error values but with a poor predictive capacity. The optimal exponents are given in Table A9.
The modified versions of GSM were divided into three separate algorithms according to the changes that were made. All new versions included the exponent P on the chla term: 1) GSM_GC used the original g coefficients and regionally-tuned P, S, and Y, 2) GSM_GCGS used the spectral g coefficients in combination with the optimized P, S, and Y parameters from GSM_GC; and 3) GSM_GS used the spectral g coefficients optimized simultaneously with P, S, and Y. The results of GSM_GC and GSM_GS are discussed in this study to demonstrate improvement in chla retrieval when using the new spectrally-dependent g coefficients and the P exponent in the phytoplankton-dependent term, however only the scores of GSM_GCGS are included in the main text for simplicity's sake, and the full results in Appendix A (see Tables A1 and A2).

Statistical Models
Several sets of statistics were used for the evaluation of the algorithm. First, the total number of matchups (N) and the number of valid retrievals (n), were used to calculate the percentage of retrievals. Second, several formulations were used to express the error between in situ and satellite-derived chla, including the mean error (µ error , units of mg m −3 ), as well as the root mean squared logarithmic error (RMSLE): the mean log-transformed error (MLE): and the mean magnitude of log-transformed error (MMLE), where C * i corresponds to satellite-derived chla of matchup i and C i corresponds to in situ chla of matchup i. These three metrics are unitless as a result of the log transformation, and the latter two metrics have an ideal value of one, stemming from the reverse log transformation that converts them from linear to multiplicative space. Third, the intercept, slope, and coefficient of determination (r 2 ) were computed for the linear regression of log-transformed satellite-derived chla on log-transformed in situ chla using the Standard Major Axis method of Type 2 regression in the lmodel2() function [34]. Finally, a metric referred to as the "win ratio" based on Seegers et. al. [35], was also computed. The full dataset of in situ/satellite matchups for a region/sensor was pared down to a new total value, T valid , which only included matchups with valid satellite-derived chla for all algorithms. This facilitated comparisons between the errors of individual matchups for each algorithm. For a single in situ/satellite pair, the algorithm that generated the lowest magnitude of error was the "winner". Each model's total number of "wins" in the set, wins model , was then divided by T valid : For example, if a dataset contains 500 matchups for which every algorithm generates a valid satellite chla concentration, and one of the algorithms (e.g., POLY1) produces the lowest error out of all algorithms for 70 of the 500 matchups, then the win ratio of POLY1 would be 70/500 = 0.14.

Scoring Method
The performance of all algorithms was evaluated using a modified version of the ranking approach developed by Brewin et al. [36], which relies on a slightly different set of metrics. Five statistics were selected as suggested in Seegers et al. [35] in an attempt to summarize the bias, accuracy, and precision of the model, as described below: 1. MLE, to account for the possible bias in the algorithm, 2. MMLE, to determine the accuracy of the algorithm, 3. r 2 of the linear regression, to test the precision or "goodness of fit" of the model, 4. percentage of valid retrievals, to ensure that a high score and low error are not reported as a result of a small number of retrievals, and 5. win ratio, to judge algorithm performance based on individual matches rather than a summary statistic.
The RMSLE was excluded from the scoring method in order to avoid redundant metrics that could skew the results by adding more emphasis to one element of the assessment (for example, bias) instead of giving all elements equal weight, however, its result is presented here for comparison with the literature. In our evaluation approach, all the metrics used were considered of equal weight. Each statistic was transformed to its magnitude of difference from the ideal value (for example, r 2 = 0.6 would be transformed to 0.4, as the ideal r 2 value is one) to give each the same reference point of zero so that lower statistics get higher scores. The mean of these transformed statistics was calculated across the results for all six algorithms. For a given algorithm, each of the five statistics was then scored as follows, taking into account their value relative to the mean and to the individual values of other algorithms: • two points if the algorithm's transformed statistic was in the lowest 20% of the statistic's values across all algorithms (i.e., closer to the ideal value of zero) and below the mean, • zero points if the algorithm's statistic was in the highest 20% of the statistic's values across all algorithms (i.e., further from the ideal) and above the mean, and • one point otherwise Statistical scores for each algorithm were then summed to give the algorithm's overall performance in comparison to others (Tables 3 and 4).

Temporal, Geographic and Phytoplankton Composition Influences on Algorithm Performance
The performance of the algorithms was also evaluated against environmental properties to detect any systematic bias or increase in uncertainties due to external factors. We studied the variability in the error (E chla ) as a function of time (defined as year + day of year/365) using boxplots generated by ggplot2() and geom_boxplot() in R [37], given that the validation exercises span over several years for all three sensors and both regions. To examine the relationships between the uncertainties and other environment variables, a correlation analysis was performed on the magnitude of error (|E chla |) between in situ and satellite-derived chla against the variables listed below, using the shapiro.test() function to check that each subset followed a normal distribution, and the cor.test() function to retrieve the correlation coefficient and p-value The ratio of f ucox to chla was selected to provide the ability to detect any bias due to change in phytoplankton composition, as the f ucox pigment is an indicator of the presence of diatoms [38,39].
To test if the seasonal cycle induced different uncertainties we performed a Student's t-test on µ error for each sensor and algorithm between spring and fall in the NWA, and similarly a One-Way ANOVA test between spring, summer and fall in the NEP. A separate t-test was also performed on µ error between coastal and open ocean bathymetry subsets (as defined above). Again, the assumption of normality of each subset was checked using the shapiro.test() function and assumptions of equal variance between subsets were checked using the var.test() and f ligner.test() functions for comparisons of two groups and three groups respectively. The tests for differences in means between subsets were conducted using the functions t.test() (for two groups) and aov() (three groups). Tukey's honest significant difference was performed on the ANOVA results from the NEP to determine which seasons generated mean errors that were statistically different from each other, using the TukeyHSD() function [24].

Parameters of the Optimized Band Ratio Algorithms
Regionalization of the OCx and GSM algorithms was carried out using satellite-derived and in situ chla, which differs from the original algorithms where the coefficients of the models were derived using remote sensing reflectances and chla measured in situ. The regionally-tuned coefficients of the band ratio models remained consistent with the original parameters and notably, the first two coefficients (i.e., a and b, Table 2) were within the same order of magnitude and of the same sign as the original coefficients. In general, the first term a was higher in the regional versions of the algorithms. This can be explained by the overall underestimation of chla by the original OCx algorithms. The a and b coefficients for the NEP for all sensors were closer to the original ones than for the NWA, which was consistent with the better performance of the original algorithms in the NEP than in the NWA (see Section 3.2 and 3.3). Regarding the coefficients for high degree polynomials (i.e., >= 2), there were no distinct patterns and the coefficients of the regional algorithms tend to be different and often opposite in sign from the original coefficients. The removal of the 443 nm band in the new band ratio algorithms may explain the differences between the original and regional coefficients of the algorithms. The numerator of the band ratio in the OCx algorithm varies between two or three possible wavebands depending on the sensor, where one of the options is the 443 nm band. Our analysis revealed that this band produced more outliers in the polynomial relationship between log-transformed in situ chla and the log-transformed band ratio than other numerator wavebands, which is particularly obvious in the NWA MODIS dataset (Figure 3). The blue boxes in Figure 3a,b show how the polynomial fit translates to a small range of modelled chla corresponding to a wide range of in situ chla. Removing the 443 nm waveband from the potential blue wavebands produced a tighter fit with the polynomial (Figure 3c,d). SeaWiFS and VIIRS NWA matchups datasets showed similar improvements after removing the 443 nm band, while the NEP datasets did not show significant changes after removing this band.

Algorithms performance in the NWA
The statistic metrics remain consistent across the three sensors, with the regionally-tuned algorithms providing the best performance, as expected. In general, the polynomial-based approach performed better than the semi-analytical algorithms ( Figure 4a and Table 3, see scores). It is noteworthy that the polynomial approaches yielded more matchups (96, 81 and 98% of possible matchups for MODIS, SeaWiFS and VIIRS respectively) than the semi-analytical methods (between 57 and 87% of possible matchups depending on sensors, global and regional methods). For the semi-analytical approach, the regional versions systematically provided more valid pixels that the original one. Table 3. Statistics computed for OCx, POLY1, POLY4, Garver-Siegel Maritorena (GSM)_ORIG, GSM_GC, and GSM_GS algorithms for the Northwest Atlantic. Note that other versions of the algorithms OCx and GSM were tested but only included in Appendix A, so the "win ratio" column does not add up to one for each sensor as it does not include all values. Also recall that as a result of the log and reverse log transformations, the ideal value of MLE and MMLE is one. N is the total number of matchups as defined in Section 2.3, and n is the number of matchups that returned valid chla for the particular algorithm.   Table 3, all the generic algorithms (i.e., OCx and GSM_ORIG) had a slope of the linear regression of satellite-derived chla on in situ chla lower than 0.84 (0.839 for OC3M) and as low as 0.600 for VIIRS GSM_ORIG. Note that the slope of the linear regression of the tuned algorithms was forced to one. The regional algorithms also had a higher correlation coefficient than the original methods for all sensors, all algorithms. The RMSLE exhibited values (from 0.29 to 0.47) that were in agreement with values reported in the literature for regional algorithms for the retrieval of chla [40][41][42]. Tuned algorithms showed smaller RMSLE than original algorithms for both semi-analytical and band ratio approaches except for the GSM approaches for VIIRS. The MLE (Equation (13)) was lower than one for the original algorithms for all three sensors, as they have an overall negative bias, as seen in Figure 4b. After correcting the bias in the regional algorithms, the MLE was closer to one (particularly in the band ratio models) and the MMLE was reduced. In general, the improvement in the MMLE was greater for the GSM-type approaches than for the OCx-type approaches. The better performance of the regional algorithms is most noticeable in the mean error, which generally improved by a full order of magnitude. When the algorithms were applied to the identical subset of retrievals, GSM_ORIG showed the highest win ratio for MODIS and VIIRS, while narrowly losing to OC4 in the SeaWiFS dataset. The regional linear and fourth-degree polynomial fits (POLY1 and POLY4) provided very similar results for all sensors, which is consistent with the study of Laliberté et al. [15] in the Gulf of Saint-Lawrence (Canada).

Algorithm Performance in the NEP
For the NEP, the original OCx algorithms exhibited a better linear fit than for the NWA (Table 4 and Figure 5b). For instance, OC3M showed a slope and intercept of the regression of satellite versus in situ data of 0.966 and 0.021 respectively against 0.839 and -0.067 for the NWA. The good performance of the original algorithms is also reflected in the correlation coefficients (r 2 ≥ 0.55) for both original algorithms (OCx and GSM_ORIG) and all sensors, as they have higher values than for the NWA, with the exception of GSM_ORIG applied to VIIRS data where r 2 is 0.48 ( Table 4). The regional algorithms, which aimed at reducing the RMSLE and forcing the slope and intercept to one and zero respectively, also improved the MLE. The RMSLE for all algorithms and sensors in the NEP are within the same order of magnitude as in the NWA except for the regional GSM models for SeaWiFS (i.e., GSM_GC and GSM_GS), which exhibited low RMSLE of 0.16 and 0.17 for GSM_GC and GSM_GS, respectively. As in the NWA, regional tuning generally improved the mean error, with the exception of the OCx and band ratio models in the MODIS dataset, where the magnitude of mean error increased, attesting to the good performance of the ocean colour model currently in use for this region and sensor. The semi-analytical algorithms (GSM-type) exhibited higher win ratios in general across all sensors, with GSM_ORIG showing the highest win ratio for MODIS and VIIRS (0.28 and 0.25 respectively), while GSM_GS showed the highest win ratio for SeaWiFS. Ultimately, the simple linear band ratio model (POLY1) obtained the highest score across the three sensors but tied with POLY4 and the regional GSM models in the SeaWiFS dataset. It is worth noting that other variations of the band ratio models (POLY2 and POLY3) also performed well, with the third-degree polynomial (POLY3) outperforming POLY1 in the SeaWiFS set, and POLY2 obtaining a higher score than other polynomials in the NWA dataset (Table A1). Similarly, the MODIS NEP dataset contained an exception to the pattern of high-scoring GSM_GC and GSM_GS, where GSM_GCGS outperformed the two.   Figure 4, for the NEP. Note the colour-coding for the solid circles in (b): blue corresponds to winter, green to spring, purple to summer, and red to fall.

Patterns in Chla Uncertainties with Phytoplankton Composition, Time, and Number of Retrievals
The correlation analysis described in Section 2.6 revealed that the magnitude of uncertainty in the retrieved chla (|E chla |) was strongly positively correlated to chla concentration for all sensors, regions, and algorithms, with r 2 varying from 0.40 (VIIRS, GSM_GC, NWA) to 0.97 (GSM_ORIG, NWA, all sensors). GSM_GC presented an exception in the NEP SeaWiFS dataset with a weaker correlation between |E chla | and chla that can be explained by the very low number of matchups (Tables 5 and  A2). Similarly, we found a relationship between |E chla | and the ratio of f ucox to chla concentration for all algorithms, which was weaker in the regionally-tuned algorithms (Table 5) indicating that phytoplankton taxonomic composition has an impact on these types of algorithms, which can be reduced by the use of the regional models. There were no significant long-term trends of error magnitude in the time series for all sensors and both regions (Table 5). However, there was a negative correlation between the magnitude of error and the day of year across all sensors in the NWA, which is also apparent in the larger spread of error (the interquartile range) occurring during the first half of each year (1.5 mg m −3 on average across sensors for both POLY4 and GSM_GS) than during the last half (0.59 and 0.63 for POLY4 and GSM_GS respectively), with a general negative bias in the spring across the whole time series ( Figure 6 and Table A6). The differences in the spread of error between the first and last halves of the year is generally less noticeable in the NEP (Figure 7), where there is a smaller positive correlation between error magnitude and day of year, corresponding to a larger average spread of error in the second half of each year (1.0 and 0.57 mg m −3 for POLY4 and GSM_GS respectively) than in the first half (0.51 and 0.34). This can possibly be attributed in part to the timing of the samples, which were spread out over the year instead of concentrated in the spring and fall as in the NWA. A t-test and ANOVA also revealed that in the NWA, the average µ error across algorithms in the spring (−0.51, −0.60, and −0.60 mg m −3 for MODIS, SeaWiFS, and VIIRS respectively) were generally significantly different (opposite in sign and higher in magnitude) than in the fall (0.20, 0.19, and 0.032) (see Table A3), but there were no significant differences in the mean error between seasons in the NEP (Table A4). We did not find any correlation between the number of matchups and the magnitude of the uncertainties for a given season/sensor/region except when only a very low number of matchups was available (e.g., MODIS in spring 2005 in the NWA, 2 matchups, Figure 6) and one bad matchup could negatively weight the entire matchup dataset. In the NEP (Figure 7), there was a large number of outliers spread evenly across spring and fall seasons, unlike the NWA, where spring blooms generated the highest spread of error. Only a few data were available for SeaWiFS in the NEP, which were concentrated mainly in 2009 and 2010, and therefore the correlation and ANOVA results for this region/sensor dataset were less reliable. All variations of GSM also returned fewer valid retrievals than the band ratio algorithms, mainly due to negative remote sensing reflectances at either end of the spectrum, which prevented the model from generating a valid fit (Table 6). Table 6. Missing retrievals by GSM_GS algorithm for each region/sensor combination, subdivided by the cause of lost data, with the percentage of lost retrievals given for each cause. R rs (41X) is the remote sensing reflectance value at 412 nm for MODIS and SeaWiFS, and 410 nm for VIIRS. R rs (6XX) varies from 645 to 671 nm depending on sensor. Note that there were no instances of negative model chla, unlike a dg (443) and b bp (443). "Unrealistic unknowns" occur when chla, a dg (443), or b bp (443) are outside their acceptable ranges, given in Equation (9). Multiple issues were a combination of a dg (443) < 0 and chla above the acceptable range, with two instances where b bp (443) was also too high (>0.18).

Uncertainties Related to Geographic Considerations
The correlation analysis of |E chla | on latitude, bathymetry and distance from in situ to satellite measurements was carried out to investigate if these variables could explain some of the biases and discrepancies observed between the in situ and satellite-derived chla (Table 7). There were no correlations observed between |E chla | and latitude, which is an interesting result given the range of latitudes in the NWA dataset from approximately 38 • to 61 • N. Possible exceptions exist in the NWA VIIRS and NEP SeaWiFS datasets, but as mentioned in the previous section, the SeaWiFS dataset had very few data, and the high latitude data in the VIIRS dataset were all retrieved during the spring bloom (see Figure 1a,b), when chla concentration would typically be higher and thus |E chla | would likely also be higher as the two were collinear (see Section 3.4.1).
There were low negative correlations between |E chla | and open ocean bathymetry in the NWA MODIS dataset, and larger negative correlations in the NEP, which again can be explained by the collinearity of chla concentration and |E chla | and the fact that in the open ocean subset, bathymetry was also negatively correlated with chla concentration (r = −0.47). As for seasonal bias, the t-test on the coastal and offshore datasets in the NWA revealed that the mean uncertainties for both datasets were statistically different in the regional band ratio models for MODIS and SeaWiFS, with µ error ranging from 0.14 to 0.42 mg m −3 for the coastal regions and -0.31 to -0.20 in open ocean, indicating that this family of algorithms systematically underestimated chla offshore and overestimates in shallower water, while the difference was less significant with the GSM-type algorithms (see Table A5). The pattern reversed in the NEP, where the regional band ratio underestimated coastal chla (−0.90 to −0.77 mg m −3 ) and overestimated chla in the open ocean (0.076 to 0.27), again ignoring the SeaWiFS dataset due to insufficient data.
Distance between a satellite matchup and the in situ measurement was up to 10 km to maximize the number of matchups, though over half the matchups in each region/sensor subset were within 2 km of the measurement. Regression of |E chla | on the distance did not reveal a relationship (|r| < 0.1) for any datasets except NWA VIIRS GSM_ORIG, and the small NEP SeaWiFS dataset, suggesting that in open waters, where spatial variations in chla are small (mesoscale) compared to the coastal environment, the stringent criteria of pixel to in situ collocation can be relaxed to increase the number of matchups.

Discussion
Satellite-derived chla concentration remains the most used product to infer global scale information on the status of the marine ecosystem, and non-specialists represent an important fraction of ocean colour data users. Here we have assessed the performance of two algorithms currently implemented in NASA's SeaDAS software (https://seadas.gsfc.nasa.gov) in Canadian waters (Atlantic and Pacific Oceans) to inform on biases associated with these algorithms in those regions. In addition, we carried out modifications of these original models by optimizing their parameterization and formulation to correct for regional bias in the NWA and NEP. Both OCx and GSM exhibited a negative overall bias particularly in the NWA, where low chla were overestimated and high concentrations were underestimated, which resulted in a linear fit of satellite-derived on in situ chla with a slope lower than one. This bias was partially corrected in the modified versions of the models, which also improved the tightness of the relationship as indicated by higher r 2 than for the original algorithms, but at the expense of the number of individual retrievals with the lowest magnitude of error between different algorithms ("win ratio", Tables 3 and 4). Significantly different means in chla concentration between spring and fall in the NWA (see Section 3.4.1) suggest that the dataset could be further subdivided by season, but at the expense of a synoptical approach. It would also require a larger matchup dataset to achieve statistical significance.
Our approach differed from NASA's in that the algorithms were fitted directly between satellite-derived and in situ chla, under the assumption that the satellite R rs were reliable, while the original algorithms were developed using both in situ radiometric and chla measurements. Furthermore, the regional parameters were optimized, tested, and compared to the original algorithms using the entire dataset for a given region and sensor, rather than extracting a subset for training and using the remainder for testing. Confidence intervals for the polynomial algorithm coefficients were retrieved using a bootstrap method that subsamples the original dataset to compute 2000 different sets of coefficients (see Section 2.4.2), revealing interval widths ranging from 0.08 to 1.26 for the first three coefficients and 0.38 to 2.39 for the last two, which had the smallest impact on the shape of the polynomial in the region defined by the band ratio values. This suggests that the coefficients derived from the full dataset are representative of any subsets. For the GSM algorithms, the sensitivity study on combinations of spectral variations in P, S, and Y (see Section 2.4.4) was performed for each region, sensor, and set of g coefficients (spectral or constant), and the median of the subset of highest-scoring combinations was selected. For this reason, we can assume that using the entire dataset or a subset for each combination would have led to similar results after retrieving the median of the optimal combinations. Finally, the small size of some of the datasets, particularly the NEP SeaWiFS set, presented a challenge in developing a reliable set of parameters spanning a wide range of values for in situ chla and the environmental variables that were tested for correlations with algorithm error. For these reasons, the regional tuning of the algorithms was carried out on the full datasets rather than subsets.
One of the main results of our studies was the identification of the 443 nm waveband as a strong contributor to the overall uncertainty in the OCx algorithms. Removing this waveband from the OCx algorithms and iteratively forcing the slope and intercept of the linear model between satellite-derived and in situ chla to one and zero respectively provided better results than the original algorithms (Table 3). Increasing the degree of the polynomial improved the results for some combinations of region and sensor (see scores in Tables 3 and 4), but overall offered minimal improvements. The negative bias of the original GSM model was corrected by introducing a new exponent, P, on the chlorophyll term, which lowered satellite-derived chla concentrations <1 mg m −3 and increased concentrations >1 mg m −3 . This exponent P was derived in combination with the exponents on the a dg and b bp terms, giving the best set of spectral slopes for each region and sensor. Inclusion of spectral dependence in the g coefficients in Equation (11) slightly improved the results, however, optimization of the spectral slopes of the phytoplankton and yellow substances absorption terms, as well as the particulate backscattering term, had a more positive impact. This finding highlights the importance of accounting for regional properties of absorption and backscattering terms, and particularly their spectral dependence. Overall, the GSM-type algorithms provided the highest win ratio (i.e., the highest number of matchups with the smallest chla uncertainties), however, this good performance was hampered by the lower number of retrievals compared to the polynomial formulations (Tables 3 and 4). This limitation has to be considered according to the end-user's applications of the chla product.
Uncertainties in satellite-derived products are currently an active field of research (IOCCG report on Uncertainties in Ocean Colour Remote Sensing, in preparation) that remains highly complex given the possible sources of uncertainties associated with satellite observations that include, among others, uncertainties in radiometry (and calibration), atmospheric corrections, data binning, and bio-optical algorithms. For instance, it has been shown that at the global scale, the OCx algorithm contains an inherent uncertainty of about 35% [43]. The difference in scales between satellite and in situ measurements also creates another source of uncertainty, as in situ data refer to about two litres or less of seawater at a discrete depth while satellites integrate a volume of tens of millions of litres of seawater. Note that uncertainties in HPLC-derived chla were partially accounted for in the linear model comparing the results by using Type 2 linear regression as described in Section 2.5.1, which minimizes the sum of areas of triangles between each point and the regression line, assuming that both variables contain errors. Here, we have addressed discrepancies between satellite-derived chla and in situ chla and included temporal and spatial effects. For the two areas of interest, namely the Northwest Atlantic and the Northeast Pacific, we did not find significant patterns of temporal drift, thanks to regular reprocessing of ocean colour products by NASA following a vicarious calibration exercise [44] that ensures stable measurements of the radiative field with time. We did not find spatial patterns, with the exception of minor positive correlations between latitude and magnitude of error in the NWA VIIRS dataset, and negative correlations between open ocean bathymetry and error magnitude in the NEP datasets, which are likely due to changes in chla concentration that affect the degree of error. Seasonal biases were observed with an overall underestimation of chla in the spring and a small overestimation in the fall. This could be explained by the change in phytoplankton community composition ( Table 5), such that when the ratio of fucoxanthin, a marker pigment for diatoms, to chla concentration increased, the magnitude of error in chla retrieval also increased. This pattern was observed across all algorithms, but weaker in the regionally-tuned algorithms. GSM-type algorithms offered the greatest improvements, in particular, GSM_GC in the VIIRS dataset where the correlation coefficient of |E chla | and the ratio of fucox to chla was reduced from 0.54 to 0.27. This demonstrates the effectiveness of regional algorithms at addressing changes in phytoplankton absorption properties, and notably the modified GSM algorithms, which include an exponent on the phytoplankton absorption term to account for changes in absorption properties with changes in phytoplankton community composition in the Northwest Atlantic [45][46][47]. However, the fact that the f ucox to chla ratio impacted the regional algorithms for SeaWiFS (i.e., both OCx-like and GSM-like) to a similar degree as the original models suggests that the 510 nm band, which is used across all models for the SeaWiFS sensor, can contain information on the presence of diatoms, as highlighted in Sathyendranath et al. [48] who developed an algorithm to discriminate diatoms from other phytoplankton types for SeaWiFS.
Finally, another notable source of uncertainty is the atmospheric correction method applied to NASA's level 2 images, based on the "black pixel assumption", which assumes that scattering in the near-infrared bands from approximately 670 nm to 865 nm [49] is negligible, thus the ocean appears black. This was a reasonable assumption for case 1 waters [50], but optically-complex case 2 waters can contain elements that contribute to the scattering in those bands, giving an inaccurate correction of remote sensing reflectance values used in chla algorithms [51]. Given that many matchups in this report were near coastal or shelf areas, there could have been a degree of error associated with poor atmospheric correction. As discussed in Section 3.4.1, many satellite matchups did contain negative R rs in a waveband near one end of the visible spectrum. These matchups were incapable of computing valid chla concentrations for the GSM algorithms but were included in the band ratio algorithms.

Conclusions
This comparison exercise between satellite-derived and in situ chla concentration measurements for three ocean colour sensors (i.e., SeaWiFS, MODIS and VIIRS), two widely-used algorithms based on different approaches (i.e., band ratio and semi-analytical) and for two regions (i.e., Northwest Atlantic and Northeast Pacific) revealed systematic bias (slopes varying from 0.60 to 0.84 in the NWA and from 0.63 to 0.97 in the NEP) and relatively large uncertainties. While regional tuning permitted correction for the biases, the uncertainties remained substantial, attesting to the fact that natural variations of phytoplankton optical properties, the presence of other optically active components, and accuracy of the atmospheric correction will challenge accurate retrieval of chla concentration. In the Northwest Atlantic, the removal of the 443 nm band in the regional band ratio algorithm improved the performance of the OCx-type algorithms. It is noteworthy that while band ratio algorithms provided the most matchups between satellite and in situ data, the GSM-type approach appeared to be most accurate thanks to its ability to decipher the phytoplankton signal from yellow substances, but its performance was hindered by the significantly lower number of retrievals, mainly due to negative radiances present in the blue or red part of the visible spectrum, perhaps due to poor atmospheric corrections. Acknowledgments: The authors would like to thank NASA's Ocean Biology Processing Group (OBPG) for supplying a large up-to-date satellite ocean colour dataset available for public use (https://oceancolor.gsfc.nasa. gov/), as well as detailed information on the current ocean colour algorithms.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript:   Table A8. g coefficients used in Equation (11) after performing linear interpolation to the necessary wavelengths. λ g 1 g 2 g 3 Table A9. Exponents used in Equations (6) and (10), where the g c columns give exponents optimized to the original constant g 1 and g 2 coefficients, and g s columns give the exponents optimized for the use of the spectral g coefficients g 1 , g 2 , and g 3 .