Large-Scale Retrieval of Coloured Dissolved Organic Matter in Northern Lakes Using Sentinel-2 Data

Owing to the significant societal value of inland water resources, there is a need for cost-effective monitoring of water quality on large scales. We tested the suitability of the recently launched Sentinel-2A to monitor a key water quality parameter, coloured dissolved organic matter (CDOM), in various types of lakes in northern Sweden. Values of a(420)CDOM (CDOM absorption at 420 nm wavelength) were obtained by analyzing water samples from 46 lakes in five districts across Sweden within an area of approximately 800 km2. We evaluated the relationships between a(420)CDOM and band ratios derived from Sentinel-2A Level-1C and Level-2A products. The band ratios B2/B3 (460 nm/560 nm) and B3/B5 (560 nm/705 nm) showed poor relationships with a(420)CDOM in Level-1C and 2A data both before and after the removal of outliers. However, there was a slightly stronger power relationship between the atmospherically-corrected B3/B4 ratio and a(420)CDOM (R2 = 0.28, n = 46), and this relationship was further improved (R2 = 0.65, n = 41) by removing observations affected by light haze and cirrus clouds. This study covered a wide range of lakes in different landscape settings and demonstrates the broad applicability of a(420)CDOM retrieval algorithms based on the B3/B4 ratio derived from Sentinel-2A.


Introduction
Long-term water quality information is important for securing and restoring ecosystem services provided by lakes [1]. Lakes are generally small in size and widely distributed in remote locations, which make regular monitoring difficult [2]. For example, Sweden has more than 126,000 lakes, but owing to the time and effort required to conduct monitoring with conventional sampling techniques, the annual Swedish lake monitoring program focuses on only around 100 lakes [3]. Similar challenges are faced by countries in northern boreal and Arctic regions that have an abundance of lakes scattered over large areas with low accessibility.
Water monitoring programs in Europe and North America have observed increases in dissolved organic matter (DOM) over the past decades [4], but knowledge of the geographic extent of these changes remains poor. DOM not only increases the cost of drinking water production and decreases recreational value of lakes, but also impacts the structure and function of aquatic ecosystems [5] continuous spruce forest and peat wetlands; Västerbotten W and Jämtland districts, which have strong alpine gradients from lowland spruce to birch forest and further to high alpine conditions; and Värmland, which has a relatively productive boreal forest. All lakes were sampled more than once, but we decided to use data from visits close to dates with Sentinel-2A scenes that had minimum cloud cover. These lakes have no monitoring plan as they are relatively inaccessible, but the site selection was motivated by the need to test the performance of Sentinel-2A in regions where in situ data are rare. from multiple depths (ca. 4-5 evenly distributed depths) within the mixed layer above the thermocline at the deepest point of the lakes. The water samples were kept cool until arrival at the laboratory, where samples were filtered (GF/F = 0.7 µm) and analyzed for absorbance using a Jasco V-560 UV/vis spectrophotometer (Easton, MD, USA). Filtered samples were refrigerated for up to one week after sampling before absorbance analysis. The measured absorbance values were converted to a(420)CDOM in Napierian units (based on a natural logarithm) with the equation a(420)CDOM = 2.303D/r, where D is the measured absorbance and r is the cell path length (in m) [10,31]. Limnologists have used a variety of wavelengths to describe CDOM, and proposed 440 nm as a common standard in North American studies [10]. However, we used 420 nm as it is the wavelength used in the national Swedish monitoring program. The difference between the date of image acquisition and the corresponding in situ data collection ranged approximately between one week to ten weeks, potentially decreasing the accuracy of our models [29]. However, past studies [15,32] have shown that CDOM concentration in lakes is a relatively stable parameter over these durations.

ID Lake Name and Region Latitude Longitude
A representative water sample was collected from each lake by sampling and pooling water from multiple depths (ca. 4-5 evenly distributed depths) within the mixed layer above the thermocline at the deepest point of the lakes. The water samples were kept cool until arrival at the laboratory, where samples were filtered (GF/F = 0.7 µm) and analyzed for absorbance using a Jasco V-560 UV/vis spectrophotometer (Easton, MD, USA). Filtered samples were refrigerated for up to one week after sampling before absorbance analysis. The measured absorbance values were converted to a(420) CDOM in Napierian units (based on a natural logarithm) with the equation a(420) CDOM = 2.303D/r, where D is the measured absorbance and r is the cell path length (in m) [10,31]. Limnologists have used a variety of wavelengths to describe CDOM, and proposed 440 nm as a common standard in North American studies [10]. However, we used 420 nm as it is the wavelength used in the national Swedish monitoring program. The difference between the date of image acquisition and the corresponding in situ data collection ranged approximately between one week to ten weeks, potentially decreasing the Remote Sens. 2020, 12, 157 5 of 16 accuracy of our models [29]. However, past studies [15,32] have shown that CDOM concentration in lakes is a relatively stable parameter over these durations.

Satellite Image Processing
Level-1 Sentinel-2 MSI data (L1C) were downloaded from the Scientific Data Hub (https://scihub. copernicus.eu). A total of eight images from between June and October 2016 were downloaded, which covered all 46 lakes from the five different regions. Per-pixel radiometric measurements were provided as top-of-atmosphere (ToA) reflectance. The L1C products had been resampled with a constant ground sampling distance of 10, 20, and 60 m, depending on the native resolution of the different bands, and were delivered as scenes with UTM Zone 33N projection (WGS-84 datum). Images with 10 m and 20 m resolution were used [33]. Sentinel-2 Toolbox version 2.2.4 within the Sentinel Application Platform version 2.2.3 was used to process the images into Level-2A (L2A) surface reflectance using the Sen2Cor atmospheric correction module [33]. Sen2Cor is a processor used to generate Sentinel-2 L2A products, performing atmospheric, terrain, and cirrus cloud correction on the L1C data. Then, 3 × 3 pixel values were extracted from the middle of each lake and the mean values of the pixels were used for analyses. All processing was performed on a 64-bit Windows 10 workstation.

Statistical Analysis
Partial least squares regression (PLSR) was used to explore the relationships between lake CDOM concentrations and band ratios derived from Sentinel-2A bands 2, 3, 4, and 5 (Table 3). In this way, we were able to systematically evaluate and compare how accurate different ratios performed as CDOM predictors. Any ratio that was most strongly related to CDOM was then regressed against observed a(420) CDOM using power curves [29], which yielded the strongest relationships. A retrieval algorithm was then derived from the resulting regression model of different band ratios. Repeated k-fold cross-validation was performed in order to test the validity of the model. During this procedure, the data were divided into k = 10 portions of equal size, where one of these were retained for validation, while the rest (k -1) were used for training. The procedure was repeated five times to reduce variance. Model performance was assessed using the root-mean-square error (RMSE) and mean absolute error (MAE).

In Situ a(420) CDOM Measurements and Satellite Image Acquisition
The a(420) CDOM in the 46 lakes, as measured in the laboratory, varied from 0.30 m −1 to 29.93 m −1 ( Table 4). The highest a(420) CDOM was recorded in seven lakes (IDs: 3,24,28,30,32,33,38) in Värmland and Västerbotten E with values between 21.2 m −1 and 29.9 m −1 . The clearest lakes were located in Västerbotten W, Jämtland, and Norrbotten, with observed a(420) CDOM between 0.4 m −1 and 2.4 m −1 . The reflectance of the lakes in the different regions was generally high in the green band (B3: 560 nm) and low in the red band (B4: 665 nm) ( Figure 2). and Västerbotten E with values between 21.2 m −1 and 29.9 m −1 . The clearest lakes were located in Västerbotten W, Jämtland, and Norrbotten, with observed a(420)CDOM between 0.4 m −1 and 2.4 m −1 . The reflectance of the lakes in the different regions was generally high in the green band (B3: 560 nm) and low in the red band (B4: 665 nm) ( Figure 2).  Table 3.   Table 3.

Partial Least Squares Regression (PLSR) Model to Explore the Relationship between CDOM and Band Ratios
A PLSR analysis was used to explore the relationships between a(420) CDOM sampled in the lakes and the atmospherically-corrected band ratios. Two significant components were evident from the PLSR analysis. The first explained 40% of the variance in CDOM and was characterized by positive loadings weights for all band ratios. The second component explained that 34% of the variance in CDOM was characterized by positive loadings for ratios with B2 as the numerator, but with negative loadings for ratios with B3 as the numerator (Figure 3). The band ratio B3/B4 clearly showed the strongest relationship to a(420) CDOM as these two variables were located on the diagonally opposite end in the loading plot ( Figure 3). The other band ratios, especially B2/B3, B2/B5, and B2/B4, were less clearly linked with a(420) CDOM . The PLSR of the non-atmospherically-corrected L1C data (not shown) was excluded because of poor model fit.

Testing and Validating a(420) CDOM Retrieval Models
The B2/B3 explained 24% of the a(420) CDOM variation before atmospheric correction ( Figure 4A), but this relationship was not significant after atmospheric correction ( Figure 4B). The green to red band ratio (B3/B4), which has been commonly used to retrieve CDOM in lakes in the past, had a relatively stronger correlation with the lake a(420) CDOM ( Figure 4D). Before atmospheric correction, B3/B4 did not significantly explain any variance (R 2 = 0.01) in a(420) CDOM ( Figure 4C), but after atmospheric correction the relationship was markedly improved (R 2 = 0.28; Figure 4D). The B3/B5 ratio showed patterns similar to those of B3/B4, but with slightly lower R 2 ( Figure 4E,F). Model coefficients estimated from Sentinel-2A before (L1C) and after atmospheric correction (L2A) ( Table 5) showed that B2/B3 had relatively higher explained variance and lower error and bias, in comparison with B2/B3 and B3/B5, before atmospheric correction. On the other hand, B3/B4 after Model coefficients estimated from Sentinel-2A before (L1C) and after atmospheric correction (L2A) ( Table 5) showed that B2/B3 had relatively higher explained variance and lower error and bias, in comparison with B2/B3 and B3/B5, before atmospheric correction. On the other hand, B3/B4 after the atmospheric correction showed the highest explained variance (R 2 = 0.28). Surprisingly, B3/B4 after atmospheric correction had slightly higher error and bias than B3/B4 and B3/B5 (Table 5). Table 5. Retrieval models for a(420) CDOM estimated from Sentinel-2A before ( b , no shading) and after ( a , grey shading) atmospheric correction. RMSE, root-mean-square error; MAE, mean absolute error; the variable X denotes the band ratio in the model equation.

Band Ratio
Model The repeated k-fold cross-validation showed that, after atmospheric correction, the B3/B4 a ratio had the highest explained variance (R 2 = 0.49; Table 6). The RMSE showed substantially lower error and bias in B3/B4 a in comparison with that found for other band ratios. This bolsters our confidence in using B3/B4 as a robust approach to predicting a(420) CDOM in lake water ( Figure 5).  The repeated k-fold cross-validation showed that, after atmospheric correction, the B3/B4 a ratio had the highest explained variance (R 2 = 0.49; Table 6). The RMSE showed substantially lower error and bias in B3/B4 a in comparison with that found for other band ratios. This bolsters our confidence in using B3/B4 as a robust approach to predicting a(420)CDOM in lake water ( Figure 5).   The final model explained a relatively high proportion of the variance in CDOM (R 2 = 0.65) after the removal of outliers that were strongly affected by atmospheric conditions like haze and thin cirrus clouds ( Figure 5). These outliers were not detected during the initial screening process in which we removed lakes that were completely covered by clouds because they had some cirrus clouds and haze that were harder to detect with the naked eye. Thus, they were removed in a secondary, and more detailed, inspection. As this study aims to test the efficiency of the MSI sensor for mapping CDOM concentrations at large scales rather than developing a new algorithm, we applied the final model from Table 7 to retrieve a(420) CDOM for the 41 lakes.
where K is 2.809, B3 is centred at 560 nm, B4 is centred at 665nm, and m is -2.341. Table 7. The best performing a(420) CDOM retrieval model estimated from the validation subset after atmospheric correction ( a ) and removal of outliers (n = 5). The variable X denotes the band ratio in the model equation.

Comparing Observed and Modeled a(420) CDOM and Mapping Its Variability
Modeled and in situ, a(420) CDOM exhibited similar patterns and ranges across the study regions ( Figure 6). Västerbotten E and Värmland had the highest a(420) CDOM concentrations and the highest variability. On the other hand, Jämtland and Norrbotten had the lowest variability in remotely sensed a(420) CDOM and clearer water compared with that of the other regions.
Remote Sens. 2020, 11, x FOR PEER REVIEW 11 of 17 removed lakes that were completely covered by clouds because they had some cirrus clouds and haze that were harder to detect with the naked eye. Thus, they were removed in a secondary, and more detailed, inspection. As this study aims to test the efficiency of the MSI sensor for mapping CDOM concentrations at large scales rather than developing a new algorithm, we applied the final model from Table 7 to retrieve a(420)CDOM for the 41 lakes.
where K is 2.809, B3 is centred at 560 nm, B4 is centred at 665nm, and m is -2.341.

Comparing Observed and Modeled a(420)CDOM and Mapping Its Variability
Modeled and in situ, a(420)CDOM exhibited similar patterns and ranges across the study regions ( Figure 6). Västerbotten E and Värmland had the highest a(420)CDOM concentrations and the highest variability. On the other hand, Jämtland and Norrbotten had the lowest variability in remotely sensed a(420)CDOM and clearer water compared with that of the other regions. The concentration of a(420)CDOM varied considerably within some of the lakes, as shown in Figure  7, where the spatial variation in a(420)CDOM of ten lakes in Västerbotten W is presented. For example, modeled a(420)CDOM was notably higher at the edges of several lakes compared with values derived from the middle pixels (Figure 7). On the other hand, for example, lakes 9-10 showed considerably less a(420)CDOM variability (Figure 7). In spite of this edge effect, the a(420)CDOM retrieval also worked well in very small lakes. For example, one of the smallest lakes in the dataset (Lake 1, Skrapmiejaure), which is 0.07 km 2 in area, was still large enough for a(420)CDOM to be acquired (0-0.3 m −1 ) from clean mid-lake pixels, owing to the high spatial resolution of the Sentinel-2A data. The concentration of a(420) CDOM varied considerably within some of the lakes, as shown in Figure 7, where the spatial variation in a(420) CDOM of ten lakes in Västerbotten W is presented. For example, modeled a(420) CDOM was notably higher at the edges of several lakes compared with values derived from the middle pixels (Figure 7). On the other hand, for example, lakes 9-10 showed considerably less a(420) CDOM variability (Figure 7). In spite of this edge effect, the a(420) CDOM retrieval also worked well in very small lakes. For example, one of the smallest lakes in the dataset (Lake 1, Skrapmiejaure), which is 0.07 km 2 in area, was still large enough for a(420) CDOM to be acquired (0-0.3 m −1 ) from clean mid-lake pixels, owing to the high spatial resolution of the Sentinel-2A data.  Table 4 and Figure 5), which explains the low maximum in this figure.  Table 4 and Figure 5), which explains the low maximum in this figure.

Band Ratio Algorithms
Several band ratio combinations were tested in this study and the B3/B4 ratio was found to perform best. In lakes with low CDOM concentrations, phytoplankton and other particles cause the green range (~560 nm, band 3) of the electromagnetic spectrum to reflect highly, while the red portion (665 nm, band 4) is mostly absorbed [17,34]. As CDOM increases, relatively more of the green light, as compared with the red light, will be absorbed, and thus relatively more red light, compared with green, remains available to be reflected back to the atmosphere (Figure 2). CDOM level may vary in the water columns, depending on the spectral interference from high concentrations of particulate matter ( Figure 2). Zhu, W et al. [34] tested many empirical algorithms with lake data and suggested that wavelengths greater than 600 nm are critical for correct estimation of CDOM in complex freshwater ecosystems. In this study, we emphasize that the combination of the bands B3 and B4 in a ratio can capture the contrasting responses of these bands, and give a reasonable measure of a(420) CDOM content in the water.
The applicability of the Sentinel-2A data for estimating lake a(420) CDOM was highlighted by Toming et al. [29], who applied it to a relatively small area in Estonia. The primary limitation of their analysis is that the derived empirical relationship might only be valid for the small subset of lakes that were included in the study. We expanded the scope to include a much larger region and nearly three times the number of lakes and found a higher explained variance (R 2 = 0.65) for the relationship between a(420) CDOM and the B3/B4 a ratio than Toming et al. [29] (R 2 = 0.52). It is important to note that lake a(420) CDOM concentrations are influenced by several factors such as lake conditions, surrounding land cover and land use, topography, and atmospheric interference. In this regard, the strength of our approach is that the model developed involves a larger sample of lakes across a relatively larger spatial area with heterogeneous land cover and topography.

Atmospheric Effects
Generally, atmospheric correction must be applied to avoid unwanted illumination effects that may be larger than the variability in water parameters. However, Toming et al. [29] found that, for Sentinel-2A, the a(420) CDOM retrieval on small scale (single image) results were better when atmospherically uncorrected imagery was used. On a large scale, involving multiple images, we found that a(420) CDOM retrieval was more successful when atmospheric correction was applied to the imagery. We evaluated the added-value of using Sen2Cor by comparing the in situ and modelled a(420) CDOM before and after atmospheric correction using the B3/B4 a band ratios. On the basis of this analysis, we found that atmospheric correction worked well in general and the algorithm performance was evaluated using repeated k-fold cross-validation to obtain more reasonable estimates of model coefficients. As shown in Table 5, any differences in the spectra can lead to large differences in the absolute accuracy of band ratios retrieval of a(420) CDOM . This can be explained in terms of biases and uncertainties that are dependent on spectral measurements of each band ratio.
Although we were not able to collect reflectance data in the field to compare it with the reflectance of acquired images, we were able to rate the performance of Sen2Cor by comparing the ToA and BoA reflectance spectra. We found that the reflectance spectra performed better in most regions after atmospheric correction. That being said, data from some sampling stations do not fit the general relationship well ( Figure 4D). We investigated the specific images collocated with these sampling stations and found that there are residual signals in the blue and near-infrared wavelengths where water reflectance is supposed to be zero. This was because of persistent atmospheric effects, which indicate that either sun glint, light haze, or cirrus clouds were present, or potentially a combination of all of these conditions. Therefore, five lakes were removed from the dataset, and data from 41 lakes were used in the subsequent analyses ( Figure 5).

Spatial Variability of a(420) CDOM
In complex freshwater systems, a(420) CDOM often displays a broad range of values, likely influenced by catchment characteristics (e.g., vegetation type or density, or land cover variety). Such heterogeneous and often small lakes present a challenge to current remote sensing methods used to estimate biological and chemical properties of water [34,35]. Previous studies found that surrounding sources of carbon (e.g., vegetation, soils) enhance the a(420) CDOM values in lake water [36][37][38][39][40]. In our study area, specifically in Västerbotten E, wetlands are significantly represented in the total catchment area. Moreover, Västerbotten E and Värmland are dominated by a mix of broadleaf and coniferous forests, which could further explain the high values of a(420) CDOM in those regions (<5.5 to >10.5 m −1 ; Figure 6, Table 4) [41,42]. On the other hand, Jämtland has the lowest percentage of forest and is mostly covered by herbaceous cover (Table 1). This could be the reason for the low a(420) CDOM levels found in this region (<2 m −1 , Figure 6 and Table 4) that cause more transparent water compared with other regions. Västerbotten W has a large proportion of mixed types of forest and herbaceous cover, which probably explains why a(420) CDOM at the edges of lakes in this region has high values in comparison with those found at lake centres. Depending on the size and depth of a lake and the heterogeneity of the topography, land cover, and land-use surrounding it, there could be a high variability of a(420) CDOM from the periphery to the lake centre. Therefore, it cannot be ruled out that lake bottom reflectance biases remote sensing measurements near the lake shore. Norrbotten has a mixture of different types of broadleaf forests and coniferous, herbaceous cover, and wetlands. The lakes in Norrbotten show a significant variation in a(420) CDOM (0.6-4.6 m −1 ; Figure 6, Table 4). Overall, the lakes in areas surrounded by mixed forests had the highest levels of a(420) CDOM , as compared with lakes surrounded by herbaceous vegetation or heathlands (Table 1). The contribution of different land cover types to the spatial patterns of a(420) CDOM in these lakes will be comprehensively assessed in a follow-up study (Supplementary Data; Figures S1 and S2).

Temporal Variability of a(420) CDOM
The in situ a(420) CDOM measurements were obtained between July and August of 2016, and the Sentinel-2A images were collected in June, July, August, September, and October of the same year. Matching water sampling and satellite image acquisition in time was a challenge. The time difference was up to two weeks in all areas, except in boreal lakes in Västerbotten E; these had a difference in time of up to ten weeks.
Thus, a potential problem with our analysis is that there was a mismatch of up to approximately 10 weeks between the field sampling date and the date of the image that was available for analysis. However, Cardille et al. [32] suggested that a such a time difference will not significantly affect the relationship between measured and remotely sensed CDOM. In our case, the timing appeared to be sufficient for this study, and justified by the fact that our model was able to explain 65% of the variability in a(420) CDOM , which is relatively high compared with other studies [29,30,43] that used Sentinel-2 data. The a(420) CDOM in boreal headwater lakes varies only marginally during the summer and we can expect relatively small temporal a(420) CDOM variations [44]. Weather conditions were stable during the study period, for example, maximum daily temperatures from June to September ranged between 18 • C and 25 • C, and precipitation was relatively low, between 25 and 45 mm. There were no heavy rain events between image acquisition and in situ data measurements that could bring particles or dissolved materials into the lakes, nor were algal blooms reported during the study period.

Summary and Conclusions
We presented an application of mapping lake coloured dissolved organic matter (CDOM) by modelling a(420) CDOM that may cause changes to water colour. In situ field measurements were taken from 46 lakes across northern Sweden during the summer of 2016 and eight images from the Sentinel-2A Multispectral Imager were acquired from the same season. Band ratios based on Sentinel-2A data were used to model lake a(420) CDOM . The best performing algorithm was based on the B3/B4 ratio, and it showed robust and consistent results across a broad range of lakes of different sizes distributed over varying landscapes. Sixty-five percent (R 2 = 0.65) of the variability of in situ CDOM was explained by the model. We found that atmospherically-corrected Sentinel-2 imagery produced better results than uncorrected imagery. However, we observed that the residual signal in atmospherically-corrected imagery may have been the result of sun glint, haze, or thin clouds that persisted despite the use of atmospheric correction. This study demonstrates the effective use of Sentinel-2 for this application over a wide range of inland waters that are situated in complex and inaccessible regions that are not well-monitored.