1. Introduction
Inland waterbodies provide essential services for various human uses, particularly water supply and recreation, as well as habitat and ecosystem-regulating services, including nutrient and carbon cycling, or effects on the local climate [
1,
2,
3,
4,
5]. However, inland waterbodies are increasingly threatened by anthropogenic exploitation and multiple environmental pressures such as organic and inorganic pollution, eutrophication, climate change effects, and toxic cyanobacteria blooms [
2,
6]. Therefore, the monitoring of inland waterbodies, with a special focus on water quality and their ecosystem status, is of global concern and a major prerequisite to better understand the effects of environmental changes on inland waters and to identify drivers of future change [
7].
Despite the increasing need for more frequent and comprehensive monitoring arising, for example, from legislation such as the European Union’s Water Framework Directive [
8], only a small fraction of inland waterbodies are part of in situ monitoring networks [
7,
9]. Although the Water Framework Directive was a major breakthrough in the monitoring of surface water, its implementation is limited by logistic and economic factors [
10,
11]. For example, samplings are not realized every year and usually occur only monthly during the growing season. The data may allow an assessment of the status of the ecosystem but do not provide deeper insights into the underlying dynamics and stressor–response relationships. It is therefore difficult to identify appropriate sustainable management strategies to improve ecological and chemical status [
12,
13].
In support of classical monitoring of inland water quality, remote sensing can be a valuable tool to provide water quality variables at a relatively low cost, at spatial scales from local to global, and at an improved temporal resolution through relatively frequent temporal revisits [
10,
14]. In this way, remote sensing can assist in identifying long-term trends and effects of climate change, point- and non-point source contaminants [
11], or the emergence of extreme events such as algal blooms [
15,
16]. For the latter, near real-time and frequent information can be provided on algal dynamics, enabling early detection of phytoplankton blooms, and early warnings or tailored management reactions [
11,
16]. This could be relevant and helpful for urban waterbodies or bathing sites where algal blooms may impose health risks, and may strengthen the conversion of monitoring algal blooms according to regulations such as the European Union’s Bathing Water Directive [
17] and its locally derived decrees. Satellite-based remote sensing can also provide data on the ecological classification and assessment of trophic status, e.g., in the form of the product types applied, such as eutrophication indices [
18,
19,
20,
21,
22].
However, the application of satellite-based remote sensing is limited in the number of measurable variables, the frequency of satellite overpasses, and the spatial resolution of the sensors on board, and mostly reflect the conditions just at the water’s surface (i.e., the visible water column, depending on the pixel size and penetration depth of light) [
7,
10,
23,
24]. Technically, the optical complexity of inland waters, referring to the intricate and fluctuating interactions of light with the varied and dynamic composition of optically active constituents in the water, still presents a challenge [
7,
16,
24]. In addition, atmospheric correction, adjacency effects, bottom reflectance effects, and the sensor’s design are challenging factors [
7,
9,
16,
24,
25]. Besides these, weather conditions such as clouds, rainfall, ice coverage, or waves during storm events may interfere with satellite-based remote sensing for certain applications [
11,
14,
26].
On the basis of the information presented, it is evident that remote sensing could enhance water quality information through increased spatial and temporal coverage, cost-effectiveness, and relatively quick availability of data [
10,
14,
26], thereby complementing in situ data. The synergies between in situ and remote sensing have not been fully realized due to challenges such as limited temporal capacities and a lack of support from the organizational management within water administrations, concerns about products’ accuracy and data continuity, and the absence of legal frameworks explicitly incorporating or permitting remote sensing-derived observations [
27,
28].
The integration of processed remote sensing products by authorities and water management authorities requires a rigorous comparison of satellite-based and in situ observations, and an analysis and interpretation of the quality, accuracy, and uncertainties (validation [
29]). Quantifying the reliability of remote sensing products and accounting for all components of uncertainty, however, is a challenging task due to the challenging process associated with every satellite-derived variable from sensor-level signals to mass concentrations [
14,
29]. Additional uncertainties are introduced through the in situ dataset as well as the spatiotemporal sampling mismatch between the satellite data and the in situ data [
29,
30]. Even though in situ measurements are often referred to as “ground truth” measurements, they also come with measurement uncertainties themselves, also caused by the different methods of sampling and analysis applied [
31].
While the aim of validation is quite clear, the implementation often involves various steps that are subject to assumptions and potentially require the user’s decisions, which affect the validation of the results [
29]. One aspect of this is the choice of the optimal spatial and temporal scales for in situ and remote sensing data. This requires a decision on which mismatch in scales can be accepted [
23,
25,
29,
32]. As a general rule, match-ups between in situ and remote sensing data should be as close to each other as possible in time and space (horizontal, vertical, temporal) so they represent the same or at least comparable conditions [
29]. However, to allow for robust statistical analysis, a sufficient number of match-ups is also needed. Therefore, a choice has to strike a balance between minimizing the spatiotemporal mismatching and producing a large number of match-ups in order to have a representative sampling size [
25,
29]. Moreover, in situ monitoring is rarely aligned with satellites’ overpasses and a full match down to minutes of both monitoring activities in space and time is the exception. Even worse, in the case that both take place on the same day and thereby may appear to be temporally aligned, this can be misleading in cases when the observed water parcels may differ substantially or one sample takes place in the morning and the other in the evening [
33].
In order to address these complications in the validation of remote sensing products for water quality, we designed this study. We collected in situ observations of 112 inland waterbodies from different monitoring agencies in Germany from 2016 to 2020. The waterbodies varied in their morphometry and trophic state, resulting in 37.930 observations for all three variables prior to preprocessing. We generated the corresponding satellite-based products from the Copernicus satellites Sentinel-2 MSI and Sentinel-3 OLCI, and used two different processing schemes (CyanoAlert
® from Brockmann Consult Ltd. in Hamburg, Germany and eoapp
® AQUA from EOMAP GmbH & Co. KG in Seefeld, Germany). Subsequently, we compared the remote sensing data for different levels of spatial aggregation (pixel windows or “macropixels” of 1 × 1, 3 × 3, 5 × 5, and 15 × 15 as well as at the scale of the whole waterbody) (see
Figure 1) and for temporal windows from the same day to up to ±5 days for different variables of water quality. We intentionally focused on processing workflows that are commercially available because applications in governmental or societal contexts are usually carried out in a consultancy setting. We quantified the deviations between satellite and in situ observations by using three different error measures. Thus, the research questions of this study were to find (i) what spatial aggregations of remote sensing data are appropriate for validation, (ii) which time window is adequate when comparing remote sensing and in situ data, and (iii) whether there are systematic differences in these characteristics depending on the variable of water quality, the satellite, or the processor.
2. Materials and Methods
2.1. In Situ Data Observations of More than 100 Lakes and Reservoirs
Laboratory in situ measurements of three target variables of water quality, namely Secchi depth, chlorophyll-a, and turbidity, for 112 waterbodies (218 measurement stations) were collected from various water authorities of 13 federal states and research institutes in Germany (see
Figure 1). The measurements cover 5 years from 2016 to 2020. Since the monitoring of water quality and the effective implementation of water protection policies in Germany is the responsibility of the federal states, the number of in situ samples and the frequency of sampling differs among waterbodies, federal states, and target variables. All available data were collected for the purpose of assessing water quality to meet the objectives of existing policies at the EU or federal level, and ultimately to ensure the good ecological status of the waterbodies. It should be noted that the in situ data were collected across federal states according to different methods and protocols. In the following section, we describe the steps of homogenizing the data we undertook to mitigate the differences in the protocols.
Several steps were taken to optimally prepare the in situ data for comparison with satellite data. The maximum depth of the in situ data was restricted, measurement sites located in shallow water zones as well as probe data were excluded, and extreme values were removed from the dataset. These steps are described in detail in the following section.
Remote sensing data capture events only within the visible water column. Therefore, to improve vertical comparability between in situ and remote sensing data, discrete in situ measurements with a sampling depth of more than 2 m from the water’s surface were removed from the dataset for chlorophyll-a and turbidity. In case of integral measurements (taken between 0 m and 25 m max.), all values were retained and arithmetically averaged per time point. Only 10% had a maximum sampling depth of more than 10 m so that the potential contribution from deep chlorophyll or maxima of turbidity, which can hardly be detected by satellites, remained small.
In addition, optically shallow waters, where the penetration depth of light exceeds the physical depth of the water column, are prone to detection errors arising from the contributions of bottom reflectance. Therefore, in this study, measurement points located in shallow water zones were removed from the dataset by visually inspecting high-resolution satellite images together with the measurement points.
Furthermore, probe data with nearly daily measurements were not integrated into the analysis to balance the number of matches across the waterbodies. Finally, extreme values in both the in situ and satellite-based water quality data were removed if they were outside the data ranges as given in
Table A3. The resulting in situ dataset consisted of 7286 data points in total and encompassed both discrete (
n = 5437) and integral (
n = 1849) measurements with varying numbers of observations per site and parameter (
Table 1 and
Table S1).
2.2. Satellite-Based Detection of Water Quality with Sentinel-2 MSI and -3 OLCI
The remote sensing data originated from optical sensors on board the Copernicus satellites Sentinel-2 and Sentinel-3 over the timespan from 2016 to 2020. Sentinel-2 satellites (A and B), launched in 2015 and 2017, with multispectral instruments (MSI) on board, offer high-resolution imagery at spatial resolutions of 10 m, 20 m, and 60 m, depending on the spectral band [
34,
35]. The MSI measure 13 spectral bands and provide data every 2–5 days depending on the latitude [
34,
35]. Likewise, the Sentinel-3 mission consists of two satellites (A and B), that were launched in 2016 and 2017, respectively [
36]. Sentinel-3 satellites carry ocean and land color instruments (OLCI) on board [
36,
37]. The OLCI measure 21 spectral bands at a spatial resolution of 300 m and provide temporal coverage every 2 days at the equator and up to twice a day in midlatitudes [
36,
37].
For the purpose of quantifying the selected variables of water quality (chlorophyll, turbidity, Secchi depth), the radiance leaving the water, i.e., the spectrally resolved light, needed to be determined on the basis of the signal detected by the sensor [
14,
38]. For this, the influence of the atmosphere and the reflection of light at the air–water interface had to be determined and subtracted from the top-of-atmosphere radiance to derive estimates of the radiance leaving the water [
38,
39,
40,
41]. The spectrally resolved remotely sensed signal was then used to retrieve the variables of water quality [
40,
42].
In this study, two commercially available but scientifically documented operational processing schemes, CyanoAlert
® (Brockmann Consult GmbH) and eoapp
® AQUA (EOMAP GmbH & Co. KG), were applied. These are both based on analytical expressions, incorporating the radiative transfer equation to retrieve the concentrations of optically active constituents of water from the radiance leaving the water, which was derived from top-of-atmosphere measurements of various optical satellite-based data sources. Both processing schemes are globally applicable for almost any kind of water and without a priori knowledge of the particular waterbody. The core element of the processing chain applied within CyanoAlert
® is C2RCC (Case 2 Regional CoastColour), which is composed of a set of neural nets (NNs) to derive the atmospheric and in-water properties. The NNs are trained by a set of approximately 10 million simulated reflectance spectra representing a wide range of in-water and atmospheric conditions [
43]. Within the processing chain, C2RCC is complemented by cloud detection (Idepix), MPH (maximum peak height) algorithms for the detection of chlorophyll-a (only OLCI data), and the Nechad algorithm for determining the turbidity. Further information on this processing scheme was presented by Brockmann et al. [
43] (C2RCC), Matthews et al. [
18,
44] (MPH), Wevers et al. [
45,
46] (Idepix), and Nechad et al. [
47]. The various processing elements of this processing scheme are referred to as CyanoAlert. In comparison, the processor underlying eoapp
® AQUA is called MIP (modular inversion and processing system). MIP is purely physics-based and consists of a sensor-independent suite of algorithms and databases to derive atmospheric and in-water properties [
28,
48]. The model encompasses all the relevant processing steps and necessary corrections such as detection of the surface type (land, water, or cloud), correction of adjacency and sun glint, and atmospheric correction. Retrieval of the variables of water quality was performed by modeling the influence of their respective optically active components on the measured radiance [
28,
48]. Details of this processing scheme were documented by Heege et al. [
48] (MIP). The processor is referred to as EOMAP-MIP. Note that in accordance with company recommendations, the Sentinel-2 MSI data processed by CyanoAlert have a resolution of 60 m, whereas the Sentinel-2 MSI data processed by EOMAP-MIP have a resolution of 10 m.
Due to the lower spatial resolution of Sentinel-3 OLCI (300 m pixels versus 10 or 60 m pixels in Sentinel-2 MSI), only a selection of 38 waterbodies were suited for OLCI images, as prescribed by area and shape of the waterbody. It was assumed that a spatial window of 5 × 5 water surface pixels, i.e., about 1.5 × 1.5 km in dimension, should be present inside a waterbody to be suitable for evaluation based on OLCI. For selection of a waterbody, the point furthest away from the entire shoreline was calculated and buffered with a radius of 750 m, based on which, a bounding box with an edge length of 1500 m was calculated. The bounding box, corresponding to a macropixel of 5 × 5 grid cells, was intersected with the respective shape of the waterbody. All waterbodies whose shoreline shapes intersected with the bounding box were considered to be unsuitable.
Each processing scheme incorporates algorithms to mask out clouds, cloud shadows, and haze, and to handle adjacency effects or sun glint. The data processed by CyanoAlert came with a quality indicator (quality band) to differentiate between valid (quality indicator == 1) and invalid (quality indicator == 0) pixels. The quality indicator is composed of a combination of quality flags generated by the C2RCC processor raised for invalid processing conditions and by the Idepix pixel classification scheme identifying disturbed water pixels. Data processed by EOMAP-MIP contained a quality score ranging from 0 to 100 (low to high quality), in which case, pixels with quality score values smaller than 50 were removed from the data. This score was calculated from the influences of atmospheric and surface effects, the angles of the sun and the sensor view, and detectable concentration limits defined within the processor’s definitions. In addition, pixel outliers within the macropixel were removed when they were outside the range of mean ±1.5-times the standard deviation, as suggested by Bailey and Werdell [
31]. Only if more than 30% of pixels within the macropixel or waterbody were valid, the respective scene was evaluated; otherwise, it was excluded. The remaining valid pixels were spatially aggregated by calculating the median, 25th percentile, and 75th percentile, as well as the coefficient of variation (CV; standard deviation/mean). The CV was calculated for characterizing the level of spatial variability within the different spatial aggregations. As an example,
Table 2 shows an overview of the amount of data extracted for chlorophyll-a and the associated spatial aggregations. Data on turbidity and Secchi depth were in the same order of magnitude and are given in the
Appendix A (
Table A1 and
Table A2).
Finally, extreme values in both the in situ and satellite-based water quality data were removed if they were outside the data ranges as given in
Table A3.
2.3. Statistical Analyses and Comparison of In Situ and Satellite-Based Observations
In this study, temporal windows of in situ and remote sensing data from the same day up to ±5 days were considered. In this process, the optimal temporal match was determined for each in situ data point, ensuring that each satellite-based data point appeared only once in the dataset. Note that the time window was expanded for this research, meaning that the ±5-day window included all matches from the same day to ±5 days. The number of matches generated per waterbody varied greatly due to difference in both the available in situ and usable remote sensing data, ranging from 1 to 108 matches per waterbody. In addition, since the literature on remote sensing of water quality reports a wide range of temporal matches to be considered near-coincident, ranging from ±3 h to ±10 days [
23], we also formed all possible matches for each in situ data point up to ±10 days. For each time lag, we calculated the residuals along with the associated statistics (mean, median, and interquartile range). Note that the in situ data did not have temporal information specified to the hour.
In this study, three error metrics were calculated for evaluation of the performance, namely the mean absolute error of the log-transformed data (MAE) and bias, in line with [
49], as well as the root mean square error (RMSE). The bias quantified systematic differences between the two datasets, namely systematic over- or underestimation, and was defined as the difference of the mean values for the in situ and satellite-based values, and hence was not sensitive to random errors [
29,
49,
50]. The RMSE and the MAE are both metrics describing the accuracy or the pairwise agreement between matched in situ and satellite-based observations [
49]. The RMSE is frequently used in validation analyses of remote sensing data but can become strongly influenced by larger deviations [
51], in contrast to MAE, which is more robust against outliers [
49,
51]. In addition, our data showed a logarithmic distribution of error (see
Figure A1), and we therefore followed the recommendation of Seegers et al. [
49] and calculated the MAE and bias using log-transformed data, followed up by back-transformation to linear space to facilitate interpretation. RMSE was calculated using untransformed data, i.e., in the linear space, to facilitate comparability with other studies and for easier interpretation. For details on the performance metrics for the assessment of satellite-based data products, see Seegers et al. [
49]. The error metrics were computed using Equations (1)–(3).
We defined different variants for spatial and temporal matching (
Table 3) in order to identify the performance of satellite-borne data and to derive recommendations for practical matching. The selection of smaller spatial aggregations was made (1 pixel, 3 × 3 macropixels, and 5 × 5 macropixels) following the recommendations of EUMETSAT [
52], which suggested spatial aggregations of 1 pixel, 3 × 3 macropixels, and 5 × 5 macropixels, depending on the local conditions. A larger spatial aggregation (15 × 15 macropixels) was added for the S2-MSI data to approximate comparability with the S3-OLCI data. Finally, a waterbody-scale variant was added to analyze to what extent the in situ data represented the entirety of the waterbody. With regard to the selection of temporal windows, the usage of same-day matching constituted the minimum possible temporal window. This was extended to time windows of ±1 day or ±5 days as alternatives in temporal matching, following previous studies that have applied temporal intervals of in situ and remote sensing data ranging from ±3 h to ±10 days [
23]. For the comparison of spatial aggregations, the temporal windows were kept to same day matches in order to restrict the variability only to the effects of spatial aggregation. Accordingly, in a comparison of the temporal windows, the spatial aggregation was kept constant at 3 × 3 macropixels. The abovementioned error metrics were used to compare and evaluate these different variants. Moreover, the respective sample sizes, i.e., the number of matches between in situ and satellite-based observations, were included in our evaluation. Note that the applied spatial aggregations differed between S3-OLCI and S2-MSI in order to account for the different spatial resolutions of these sensors.
All analyses and visualizations were performed using R Statistical Software (v4.3.0; R Core Team 2021). Geospatial analysis was performed using ArcGIS Desktop (Esri 2019).
4. Discussion
Restating the established objectives, the study aimed to identify (i) appropriate spatial aggregations of remote sensing data for a good representation of observations obtained through in situ measurements, (ii) adequate temporal windows when comparing remote sensing and in situ data, and (iii) systematic differences in these characteristics depending on the variable of water quality, satellite, or processor.
4.1. Spatial Aggregations
Regarding Objective (i), the study has shown that no clear pattern emerged regarding the spatial aggregation. This finding was invariant to the processing scheme used and was confirmed when all results for both processors (EOMAP-MIP, CyanoAlert) were merged into one analysis (only possible for 3 × 3 macropixel and the waterbody-scale variants) (
Figure A2 and
Figure A3). In detail, the pairwise agreement (MAE, RMSE) between the different spatial aggregations (different sizes of macropixels versus the waterbody scale) was surprisingly similar overall and showed no clear patterns, despite large differences in their spatial extent and variability, whereby the waterbody-scale variant performed similarly well overall compared with the macropixel variants (see
Figure 4). As an exception, the performance of the waterbody-scale evaluation was significantly better for OLCI data processed with EOMAP-MIP when evaluated with the RMSE. Similarly, systematic differences (bias) between the in situ and remote sensing data showed only minor differences among the different spatial aggregations. This is astonishing, because the CV in the satellite data clearly showed that the variability in the target variables (e.g., chlorophyll-a) increased with an increase in the spatial scale and was at maximum at the waterbody scale. This spatial variability at the scale of the whole waterbody actually suggests that macropixels should be superior in terms of validation, which could not be confirmed statistically. Note that in this context, the comparison of spatial aggregations only used same-day matches (
Figure 4) and the temporal dynamics of the ecosystem had only a limited impact (as discussed in
Section 4.2).
In addition, the results showed only minor differences among the macropixel variants. However, to account for processing errors and to avoid the risk of operating with a faulty pixel, very small spatial windows (e.g., only one pixel) are not recommended [
31,
53]. This was also partly reflected in this study, with the 1-pixel variant of aggregated OLCI EOMAP-MIP data performing slightly worse than the 3 × 3 macropixel variant for all target variables. Different macropixel variants of MSI data showed noticeable differences in the number of matches compared with the 15 × 15 macropixel variant. On average, 8% fewer matches were produced when the 15 × 15 macropixel results were compared with the smaller macropixel variants, probably due to higher number of invalid pixels when the measurement points were closer to land.
In the context of governmental monitoring programs, where the assessment of the status of the entire waterbody is the goal, aggregating at the waterbody scale can be advantageous for comparing in situ and remote sensing data. The spatial variability is averaged away, and thus the spatial heterogeneity is considered. Another advantage of waterbody-wide aggregation is the mitigation of effects occurring primarily close to the shore such as adjacency effects, shade from hills, or bottom reflection. This is particularly valid for waterbodies where the proportion of number of pixels close to the shore is high compared with the total number of pixels. Furthermore, waterbody-scale extractions usually yield slightly higher numbers of matches (
Figure 4), because it is not so important where exactly the invalid pixels are located for the water-body-scale extraction still to be valid. For statistical purposes, it may thus also be advantageous to aim for waterbody-scale extractions.
On the other hand, macropixel-based products may be, on one hand, able to account for local conditions but, on the other hand, can be severely influenced by the neighboring pixels in a locally heterogeneous environment. The latter becomes even more influential when part of the spatial heterogeneity is attributable to random errors. Note that in this respect, in situ samples are always point samples attributed to precise locations with a very small spatial extent. The fact that waterbody-wide averaging does not deteriorate the statistical performance of satellite-based products in most cases (
Figure 4) may indicate that spatial dynamics at small scales are, in many cases, not strong enough that large-scale averaging yields similar patterns.
It needs to be noted that we conducted a broad-scale validation study using administrative data. Therefore, the validation here aimed to examine whether in situ and remote sensing data can be used comparably well for official monitoring. In a different context, for example, when it comes to optimizing algorithms, the existing protocols should be relied upon, which typically rely on macropixel variants and well-positioned sampling sites not too close to shore, where the spatial variability of the variables of water quality in question is relatively stable over the whole macropixel. In addition, a very tight match of satellite overpasses and in situ sampling is crucial (at temporal scales far below daily; see the discussion below) for these purposes; otherwise, the sampled water parcels do not exhibit the same characteristics of water quality due to possible temporal fluctuations in the concentrations at small time scales.
4.2. Temporal Windows
Regarding Objective (ii), different time windows performed similarly well with only minor differences and without clear patterns when evaluated with the MAE and bias (see
Figure 6). However, RMSE showed an increase for various target variables (except Secchi depth) and sensor–processor combinations. The differences between MAE (log scale) and RMSE (linear) indicated an increase in the skewness of data’s distribution and the presence of outliers with increasing time windows (see
Section 4.5). This indicated that expanding the time window to up to 5 days can be useful in validation studies for all target variables because of a noticeable increase in the number of matches, as long as special attention is given to the appropriate treatment of outliers.
In addition, an expansion of the time lags for up to 10 days showed that the distribution of residuals between in situ and remote sensing data and the associated summary statistics (mean, median, and interquartile range) were relatively stable even for time lags of more than 5 days (
Figures S1–S3).
4.3. Interplay between Temporal and Spatial Scales
A waterbody surface can be seen as a heterogenous system in motion, so that the spatial patterns depend on the time window. Two hours may lead to different spatial patterns at the surface due to the high advective transport in waterbodies, particularly under windy conditions [
33] and during ice-melt or phytoplankton blooms. Such comparably rapid events impair the agreement between in situ and remote sensing data more strongly for macropixel variants than for evaluations based on the whole waterbody [
9]. It would therefore seem advisable, particularly for the macropixel approaches, to match only in situ and satellite data that are within 2 h of each other [
7]. In the present study, this was not possible due to the time information in the in situ data not being specified to the hour. The smallest time window available for a comparison of different spatial aggregations was the same day, which translated to a temporal mismatch between in situ and remote sensing data of up to 8 h. This is one explanation for some errors being relatively high, which complicated the evaluation of the spatial patterns.
4.4. Systematic Differences among Variables of Water Quality, Satellites, or Processors
The agreement between in situ and remote sensing data depends on multiple factors, e.g., the environmental and in-water conditions, the placement of the sensors, and the sensitivity of the algorithms, among others. For this study, two operational processing schemes were applied with different approaches to address atmospheric influences, sun and sky glint, and other interfering factors from the radiance leaving the water to derive the concentrations of the variables of water quality. These operations have to work over a wide range of geographical regions, may be affected by unpredictable uncertainties, and are further developed continuously. The abovementioned aspects led to the differences between sensor–processor combinations and the occurrence of occasionally strong outliers irrespective of the spatial aggregation or time window applied. However, integrating all the alternative satellite products (S2-MSI or S3-OLCI, EOMAP-MIP or CyanoAlert) into one waterbody-specific average yielded a robust value when compared with the in situ-based values (unpublished data).
4.5. Error Metrics
MAE, bias, and RMSE are often applied in validation studies in parallel, although they are only partially complementary [
29]. MAE and RMSE differ in their sensitivity to outliers and the distribution of the data [
49]. RMSE, in contrast to MAE, penalizes high absolute deviations and an uneven distribution of the error due to it being calculated in linear space and the squaring of errors. However, they both addressed the pairwise agreement (accuracy) between remote sensing and in situ data. Since the residuals in regressions of satellite versus in situ observations increase with the mean—an argument for using log scale—the RMSE for the concentration of chlorophyll-a in eutrophic waterbodies should be much higher than for oligotrophic waterbodies. The opposite holds true for the log-based MAE, where smaller deviations had a big impact if the chlorophyll value was low. To give an example, the log-based MAE resulted in the same magnitude of error when the detected/measured chlorophyll was 0.1 or 1 µg/L compared with 10 and 100 µg/L because both have the same ratio. But the RMSE would weigh the pair at 10 and 100 µg/L to be far more erroneous because it is based on squared deviations at the linear scale. From the limnological point of view, however, a difference between 0.1 and 1 µg/L is irrelevant and hardly measurable in the laboratory, while a difference between 10 and 100 µg/L makes a huge difference in evaluations of the status. In summary, RMSE stresses errors at larger values (e.g., high chlorophyll) while MAE emphasizes deviations at low true values (e.g., low chlorophyll).
Therefore, the small and random differences between the different spatial aggregations across all error measures, with MAE and RMSE being not very different from each other, indicated that the presence of outliers or the distribution of the errors did not differ significantly among different spatial aggregations. However, we noted that RMSE was increasingly higher at large time windows for various target variables and sensor–processor combinations. This suggests an increase in the influential outliers and the skewness of the distribution of the errors with an increasing time window, which should be addressed when larger time windows are applied. This is also partially reflected in
Figure S1, with the median and interquartile range being relatively stable up to a time lag of ±5 days, whereas the mean increased more.
Lastly, it needs to be considered that both datasets, remote sensing and in situ, contained errors and uncertainties. Therefore, the assumption that the error metrics fully evaluated the dataset is disputable. Given these results and occurrences of error, we agree with Papathanasopoulous et al. [
54] and IOCCG [
55] that satellite-based monitoring will complement rather than replace in situ sampling to increase the understanding of the spatiotemporal development of waterbodies, as well as to provide information where in situ data were (in the past) or are still sparse or non-existent.
5. Conclusions
In general, our results provide clear evidence that satellite-based products reflect the average condition of a given waterbody and are therefore suited for assessments of the status, such as are required in national and international legislation for the protection of water. The results showed that it does not necessarily pay to focus spatially on the exact sampling point. In many cases, waterbody-scale values achieved similar or slightly better accuracies and biases despite the large differences in the spatial aggregation and spatial variability. They also provided slightly larger sample sizes. Data on Secchi depth produced the least pronounced differences among all spatial aggregations, independent of the sensor and processor. Overall, the results did not show huge differences among different spatial aggregations, and no clear preferences for the type of spatial aggregation emerged.
The study has also shown that an expansion of the time window of up to ±5 days can be practiced under certain conditions. Data products for Secchi depth showed only small and random differences among different temporal windows. In contrast, data products for turbidity and chlorophyll-a showed an increase in outliers with increasing time window for various sensor–processor combinations. Therefore, if applicable, the increasing occurrence of outliers must be considered in case of extending the time window. In summary, the results indicated that an expansion of the time window of up to ±1 day or ±5 days can be useful in validation studies because of the marked rise in number of matches, which were, on average, about 2.5 times (±1 day) or 5 times (±5 days) greater than same-day matches.
Besides these details on procedures of validation for satellite-based monitoring of water quality, our data on more than 100 waterbodies showed that averaging all the available values for a given waterbody can very likely reflect the status of the waterbody. Satellite-based information can therefore supplement the information collected via in situ monitoring, which can only be conducted at limited temporal and spatial scales, and thus can improve assessments of the ecosystem status of a given waterbody.