#### 4.3. Quality Assessment and Discussion

In this paper, 32 study areas shown in

Table 4 have been used for validating the proposed approach. After a detailed discussion of three selected water bodies (Ray Roberts, Poço da Cruz, Tharthar), a general quality assessment of all 32 globally distributed study areas is performed in this Section.

All study areas of lakes and reservoirs have different characteristics. The long-term water level variations are between 2.26 m for Clear Lake and 43.56 m for Mead Lake. The smallest investigated inland water body is the Jenipapeiro reservoir whose surface area varies between 0.0 km

${}^{2}$ and 9.15 km

${}^{2}$. Here, a dam has been constructed at the turn of the millennium. Lake Tharthar is the largest study area which has the strongest area variations between 1579.39 km

${}^{2}$ and 2476.11 km

${}^{2}$. In addition to the surface area, the lake-shore lengths of the study area and its ratio are shown in

Table 4. For study areas in mountainous regions which are mainly reservoirs, the ratio between surface area and shore length decreases. The ratio of natural lakes is therefore higher. For example, the shape of Salton Sea is more circular which leads to a higher ratio of 2.37 between surface area and shore length. In contrast, Jenipapeiro Lake is a reservoir located in the mountains whose topography is more frayed as the reservoir filled up the valley which leads to a longer shore line compared to its surface area. The ratio between surface area and lake shore varies between 0.11 and 2.37. In theory, the shape of lakes and reservoirs has a strong impact on the resulting quality of the surface area time series. In the land-water classification process indexing errors mainly are located at shores and not inside the lake. This fact has the potential for higher errors of reservoirs with longer shores than for natural lakes. But comparisons between ratio of surface area and shore length (

Table 4) and the resulting correlations of the validation (

Table 5) show no direct link.

Based on the resulting monthly surface area time series, we validate our results by calculation square Spearman correlation

${R}^{2}$ with respect to water level time series from gauging station respectively satellite altimetry depending on the availability. In order to demonstrate the improvement of our new approach for filling the data gaps, we also compute

${R}^{2}$ for all initial monthly masks with data gaps with water level data sets. Depending on the length of the different water level time series, the number of used points differs for each study area. But the comparison of initial areas and final areas is always based on the same number of data.

Figure 5 shows the detailed validation of the results for all 32 study areas with water level time series from in situ data and satellite altimetry. For each study area, the number of used points, correlations coefficient

${R}^{2}$ with initial areas, correlations coefficient

${R}^{2}$ with final areas and the performance are given for each used validation data set.

In addition to the study areas geometry, the location and its climate condition it is expected to have an impact on the results. All study areas are located in different climate zones which are shown in

Table 4. The climate zones are defined mainly through precipitation and temperature which finally leads to a different cloud coverage and rainfall. In the data processing, a higher cloud coverage leads to more data gaps which have to be reconstructed. In our study, the annual cloud coverage varies between 22% for Salton Sea and 65% for Poço da Cruz Lake and Forggen Lake. The annual rainfall is not directly connected to cloud coverage as this depends also on the temperature. The annual rainfall varies between 71 mm/y for Salton Sea up to 1749 mm/y for Bankim Lake.

Beside lake geometry and climate conditions, the data holding and its availability can be another limiting factor in the processing of the surface area time series. Depending on the data availability, the length and quality of the resulting surface area time series performs differently. The data availability of Landsat and Sentinel scenes varies globally [

39]. For example, not all taken Landsat-5 scenes have been finally processed by the agencies and made available yet. In particular, in Africa, the lack of Landsat-5 scenes can be seen for Lake Bankim and Lake Lagdo located in Cameron/Africa where only around 6% of all Landsat 5 images are available.

Table 5 gives an overview of the used data and its availability for the six missions for each study area. The average data availability for Landsat-4 (3.18%), Landsat-5 (66.63%), Landsat-7 (88.51%), Landsat-8 (98.86%), Sentinel-2A (98.19%) and Sentinel-2B (99.48%) increases the newer the satellite mission is. For the study areas, one can clearly see that data availability especially for Landsat-5 and Landsat-7 varies with respect to the location. In this study, the time span is defined between January 1984 and June 2018. In the best case, the resulting surface area time series can contain 414 monthly data points. Depending on different factors such as input data, data gaps, land-water indexing, and threshold computation, the number of monthly data points decreases in reality. For the selected study areas, the resulting number of monthly masks varies between 114 for Lake Bankim and 352 for Salton Sea.

For validating 15 of 32 study areas water level time series from gauging stations are used. All in situ measurements are given on a daily basis which are averages on a monthly basis as the monthly surface area time series is based on various daily scenes. For all study areas, the performance of the new gap-filling approach shows improvements of the correlation coefficient ${R}^{2}$ between initial and final area time series. The correlation coefficient ${R}^{2}$ improved between 0.035 and 0.441 for all comparisons with in situ data. In most cases, the average correlation coefficient ${R}^{2}$ increases from 0.610 for initial areas to 0.834 which is an improvement of about 37%. In this study, Lake Claiborne and Lake Clear have a very low correlation coefficient but still a good improvement. The reason is that both reservoirs have only low variations in height and area except of a few months in the period of three decades. On the other hand, Lake Mead ($\Delta {R}^{2}$: 0.035) and Poço da Cruz ($\Delta {R}^{2}$: 0.060) only show small improvements. Here, the initial masks are already very good with correlations larger than 0.9.

We use water level time series from satellite altimetry for validation in remote areas where no in situ data is available as it can be used globally as long as the lake or reservoir is crossed by a satellite track. Compared to in situ data, the quality of water level time series from satellite altimetry is lower. They provide cm-accuracy for large lakes and dm-accuracy for smaller lakes [

31]. Here, the initial resolution varies between about 10 and 35 days depending on the used altimeter missions which are merged on a monthly basis for validation. In this study, water level time series from satellite altimetry are used for 28 study areas crossed by at least one satellite track. The performance of the new gap-filling approach shows improvements of the correlation coefficient

${R}^{2}$ between initial and final area time series for all study areas also for satellite altimetry. The correlation coefficient

${R}^{2}$ improved by between 0.014 and 0.734 for 28 comparisons with water level time series from satellite altimetry. Overall, the average correlation coefficient

${R}^{2}$ increased from 0.611 for initial areas to 0.880 which is an improvement of about 44%. Despite different targets, the performance is similar by using in situ data for validation. The behavior of the correlation coefficient

${R}^{2}$ resulting from satellite altimetry is comparable with correlation coefficients

${R}^{2}$ using in situ data even though the lengths of the validation time series differ between both data sets.

One can conclude that validating surface area time search using in situ or satellite altimetry provides reliable results in both cases. In order to demonstrate the potential of the gap-filling approach one can say that an improvement of about 37–44% can be achieved. In general, the best improvement can be achieved when the initial correlation coefficients ${R}^{2}$ with water levels are between 0.4 and 0.6. For those cases, the final correlations have been increased to more than 0.8 or even 0.9 for almost all study areas.

In addition to the validation using external data sets, an area error is given for each month. This area error contains a composite of uncertainties from the land-water indexing and the gap-filling step. All pixels which are not clearly identified as water or land within the AOI go into the final error. Moreover, all pixels which are filled in the gap-filling step and have a smaller long-term water probability than the computed threshold plus 5% go also into the error. This usually affects pixels which are located near the lake shore as pixels with a higher water probability are located within the lake or reservoir. This leads to an average area error between 1.31% and 9.68% of the AOI for all 32 study areas. Additionally, comparisons showed that there is no direct dependency between final correlation coefficients ${R}^{2}$ and the resulting surface areas.

To provide RMS values for validation, we performed a cross-validation for all study areas. This has been already shown in detail for the study areas Ray Roberts, Poço da Cruz and Tharthar. The methodology is described in

Section 4.2.1. It is based on a relationship function between water levels and surface areas.

Table 6 shows the cross-calibration results for all study areas. Moreover, it provides the number of points used for the regression and the number of points used for the validation of each study area. As can be seen from the table, the relative area errors remain below 8% for all 32 targets. The absolute RMS errors varies between less than 1 km

${}^{2}$ and about 40 km

${}^{2}$. Of course these values not only include the area uncertainties but may also reflect errors coming from the water level measurements or the assumed hypsometry function. When comparing both fittings (linear and polynomial) one can clearly see the influence of this choice. The optimal fitting function depends on the lakes (unknown) bathymetry. Of course, the results will improve when more data points are used for the fitting process. However, then the validation is no longer done by independent input data.

When comparing the cross-validation results with the internal errors derived in the AWAX approach (provided in

Table 5) a very good agreement can be seen. For Lake Ray Roberts, the area error derived from the AWAX approach is 2.18 km

${}^{2}$ (1.59%) where the cross-validation yields a RMS of 3.04 km

${}^{2}$ (2.21%) for the linear fit, respectively 2.85 km

${}^{2}$ (2.07%) for the polynomial fit. Another example is Poço da Cruz, Lake where the linear and polynomial fit behaves differently. Here, the cross-validation shows an RMS of 3.53 km

${}^{2}$ (6.07%) for the linear fit, respectively 1.10 km

${}^{2}$ (1.90%) for the polynomial fit. The surface area error derived from the AWAX approach is 3.23 km

${}^{2}$ (5.56%) which is within both results of the cross-validation. The external validation thus confirms the order of magnitude of the internal AWAX errors and proves their high reliability.

Further investigations also showed that there are no clear dependencies between surface area, respectively final correlation coefficients ${R}^{2}$ with other parameters such as lake shore, climate zone, annual cloud coverage, or annual rainfall. Finally, one can conclude that this new approach can be used for nearly all lakes and reservoirs as they are large enough to be observed successfully by Landsat or Sentinel-2.