Uncertainties Involved in the Use of Thresholds for the Detection of Water Bodies in Multitemporal Analysis from Landsat-8 and Sentinel-2 Images

Although the single threshold is still considered a suitable and easy-to-do technique to extract water features in spatiotemporal analysis, it leads to unavoidable errors. This paper uses an enumerative search to optimize thresholds over satellite-derived modified normalized difference water index (MNDWI). We employed a cross-validation approach and treated accuracy as a random variable in order to: (a) investigate uncertainty related to its application; (b) estimate non-optimistic errors involving single thresholding; (c) investigate the main factors that affect the accuracy’s model, and (d) compare satellite sensors performance. We also used a high-resolution digital elevation model to extract water elevations values, making it possible to remove topographic effects and estimate non-optimistic errors exclusively from orbital imagery. Our findings evidenced that there is a region where thresholds values can vary without causing accuracy loss. Moreover, by constraining thresholds variation between these limits, accuracy is dramatically improved and outperformed the Otsu method. Finally, the number of scenes employed to optimize a single threshold drastically affects the accuracy, being not appropriate using a single scene once it leads to overfitted threshold values. More than three scenes are recommended.


Introduction
Several studies have been dedicated to enhancing water features extraction by remote sensing techniques aiming to monitor and characterize hydrological dynamics of lakes and reservoirs [1][2][3][4][5][6].
The accuracy in water feature extractions by remote sensing has been constantly pursued by the development of several methods and approaches, such as: enhancement of water spectral indices [7][8][9][10][11]; use of supervised classification algorithms based on machine learning [12] to segment water/non-water features and tailor an appropriated threshold [13] and combining different bands, sensors and spectral indices [14][15][16][17].
Threshold segmentation remains attractive due to its easy implementation, lower investment of time, and relative accuracy in extracting water features [8,13]. However, selecting an appropriate threshold to maximize accuracy is challenging and time-consuming [7], and can underestimate or overestimate surface water features due to its spatial-temporal variability [17].
Despite the well-known limitation of employing a single threshold to correctly classify all instances, especially in spatiotemporal classification problems [13,18], the results from Bangira [12] and Sarp and Ozcelik [19] evidence that the performance of supervised classification methods is slightly superior to those achieved through threshold segmentation which still puts its use as a solution feasible to be implemented producing accurate results.
In addition, factors such as complexity land-covers near water boundary, backgrounding effects, physical-chemical and biological water characteristics (e.g., turbidity, color, chlorophyll), presence of aquatic vegetation, shadows, mixed pixels, coarse pixel resolution affect negatively more sophisticated classifying methods as well [7,12], therefore, not being an exclusive limitation related to threshold methods.
On the one hand, several works recurrently point out that a single global optimum threshold approach is a suitable technique for time series analysis and track water dynamics [8,13]. On the other hand, the accuracy variability from its application is understudied. So far, there is no knowledge of research in which accuracy's variability, calculated over water level estimation using Landsat-8 and Sentinel-2 imagery, has been used to support the choice of the threshold. Furthermore, the accuracy exclusively from orbital imagery, not considering errors due to coarse digital elevation model resolution, has not yet been investigated.
Apart from that, most optimization approaches are based on the premise that the lack of threshold stability may be a problem [18], which leads to a subjective choice [7]. A stable optimal threshold value has been regarded by many authors as an indication that a single threshold value is suitable to classify multiple scenes across space and time [20]. Conversely, results from Du [21] and Weekley and Li [1] indicate that the accuracy stability could be reached even in a threshold range variation, depending on the index and methodology used.
The purpose of this paper is to investigate the uncertainty involving using a single optimized threshold over multitemporal Landsat-8 and Sentinel-2 imagery scenes. Our primary objectives were: (1) understanding how accuracy varies with thresholds values to identify ranges less subjected to errors; (2) investigate accuracy statistics, exclusively from orbital imagery through employing a high-resolution digital elevation model, and to compare each sensor performance; (3) comparing the advantage in using the proposed cross-validation approach to support threshold choice versus a nonparametric and unsupervised method of automatic threshold selection; and (4) contributing to operational steps improvement in adjusting a single threshold in multitemporal classification problems, regarding the accuracy and computational effort.

Study Area
The 'Poço da Cruz' (PC) reservoir, in Figure 1, was commissioned in 1958 and is located on the Moxotó River, an intermittent tributary of the São Francisco River, one of the most important rivers basins in Brazil. According to recent bathymetric surveys, its storage capacity is 0.484 km 3 , corresponding to a maximum surface area of 56.34 km 2 . The Moxotó River Basin drains about 9619 km 2 , and the area affluent to the reservoir is 4716 km 2 . The region's climate is semi-arid, where the mean annual rainfall is 421.8 mm, highly concentrated during the four months from December to May. Evaporation rates are about 1568 mm/year. The study area is inserted at the Caatinga biome, a seasonal tropical dry forest present in South America, characterized by a high albedo during the dry season and vegetation health highly correlated to precipitation and soil moisture. forest present in South America, characterized by a high albedo during the dry season and vegetation health highly correlated to precipitation and soil moisture.

Digital Elevation Model
The digital elevation model (DEM) of the PC reservoir was provided by the Brazilian Water Agency (ANA) and consisted of an aerial laser survey using the LiDAR (Light Detection and Ranging) method for the land part integrated with a bathymetric survey for the flooded area. The latter was carried out with an eco-probe using a single beam for shallow waters and a multibeam for deep waters. Both were referenced to the Brazilian Geodetic System (SGB), resulting in a density DEM of 2 points/m 2 and a 1 m resolution image [22].

Data Acquirement and Image Pre-Processing
The image acquisition and pre-processing were carried out using the Google Earth Engine (GEE), a platform for geospatial cloud processing that makes available several spatial Earth-imaging missions and collections [23]. Satellite imagery was selected and processed through this tool to generate the water Modified Normalized Difference Water Index (MNDWI) images. This index proposed by Xu [11] introduced the short-wave infrared (SWIR) rather than the NIR band [10]. Such modification enhanced: (1) the ability to suppress vegetation and built-up noise; (2) the stability of the adjusted thresholds [11,18]; and (3) the accuracy in classifying impure pixels [14]. Furthermore, MNDWI is one of the most widely used water indices in surface water mapping, providing a great empirical application base.

Digital Elevation Model
The digital elevation model (DEM) of the PC reservoir was provided by the Brazilian Water Agency (ANA) and consisted of an aerial laser survey using the LiDAR (Light Detection and Ranging) method for the land part integrated with a bathymetric survey for the flooded area. The latter was carried out with an eco-probe using a single beam for shallow waters and a multibeam for deep waters. Both were referenced to the Brazilian Geodetic System (SGB), resulting in a density DEM of 2 points/m 2 and a 1 m resolution image [22].

Data Acquirement and Image Pre-Processing
The image acquisition and pre-processing were carried out using the Google Earth Engine (GEE), a platform for geospatial cloud processing that makes available several spatial Earth-imaging missions and collections [23]. Satellite imagery was selected and processed through this tool to generate the water Modified Normalized Difference Water Index (MNDWI) images. This index proposed by Xu [11] introduced the short-wave infrared (SWIR) rather than the NIR band [10]. Such modification enhanced: (1) the ability to suppress vegetation and built-up noise; (2) the stability of the adjusted thresholds [11,18]; and (3) the accuracy in classifying impure pixels [14]. Furthermore, MNDWI is one of the most widely used water indices in surface water mapping, providing a great empirical application base.
MNDWI images were composited from Landsat-8/OLI (Operational Land Imager) and Sentinel-2/MSI (Multispectral Instrument) using the following dataset available in GEE:

•
Landsat-8 Surface Reflectance Tier 1 collection provided by United States Geological Survey (USGS), a Level-1 precision and terrain corrected product, atmospherically Sentinel-2 collection provided by the European Space Agency (ESA), a Level-2A orthorectified product, atmospherically corrected surface reflectance using the Sen2Cor algorithm. Along with a 5-day temporal resolution, SWIR band (B11) available with 20 m spatial resolution, and Green band (B3) available with the 10 m and resampled to 20 m were used in the MNDWI composition. We used 24 free-of-cloud images available from 22 December 2018 to 17 September 2020.

Hydrological Monitoring Data
The observed water levels (OWL) from the PC reservoir were provided by Waters and Climate Agency of Pernambuco state (APAC), available in a daily temporal resolution, starting on 1 July 2000. In addition, OWL was converted to the SGB by adding 0.9 m. Gaps in OWL corresponding to the image acquisition dates were filled using linear interpolation according to Table 1.

Image Segmentation and Water Level Estimation
The image segmentation and water level (WL) estimation process are illustrated in Figure 2. All geospatial operations necessary to image segmentation and water extraction elevation (Yc) were automated by using the following R packages: sf [24] and raster [25]. In addition, data manipulation and statistical analysis were also conducted in R, using the packages: dplyr [26], ggplot2 [27], and reshape2 [28]. ( )

Threshold Optimization and Accuracy Evaluation
The general threshold optimization process is illustrated in Figure 3. It consists of an enumerative search, in which all the MNDWI images in the input dataset are segmented using the same single threshold value. The process illustrated in Figure 2 is carried out for each threshold T[i] in the predefined threshold array-which varies between a minimum and maximum value with an increment of 0.01. Some previous simulations were run to define the range of the threshold array aiming to minimize the computational effort.
By running the Figure 2 process recursively, each image in the input dataset has the corresponding estimated WL (Yc), which in turn is compared to its corresponding observed or filled water level (Yo), shown in Section 2.2.3, to calculate RMSE (Equation (2)). First, the predefined threshold T was used to segment all MNDWI images in the dataset (Figure 2a), whereby values greater than T were classified as water ( Figure 2b). After water/non-water segmentation, a mask derived from DEM, corresponding to the maximum elevation at 437 m (2 m higher than the maximum operational limits), was applied over the binary classified raster layer to discard pixels outside its boundaries ( Figure 2c).
This geospatial operation was carried out to avoid WL arising from misclassified pixels and disconnected water bodies. Similar technics were used by [29] to constrain the study area and exclude effects from build-up areas, anthropogenic land cover, and buildings. After masking, the raster was converted to polygon (Figure 2d), and then vertices transformed to points (Figure 2e). A WL dataset was extracted from DEM at coordinate points (Figure 2f).
After DEM extraction, a large dataset of WL values is generated as output, being necessary to choose a unique representative WL (Yc) value to calculate accuracy (Equation (2)). This problem is circumvented by adopting the Median statistic, which represents a central tendency of data. Before the Median calculation, the Tukey method [30] was applied to remove extreme values from the extracted WL dataset (Figure 2g). This procedure reinforces noise reduction and removes undesirable extreme WL values, such as remaining misclassified pixels at higher land and border points identified inside the water body (arising from image failures such as vertical lines Figure 2a,d,e).
In addition, some characteristics present in PC inlet branches, such as agriculture, shallow waters, pastures, macrophyte vegetation, and eutrophication, have a substantial similarity of wetlands, marshland, and inundated areas. Hence, it cannot be correctly addressed without a sub-pixel level approach [31], which leads to omission errors in water detection [32]. WL extracted from these areas probably corresponds to Figure 2g lower tail and were effectively trimmed after filtering (Figure 2h). Therefore, the values beyond the higher and lower limits shown in Equation (1) are removed to aggregate more confidence and stability to extracted WL dataset. The variables Q 1 and Q 3 represent the first and third quartiles from extracted WL values, respectively. These steps are illustrated in Figure 2g,h.

Threshold Optimization and Accuracy Evaluation
The general threshold optimization process is illustrated in Figure 3. It consists of an enumerative search, in which all the MNDWI images in the input dataset are segmented using the same single threshold value. The process illustrated in Figure 2 is carried out for each threshold T[i] in the predefined threshold array-which varies between a minimum and maximum value with an increment of 0.01. Some previous simulations were run to define the range of the threshold array aiming to minimize the computational effort. Finally, the optimum threshold value T* is taken as the one corresponding to the minimum RMSE value after the enumerative search described above. Figure 3. The optimization process ran over a selected images dataset and a predefined threshold array to find T*, which minimizes RMSE.

Uncertainty Analysis
This work aimed to investigate uncertainties concerning using the single threshold approach. Moreover, to address a method to fulfill this purpose, we must consider that: The single threshold method is a stump, a machine learning model consisting of a one-level decision tree, therefore an adaptative non-linear technique;

•
Our interest is to support constructing a model capable of predicting WL on independent test data; • A fitting method typically adapts to training data, and hence the training error is commonly an optimistic estimate of test error (generalization error) [ The computational effort required to tune our model (enumerative search) is considerable due to the complexity of the WL estimative process described in Figure 2; The Bootstrap methodology [34,35] is the most suitable approach to be applied. By Figure 3. The optimization process ran over a selected images dataset and a predefined threshold array to find T*, which minimizes RMSE.
By running the Figure 2 process recursively, each image in the input dataset has the corresponding estimated WL (Yc), which in turn is compared to its corresponding observed or filled water level (Yo), shown in Section 2.2.3, to calculate RMSE (Equation (2)).
Finally, the optimum threshold value T* is taken as the one corresponding to the minimum RMSE value after the enumerative search described above.

Uncertainty Analysis
This work aimed to investigate uncertainties concerning using the single threshold approach. Moreover, to address a method to fulfill this purpose, we must consider that: The single threshold method is a stump, a machine learning model consisting of a one-level decision tree, therefore an adaptative non-linear technique;

•
Our interest is to support constructing a model capable of predicting WL on independent test data; • A fitting method typically adapts to training data, and hence the training error is commonly an optimistic estimate of test error (generalization error) [ The computational effort required to tune our model (enumerative search) is considerable due to the complexity of the WL estimative process described in Figure 2; The Bootstrap methodology [34,35] is the most suitable approach to be applied. By mimicking the cross-validation method, this method fits the model on a set of bootstrap samples drawn with replacement from the original dataset and then evaluates how well it predicts the original training set [33]. Given the nature of our optimization problem ( Figure 3) that demands all the geospatial operations described in Figure 2 recursively, training and test were carried out as distinct processes, making it impossible to keep track of predictions not containing information used in threshold optimization.
Considering the computational burden of the optimization by enumerative search, 20 thresholds values (T*) were generated for each image dataset. The number of MNDWI images (n) taken as input in the training set of the optimization process ( Figure 3) varied and assumed the predefined values n = (1,3,7) to investigate how this factor affects accuracy. Therefore, 6 datasets of 20 optimized thresholds (T*) were obtained: 3 MNDWI Landsat-8 and 3 MNDWI Sentinel-2.
The input dataset images were randomly selected, with repositioning. In this approach, the whole dataset is assumed to be the population of size N (Figure 4) following the Bootstrap methodology mentioned above. The accuracy empirical probability density function was derived by employing the flowchart illustrated in Figure 5, which generated a dataset of 500 RMSE and T* values. It then was used to analyze the uncertainties concerning thresholding. and Sentinel-2 (N = 24) images data is assumed to be the population; (2) n = (1,3,7) images are drawn randomly from the population with reposition; (3) the optimization process described above and illustrated in Figure 3 is run to obtain T* as output, and the sampled images in the former step as input; (4) the process is repeated 20 times for each image collection and each sample size n = (1,3,7). and Sentinel-2 (N = 24) images data is assumed to be the population; (2) n = (1,3,7) images are drawn randomly from the population with reposition; (3) the optimization process described above and illustrated in Figure 3 is run to obtain T* as output, and the sampled images in the former step as input; (4) the process is repeated 20 times for each image collection and each sample size n = (1,3,7). The accuracy empirical probability density function was derived by employing the flowchart illustrated in Figure 5, which generated a dataset of 500 RMSE and T* values. It then was used to analyze the uncertainties concerning thresholding. and Sentinel-2 (N = 24) images data is assumed to be the population; (2) n = (1,3,7) images are drawn randomly from the population with reposition; (3) the optimization process described above and illustrated in Figure 3 is run to obtain T* as output, and the sampled images in the former step as input; (4) the process is repeated 20 times for each image collection and each sample size n = (1,3,7).

Figure 5. Estimation of RMSE empirical probability density functions of the from mimicked cross-validation: (1) Each dataset collection of optimized thresholds (T*) is supposed to be the population;
(2) three elements of T* drawn by chance with reposition; (3) a mean optimized threshold value (T * ) is calculated; (4) a single image is chosen by chance, with reposition, from the original MNDWI Sentinel-2 or Landsat-8 dataset; (5) the process detailed in Figure 2 is carried out, the water level is estimated and compared to observed or interpolated, and RMSE is calculated; (6) steps 1 to 5 are repeated 500 times, the pairs [T * , RMSE] are stored for accuracy's statistical analysis.  Figure 2 is carried out, the water level is estimated and compared to observed or interpolated, and RMSE is calculated; (6) steps 1 to 5 are repeated 500 times, the pairs [T * , RMSE] are stored for accuracy's statistical analysis.

Threshold Stability Analysis
The RMSE and T* datasets, generated by mimicked cross-validation, were analyzed to investigate intervals a ≤ T* ≤ b where the threshold selection leads to minimal and more stable RMSE values.
The OTSU algorithm, an automatic, unsupervised method [32], was applied to compare cross-validation results. The EBImage R package [36] was used to segment each MNDWI image, by maximizing the intraclass water/non-water pixel variance. Before OTSU segmentation, the images were masked to constrain the study area (described in Section 2.3.1).

Errors Involving Single Threshold
Errors involving thresholding arises from two main reasons: • Thresholds values are not constant in space and time and vary due to subpixel waterland-cover composition [18] and due to environmental optical complexity [7] that affects reflected spectral profile such as biological water characteristics, presence of aquatic vegetation, and complex land-covers near water boundary; • The noise arising from misclassified non-water features, whose spectral index values are similar to water [7,11,37]. Figure 6 suggests that the lower RMSE values represent errors from spatiotemporal reflectance variability, forming a block spreading randomly across the threshold axis. Visually, in Landsat-8 imagery, this type of error is more concentrated in RMSE values below 2.5 m (Figure 6a), while in Sentinel-2, it is concentrated below 1 m (Figure 6b). Figure 6 suggests that the lower RMSE values represent errors from spatiotemporal reflectance variability, forming a block spreading randomly across the threshold axis. Visually, in Landsat-8 imagery, this type of error is more concentrated in RMSE values below 2.5 m (Figure 6a), while in Sentinel-2, it is concentrated below 1 m (Figure 6b). Figure 6 also illustrates that noise from misclassification corresponds to higher RMSE values concentrated on the left side (lower thresholds). These errors are associated with higher negative MNDWI values, represented by pixels with high reflectance in the SWIR band, such as build-up, rock, roads, sand [18], and low albedo features, such as asphalt, shadows, mountains, building, and clouds [7].

Accuracy, Stability, and Precision
As illustrated in Figure 6, errors arising from pixels reflectance variability are more limited, predictable, however unavoidable. Conversely, errors arising from noisy misclassification are higher, spread, and unstable. In this work, we removed noise effects through (a) masking MNDWI images and constraining the area (b) applying outlier filters, and (c)  Figure 6 also illustrates that noise from misclassification corresponds to higher RMSE values concentrated on the left side (lower thresholds). These errors are associated with higher negative MNDWI values, represented by pixels with high reflectance in the SWIR band, such as build-up, rock, roads, sand [18], and low albedo features, such as asphalt, shadows, mountains, building, and clouds [7].

Accuracy, Stability, and Precision
As illustrated in Figure 6, errors arising from pixels reflectance variability are more limited, predictable, however unavoidable. Conversely, errors arising from noisy misclassification are higher, spread, and unstable. In this work, we removed noise effects through (a) masking MNDWI images and constraining the area (b) applying outlier filters, and (c) using the median statistic of extracted water levels. However, as illustrated in Figure 7, as threshold values are lower than the optimum threshold value (T*), noise drastically negatively impacts accuracy. These results follow Yang [37], who found that when the threshold is lower than a specific value, it brings too much noise that cannot be effectively eliminated, even by a filter.
Therefore, the optimized threshold T* can also be considered a critical value that leads to an unstable region in terms of accuracy. Aiming to minimize RMSE, the optimization process may take thresholds to highly negative values, which can work only in the specific scene and produces too much noise when applied in another one, affecting accuracy negatively. Hence, it is desirable to choose threshold values higher than T*, where error variations are lower, increase linearly, smoothly, and are more stable.
The cross-validation approach allowed to estimate the cross-effects of optimized threshold values (T*) over imagery dataset samples and thus to identify the stability region. However, as shown in Figure 6, the stability region can be identified using all imagery datasets (represented by the thick black line) without using a cross-validation approach, which requires less computational effort and is less time-consuming. Sensors 2021, 21, x FOR PEER REVIEW 12 of 16  Table 2 illustrates that the mean RMSE* increases as the number of images employed in threshold optimization grows. This can be explained due to inter-scenes reflectance variability, making it difficult to adjust a single threshold over a higher image dataset.

The Non-Optimistic Error
By comparing the ratio between mean RMSE (Table 3) and RMSE* (Table 2), it can be noticed how optimistic the error estimator is in the calibration step, depending on the number of images employed. Optimization that uses a single image (n = 1) chosen by chance can produce overfitted thresholds, too specialized in extracting water features The optimized thresholds' statistics illustrated in Table 2 shows that the mean values stay very close to the lower limit of the stability region. In terms of standard deviation (SD), the upper limits for Landsat-8 and Sentinel-2 can be reached by adding the mean value by a multiplication factor, 1.39xSD and 1.5xSD, respectively. This procedure was used by [29] to adjust threshold values for different dates and regions when the statistics are beyond the normal range. Note: T* is the optimized thresholds and RMSE* is the accuracy by varying the size sample (n) and imagery collection. SD: standard deviation. CV: coefficient of variation.
Regarding the threshold variation, Landsat-8 thresholds varied 42.4% in the range of −0.33 < T < −0.19, and Sentinel-2 thresholds varied 11.1%. This result is superior to the limit of 10% considered by Herndon [20] as the possible range of optimal threshold value variation. However, this must be observed with a caveat since the performance metric used is different, and the study area is very constrained.
As illustrated in Table 4, there is a slightly generalized precision gain for both satellite dataset imagery. For n = (3) precision is increased for all instances: 0.5 m, 1.0 m, 1.25 m, and 2.0 m. After restricting the threshold variation for Sentinel-2,~99% of the errors are lower than 1 m, and~100% of RMSE are lower than 1.25 m.  Table 2 illustrates that the mean RMSE* increases as the number of images employed in threshold optimization grows. This can be explained due to inter-scenes reflectance variability, making it difficult to adjust a single threshold over a higher image dataset.

The Non-Optimistic Error
By comparing the ratio between mean RMSE (Table 3) and RMSE* (Table 2), it can be noticed how optimistic the error estimator is in the calibration step, depending on the number of images employed. Optimization that uses a single image (n = 1) chosen by chance can produce overfitted thresholds, too specialized in extracting water features from a specific scene, but when applied in another image, can lead to errors 20.8 times higher than expected. Table 3 also shows that using three scenes (n = 3) in the optimization, the ratio RMSE/RMSE* is significantly reduced, indicating that the calibration and test sample variability become closer.
Conversely, the ratio RMSE/RMSE* less than 1, when n = (7), illustrated in Table 3, suggests that the same information employed in threshold calibration is used in test samples. This is because we did not employ pure cross-validation but a mimicked approach. Therefore, as the sample size (n) increases, considering the small Landsat-8 and Sentinel-2 imagery employed in this experiment, the probability of using the same information also grows.
Our results indicate that using a sample size greater than three (n > 3) to optimize single thresholds (T*) leads to non-overfitted models and are more suitable for multitemporal analysis once it creates a good balance between bias and variance: a desirable characteristic of predictive models [33].

Comparison between Otsu and Single Optimized Threshold
By comparing optimized thresholds (Table 2) to OTSU values (Table 5), one can notice that automated threshold values calculated by the OTSU method are higher than the optimized ones. Mean RMSE values from OTSU (Table 6) are also higher than optimized ones (Table 3), however, maximum RMSE values from OTSU are lower than the optimized ones. When comparing results constrained in the stability region, mean and maximum RMSE values from optimized thresholds are dramatically superior to OTSU. These results follow Ji and Wylie [18] and Herndon [20], that recommend the adjustment on the actual situation and calibrated to the local environment, aiming to improve performance.

Ensemble as Alternative to Single Threshold Approach
Our primary objective was to investigate uncertainty concerning the single threshold method and support its best choice. It was possible to identify a more stable region where thresholds can vary without impacting accuracy negatively. We investigated how better a model performs in the stability region and evaluated ways to securely set threshold values in the stability region using optimized threshold statistics, aiming to provide elements to support threshold choice.
Nevertheless, as pointed out, single thresholds can also be classified as a machine learning model, more specifically, a one-level decision tree. This understanding brings a new insight into results and the applied methodology, helping construct an easy-to-to alternative to the single threshold.
Usually, the decision tree model is sensitive to trained data [38], tending to produce unstable results with high variance [33]. Bagging or bootstrap aggregation [39] was one of the first ensemble technics developed. It combines the prediction of multiple models fitted over bootstrap samples, which are averaged to result in the bagged model's prediction [40] Mainly applied to decision trees, bagging can reduce the variance of unstable models and give a substantial gain in accuracy, making the final prediction more stable [39,40]. It occurs because averaging reduces variance and keeps bias unchanged [33]. Therefore, the power of bagging consists of combining models that have different perspectives on data [38].
As an alternative to the single threshold setting in the stability region, a bootstrap aggregation model would be constructed by ensembling optimized thresholds (T*) shown in Table 2 or using the constrained interval.
It is important to highlight the RMSE values shown in Table 3 do not correspond to bagged prediction accuracy once it was calculated as the mean of 500 RMSE values. Moreover, bagging averaging prediction must consider the number of models trained in bootstrap samples (in our experiment N = 20), and WL estimations (Yc) should be averaged before RMSE calculation.
Another observation is that bagging is suitable for models of high variance. Hence, the recommendation of employing more than three image scenes in optimization (discussed in Section 3.3) requires more investigations before constructing the ensemble model.

Conclusions
Using a single threshold to extract water features in multitemporal analysis leads to unavoidable errors that can be minimized by setting values constrained to specific limits, in which errors remain stable, and threshold values can vary without affect causing loss of accuracy.
The accuracy stability region can be identified by employing a cross-validation approach or using the whole imagery dataset. It happens because the limits of the accuracy stability region do not depend on the sample size employed in optimization. Conversely, the cross-validation approach is more time-consuming, requires more computational effort, and is not easy to use. The results also showed that the mean statistics of optimized thresholds values, even when carried out over singles scenes, can take thresholds close to the lower limit of the stability region.
This paper also examined the relationship between the number of scenes employed in single threshold optimization and the non-optimistic error. The results showed that the number of images plays a critical role in the model accuracy. More than three scenes randomly selected are always recommended. If a large number of free-of-cloud images is available, cross-validation (not mimicked) can be employed to provide this information more precisely.
The single threshold optimized by field observations was superior to the automated water extraction by the OTSU method. The performance of Sentinel-2 was superior to Landsat-8 imagery. Additionally, the OTSU method could not extract the maximum potential of Sentinel-2 imagery compared to the threshold method. Further research aims to investigate whether these results will be maintained by varying factors, such as spatial resolution, water index from different spectral bands, and other surface correction algorithms.
This paper showed that non-optimistic errors involved in applying a single threshold over MNDWI might not be compatible with the purpose of operational monitoring, even when employing a high-resolution digital elevation model. Considering that the single threshold method is a one-level decision tree, better accuracy would be expected by ensembling multiple optimized thresholds values.
Future research may use high-resolution digital elevation models associated with observed water levels and their respective contours levels to identify water pixels with high confidence and use them as training data. Experiments carried out at subpixel scale by DeVries [32] and Jones [31] indicate that machine learning algorithms and empirical decision rules, fitted over satellite bands, water, and vegetation spectral indexes, can aggregate more flexibility, improving water classification.