A Comparative Study of Water Indices and Image Classiﬁcation Algorithms for Mapping Inland Surface Water Bodies Using Landsat Imagery

: A comparative study of water indices and image classiﬁcation algorithms for mapping inland water bodies using Landsat imagery was carried out through obtaining 24 high-resolution ( ≤ 5 m) and cloud-free images archived in Google Earth with the same (or ± 1 day) acquisition dates as the Landsat-8 OLI images over 24 selected lakes across the globe, and developing a method to generate the alternate ground truth data from the Google Earth images for properly evaluating the Landsat image classiﬁcation results. In addition to the commonly used green band-based water indices, Landsat-8 OLI’s ultra-blue, blue, and red band-based water indices were also tested in this research. Two unsupervised (the zero-water index threshold H0 method and Otsu’s automatic threshold selection method) and one supervised (the k -nearest neighbor (KNN) method) image classiﬁcation algorithms were employed for conducting the image classiﬁcation. Through comparing a total of 2880 Landsat image classiﬁcation results with the alternate ground truth data, this study showed that (1) it is not necessary to use some supervised image classiﬁcation methods for extracting water bodies from Landsat imagery given the high computational cost associated with the supervised image classiﬁcation algorithms; (2) the unsupervised classiﬁcation algorithms such as the H0 and Otsu methods could achieve comparable accuracy as the KNN method, although the H0 method produced more large error outliers than the Otsu method, thus the Otsu method is better than the H0 method; and (3) the ultra-blue band-based AWEI nsuB is the best water index for the H0 method, and the ultra-blue band-based MNDWI 2uB is the best water index for both the Otsu and KNN methods. n Gw is number of water pixels classiﬁed by the GE image, n Lw | Gw is number of water pixels classiﬁed by the Landsat given that they are classiﬁed as water pixels by the GE image, n Ln | Gn is number of non-water pixels classiﬁed by the Landsat given that they are classiﬁed as non-water pixels by the GE image, and n is number of total pixels inside the bu ﬀ er zone. These two errors are computed inside the deﬁned bu ﬀ er zone for each study area. The computed relative errors can reveal if the Landsat-extracted water body areas are overestimated (i.e., positive REs) or underestimated (i.e., negative REs). The overall errors are computed as 100% minus the overall accuracy of Landsat image classiﬁcation results, as shown in Equation (10), and the overall image classiﬁcation accuracies are evaluated based on error or confusion matrices [48], thus the computed overall errors can reveal the omission or commission errors in the Landsat image classiﬁcations of water and non-water pixels.


Introduction
The Earth's inland surface water body consists of rivers, freshwater or saltwater lakes, and marshes with a total surface area of about 5.6 million km 2 (about 1.1% of Earth's surface area), i.e., 0.8 million km 2 for rivers [1], 2.1 million km 2 for lakes, and 2.7 million km 2 for marshes [2]. Although these inland surface water bodies only hold about 0.013% of Earth's total water, i.e., about 178,000 km 3 [2][3][4], they are important compartments in the global terrestrial water cycle, and mapping inundation areas of inland surface water bodies is of great significance for flood prediction and prevention [5][6][7][8]; flood risk and damage assessments [9][10][11][12][13][14][15][16][17]; estimation of water storage in rivers, lakes, and reservoirs [18][19][20]; calculation of evaporation from wetlands and lakes/reservoirs [21]; retrieval of lake water level and river stage [22][23][24][25]; reservoir operation and management [26]; and assessment of ecological functions and health in wetlands and marshes [27,28]. In addition to the above-mentioned practical applications, where GREEN and NIR are spectral reflectance in the green light band and the near-infrared band, respectively. Xu (2006) showed that the NDWI had trouble with eliminating the build-up land noise from the extracted water bodies. Considering that the build-up land exhibits relatively higher reflectance in the shortwave infrared (1.5-3.0 µm) band than in the near-infrared (0.7-1.3 µm) band, Xu (2006) replaced the near-infrared band with the shortwave infrared (SWIR) band in the NDWI and referred to it as the modified normalized difference water index (MNDWI). Although Landsat-5 TM has two SWIR bands, i.e., band 5 (1.55-1.75 µm) and band 7 (2.09-2.35 µm), Xu (2006) found that band 5 of Landsat-5 was better than band 7, thus band 5 was used in the MNDWI. Landsat-8 OLI also has two SWIR bands, i.e., band 6 (1.57-1.65 µm) and band 7 (2.11-2.29 µm). In this study, we referred to the first SWIR band (i.e., band 5 in Landsat-5 and band 6 in Landsat-8) as SWIR 1 , and the second SWIR band (i.e., band 7 in both Landsat-5 and Landsat-8) as SWIR 2 . To differentiate from MNDWI, we call the SWIR 2 -based water index MNDWI 2 . Both MNDWI and MNDWI 2 are defined as follows: Remote Sens. 2020, 12, 1611 3 of 18 In water body classification, shadows produced by mountains, trees, buildings, and river banks can contaminate satellite imagery classification of water bodies. To remove the impact of shadows, Feyisa et al. (2014) proposed the AWEI using five bands, given as follows: AWEI s = BLUE + 2.5GREEN − 1.5(NIR + SWIR 1 ) − 0.25SWIR 2 (4) where BLUE, GREEN, NIR, SWIR 1 , and SWIR 2 are spectral reflectance in the blue light, green light, near infrared, shortwave infrared 1, and shortwave infrared 2 bands, respectively. To use the AWEI, Feyisa et al. (2014) suggested the following criteria: (1) for the areas without high albedo surfaces such as snow cover, and where shadows are the main factor causing errors in the extracted water bodies, AWEI s alone is sufficient to identify water; (2) if there is no shadow, AWEI ns alone is sufficient; (3) if both high albedo surfaces and shadow/dark surfaces are present, Equations (3) and (4) are used sequentially; and (4) without any shadow/dark surfaces and high albedo surfaces, either one alone can be used. According to Equations (1)-(4), unsurprisingly, we can find that the spectral reflectance in the green band is the key variable in these three commonly used water indices, given the relatively high reflectance in green light associated with turbid or algae-laden water measured in the fields [41,42]. Thus, in this paper, we referred to these commonly used water indices as the green band-based water indices. However, if we pay attention to the measured spectral reflectance of clear water in Meaden and Kapetsky (1991), we can find that the reflectance in blue light is actually the highest among all visible lights. On the other hand, if water contains a certain amount of sediments, the spectral reflectance in red light should be the highest [41]. It seems that these kinds of questions have not been paid much attention in the literature; therefore, one goal of this study is to evaluate performances of all water indices including the green (commonly used), ultra-blue (only Landsat 8 has this ultra-blue band), blue, and red band-based water indices.
One key purpose of utilizing water indices in the extraction of water bodies from satellite imagery is to simplify the image classification by defining the zero-water index value as the threshold to differentiate water and non-water pixels. However, this single zero-water index threshold method may not work well, and studies showed that a dynamic or automatic selected threshold method such as the Otsu method [47] was better than the zero-water index threshold method [32]. No matter the single threshold method or the automatic selected threshold method, they all belong to the unsupervised image classification. Compared to the unsupervised image classification, the supervised image classification should perform better because human intervention and input of training data could assist computers with identifying water and non-water pixels, although computational efficiency of the supervised methods is usually lower than that of the unsurprised methods. Obviously, there is a tradeoff between computational efficiency and accuracy of image classification, thus the following questions should be answered: (1) Which image classification algorithm is proper for a particular research study? (2) How should one choose an image classification method? Therefore, the second goal of this study is to address these questions and provide recommendations regarding selection of image classification methods through comparing performances of different image classification methods.
Accuracy assessment is the critical final step in image classification that also has some critical issues to be addressed, such as (1) how to collect ground truth data to validate the image classification results and (2) how to properly compare the computed accuracy (e.g., the Kappa coefficient, relative error, overall error, omission error, and commission error) among tested sites. Collecting ground truth data for validating extracted water bodies from Landsat imagery is time-consuming and labor-intensive, especially to draw a general conclusion. Multiple Landsat images across the globe are necessary for the accuracy assessment, which make it even more difficult to facilitate a ground campaign to collect all ground truth data for validating all the selected Landsat images across the globe. Considering the difficulty in collecting ground truth data, this study took advantage of high-resolution satellite imagery (spatial resolution ≤ 5 m) archived in Google Earth and utilized the Google Earth imagery as the "alternate" ground truth data for evaluating Landsat imagery classification results.
In this study, we targeted the three critical issues discussed above and conducted a systematic study to shed new light on the problems and provided our recommendations for the remote sensing community with regard to the best water index(es) and image classification method(s) in terms of the accuracy of extracted water bodies from Landsat imagery. The remainder of this paper is organized into four sections: Section 2 describes the study area and data sources. Section 3 introduces the methods employed in this study. Results and discussion are presented in Section 4. Conclusions are given in Section 5.

Study Areas and Data
The number of high-resolution satellite images archived in Google Earth generally is much less than the number of Landsat images over a specified area because high-resolution satellite (e.g., Ikonos, QuickBird, WorldView, and GeoEye) operators do not provide free daily images to Google Earth. To match the image acquisition date between Landsat imagery and high-resolution satellite imagery over a particular water body (such as lakes and rivers), manually searching high-resolution satellite images archived in Google Earth and Landsat imagery is needed. The strategy taken in this study is to first find high-resolution satellite imagery (click "Show historical imagery" button on Google Earth) over a selected lake, then go to the United States Geological Survey (USGS) EarthExplorer website (earthexplorer.usgs.gov) to search around the same date (±1 day) for a Landsat image without cloud cover over the selected lake.
The method for building the ground truth data based on Google Earth images is described in Section 3. Through searching for high-resolution satellite images archived in Google Earth and Landsat images at the USGS EarthExplorer website, this study selected 24 lakes across the globe as shown in Figure 1. The acquisition date, pixel cell size, water surface elevation (WSE), latitudinal range, and longitudinal range of each Google Earth image are listed in Table 1, along with the acquisition date, and path and row indices of the corresponding Landsat-8 OLI scene. For each Landsat-8 OLI scene, both Precision and Terrain corrected (L1TP) Level-1 (scaled and calibrated digital number) and Level-2 (computed surface reflectance) data were downloaded from the USGS EarthExplorer website. Landsat image processing is also described in Section 3.   Figure 2 is a flowchart illustrating steps for building the alternate ground truth data from Google Earth (GE) images, extracting and processing Landsat image data, classifying GE and Landsat images, and comparing GE and Landsat image classification results. Each saved GE image was first georeferenced through selecting the World WGS84 as the Geographic Coordinate System and entering coordinates of four corners using the georeferencing function in ArcGIS. Then, each georeferenced GE image was projected from the Geographic projection to the Universal Transverse Mercator (UTM) projection to match the projection of the corresponding Landsat image. The corresponding Landsat-8 OLI image was subsequently clipped to the same spatial extent of the GE image and then resampled into the same grid cell size as the GE image using the bilinear resampling method embedded in ArcGIS.

Data Processing
Among the downloaded Landsat-8 OLI Level-1 images in eight bands, the band 8 image was only used for correcting the possible errors in the georeferenced Google Earth image. Two steps were taken in this study to reduce such errors: (1) first, visually inspecting the images to identify a couple of benchmark pixels in both Google Earth image and the resampled (i.e., with the same pixel cell size as the corresponding Google Earth image) Landsat-8 OLI band 8 image, and then determining the average shift in the pixel distance and applying the average pixel shift distance to correct the Google Earth image; and (2) computing the spatial correlation between the Google Earth image and the resampled Landsat-8 OLI band 8 image over a range of shifts in x and y directions to determine the optimal shifts in x and y directions that are associated with the maximum spatial correlation coefficient. Then, use the optimal shifts to correct the Google Earth image.
as the corresponding Google Earth image) Landsat-8 OLI band 8 image, and then determining the average shift in the pixel distance and applying the average pixel shift distance to correct the Google Earth image; and (2) computing the spatial correlation between the Google Earth image and the resampled Landsat-8 OLI band 8 image over a range of shifts in x and y directions to determine the optimal shifts in x and y directions that are associated with the maximum spatial correlation coefficient. Then, use the optimal shifts to correct the Google Earth image.

Water Index
As discussed in Section 1, the commonly used water indices are referred as the green band-based water indices. In this study, in addition to these green band-based water indices, we also compared the performances of the other three sets of water indices, i.e., ultra-blue band, blue band, and red band-based water indices. To keep the commonly used notations of various water indices, this study added a subscript to each water index term for representing the visible band used in the water index as follows:

Water Index
As discussed in Section 1, the commonly used water indices are referred as the green band-based water indices. In this study, in addition to these green band-based water indices, we also compared the performances of the other three sets of water indices, i.e., ultra-blue band, blue band, and red band-based water indices. To keep the commonly used notations of various water indices, this study added a subscript to each water index term for representing the visible band used in the water index as follows: where X (uB, B, G, R) stands for the ultra-blue, blue, green, or red band used in computing water index. This study utilized two types of Landsat data for computing water indices: surface reflectance given in the Level-2 products (hereafter SR water index) and top-of-atmosphere (TOA) spectral reflectance R λ computed from the digital number (DN) of each Landsat-8 imagery pixel using the following equation (hereafter TR water index): where M = 2 × 10 −5 and A = −0.1 are rescaling factors for converting the digital number to reflectance in band λ, and θ is the solar zenith angle in degrees which is given in the metadata file of each Landsat scene.

Unsupervised Image Classification
The simplest unsupervised image classification method is to select a single threshold to differentiate water and non-water pixels. Without computing any water index, a simple density slice method can be used to determine a threshold from the histogram of an image, i.e., choosing the digital number associated with the valley of the histogram of the image as the threshold. Although this approach is simple to carry out, it is subject to uncertainty and errors if a histogram does not show a distinct valley. Using the computed water indices for image classification, instead of selecting a threshold based on the histogram, a zero-water index threshold method is usually chosen for extracting water bodies. This method can improve the efficiency of image classification, but it is also subject to uncertainty and errors, because a threshold value of the zero-water index might not achieve the most accurate extraction of the water body. Therefore, this study evaluated the accuracy of the extracted water bodies based on the zero-water index threshold method (hereafter the H0 method).
In addition to the H0 method, a nonparametric and unsupervised automatic threshold selection method proposed by Otsu (1979) was evaluated in this study. The principle of Otsu's method is to maximize the following objective function f : where P W and P NW are probabilities of water pixels and non-water pixels, respectively, and µ W and µ NW are mean water index values of classified water pixels and non-water pixels, respectively. The optimal water index threshold is determined through searching the water index threshold (WIT) between −1 and 1 with an interval of 0.01 for maximizing the objective function shown in Equation (7). All terms on the right hand of Equation (7) are computed as follows: where WI i is the water index of pixel i; and n, n W , and n NW are numbers of total pixels, pixels with WI > WIT, and pixels with WI ≤ WIT, respectively.

Supervised Image Classification
Given the relative simplicity of identifying water and water-land boundaries with human visual inspection, supervised classification might be a good choice for fulfilling the task of water pixel classification by inputting a training dataset for classification. There are several supervised classification methods, such as maximum likelihood, Gaussian mixture, minimum distance, nearest neighbor, k-nearest neighbor, etc. This study chose the k-nearest neighbor classifier (hereafter the KNN method) to be evaluated because of its simplicity and effectiveness [48]. Applying the KNN method to determine if an unknown pixel x belongs to water class or non-water class, we first need to compute the spectral distance between the pixel x and each training pixel in m-dimensional spectral space as follows: where i is the index of training pixels, n is the number of training pixels, j is the band index, m is the number of bands to be used in image classification, R x,j is the spectral reflectance of the pixel x to be classified in band j, and R i,j is the spectral reflectance of the training pixel i in band j. All the computed spectral distances between the unknown pixel x and all the training pixels will be ranked from the lowest to the highest. Based on k ranked spectral distances, the final step is to determine which class the pixel x belongs to, and k is the number of the nearest training pixels to be considered in image classification. There are two questions that must be answered before we can accomplish the final step: (1) What is the suitable k value? (2) What is the proper method for classifying unknown pixels? In this study, since we are interested in identifying water-body class, there are actually only two classes to be determined, water and non-water, thus the training pixels either belong to water class or non-water class. Therefore, we set k to be the number of training water pixels (n w ).
With regard to the second question, two possible approaches can be used to solve this problem. (1) Count the numbers of the nearest neighbors that belong to the water class (k w ) and the non-water class (k nw ) among the k ranked nearest training pixels (k = k w + k nw ). If k w is greater than k nw , then the unknown pixel belongs to the water class, otherwise it belongs to the non-water class. (2) Compute the average spectral distance (d w ) to the k w nearest water pixels and the average distance (d nw ) to the k nw nearest non-water pixels. If d w is less than d nw , the unknown pixel belongs to the water class, otherwise it belongs to the non-water class. However, these two methods are subject to uncertainty associated with the selected k value, because a small variation in the selected k value could result in a different image classification result. To eliminate such uncertainty, in this study, we proposed to compute the sum of the inverse distances of each class among the identified k ranked nearest training pixels. For the water-body identification problem, there are only two classes, water or non-water, thus we only need to compute two sums of the inverse distances. To avoid the division by zero (i.e., as the spectral distance is zero), we first identify the minimum non-zero spectral distance (d min ) among the distances from pixel x to all training pixels. If the computed spectral distance is zero, we set the inverse spectral distance to be 2/d min , otherwise the inverse spectral distance is 1/d i , where d i is the spectral distance from pixel x to training pixel i. If the sum of the spectral distances to k w nearest water training pixels is greater than that to k nw nearest non-water training pixels, pixel x is a water pixel, otherwise it is a non-water pixel.

Assessment of Image Classification Results
To evaluate the Landsat image classification results, first, a polygon covering a portion of the lake water body and a portion of land on each GE image were defined (Figure 3, as an example), and the KNN method was then used for classifying the water body and land inside the predefined polygon. The reason for choosing the KNN method is that each GE image is an RGB image and the digital numbers (DN) in the three channels (RGB) do not necessarily correspond to the same bands as the Landsat, thus each digital number (DN) could not be converted into the spectral reflectance, and furthermore, each GE image does not contain any near-infrared or shortwave infrared band images which are required for computing water index. Therefore, the H0 method is not applicable for GE images. On the other hand, as a supervised image classification method, the KNN method with input of training data would achieve a higher accuracy than other unsupervised image classification algorithms.
training pixels were selected from for the KNN method. For example, the GE image overlaid by the predefined polygon (in white) and the identified water-land boundary (in green) and the buffer zone (in red) are shown in the left panel of Figure 3, and the corresponding Landsat-8 OLI band 5 image overlaid by the predefined polygon (in white) and the buffer zone (in red) are shown in the right column of Figure 3. The GE and Landsat-8 OLI band 5 images overlaid by the defined polygons and buffer zones for all 24 lakes are shown in the Supplementary Material section. Using the GE image classification results as the alternate ground truth data, two measures were employed in this study for assessing the Landsat image classification results: (1) relative error (RE) of the extracted water body area, and (2) overall error (OE) of the Landsat image classification: where is number of water pixels classified by the Landsat, is number of water pixels classified by the GE image, | is number of water pixels classified by the Landsat given that they are classified as water pixels by the GE image, | is number of non-water pixels classified by the Landsat given that they are classified as non-water pixels by the GE image, and is number of total pixels inside the buffer zone. These two errors are computed inside the defined buffer zone for each study area. The computed relative errors can reveal if the Landsat-extracted water body areas are overestimated (i.e., positive REs) or underestimated (i.e., negative REs). The overall errors are computed as 100% minus the overall accuracy of Landsat image classification results, as shown in Equation (10), and the overall image classification accuracies are evaluated based on error or confusion matrices [48], thus the computed overall errors can reveal the omission or commission errors in the Landsat image classifications of water and non-water pixels. Digital numbers (DN) in three RGB channels of GE images were directly used to compute the digital number distances for identifying water or non-water pixels using the KNN method as described in Section 3.3.2. The classification results were checked and corrected if man-made structures, boats, or clouds appeared in the identified water body areas through human visual inspection. Both the high spatial resolution associated with GE images and visual inspection and correction ensured a relatively high accuracy of the GE image classification results.
After the extraction of the water body inside the predefined polygon, the water-land boundary was identified for defining a buffer zone with a width of 300 m (each side has a perpendicular distance of 150 m to the water-land boundary). The buffer zones were the domains where the image classification results were evaluated. The predefined polygon over each lake was the domain where the optimal water index threshold was determined by the Otsu method, and water and non-water training pixels were selected from for the KNN method. For example, the GE image overlaid by the predefined polygon (in white) and the identified water-land boundary (in green) and the buffer zone (in red) are shown in the left panel of Figure 3, and the corresponding Landsat-8 OLI band 5 image overlaid by the predefined polygon (in white) and the buffer zone (in red) are shown in the right column of Figure 3. The GE and Landsat-8 OLI band 5 images overlaid by the defined polygons and buffer zones for all 24 lakes are shown in the Supplementary Material section.
Using the GE image classification results as the alternate ground truth data, two measures were employed in this study for assessing the Landsat image classification results: (1) relative error (RE) of the extracted water body area, and (2) overall error (OE) of the Landsat image classification: where n Lw is number of water pixels classified by the Landsat, n Gw is number of water pixels classified by the GE image, n Lw|Gw is number of water pixels classified by the Landsat given that they are classified as water pixels by the GE image, n Ln|Gn is number of non-water pixels classified by the Landsat given that they are classified as non-water pixels by the GE image, and n is number of total pixels inside the buffer zone. These two errors are computed inside the defined buffer zone for each study area. The computed relative errors can reveal if the Landsat-extracted water body areas are overestimated (i.e., positive REs) or underestimated (i.e., negative REs). The overall errors are computed as 100% minus the overall accuracy of Landsat image classification results, as shown in Equation (10), and the overall image classification accuracies are evaluated based on error or confusion matrices [48], thus the computed overall errors can reveal the omission or commission errors in the Landsat image classifications of water and non-water pixels.

Results and Discussion
This study tested four sets (i.e., ultra-blue, blue, green, and red band-based) of five different water indices (Equations (5a-c)) with three different image classification algorithms (i.e., the H0 method, the Otsu method, and the KNN method). In addition to the Level-1 Landsat data, the Level-2 Landsat surface reflectance data for each scene were also used in computing water indices for image classification and ultimately for evaluating the Landsat image classification results over 24 selected lakes across the globe (Figure 1). Therefore, in total, 4 × 5 × 3 × 2 × 24 = 2880 Landsat image classification results were evaluated in this study. The relative errors of the Landsat-extracted water body areas and overall image classification errors of these 2880 cases are listed in the Supplementary Materials section. Given the large number of cases, the boxplot was employed in this study for illustrating the results. All boxplots in this paper show the 5th percentile, 25th percentile, 50th percentile (i.e., medium), 75th percentile, and 95th percentile values, and dots represent data points outside the range of the 5th-95th percentile.

Impact of Different Landsat Products on Water Classification Results
To assess the impact of different Landsat products on the Landsat-extracted water bodies, three boxplots of water indices versus relative errors of the Landsat-extracted water body areas compared against the water body areas identified from the GE images corresponding to three different image classification methods (H0, Otsu, and KNN) using the water indices computed from the TOA reflectance (i.e., TR water index) and the surface reflectance (i.e., SR water index) are shown in Figures 4 and 5, respectively. Comparisons of these two figures indicate that, as the surface reflectance was used for computing water index, the Landsat-extracted water body areas were generally underestimated, especially as the H0 method was used for image classification; all medians of the relative errors are negative, as shown in Figure 5.  To demonstrate the cause for such underestimations associated with the SR water indices, Figure 6 shows the computed NDWIs over Chuzenji Lake using the TOA reflectance and the surface reflectance. The TR water indices for the water body inside the buffer zone (white polygon) are greater than zero, while the SR water indices for the water body inside the buffer zone (red polygon) are less than zero, thus the H0 method underestimated the water body area inside the buffer zone if the surface reflectance was used for computing water index. To demonstrate the cause for such underestimations associated with the SR water indices, Figure  6 shows the computed NDWIs over Chuzenji Lake using the TOA reflectance and the surface reflectance. The TR water indices for the water body inside the buffer zone (white polygon) are greater than zero, while the SR water indices for the water body inside the buffer zone (red polygon) are less than zero, thus the H0 method underestimated the water body area inside the buffer zone if the surface reflectance was used for computing water index. Theoretically speaking, using the surface reflectance to compute water index should yield a higher accuracy in image classification than the TR water index. However, comparisons of relative errors in the Landsat-extracted water body areas and overall image classification errors between the TR water index and SR water index for the three different image classification methods listed in Table 2 all show that the SR water index led to about 75% cases (out of 480 cases) with a worse accuracy than the TR water index, no matter which of three image classification methods (i.e., H0, Otsu, and KNN) was used. The results suggest that the water indices computed based on the Landsat current version Level-2 surface reflectance products might be subject to higher errors and uncertainties than the TR water indices in some regions; therefore, in the remainder of this paper, without specifying, all water indices were computed from the TOA reflectance (i.e., TR water indices).

Comparisons of Three Image Classification Algorithms
In addition to the boxplots of the relative errors of the Landsat-extracted water body areas versus the TR water index for three different image classification algorithms illustrated in Figure 4, the boxplots of the overall Landsat image classification errors versus the TR water index are plotted in Figure 7. Both Figures 4 and 7 suggest that the H0 method did not perform well compared to the Otsu and KNN methods because of the larger error outliers produced by the H0 method, especially as the MNDWI 2 type water index was utilized in the H0 method for image classification. According to Figures 4 and 7, the water index MNDWI 2 employed in the H0 method for image classification produced larger relative errors and overall image classification errors than any other water indices, no matter which visible light band (i.e., ultra-blue, blue, green, or red) was used in the water index MNDWI 2 . Unlike the H0 method, as the Otsu and KNN methods were employed for image classification, the water index MNDWI 2 yielded comparable relative errors in the Landsat-extracted water body areas and overall image classification errors as the other water indices. These results indicate that if the H0 method is used for classifying water and non-water pixels, the water index MNDWI 2 should be avoided, i.e., SWIR 2 should not be used in the MNDWI. Actually, as Xu (2006) first proposed the MNDWI, SWIR 1 rather than SWIR 2 was chosen for computing the MNDWI. The reason for the overestimated water body areas is that, in some cases, reflectance in SWIR 2 band for non-water pixels was less than the visible band reflectance of these non-water pixels, thereby producing positive MNDWI 2 for these non-water pixels, thus the H0 method based on the water index WI 3x overestimated water body areas, such as in Brown Lake and Lake Okeechobee (results listed in Tables ST1 and ST4 in the Supplementary Materials section).
According to Figures 4 and 7, differences in the relative errors of the Landsat-extracted water body areas and overall Landsat image classification errors between the Otsu method and the KNN method are insignificant. Table 3 presents comparisons of relative errors in the Landsat-extracted water body and overall Landsat image classification errors among three image classification algorithms (i.e., H0, Otsu, and KNN). Surprisingly, the numbers of smaller REs or OEs produced by the H0 method are slightly greater than those produced by the Otsu and KNN methods. According to Table 3, as a supervised image classification method, the KNN method is unsurprisingly better (in terms of the numbers of smaller REs and OEs) than the Otsu method, which is an unsupervised method, but surprisingly, it is not better than the H0 method. However, larger error outliers produced by the H0 as shown in Figures 4 and 7 indicate that the H0 method is more sensitive to the water index than the other two methods (Otsu and KNN). These results suggest that: (1) it is not necessary to use some supervised image classification methods for identifying water bodies from Landsat imagery given the high computational cost associated with the supervised image classification methods; (2) the unsupervised image classification algorithms such as the Otsu and H0 methods could yield comparable accuracy in the Landsat-extracted water body areas and image classification of water and non-water classes as the KNN method; and (3) although the zero-water index threshold (i.e., the H0 method) worked better in slightly more than 50% of cases compared to the automatic threshold determined by the Otsu method, the Otsu method produced less large error outliers than the H0 method as shown in Figures 4 and 7. Therefore, if there is no preference when selecting water index for classifying water and non-water classes, the Otsu method is preferable to the H0 method.

Comparisons of Twenty Water Indices
The results presented in Section 4.2 demonstrate that the accuracies of Landsat-extracted water body areas and Landsat classifications of water and non-water pixels depend on the water index used for classifying water and non-water classes, no matter which image classification method is used, especially for the H0 method, which is very sensitive to water index. That is probably one reason for some debate on the performances of various water indices in the literature. In this paper we reviewed eight relevant studies [32][33][34][35][36][37][38][39] and found that 43.75% (3.5/8) studies claimed that MNDWI was the best, 37.5% (3/8) studies claimed that NDWI was the best, and 18.75% (1.5/8) studies claimed that AWEI was the best, in terms of accuracy of Landsat image classifications. None of the water indices exhibited outstanding superiority (i.e., more than 50%) to others, which is also the case in this study. According to the relative errors of the Landsat-extracted water body areas listed in ST1 (for the H0 method) and ST2 (for the Otsu method) for five commonly used green band-based water indices, i.e., NDWI, MNDWI, MNDWI 2 , AWEI ns , and AWEI s , the corresponding percentages of rank-one in terms of accuracy among 24 lakes are 12.5%, 25%, 12.5%, 25%, 25%, and 25% for the H0 method, and 8.33%,12.5%, 25%, 25%, and 29.17% for the Otsu method, respectively.
To compare the performances of twenty water indices tested in this study, means of absolute relative errors (MARE) of Landsat-extracted water body areas and overall errors (MOE) of Landsat image classifications over 24 tested lakes for three different image classification methods (i.e., H0, Otsu, and KNN) were computed and listed in Table 4. If we only focus on the commonly used green band-based water indices, according to Table 4, we can find that AWEI sG produced both the lowest MARE and MOE for the H0 method, AWEI nsG produced both the lowest MARE and MOE for the Otsu method, and MNDWI 2G produced both the lowest MARE and MOE for the KNN method. However, if we consider all twenty water indices, Table 4 shows that the ultra-blue band-based AWEI nsuB is the best water index for the H0 method, and the ultra-blue band-based MNDWI 2uB is the best water index for both the Otsu and KNN methods, because they produced the smallest MAREs and MOEs compared to all other water indices for the same image classification algorithm. None of the red band-based water indices showed any improvement in extracting water features compared to the green band-based water indices, which is probably due to the fact that none of the 24 selected lakes in this study had high sediments loads leading to high reflectance in red light.

Conclusions
This paper addressed three important issues related to extraction of water bodies from Landsat imagery: How to collect ground truth data across the globe for validating Landsat image classification results? Which water indices (among NDWI, MNDWI, AWEI) and which image classification (unsupervised or supervised) methods are the best for extracting water bodies from Landsat images? First, this study took advantage of high-resolution satellite images archived in Google Earth to obtain 24 high-resolution (≤5 m) and cloud-free images and each image covers a portion of 24 selected lakes across the globe, and then a method was developed to generate the alternate ground truth data from the Google Earth images for properly evaluating the Landsat image classification results.
With regard to the computed water indices for identifying water pixels from Landsat imagery, in addition to the commonly used green band-based water indices (i.e., NDWI, MNDWI, and AWEI), Landsat-8 OLI's ultra-blue, blue, and red band-based water indices were also tested in this research, thus a total of 20 types of water indices were evaluated. Both Level-1 Landsat images (used for computing the top-of-atmosphere reflectance) and Level-2 Landsat surface reflectance data were utilized for computing water indices that are referred to as TR and SR water indices, respectively. With regard to the image classification, two unsupervised methods i.e., the single zero-water index threshold method (i.e., the H0 method), and Otsu's automatic threshold selection method, and the supervised KNN method were employed for conducting the image classification. Through comparing a total of 24 × 20 × 3 × 2 = 2880 image classification results with the alternate ground truth data derived from the Google Earth images, the following conclusions are drawn: (1) The top-of-atmosphere reflectance computed from the Level-1 Landsat image data are better than the current Level-2 Landsat surface reflectance products for computing water indices, because the water indices computed based on the Landsat current version Level-2 surface reflectance products might be subject to higher errors and uncertainties than the TR water indices in some regions. (2) It is not necessary to use some supervised image classification methods for identifying water bodies from Landsat imagery given the high computational cost associated with the supervised image classification methods. The unsupervised image algorithms such as the Otsu and H0 methods could yield comparable accuracy in the Landsat-extracted water body areas and image classification of water and non-water classes as the KNN method. (3) Although the zero-water index threshold (i.e., the H0 method) worked better in slightly more than 50% cases compared to the automatic threshold determined by the Otsu method, the Otsu method produced less large error outliers than the H0 method. Therefore, if there is no preference when selecting water index for classifying water and non-water classes, the Otsu method is preferable to the H0 method. (4) Among five commonly used green band-based water indices, AWEI s produced both the lowest mean absolute relative errors (MARE) in the Landsat-extracted water body areas and mean overall errors in the Landsat image classifications (MOE) for the H0 method, AWEI ns produced both the lowest MARE and MOE for the Otsu method, and MNDWI 2 produced both the lowest MARE and MOE for the KNN method. (5) Comparisons among twenty water indices over 24 lakes across the globe showed that the ultra-blue band-based AWEI nsuB is the best water index for the H0 method, and the ultra-blue band-based MNDWI 2uB is the best water index for both the Otsu and KNN methods.
In this study, none of the red band-based water indices showed any improvement in extracting water features compared to the green band-based water indices, which is probably due to the fact that none of the 24 selected lakes had high sediments loads leading to high reflectance in red light. Due to limited numbers of high-resolution satellite images archived in Google Earth that can be used for assessing the Landsat water body mapping results, the evaluations of different visible band-based water indices (including both TR and SR water indices) and three image classification algorithms were only carried out on 24 individual lakes with low turbidity, less vegetation cover, and single image acquisition dates. The performances of various water indices and image classification algorithms might differ from the results presented in this study if multiple Landsat images with different acquisition dates over a single lake are used for mapping flooded areas, because as water level declines, a number of issues such as mixed water-vegetation-sediment pixel, vegetation cover, and turbidity will arise, which deserve further research.
Supplementary Materials: The following are available online at http://www.mdpi.com/2072-4292/12/10/1611/s1. Figure SF1: (left panel): 24 Google Earth images overlaid by the predefined polygon (in white) and the identified water-land boundary (in green) and the buffer zone (in red); (right panel): 24 Landsat-8 OLI band 5 images overlaid by the predefined polygon and the identified buffer zone. Table ST1: Relative errors (%) of Landsat-8 OLI water classification results using the zero-water index threshold (H0) method. Table ST2: Relative errors (%) of Landsat-8 OLI water classification results using the Otsu method. Table ST3: Relative errors (%) of Landsat-8 OLI water classification results using the KNN method. Table ST4: Overall errors (%) of Landsat-8 OLI water/land classification results using the zero-water index threshold (H0) method. Table ST5: Overall errors (%) of Landsat-8 OLI water/land classification results using the Otsu method. Table ST6: Overall errors (%) of Landsat-8 OLI water/land classification results using the KNN method. Table ST7: Relative errors (%) of Landsat-8 OLI SR water classification results using the zero-water index threshold (H0) method. Table ST8: Relative errors (%) of Landsat-8 OLI SR water classification results using the Otsu method. Table ST9: Relative errors (%) of Landsat-8 OLI SR water classification results using the KNN method. Table ST10: Overall errors (%) of Landsat-8 OLI SR water/land classification results using the zero-water index threshold (H0) method. Table ST11: Overall errors (%) of Landsat-8 OLI SR water/land classification results using the Otsu method. Table ST12: Overall errors (%) of Landsat-8 OLI SR water/land classification results using the KNN method.