Comparative Analysis of Machine Learning Algorithms in Automatic Identiﬁcation and Extraction of Water Boundaries

: Monitoring open water bodies accurately is important for assessing the role of ecosystem services in the context of human survival and climate change. There are many methods available for water body extraction based on remote sensing images, such as the normalized difference water index (NDWI), modiﬁed NDWI (MNDWI), and machine learning algorithms. Based on Landsat-8 remote sensing images, this study focuses on the effects of six machine learning algorithms and three threshold methods used to extract water bodies, evaluates the transfer performance of models applied to remote sensing images in different periods, and compares the differences among these models. The results are as follows. (1) Various algorithms require different numbers of samples to reach their optimal consequence. The logistic regression algorithm requires a minimum of 110 samples. As the number of samples increases, the order of the optimal model is support vector machine, neural network, random forest, decision tree, and XGBoost. (2) The accuracy evaluation performance of each machine learning on the test set cannot represent the local area performance. (3) When these models are directly applied to remote sensing images in different periods, the AUC indicators of each machine learning algorithm for three regions all show a signiﬁcant decline, with a decrease range of 0.33–66.52%, and the differences among the different algorithm performances in the three areas are obvious. Generally, the decision tree algorithm has good transfer performance among the machine learning algorithms with area under curve (AUC) indexes of 0.790, 0.518, and 0.697 in the three areas, respectively, and the average value is 0.668. The Otsu threshold algorithm is the optimal among threshold methods, with AUC indexes of 0.970, 0.617, and 0.908 in the three regions respectively and an average AUC of 0.832.


Introduction
Water is the source of life: the earth's surface open water body accounts for about 74% of the total earth area, it is an important resource for all life survival, and it is also the most important component of living organisms [1,2]. In China, the distribution of water resources is quite uneven, and the pollution situation is serious. So, how to identify water bodies efficiently and accurately has become a severe issue [3,4].
With the rapid development of aviation and aerospace technology, remote sensing technology has provided advanced support for many fields, including resource survey, environmental monitoring, mapping, and geography [5,6].
The development of remote sensing technology makes it possible to extract water information quickly and accurately, which is substantially different from conventional field survey methods employed in the past [7][8][9][10].
Monitoring open water bodies accurately is an important and basic application in remote sensing. Various water body mapping approaches have been developed to extract water bodies from multispectral images [11][12][13]. Using remote sensing images to monitor a water body is mainly based on spectral bands and each image's spatial feature, so the identification methods can be categorized into three types from different perspectives.
(1) Water body index method: This method is based on the spectral curves of water bodies, and thresholds are utilized to effectively distinguish water bodies from the background [14]. Different water indexes have already been proposed in the past few decades. Specifically, in 1996, McFeeters [15] introduced the normalized difference water index (NDWI) model to extract water bodies. However, this model is unable to distinguish between dark shadow and water bodies. To overcome the shortcomings of NDWI, in 2006, Xu [16] proposed the modification of normalized difference water index (MNDWI) to enhance open water features in remotely sensed imagery, and this model has better results for urban water bodies extraction. The water body index method has the characteristics of high precision and low computational cost, which has been widely used in practical applications. In the last few decades, the MNDWI of Xu is one of the most widely used water indices for various fields, including surface water mapping, land use/cover change analyses, and ecological research [17][18][19][20].
(3) Object-based image analysis methods (OBIA): Due to the limitations of pixel-based classification methods, such as the salt and pepper phenomenon in classification results, object-based classification techniques have been increasingly applied in remote sensing classification in recent years [37,38]. Many successful cases of water body extraction using OBIA methods have been reported [39][40][41][42][43]. Given that urban functional zones (UFZs) are composed of diverse geographic objects, Du et al. [44] presented a novel object-based UFZ mapping method using very-high-resolution (VHR) remote sensing images. Based on object-oriented analysis technology and multi-source data, Guo et al. [45] proposed a multi-level classification scheme based on goals and rules to study the changes of glacier environments.
In addition, some studies also have used synthetic aperture radar (SAR) data to monitor the surface dynamics, because these data are insensitive to clouds [14,46,47]; the area of surface water can be extracted from SAR data based on textural analysis [48], change detection [49], automatic segmentation [50], and classification [51].
At present, machine learning algorithms to extract water bodies mainly include neural networks, support vector machines, and random forest algorithms. The studies carried out in the past have identified the best performing classification algorithm by comparing different classification algorithms. However, none of them provides a comprehensive comparative analysis of some popular classification algorithms [37,52].
There are few studies on the evaluation of the transfer performance of each machine learning algorithm applied to remote sensing images in different periods. Based on Landsat-8 images, this study uses machine learning algorithms such as decision tree, logistic regression, random forest, and neural network to extract water bodies. First of all, the effect of each machine learning algorithm on the test set is discussed. After that, each machine learning algorithm is applied to three different local areas, and its effect on each local area is evaluated. At last, each machine learning algorithm is applied to remote sensing images in different periods to evaluate the model transfer performance of each machine learning algorithm, and three threshold methods are compared. The results could shed light on the future work of water body extraction based on remote sensing.

Data
Landsat-8 data from the website (http://glovis.usgs.gov/ (accessed on 20 October 2021)) of the United States Geological Survey are used. Landsat-8, launched as a collaboration between the United States Geologic Survey (USGS) and National Aeronautics and Space Administration (NASA) on 11 February 2013, carries onboard the OLI push broom multispectral radiometer [53]. As shown in Table 1, the Landsat-8 OLI/TIRS imagery has 11 spectral bands in total, including eight spectral bands (i.e., three visible bands, two bands for describing aerosol, water vapor, and cirrus clouds, two short-wave infrared bands (SWIR) and near infrared (NIR)) with spatial resolution of 30 m, one panchromatic spectral band with a spatial resolution of 15 m, and two thermal spectral bands with a spatial resolution of 100 m [54]. Landsat-8 remote sensing images (path 123; raw 039) of the same area acquired on 4 October 2019 and 20 October 2019 are used in our experiment. Specifically, the data on 20 October 2019 are used to establish the model and compare the effect of each algorithm, and the data on 4 October 2019 are used to examine the performance of model transfer. Three different areas with different surface features are selected from remote sensing images. As shown in Figure 1, Area1 has a large area of water with relatively simple surface object types, while Area2 has a small water area and complex surface environment, and its water extraction is affected by numerous vegetation and mountain shadow. Area3 is located in the urban built-up area and has multiple contiguous water bodies; thus, the water extraction is affected by nearby buildings and roads.
To avoid the effects of too many clouds and aerosol, images with fewer clouds are selected here. All original data are processed by converting the original digital number (DN) value into spectral radiance, through Equation (1) [55]. The formula is given as follows: where: L λ = spectral radiance W/m 2 ·sr·um ; M L = radiance multiplicative scaling factor for the spectral band(radiance_mult_band_n from the metadata); A L = radiance additive scaling factor for the spectral band(radiance_add_band_n from the metadata); Q cal = raw digital numbers (DN).

Pre-Processing
By adopting spectral band combinations 7/5/4, 7/4/3, 6/5/4, and 4/3/2 combined with visual interpretation, a sample dataset is selected from Landsat images for classification; the sample set contains 340 water samples and 454 non-water samples. To avoid the influences of heterogeneous categories in the subsequent classification, the ratio of other ground object samples to the water body samples remains at 1.3:1.

Pre-Processing
By adopting spectral band combinations 7/5/4, 7/4/3, 6/5/4, and 4/3/2 combined with visual interpretation, a sample dataset is selected from Landsat images for classification; the sample set contains 340 water samples and 454 non-water samples. To avoid the influences of heterogeneous categories in the subsequent classification, the ratio of other ground object samples to the water body samples remains at 1.3:1.
The characteristics of the data, such as a large correlation between multiple spectral bands in the original images and similar information and structures between different spectral bands, generally bring significant amounts of redundancy. For this reason, principal component analysis (PCA) for dimensionality reduction is applied to remove repetitive and redundant information between various spectral bands [56]. The first and second principal components in the PCA with a cumulative variance contribution of 99% are selected as classification characteristics.
Based on the PCA, four generally used texture features, i.e., contrast, autocorrelation, dissimilarity, and entropy are extracted. The distance is set to be 1 pixel (distance of 30 m), 2 pixels (distance of 60 m), and 3 pixel (distance of 90 m), and 3 × 3, 5 × 5, 7 × 7, and 9 × 9 are selected as windows with orientations of 0 • , 45 • , 90 • , and 135 • . Optimal combined features are selected as the characteristic spectral bands for water body extraction. When the two parameters-i.e., window size and distance-increase, the edges of the images get fuzzy, and the window size shows more effects than distance. Considering the factors of ground objects correlation and image resolution, we set the distance to 1 pixel and select a 3 × 3 window with four orientations of 0 • , 45 • , 90 • , and 135 • .
After the size and window parameters are determined, J-M distance [57,58] and transformed divergence [59] (T-D) in many extracted texture features are used for studying the separability of ground objects; thus, the characteristics ultimately used for classification are determined as well. As shown in Table 2, the separability of the first component (PCA1) and the second component (PCA2) is compared in detail, and the separability of J-M dissimilarity in PCA2 is the optimal. Therefore, in later classifications, a total of six characteristics are selected.

Research Methods
First of all, the performance of machine learning algorithms with a different sample number is discussed. During this process, the optimal parameters of the models are determined and the indices, such as precision and AUC, are used to evaluate the performances of algorithms in the test set. Then, according to spectral characteristics, the water indices are constructed, and on this basis, thresholds are selected; thus, water bodies and other ground objects are classified and identified. Moreover, machine learning methods, such as SVM, decision tree, and random forest, are used to extract water bodies. At last, the accuracy of the test results is verified for the same area at different times.

MNDWI
In 2006, Xu [16] presented a modification of normalized difference water index (MNDWI) (Equation (2)) by replacing the NIR spectral band used in NDWI with the SWIR spectral band to reduce the influence of building information on water bodies. By using the MNDWI water index method, the MNDWI image is binarized by selecting an appropriate threshold to achieve water bodies extraction. The determination of thresholds affected the accuracy of water body extraction, and different thresholds might be made by subjective judgments of different people. To reduce such influences, three methods for determining thresholds are used for comparison and discussion. The three threshold methods used in this article are as follows: (1) the user-defined threshold method, which is determined according to visual effect through multiple experiments; (2) the Otsu threshold method [60,61]; and (3) the adaptive threshold method, which is used to scan the image through a 3*3 window.
The MNDWI is expressed as follows: where Green is the radiance of the green band, which corresponds to the 3rd Landsat-8 image band; SWIR represents the short-wave infrared band radiance, namely band 6 of the Landsat-8 image.

Machine Learning Algorithms
In this research, six machine learning algorithms are selected, all of them used the same group of sample set, and the whole samples are divided into a training set and a test set by the ratio of 7:3. Furthermore, in the process of model training, the relevant parameters of the models are further trained by using 10-fold cross-validation with hierarchical sampling of the training set. Finally, some indices, such as accuracy, recall rate [62], and AUC [63], are utilized to assess the results.

SVM
SVM has a simple structure but a strong generalization ability to solve problems with high-dimensionality, small sample numbers [64,65]. In this study, the Gaussian radial basis function is selected as the kernel function. By using the grid search method in combination with 10-fold cross-validation, the optimal parameters are determined as C = 3 and γ = 0.003.

Decision Tree
The decision tree determines the categories of the samples in the dataset by assigning the sample data to a certain leaf node. There are many methods for constructing the decision tree, but all of them are based on the different purity indices selected and sample attributes for classification [66]. The algorithms ID3, C4.5, C5.0, etc. are generally used. A classification and regression tree (CART) algorithm is used in this study, and pre-pruning is utilized to avoid the overfitting problem. The parameters mainly include the limited depth of the decision tree, the minimum sample number of leaf nodes, and the least sample number of separable leaf nodes. By using the grid search method and 10-fold cross-validation, the final parameters are determined as follows: the entropy is selected as the purity index and the maximum depth is 7. The lowest sample number of separable leaf nodes is 8, and the minimum sample number of leaf nodes is 1.

Multi-Hidden-Layer Neural Network
The neural network uses specific learning algorithms to learn from data through many learning algorithms; however, the network is generally trained by iteratively modifying connection weights and deviations until the error between the output generated by the network and the expected output is smaller than some specified threshold [21]. The input characteristics are passed to the next layer of nerve cells through a non-linear activation function and then continue to be passed down after activation of the nerve cells in this layer. That process is repeated and cycled to the output layer. The repeated superposition of these non-linear functions ensures that the neural network has sufficient non-linear fitting ability, while different activation functions can affect the output of different neural networks. By selecting a sigmoid activation function, it is determined that the neural network structure should have four layers based on multiple tests through cross-validation. Except for input and output layers, the numbers of nerve cells in the two hidden layers are eight and six, respectively.

Random Forest
The random forest is an ensemble method specially designed for a decision tree classifier, and the selection of random attributes is further added to its training process. Using similar parameters to those used for the decision tree, the random forest model is easy to implement and shows good effects [32,33]. In this research, parameters are determined by using cross-validation and grid search methods. The main parameters of random forest are as follows, there are 10 weak estimators in the decision tree, and the maximum depth is 4. Moreover, a Gini function is selected as the purity index.

XGBoost
The core of XGBoost is an ensemble algorithm based on the gradient boosting decision tree (GBDT), and it can be used for classification or regression problems. Its modelling process is as follows: a decision tree is built, and one more tree is added upon each iteration to form a strong evaluator integrating many numerical models [67,68]. The accuracy is superior to that of a weak estimator, and its calculation speed and performance are good [69]. The main parameters are set as follows: the maximum depth of each tree is 3, and a weak classification estimator with 300 decision trees is established. The learning rate is set to be 0.01.

Logistic Regression Algorithm
The logistic regression is a type of classification model. It establishes a regression formula for samples and a sigmoid function is used for classification. For more information, please refer to references [70,71].

Effects of the Sample Number on Learning Algorithms
For each classification algorithm in machine learning, the basic requirement is that the training and test set are reliable and there are enough samples for training. In this way, a good classifier can be trained. It is assumed that the samples selected by visual interpretation are reliable: namely, the various classes of the sample points are assigned to correct labels. Based on this, a small sample is randomly selected from the training set and divided into a training set and a validation set in the proportion of 7:3. By using the accuracy of the validation set of the small sample as an evaluation index, the effects of the sample number on the classification effects of each algorithm are discussed, so as to judge whether the sample number selected is sufficient to achieve the purpose of the training model.
As demonstrated in Figure 2, the accuracies of the classification algorithms in the validation set of the experiment all tend to increase with the sample number, and they show a smaller error relative to the accuracy in the training set. Moreover, the accuracies gradually tend to be equal. This indicates that there is almost no underfitting of the samples, and the parameters of each algorithm are well adjusted. The accuracy of the logistic regression algorithm is improved rapidly, approximating to the accuracy in the training set when the sample number is small, suggesting that there is almost no overfitting. As the sample number increases, the accuracy stabilizes; however, other classification algorithms need larger samples to achieve this stability, and the accuracy fluctuates (albeit within a small range), therefore, the number of training samples selected in the experiment can meet the needs of model training.

Analysis of Performance Indices of Machine Learning Algorithms
After testing the performance of the models when using each algorithm on sets of different sample numbers, the effect of each model in the same test set is further evaluated, so as to reflect the predictive abilities of the models to some extent and judge the generalization abilities of the algorithms. As shown in Table 3, the value of the accuracy index and recall index of each model in classifying water bodies and other ground objects are high, the accuracy index is in the range of 0.945-1, and the recall index is in the range of 0.911-1. However, the AUC index can better represent the comprehensive performances of the models and the higher the value, the better the performance [63]. There is little difference in the effect of each machine learning algorithm on the test set, and the AUC index ranges from 0.956 to 0.987; by analyzing AUC data, the logistic regression and XGBoost algorithm are found to perform best on the test set, followed by the SVM, the neural network, then the random forest, while the decision tree has (in general) the worst performance. Whether the evaluation of these algorithms in the test set can accurately represent the generalization abilities of the algorithms for classifying water bodies in the remote sensing images needs to be discussed and studied using remote sensing images acquired under different conditions. Appl. Sci. 2021, 11, x FOR PEER REVIEW 8 of 20

Analysis of Performance Indices of Machine Learning Algorithms
After testing the performance of the models when using each algorithm on sets of different sample numbers, the effect of each model in the same test set is further evaluated, so as to reflect the predictive abilities of the models to some extent and judge the generalization abilities of the algorithms. As shown in Table 3, the value of the accuracy index and recall index of each model in classifying water bodies and other ground objects are high, the accuracy index is in the range of 0.945-1, and the recall index is in the range of 0.911-1. However, the AUC index can better represent the comprehensive performances of the models and the higher the value, the better the performance [63]. There is little difference in the effect of each machine learning algorithm on the test set, and the AUC index ranges from 0.956 to 0.987; by analyzing AUC data, the logistic regression and XGBoost algorithm are found to perform best on the test set, followed by the SVM, the neural network, then the random forest, while the decision tree has (in general) the worst performance. Whether the evaluation of these algorithms in the test set can accurately represent the generalization abilities of the algorithms for classifying water bodies in the remote sensing images needs to be discussed and studied using remote sensing images acquired under different conditions.

Comparative Analysis of NDWI and Machine Learning Algorithms
The model established by 2019/10/20 training data is used for water extraction in three areas of 2019/10/20. Statistical results of AUC indicators of each algorithm are shown in Figure 3 (For more details, see Tables A1-A4 in the Appendix A). In general, the XGBoost algorithm has the best accuracy, with an average AUC of 0.966, and the AUC indicators in the three regions are 0.985, 0.972, and 0.941 respectively, which is followed by the random forest algorithm with an average AUC of 0.964, and the AUC indicators in the three regions are 0.985, 0.973, and 0.935; the SVM algorithm has the worst accuracy, the average AUC is 0.898 and the AUC indicators in the three regions are 0.982, 0.789, and 0.923, respectively. When each machine learning algorithm is applied to three different local regions, the average range of AUC index is 0.898-0.966 (for more details, see Table A1 in Appendix A), and the descending order of each machine learning algorithm is XGBoost, random forest, decision tree, logistic regression, neural network, and SVM according to the value of the AUC index. However, this is inconsistent with the conclusion of Section 4.2. In Section 4.2, there is little difference in the accuracy of each machine learning algorithm on the test set, and the AUC index ranges from 0.956 to 0.987. The machine learning algorithms are XGBoost, LR, SVM, NN, RF, and DT in descending order according to the value of the AUC index. It further explains that the evaluation on the test set cannot represent the effect of each algorithm applied in a local area. Among the threshold classification methods, the Otsu threshold algorithm is the best, with an average AUC of 0.957, and the AUC indicators in the three regions are 0.985, 0.922, and 0.964, respectively, followed by the custom threshold algorithm, and the worst performance among all algorithms is adaptive threshold algorithm: the average AUC is only 0.764.

Reliability Test
To discuss the effects of the aforementioned algorithms in water body extraction from remote sensing images in different periods, a remote sensing image captured on 4 October 2019 in the same region is selected. Based on this, the water bodies are classified using the same algorithms and parameters. The aim is to verify whether the experimental results of each algorithm under different image conditions are reliable and decide whether the models are universal.
The model established by the data of 2019/10/20 is used in the data of 2019/10/04 for water body extraction. The statistical results of the AUC indicators of each algorithm are shown in Figure 4 (for more details, see Tables A5-A8 in Appendix A). As shown in Table  4, the AUC indicators of each machine learning algorithm for three regions all show a significant decline, with a decreased range of 0.33-66.52% As shown in Figure 4, the differences among the different algorithm performances in the three areas are obvious. In the surface complex Area2, the AUC index of the machine learning algorithms is near 0.5, which means it is difficult to extract water bodies accurately. In Area1 with a simple surface environment, although the accuracy of all machine learning algorithms decreases, the errors are still within an acceptable range. In general, the decision tree algorithm has better transfer performance, with an average AUC of 0.668, and the AUC indexes of the three The image water extraction results of each algorithm were placed in the supplementary materials, as shown in Figure S1: Classification results of each algorithm in Area1 on October 20; Figure S2: Classification results of each algorithm in Area2 on October 20; Figure S3: Classification results of each algorithm in Area3 on October 20. As can be seen from the results graph, compared with other algorithms, the salt and pepper phenomenon for the adaptive threshold and custom threshold is very serious, there is a large number of non-water body "noise", other algorithms basically have the same visual interpretation effect, and there is no obvious difference, but the edge part is slightly different due to the influence of adjacent features.

Reliability Test
To discuss the effects of the aforementioned algorithms in water body extraction from remote sensing images in different periods, a remote sensing image captured on 4 October 2019 in the same region is selected. Based on this, the water bodies are classified using the same algorithms and parameters. The aim is to verify whether the experimental results of each algorithm under different image conditions are reliable and decide whether the models are universal.
The model established by the data of 2019/10/20 is used in the data of 2019/10/04 for water body extraction. The statistical results of the AUC indicators of each algorithm are shown in Figure 4 (for more details, see Tables A5-A8 in Appendix A). As shown in Table 4, the AUC indicators of each machine learning algorithm for three regions all show a significant decline, with a decreased range of 0.33-66.52% As shown in Figure 4, the differences among the different algorithm performances in the three areas are obvious. In the surface complex Area2, the AUC index of the machine learning algorithms is near 0.5, which means it is difficult to extract water bodies accurately. In Area1 with a simple surface environment, although the accuracy of all machine learning algorithms decreases, the errors are still within an acceptable range. In general, the decision tree algorithm has better transfer performance, with an average AUC of 0.668, and the AUC indexes of the three regions are 0.790, 0.518, and 0.697 respectively. The XGBoost algorithm has an average AUC of 0.631, and its AUC index in the three regions is 0.718, 0.512, and 0.665, respectively. The logistic regression algorithm has the worst accuracy, with an average AUC of 0.392, the AUC index in the three regions is 0.329, 0.489, and 0.357, respectively, which is inconsistent with the conclusion in Sections 4.2 and 4.3. When the model is directly transferred to remote sensing images of different periods for water extraction, the generalization ability of each machine learning algorithm is different. Among the threshold classification methods, the Otsu threshold algorithm is optimal, and its average AUC is 0.832. The AUC indexes in the three regions are 0.970, 0.617, and 0.908, respectively, which exceed the accuracy of the other machine learning algorithms. For the other two threshold algorithm, custom threshold, whose average AUC is 0.700, and the AUC indexes in the three regions are 0.842, 0.549, and 0.708 respectively. The adaptive threshold algorithm has an average AUC of 0.611, and its AUC indicators in the three regions are 0.703, 0.506, and 0.623 respectively. All in all, for different periods of remote sensing images, the threshold method is better than most of the machine learning algorithms, because the sensor imaging is affected by clouds, sun angles, and sensors. Due to the influence of the angle and other factors, the characteristics of remote sensing images will be very different during the adjacent imaging time. Even if there is no major change in the surface features, the pixel value of the remote sensing image could also change significantly. Therefore, the machine learning models trained on the data of 2019/10/20 may not be suitable for different periods. The water extraction results of each algorithm were placed in the supplementary materials, as shown in Figure S4: Classification results of each algorithm in Area1 on October 4; Figure S5: Classification results of each algorithm in Area2 on October 4; Figure S6: Classification results of each algorithm in Area3 on October 4. It can be seen from the classification result diagrams that most of the machine learning pepper and salt phenomenon is very serious, and there is a large number of non-water "noise". The visual effects of various algorithms are also significantly different.  However, the water extraction effect of the threshold method is related to the remote sensing image data, and the water extraction effects of remote sensing images from different periods do not affect each other.
The water extraction results of each algorithm were placed in the supplementary materials, as shown in Figure S4: Classification results of each algorithm in Area1 on October 4; Figure S5: Classification results of each algorithm in Area2 on October 4; Figure S6: Classification results of each algorithm in Area3 on October 4. It can be seen from the classification result diagrams that most of the machine learning pepper and salt phenomenon is very serious, and there is a large number of non-water "noise". The visual effects of various algorithms are also significantly different.

Discussion
This study mainly selects neural network, support vector machine (SVM), logistic regression, random forest, decision tree, and XGBoost from machine learning algorithms, and it selects the MNDWI water index combined with three threshold methods to extract the water bodies. Michael Schmitt [72] pointed out that for a simple surface environment, only the threshold method can achieve satisfactory results, and when the surface environment is slightly more complicated, a supervised classification method, such as SVM, needs to be introduced. However, for the supervised classification method, how to choose the appropriate number of samples is a problem worthy of research. For example, Deepakrishna Somasundaram et al. [73] selected 3765 water samples and 2685 non-water samples from the four-view Landsat-8 OLI image; Wei Jiang et al. [74] selected more than 10,000 water samples and non-water samples in each study area. The choice of these large numbers of training samples brings additional costs. In order to study the influence of sample size on various algorithms, an experiment was designed in this paper, as outlined in Section 4.1. As shown in Figure 2, there are great differences in the number of samples required for various algorithms to reach their optimal. The logistic regression algorithm requires the lowest number of samples, which is close to 110. The SVM algorithm has the best performance when the number of samples reaches 150. As the number of samples increases, the order of the optimal model is neural network, random forest, decision tree, and XGBoost. The primary task of water body extraction is to select a certain number of samples for the training model. The conclusion of the sample number requirements of each machine learning algorithm in this paper can be used as a reference for other similar applications to reduce the cost of sample selection.
Most studies only use test set samples to evaluate the optimal model and use the selected model for the final classification of images. However, Liu Yang et al. [75] pointed out that in different surface environments, various types of shadows or background noises need to be considered. For example, compared with arid areas, the influence of vegetation on water extraction should be considered in humid areas. In mountainous areas, the extracted water is often mixed with mountain shadow. These types of background information have different influences on different water extraction algorithms [61,76]. For the above reasons, it is worth discussing whether the evaluation effect on the test set can explain the actual generalization performance of the model, that is, whether the evaluation effect on the test set is consistent with the evaluation effect on the local area. For this reason, three local areas with different ground conditions are selected. As shown in Figure 3, in general, the simpler the ground scene, the better the classification accuracy. If the ground scene is complex, the accuracy of various algorithms has a great difference. Generally, three algorithms (decision tree, XGBoost, and Otsu) can perform well in various scenarios. In the case of mountain shadow in the ground background, it is suggested to give priority to the XGBoost algorithm. In the case of roads and buildings in the ground background, besides the XGBoost or decision tree algorithms, a logistic regression algorithm with a relatively simple model can also be tried.
However, when multi-stage extraction research on water bodies is needed, the original model will naturally be directly used to extract water bodies from remote sensing images in other different periods. As shown in Table 4, when various machine learning algorithms are directly used to extract water bodies from remote sensing images in different periods, the AUC indicators of each machine learning algorithm for the three regions all show a significant decline, with a decrease range of 0.33-66.52%. Generally, simple ground scenes have higher accuracy, while complex ground scenes have some effects for different machine learning algorithms. As shown in Table 4, among all the machine learning algorithms, the accuracy of decision tree decreased the least in the three regions on average, and the AUC index decreased 30.43% on average, followed by XGBoost. In the threshold method, although the change of adaptive threshold is small, its accuracy is always very low, while the Otsu algorithm not only has a good accuracy, but also the average decline of the AUC index is small, which is 13.46%. The decision tree algorithm can still achieve better classification results, and the Otsu algorithm also performs well. Experiments show that it is not recommended to directly use the machine learning model to extract water from remote sensing images in different periods. The Otsu classification result can be used as a reference, so that training samples can be selected in other periods quickly and conveniently to extract water bodies using machine learning algorithms.
In summary, for water extraction from remote sensing images, although various algorithms can achieve satisfactory results under certain conditions, none of them can be applied to all remote sensing image and scenes. The factors affecting the classification accuracy of remote sensing images mainly include the complexity of the field landscape, the availability of data, the effectiveness of the processing method, and the experience judgment of the processing personnel [5,76]. Therefore, on the basis of this study, when extracting water from remote sensing images, the water index (MNDWI preferred) can be used first and combined with the Otsu algorithm to classify water bodies. This result is in agreement with the results obtained by Ya'nan Zhou et al. [38], who used the NDWI image to select water samples from the input image. However, if the accuracy does not meet the requirements of the application, on the basis of its classification, researchers can further select the number of samples that meet the requirements of various machine learning algorithms ( Figure 2) and select the corresponding machine learning training model. Among the various machine learning algorithms, XGBoost, decision tree, and logistic regression algorithms are preferentially recommended.

Conclusions
Based on Landsat-8 images, decision tree, logistic regression, random forest, neural network, support vector machine, and XGBoost algorithms are used to extract water bodies. Firstly, the effect of each machine learning algorithm on the test set is discussed. Secondly, each machine learning algorithm is applied to three different local areas, and the consistency between the accuracy of each machine learning algorithm on the test set and the accuracy of the local area is evaluated. Finally, each machine learning algorithm is applied to remote sensing images in different periods, the model transfer performance of each machine learning algorithm is examined, and three threshold methods are compared. The following conclusions are drawn: (1) There are great differences in the numbers of samples required for various algorithms to reach their optimal. The logistic regression algorithm requires a minimum number of samples, about 110. The SVM algorithm has the best performance when the number of samples reaches 150. As the number of samples increases, the optimal order of the model is neural network, random forest, decision tree, and XGBoost.
(2) The accuracy evaluation effect of each machine learning on the test set cannot represent the effect on the local area, because the surface complexity is not same in the three local areas. In Area1 with a single surface type, its AUC range is 0.982-0.985; in Area2 with complex surface environment (numerous vegetation and mountain shadow), its AUC range is 0.789-0.973; in Area3 with wide water distribution, its AUC range is 0.923-0.941 in an urban built-up area.
(3) When the models are directly applied to remote sensing images in different periods, the model accuracy is greatly reduced, the AUC indicators of each machine learning algorithm for three regions all show a significant decline, with a decreasing range of 0.33-66.52%. In general, among the machine learning algorithms, the decision tree algorithm has good transfer performance, with an average AUC of 0.668, and the AUC indexes in the three regions are 0.790, 0.518, and 0.697 respectively. Among the threshold methods, the Otsu threshold algorithm is the optimal, with an average AUC of 0.832 and AUC indexes in the three regions are 0.970, 0.617, and 0.908, respectively.
(4) Owing to the complex distribution of ground objects and many influential factors in the remote sensing image classification, it is difficult to collect small and dispersed water bodies in this research. This limits the performances of these models in the environment with many hill shadows and complex ground objects. The accuracy of these models needs to be further improved; more samples should be collected from images over different areas and periods to train the models in the future.