1. Introduction
Water is the source of life and primary factor for maintaining the sustainable development of the earth’s ecological environment, and it has an important impact on public health, living environment and economic development [
1]. Therefore, timely and accurate large-scale regional water body surveys and dynamic monitoring are of great significance for water resource planning, flood control and disaster reduction [
2]. In recent years, satellite remote sensing technology has developed rapidly [
3]. As an important method of obtaining water body information, remote sensing images have been widely used in mapping, geography, environmental protection, military reconnaissance and other fields [
4]. Most of the previous water resource surveys were based on medium- and low-resolution remote sensing images [
5], and their limited spatial resolution made it difficult to extract small-area water bodies, such as broken lakes and slender rivers, in complex areas [
6]. With the successful launch of high-resolution optical satellites (such as Worldview-2 and GF-2), the spatial resolution of satellite remote sensing images has also been improved from the meter level to the submeter level [
7]. High-resolution remote sensing images have more detailed spatial, texture, geometry and other ground feature information [
8] and thus can more clearly differentiate water bodies in complex scenes. However, optical satellite remote sensing can also produce low-quality images caused by cloud cover in bad weather and long return periods, which makes it difficult to collect and process real-time data [
9]. Unmanned aerial vehicle (UAV) remote sensing can overcome the above limitations [
10], and the resolution of UAV orthophoto images reaches the centimeter level; thus, the images present extremely rich ground object details [
11]. At present, an increasing number of scholars use high-resolution optical satellite remote sensing images and UAV orthophotos to extract surface water [
12].
Water body extraction methods can be mainly divided into (1) extraction methods based on image spectral characteristics (single-band threshold method, multiband spectral relationship method, water index method) [
13,
14,
15]; (2) classifier methods (object-oriented method, decision tree method, support vector machine (SVM) method, random forest (RF) method, etc.) [
16,
17,
18,
19]; and (3) deep-learning methods [
20,
21,
22].
(1) The traditional water index method based on image spectral characteristics is widely used in the research field of water body extraction because of its fast calculation speed, high precision and wide implementation range [
23]. McFeeters et al. [
24] constructed a normalized differential water index (NDWI) by using green band and near-infrared band in TM image, which can suppress vegetation information to the maximum extent and highlight water information. However, because water body and shadow have similar spectral characteristics, and the reflected signals of buildings and soil in urban areas are stronger than water body, soil, buildings and shadows are confused with water body. Xu et al. [
25] constructed a modified normalized differential water index (MNDWI) by replacing the near-infrared band in NDWI with the mid-infrared band, which can effectively reduce the influence of buildings and soil on water extraction, but there is still the interference of shadow information on water. The water index method has great differences in the optimal thresholds on different occasions, and the selection of the optimal threshold needs to be determined according to research experience [
26]. Therefore, this method is subjective and low in automation [
27]. Although new algorithms have been proposed to alleviate the above difficulties [
28,
29], these improved methods are still concentrated on the calculation of spectral information of remote sensing images and do not consider the spatial and texture characteristics of images [
30]. Moreover, these methods cannot be fully used for high-spatial-resolution remote sensing images with only four bands ( red, green, blue and near-infrared), which seriously restricts the detection ability of small areas or narrow water bodies in complex regions [
31]. In the case of high-spatial-resolution remote sensing images, the widespread shadow problem is more complicated due to high urban buildings and mountain vegetation [
32].
(2) Considering the above difficulties, Li et al. [
33] combined the statistical characteristics of spectral information of image segmentation and the shape characteristics of shadow removal to extract water bodies from high-resolution images at different levels. The results improved the extraction accuracy of water bodies from high-resolution images and reduced the influence of shadows. Although this method considers the shape characteristics of high-resolution images, the degree of automation is still low, and it needs to be combined with the supervised classification method. Huang et al. [
34] first calculated the water, shadow and vegetation indices as the initial pixel level results and then combined them with an object-based machine learning method applied to water-type recognition in GeoEye-1 and WorldView-2 high-resolution images. The results showed that the decision tree method using texture and geometric features is far better than other machine learning classification methods and water index methods and greatly improves the automation of water body recognition and further reduces the impact of shadows in urban areas. The classifier method removes a portion of the influence on water bodies of similar spectral features, such as shadows and buildings, although its feature extraction and classifier design are cumbersome. The results are excessively dependent on the selection of limited sample sets [
35] and cannot deeply mine the information characteristics of ground objects, which results in insufficient generalization ability, and different seasons and images are not universal [
36].
(3) With the development of deep learning technology, end-to-end convolutional neural networks (CNNs) show great advantages in extracting the spatial context relationship of images [
37] and can automatically extract more abstract high-dimensional features from the input low-level image features [
38]. Currently, various new deep learning models have been applied to the classification and change detection of ground objects in high-resolution remote sensing images, such as VGGNet [
39], FCN (full convolution network) [
40], ResNet [
41] and U-Net [
42]. Compared with the machine learning classification method, even if the initial sample size is insufficient, the deep learning method can also enrich the sample size through data expansion [
43] to achieve higher extraction accuracy. For example, Li et al. [
44] used the FCN model to extract water bodies from GF-2 high-resolution remote sensing images under the condition of limited training samples, and the results were significantly better than those of the NDWI and SVM methods. Li et al. [
45] combined the two visual features of the gray level co-occurrence matrix (GLCM) [
46] and Gabor filtering and input them into the improved U-Net to extract the water body of UAV high-resolution remote sensing images. The results showed the stability and accuracy of the improved U-Net method combined with GLCM features. The successive proposal of deep learning networks has continuously improved the accuracy of classification tasks and increased the network performance. However, the layers of deep neural networks are also deepening and the structure is becoming increasingly complex [
47], which will lead to the rapid increase in the number of model parameters and the consumption of computational resources, thereby complicating model training and hindering model application in actual demand scenarios [
48]. Accordingly, Howard et al. [
49] proposed MobileNetV1 in 2017, which uses deep separable convolution to construct lightweight deep neural networks, and it can not only reduce the number of parameters and computation but also maintain high image classification accuracy. Subsequently, the proposed MobileNetV2 [
50] further improved the performance of the network by using the reverse residual structure. At present, water body surveys require lightweight networks with high efficiency, high convenience and high accuracy in the field of intelligent extraction of high-resolution remote sensing images.
In summary, the traditional method of extracting water bodies from high-spatial-resolution remote sensing images easily identifies shadows and buildings as water bodies, cannot easily extract small water bodies and mountainous water bodies, and presents weak generalization ability. The existing deep convolutional neural network model has a long training time, and the computational resources requirements are high; thus, it cannot meet real-time requirements. In this paper, the lightweight network MobileNetV2 was used to extract the water bodies in complex scenes from high-resolution remote sensing images of GF-2, Worldview-2 and UAV orthophotos with three different sensors, and the extraction, accuracy evaluation and model efficiency results of SVM, RF and U-Net were compared. The purpose was to select an efficient, convenient and accurate water body extraction method under complex geographical conditions based on high-resolution remote sensing images and to provide a reference for the real-time and rapid extraction of water distribution information and rational utilization of water resources.
5. Discussion
5.1. Parameter Sensitivity Analysis
This paper analyzes the sensitivity of the data training and verification accuracy of the MobileNetV2 network under different epochs, as shown in
Figure 10. The results show that when the number of epochs was five, the model
Precision rate was 66.7%, the verification accuracy was 96.6%, and the
Recall rate was 95.6%. As the epoch number increased, the model verification accuracy increased. When the number of epochs reached 20, the
Precision rate of the model was the highest at 87.2%. However, the accuracy decreased as the epoch number increased, indicating that more or fewer epochs are not conducive to the task of semantic image segmentation. With a low number of epochs, the model training will lose a large number of image features, and with an excessive number of epochs, overfitting of the model may occur. The accuracy of the model was higher only when the number of epochs was moderate, which is beneficial to obtaining fine and accurate extraction results.
5.2. Efficiency Comparison of the Deep Learning Models
The two deep learning models were compared to determine the efficiency of the water body extraction model. According to Formulas (1)–(2), the
Params (number of parameters) and
FLOPs of the deep learning model were calculated, and their average training times are shown in
Table 5. The
average training time of the MobileNetV2 model was much shorter than that of the U-Net model, which took approximately 645 s, because the number of parameters used and the computational resources occupied by the MobileNetV2 model were more than two times less than those of the U-Net model, which greatly improves the efficiency of the water body extraction model to achieve rapid and convenient extraction of complex water bodies in high-resolution images.
5.3. Influence of Spatial Resolution Change on the Accuracy of Water Body Extraction by the MobileNetV2 Model
To verify the influence of different spatial resolutions on the accuracy of water body extraction by the MobileNetV2 model, we grouped the training data sets of GF-2 (1 m), Worldview-2 (0.5 m) and UAV (0.2 m) images into four combinations and input them into the MobileNetV2 model for training, and then tested the generalization images of different sensors used in
Figure 9. The generalized test results without any combination of sensor training data, in
Table 4, were applied as a benchmark to calculate the changes of the accuracies for different combinations. The results of these changes are shown in
Table 6. The results show that the model training with different spatial resolution images has a signification impact on the water body extraction. The different spatial resolution images indicate diverse complexities of background and water body samples, which may lead to the lack of feature similarities between the test images from individual sensor and the training images combined by multiple sensors. For example, when training the MobileNetV2 model using only the lower spatial resolution training dataset I, the water body extraction accuracy of the UAV generalization test images decreased significantly, with
F1-socore and
Kappa reduced by 0.46 and 0.56, respectively. Moreover, when the training data involved higher spatial resolution images, the MobileNetV2 model improved the accuracy of water body extraction from generalization test images of lower spatial resolution to varying degrees. The
F1-score and
Kappa with the training dataset II (combining the GF-2 with UAV images) were 0.08 and 0.09 higher, respectively, than those with the training dataset I (combining the Worldview-2 and GF-2 images). The
F1-score and
Kappa with the training dataset III (combing the Worldview-2 and UAV images) were 0.03 and 0.03 higher, respectively, than those with the training dataset of only the Worldview-2 images.
It is worth noting that with the training dataset IV, which was composed of these three different spatial resolution images, the F1-score and Kappa of the MobileNetV2 model on the generalization test images of GF-2 and Worldview-2 improved 0.04 and 0.04, 0.06 and 0.06, respectively. It indicates that the training data, combining relatively low spatial resolution images with high-spatial-resolution images, is beneficial for improving the accuracy of water body extraction from low spatial resolution test images. In addition, compared with the classification results with training data of only UAV images, the MobileNetV2 model, using any combination training data from UAV and other sensors, cannot improve the water body extraction accuracy for the UAV generalization test images. It indicates that training data by adding lower spatial resolution images may reduce the accuracy of water body extraction from higher spatial resolution images from the MobileNetV2 model.
5.4. Extraction Error Analysis of Mixed Water Bodies and Small Area Water Bodies
In this paper, MobileNetV2 was used to extract complex water bodies from three different high-spatial-resolution remote sensing images, and the accuracy was compared with that of the SVM, RF and U-Net models. MobileNetV2 shows greater advantages in efficiency and accuracy than other models and has a good overall effect in the three-view test images with strong robustness, which can be further applied to the actual water body investigation task. However, in complex environments where water bodies and other ground objects are mixed together, such as irrigated farmland with crops, edges of ditches with vegetation, and eutrophic water bodies, these models have large misclassifications and missed classifications. We analyzed the reasons for the low-precision extraction errors of the mixed water bodies and the broken, narrow and small-area water bodies.
The decrease in spatial resolution may lead to insufficient feature extraction of mixed water and small-area water. The three images in this paper come from different sensors and different spatial resolutions.
Figure 9 shows that images with different spatial resolutions show very different extraction results. The MobileNetV2 model has the best extraction effect on UAV generalization test images, which may be related to the high-spatial-resolution of UAV images, which clearly display the boundary, texture and shape of water and other mixed ground objects so that the model can clearly extract and distinguish the characteristics of different ground objects during training. However, on the Worldview-2 generalization test image with slightly lower spatial resolution than the UAV image, MobileNetV2 mistakenly identified multiple farmlands and artificial facilities as water bodies and lacked complete extraction of small-area eutrophic water bodies and farmland irrigation water bodies. On GF-2 images with lower spatial resolution, not only farmland but also large dark vegetation around lakes were misidentified as water by the MobileNetV2 model. As seen from
Table 6, these misclassifications and missing classifications of mixed water and small water areas reduced the
F1-score and
Kappa of the MobileNetV2 model from 0.82 and 0.81 for Worldview-2 to 0.64 and 0.61 for GF-2, respectively. The reason may be that as the spatial resolution of the image increases, the area of a single pixel also gradually increases, which leads to lost information for small water bodies with relatively broken shapes and narrow boundaries and slender ditches; moreover, the spatial structure of the image changes [
65], resulting in the problem of mixed pixels, which leads to insufficient feature extraction for each mixed water body and small water body when the model is trained in complex scenes. The training data of higher resolution images may alleviate the problem of incomplete feature extraction caused by mixed pixels. As shown in
Table 6, with training dataset IV, after adding Worldview-2 and UAV training data, which have a higher spatial resolution than GF-2 image training data, the MoblieNetV2 model improves the accuracy of water extraction in the GF-2 and Worldview-2 generalization test images.
The performance of the model itself has a great influence on the water body extraction results. The SVM, RF, U-Net and MobileNetV2 models used in this paper have different structures and performances. SVM and RF are the classic models of machine learning, and their advantages and disadvantages are obvious. The SVM model can deal with high-dimensional data well, but it is sensitive to missing data. Therefore, water bodies with obvious characteristics can be extracted in each image, although the accuracy cannot be guaranteed for complex water bodies. The RF model has a strong antinoise ability and can deal with missing data. The extraction effect in each image is relatively stable, although in complex scenes, it will also fall into overfitting, resulting in reduced model test accuracy. As deep learning algorithms, U-Net and MobileNetV2 have significant advantages for complex, large-scale data processing. However, under the complex background of mixed ground objects and dark objects with more similar spectral characteristics, insufficient feature extraction will occur. It is necessary to further improve or innovate the network structure of feature extraction to enhance the feature extraction ability. In the future, by combining GIS spatial analysis technology or other data sources, the recognition ability of mixed water bodies and small-area water bodies in complex scenes could be enhanced.
6. Conclusions
In this paper, we applied the lightweight network MobileNetV2 to extract the water bodies from GF-2, Worldview-2 and UAV images. The results show that the F1-score and Kappa of the water body extraction results of the MobileNetV2 model were higher than those of the SVM, RF and U-Net models, which were 0.75 and 0.72 for GF-2, 0.86 and 0.85 for Worldview-2, 0.98 and 0.98 for UAV, respectively. Our model improves the difficulty of feature extraction in traditional methods and is less affected by interferential features such as irrigated farmland, shadows and buildings. Moreover, this method uses much less parameters, calculation and training time than U-Net model, which greatly improved the efficiency of algorithm. In order to verify the generalization ability of the MobileNetV2 model, we respectively selected the areas with more cultivated land, building facilities, shadows and complex water bodies from these three sensor images. The results show that our model still maintains high extraction accuracies, and the F1-score and Kappa were 0.64 and 0.61 for GF-2, 0.82 and 0.81 for Worldview-2, 0.98 and 0.98 for UAV, respectively. Additionally, we analyzed the influence of different spatial resolution images from multiple sensors. It reveals that MobileNetV2 model could achieve higher accuracy in water body extraction by training with only the higher spatial resolution sample data, or training with the combination of lower and higher spatial resolution images, according to the existed remote sensing images. For the purpose of applying MobileNetV2 model to mixed and small area water bodies, we will further improve the feature extraction structure in this network and combine it with GIS spatial analysis technology.