Water Body Extraction from Very High Spatial Resolution Remote Sensing Data Based on Fully Convolutional Networks

: This paper studies the use of the Fully Convolutional Networks (FCN) model in the extraction of water bodies from Very High spatial Resolution (VHR) optical images in the case of limited training samples. Two di ﬀ erent seasonal GaoFen-2 images with a spatial resolution of 0.8 m in the south of the Beijing metropolitan area were used to extensively validate the FCN model. Four key factors including input features, training data, transfer learning, and data augmentation related to the performance of the FCN model were empirically analyzed by using 36 combinations of various parameter settings. Our ﬁndings indicate that the FCN-based method can work as a robust and cost-e ﬀ ective tool in the extraction of water bodies from VHR images. The FCN-based method trained on a small amount of labeled L1A data can also signiﬁcantly outperform the Normalized Di ﬀ erence Water Index (NDWI) based method, the Support Vector Machine (SVM) based method, and the Sparsity Model (SM) based method, even when radiometric normalization and spatial contexts are introduced to preprocess the input data for the latter three methods. The advantages of the FCN-based method are mainly due to its capability to exploit spatial contexts in the image, especially in urban areas with mixed water and shadows. Though the settings of four key factors signiﬁcantly a ﬀ ect the performance of the FCN based method, choosing a qualiﬁed setting for the FCN model is not di ﬃ cult. Our lessons learned from the successful use of the FCN model for the extraction of water from VHR images can be extended to extract other land covers.


Introduction
The rapid development of urbanization in the last decades has greatly changed the spatial distribution and quality of the surface water body in urban areas in China. One noticeable consequence is the deterioration of water quality due to frequent human activities. Thus, timely water body spatial distribution and quality information in urban areas is important in the management of public health and the living environment [1,2]. With the need of high frequency monitoring in large areas, traditional ground surveying cannot meet current demands due to its high labor cost and low efficiency. The development of remote sensing technology in recent years has largely improved the quality and availability of Very High spatial Resolution (VHR) optical remote sensing images (usually <1 m spatial resolution), which have great potential for monitoring the fine scale surface water body in urban areas. However, developing a fast and effective way to extract surface water bodies from VHR images is still a significant problem.
Over the past decades, various water body mapping methods have been developed for remote sensing data [3][4][5][6][7][8][9][10][11][12][13][14][15]. A commonly used approach is based on water spectral indices such as the Normalized Difference Water Index (NDWI) [3,4]. The difficulty of using existing water indices is that the optimal thresholds vary a lot across regions. Although novel methods have been developed to alleviate this difficulty [5][6][7][8][9][10][11], most of them concentrate on low or middle spatial resolution optical images such as Sentinel-2 and Landsat. Low spatial resolution limits the capability of the data to detect small or narrow water bodies in urban areas [12]. Additionally, in the case of high spatial resolution optical images, the prevalence of shadows from tall buildings in urban areas makes the problem more complex, especially when the sun's zenith angle is low.
To overcome the limitation of these indices in high spatial resolution images, Huang et al. [13] made use of a group of information indexes and a machine learning technique to differentiate non-water responses such as shadows from water bodies. However, this strategy is sensitive to the quality of the threshold for the indexes and the training data for machine learning techniques. Yao et al. [14] selected the optimal threshold by a Support Vector Machine (SVM) based learning process to derive an initial result, and then refined it by a shadow removal process. However, the shadow removal method has a strong assumption of a rigid geometric relationship between a tall building and its shadow in the VHR image, which does not always hold true due to the existence of artificial targets with irregular shapes in urban areas. Recently Wu et al. [15] proposed a Two-Step Urban Water Index to extract water bodies from high spatial resolution images in a robust way. The method follows a strategy similar to [14] but it models the spectral features of water and shadows differently by combining an Urban Water Index (UWI) and an Urban Shadow Index (USI) both of which are established on the basis of spectral analysis and linear SVM. Although the effectiveness of the method has been extensively validated on high spatial resolution images from various sites, it heavily relies on atmospheric correction of the test image and is also not well validated on data from different seasons.
Different from using traditional expert designed features to capture spatial contexts in images, Convolutional Neural Networks (CNN) are shown to have great advantages in representing spatial contexts in images based on an end-to-end learning framework [16,17]. The deep learning model has gained increasing popularity in the remote sensing community nowadays [18,19]. Various novel deep learning models have been developed for scene classification and change detection on remote sensing data [20,21]. Among the developed deep learning models, the recently developed Fully Convolutional Networks (FCN) [22,23] can perform sematic segmentation and fits well to the pixel-wise classification of remote sensing data. Based on the FCN architecture, Isikdogan et al. [24] proposed a novel method to map water from Landsat data. The method can distinguish water from snow, ice, cloud, and terrain shadows, without requiring a locally varying threshold. However, the model was trained from scratch and was largely based on the freely available Landsat archives and the global inland water mask layer. The situation is quite different for VHR images in urban areas in several aspects such as the abundant availability of free training samples, the stability of spectral reflectance, the characteristics of spatial contexts, and the range of spectral coverage.
As far as we know, no previous work has seriously studied the extraction of water bodies from VHR images based on the FCN model. In this paper, we study the capability of the FCN model for the extraction of water bodies from VHR images, especially in the case of limited training samples. We extensively evaluate the performance of the FCN based method and compare it with the commonly used spectral index based method and supervised learning methods. We also empirically analyze several key factors in the FCN model implementation. Our lessons learned from this study can be extended to extract other land covers from VHR images.

Data
The study area selected was in the south of the Beijing metropolitan area. The high intensity of the human activity in this area has brought great attention to the surface water quality. Thus, fine scale and high frequency surface water body monitoring in this area has now become a routine task. Two different seasonal images from Chinese GaoFen-2, namely D302 and D609 acquired on March 2th 2017 and June 9th 2017, respectively, were used to extensively validate the FCN model by visual inspection and quantitative analysis. Each image has four bands (Blue (B), Green (G), Red (R), and Near Infrared (NIR)) and 0.8 m spatial resolution after pan sharpening. As most absolute methods for atmospheric correction of VHR images require the properties of atmosphere at the image acquisition time, which are usually difficult to obtain [25], all images used in the FCN-based method are L1A data products based on systematic radiometric correction without atmospheric correction for the purpose of generalization. For training data preparation, major water bodies in D302 were manually interpreted, and a typical region (namely W1) in D302 with abundant water and shadows was selected. Typical water and shadow areas in the W1 were also selected. For testing data preparation, 10 subsets (1500 × 1500 pixels for each) representing typical water and shadow areas, namely T1-T10, were selected based on visual inspection. Among them, T1-T6 were from D302 and T7-T10 were from D609. Meanwhile, the two full images were also used in the test to further validate the results by visual inspection. The study data and the related subsets are illustrated in Figure 1. The study area selected was in the south of the Beijing metropolitan area. The high intensity of the human activity in this area has brought great attention to the surface water quality. Thus, fine scale and high frequency surface water body monitoring in this area has now become a routine task. Two different seasonal images from Chinese GaoFen-2, namely D302 and D609 acquired on March 2th 2017 and June 9th 2017, respectively, were used to extensively validate the FCN model by visual inspection and quantitative analysis. Each image has four bands (Blue (B), Green (G), Red (R), and Near Infrared (NIR)) and 0.8 m spatial resolution after pan sharpening. As most absolute methods for atmospheric correction of VHR images require the properties of atmosphere at the image acquisition time, which are usually difficult to obtain [25], all images used in the FCN-based method are L1A data products based on systematic radiometric correction without atmospheric correction for the purpose of generalization. For training data preparation, major water bodies in D302 were manually interpreted, and a typical region (namely W1) in D302 with abundant water and shadows was selected. Typical water and shadow areas in the W1 were also selected. For testing data preparation, 10 subsets (1500×1500 pixels for each) representing typical water and shadow areas, namely T1-T10, were selected based on visual inspection. Among them, T1-T6 were from D302 and T7-T10 were from D609. Meanwhile, the two full images were also used in the test to further validate the results by visual inspection. The study data and the related subsets are illustrated in Figure 1. The two upper images are pan-sharpened GaoFen-2 images (namely D302 and D609) and the related training and test data in our experiment. W1 refers to the training subset. T1 to T10 refer to subsets for the test, and the size of each test data is 1500×1500 pixels. Water areas enclosed by light green lines in D302 indicate manually interpreted ground truths. The images are aligned based on their true geographical locations and are displayed with R (NIR)-G (R)-B (G) after being linearly stretched to [0,255]. The original spatial resolution of the image is 0.8 m. At the bottom, the W1 is Figure 1. The two upper images are pan-sharpened GaoFen-2 images (namely D302 and D609) and the related training and test data in our experiment. W1 refers to the training subset. T1 to T10 refer to subsets for the test, and the size of each test data is 1500 × 1500 pixels. Water areas enclosed by light green lines in D302 indicate manually interpreted ground truths. The images are aligned based on their true geographical locations and are displayed with R (NIR)-G (R)-B (G) after being linearly stretched to [0,255]. The original spatial resolution of the image is 0.8 m. At the bottom, the W1 is specifically enlarged with its ground truth, and typical water (in red) and shadow samples (in green) selected from the W1 are displayed on the W1.

FCN Based Method
The flowchart of our FCN-based method is illustrated at the top of Figure 2. An input image is firstly clipped into patches (256×256 in our study). The patches are fed into a trained FCN model as shown at the bottom of Figure 2. The resulting patches of the FCN inference are spatially aligned accordingly. The output of the FCN based method is a map referring to water bodies in the input VHR image. To alleviate the possible boundary effect caused by spatial alignment in the final result, patches prepared for the inference are also clipped with spatial overlaps of 32 pixels, which are dropped in the following results alignment. The strategy is illustrated simply in Figure 3.
bars above each main connector. The two numbers in the color bar refer to the sizes of the filter and stride, respectively. The output of the encoder is fed into the decoder, which contains a sequence of upsampling layers to gradually recover spatial contexts of the label. The number of features in each decoded layer is equal to the number of classes (two in our case, water and non-water). The upsampling is realized by a transposed convolution with a fixed filter defined by the bilinear interpolation. Two skip layers are also introduced to enhance the spatial detail of label recovery. Two encoding features with finer details are convoluted into features with the dimension of the number of classes, and the convoluted features are added to corresponding decoding layers. In the output layer, a normalized exponential function, namely softmax, transforms the decoded features into probabilities of water and non-water, and then argmax selects the label with the highest probability for each location and gets a pixel-wise map of the water body mask. The one-to-one convolution at the front makes the model flexible for reusing weights of the VGG-16 model trained on the ImageNet. This is useful in situations when the number of training samples is small [27].

Patching
Alignment FCN Inference Water map The range with the size of 224(256-32) for alignment after inference The patch with the size of 256 for inference upsampling is realized by a transposed convolution with a fixed filter defined by the bilinear interpolation. Two skip layers are also introduced to enhance the spatial detail of label recovery. Two encoding features with finer details are convoluted into features with the dimension of the number of classes, and the convoluted features are added to corresponding decoding layers. In the output layer, a normalized exponential function, namely softmax, transforms the decoded features into probabilities of water and non-water, and then argmax selects the label with the highest probability for each location and gets a pixel-wise map of the water body mask. The one-to-one convolution at the front makes the model flexible for reusing weights of the VGG-16 model trained on the ImageNet. This is useful in situations when the number of training samples is small [27].

VHR images
Patching Alignment FCN Inference Water map Figure 2. Illustration of the FCN-based method. The top refers to the main procedure; the bottom refers to the FCN model. In the model, the large blue boxes refer to intermediate features whose sizes are shown in each box. Each color bar above the main connector represents a specific module whose function is indicated by its color and is explained at the lower-left corner.
The range with the size of 224(256-32) for alignment after inference The patch with the size of 256 for inference The FCN model used here can be divided into four parts: a preprocessing layer, an encoder, a decoder, and an output layer. In the preprocessing layer, the input patch is optionally clipped and contrast enhanced in a random manner, and then transformed into a 3-channel patch with the same spatial size of the input patch by a one-to-one convolution. In the encoder, five layers in sequence similar to those in the VGG-16 model [26] are used to gradually encode each 3-channel patch into a group of small but informative features. The detail of each layer is illustrated by a sequence of color bars above each main connector. The two numbers in the color bar refer to the sizes of the filter and stride, respectively. The output of the encoder is fed into the decoder, which contains a sequence of upsampling layers to gradually recover spatial contexts of the label. The number of features in each decoded layer is equal to the number of classes (two in our case, water and non-water). The upsampling is realized by a transposed convolution with a fixed filter defined by the bilinear interpolation. Two skip layers are also introduced to enhance the spatial detail of label recovery. Two encoding features with finer details are convoluted into features with the dimension of the number of classes, and the convoluted features are added to corresponding decoding layers. In the output layer, a normalized Remote Sens. 2019, 11, 1162 5 of 19 exponential function, namely softmax, transforms the decoded features into probabilities of water and non-water, and then argmax selects the label with the highest probability for each location and gets a pixel-wise map of the water body mask. The one-to-one convolution at the front makes the model flexible for reusing weights of the VGG-16 model trained on the ImageNet. This is useful in situations when the number of training samples is small [27].

Experimental Setup
Our experiments are designed to validate the capability of the FCN-based method for water body extraction from VHR images and also to analyze key factors affecting its performance. The number of factors affecting the FCN model's behavior is quite large. In this study, we focus on the analysis of four key factors including input features, training data, transfer learning, and input augmentation, while setting others to default values, as listed in the Table 1.
A total of 36 models with various parameters as listed in Table 2 were trained, and the results were analyzed in terms of overall performance and for the four selected factors. To be specific, for each selected factor, each choice of factor is fully studied under different combinations of other parameters. Additionally, to evaluate the stability of training of the FCN based method, we further selected 3 typical groups of parameters to train the FCN model independently 3 times.
To further validate the effectiveness of the FCN-based method, we compared it with three classic methods, including the NDWI based method, the SVM-based method, and the Sparsity Model (SM)-based method. All components of the methods are comparatively described in Table 3. In total, we used 32 combinations for the three methods compared, as listed in Table 4. Finally, we selected the combination with the highest average accuracy on the 10 test data for each method. It should be noted that because the three methods may be sensitive to radiometric changes between training and testing images, we employed the histogram matching based method [28] and the Iterative Reweighted Multivariate Alternate Detection (IR-MAD) based method [25] to normalize all images to the same radiometric level. The three methods also do not exploit spatial context information as the FCN-based method does in nature. For the NDWI and SVM based methods, we used Simple Linear Iterative Clustering (SLIC) [29] to segment the image into objects and then used the mean spectral feature of each object as the input for training and testing. For the SM-based method, we adapted the joint sparsity model by assuming that all pixels within a small neighborhood share the same group of words but have different coefficients [30].

Factors Choices Explanations
Input feature Reusing trained weights of VGG-16 on ImageNet as shown in Figure 2.
Training data td1 Patches covering at least one pixel of water in the W1; 70 in total; all patches are extracted based on a moving step that is the same as the patch size. The following is the same td2 All patches in the W1; 340 in total td3 Patches covering at least one pixel of water in the image of D302; 1666 in total The sequential combination of random clip and contrast enhancement as shown in Figure 2.

Initializer Xavier
Initialize the weight of the network before training [31] Batch size 1 The number of patch used in each round of training Learning rate 0.00001 Key parameter in the Adam Table 2. List of experimental setups for analyzing overall performance of the FCN-based method and four key factors of the model, respectively.

Purposes Experiment Setup (the Order of Factor Is Irrelevance)
Analysis of overall performance of the FCN-based method with all combinations of selected parameters Analysis of which type of input feature is more effective Analysis of whether transfer learning is useful Analysis of which group of training data is more effective Analysis of whether data augmentation is useful Analysis of the stability of the training process Table 3. List of components of the methods in the comparison.

Component Description
water We randomly select 10000 samples from the W1 and then divide them into water and others based on their labels.

water-shadow
We manually collect typical water and shadows samples in the W1 as illustrated in the lower-left of the Figure 1. These samples lie near the boundary of decision function and is useful for discriminative methods such as the NDWI and SVM based methods [13].
norm norm refers to the IR-MAD based radiometric normalization method. Here we select the atmospherically corrected Sentinel-2 image spatially coded as T50TMK and acquired on 9 March 2017 as reference. All VHR images are normalized to the reference.
hm hm refers to histogram matching based radiometric normalization. D302 works as reference and D609 is transformed in our experiment.
grid-svm grid-svm refers to the linear SVM model with an optimized penalty C. the linear SVM model is employed here for its efficiency and effectiveness compared with the RBF based SVM model after rigorous comparisons on training samples. The C is set by a grid search.
ndwi ndwi refers to the NDWI based method which uses the mean index value of the selected water and other samples in the training data as the threshold. The index is calculated according to Equation (1).
slic slic refers to the SLIC method and is used to segment an input image into small objects based on which spatial contexts can be exploited in the water extraction. SLIC on a large image is computational expensive. Here we cut a large input image into patches with a size of 500*500 pixels, and then segment each patch into approximately 10000 regions, finally all segmented patched are combined into a single image. To mitigate the boundary effect, we overlap 200 pixels vertically and horizontally in the patch cutting.
best-sm best-sm refers to the joint sparsity model. Since it does not scale well with the size of dictionary, we randomly select 250 samples for water and others, respectively from the training data. To keep the uncertainty brought by training samples to the minimum level, we run the SM based method for 10 times and select the one with the best performance. Table 4. List of experimental setups for analyzing overall performance of the three methods in comparison. The combination in the table is composed by components listed in Table 3. For example, norm-hm-water-shadow-best-sm indicates a customized SM-based method. The method is trained on water-shadow samples from the normalized D302 image and is tested on the normalized D302 image and the preprocessed D609 image. In the preprocessing, the D609 image is firstly normalized to the Sentinel-2 image, and then has its histogram matched to the normalized D302 image.

The SM Based Method The SVM Based Method The NDWI Based Method
water-best-sm water-shadow-best-sm norm-water-best-sm norm-water-shadow-best-sm norm-hm-water-best-sm norm-hm-water-shadow-best-sm hm-water-best-sm hm-water-shadow-best-sm water-grid-svm water-shadow-grid-svm norm-water-grid-svm norm-water-shadow-grid-svm norm-hm-water-grid-svm norm-hm-water-shadow-grid-svm hm-water-grid-svm hm-water-shadow-grid-svm water-slic-grid-svm water-shadow-slic-grid-svm norm-water-slic-grid-svm norm-water-shadow-slic-grid-svm water-ndwi water-shadow-ndwi norm-water-ndwi norm-water-shadow-ndwi norm-hm-water-ndwi norm-hm-water-shadow-ndwi hm-water-ndwi hm-water-shadow-ndwi water-slic-ndwi water-shadow-slic-ndwi norm-water-slic-ndwi norm-water-shadow-slic-ndwi In all experiments, we used the F1 score to assess the performance of each method. The F1 score is the harmonic average of the precision and recall, as indicated in Equation (2) where an F1 score reaches its best value at 1 and worst at 0. It is more objective than overall accuracy in our binary classification case because a water body mostly covers a small portion of the image under evaluation.
where precision is the number of correct positive pixels divided by the number of all positive pixels returned by the method, and recall is the number of correct positive pixels divided by the number of all relevant pixels.  Figure 4 illustrates the mean and variance of F1 scores on all 10 test data based on the 36 FCN models with different parameter settings, as listed in Table 2. The mean F1 score has a large dynamic range from 0 to~0.92. There are 15 trained models with a mean F1 score above 0.8, and there are 7 trained models with a mean F1 score below 0.5. Meanwhile, the variance of the F1 score varies from 0 to~0.2 with the models. It can be inferred from Figure 4 that models with large mean values have low variances. Figure 5 illustrates the mean and variance of F1 scores on the 36 FCN models for each of the 10 test data with different parameter settings as indicated in Table 2. The mean F1 score fluctuates from 0.5 to~0.8, and the variance remains relatively stable around 0.1. T5 and T7 are the most and the least accurately processed test data, respectively. In general, results of the test data (T1-T6) from D302 where training data were collected are better than those from D609 (T7-T10).
The results imply that the performance of the FCN based method is significantly affected by the parameter setup, though the choice of a qualified parameter setting for the FCN model is not hard to find. The bad average performance on T7 may be due to the ratio of the area of water bodies to shadows in this test data being small compared to that of others. Thus few omissions or false alarms may lead to a large change of the F1 score. The situations for T1, T4, and T5 are largely the opposite. To further validate the effectiveness of the FCN-based method, we applied a few well-trained FCN models to the two full images. Consistent results were obtained based on visual inspection of the resultant water maps.  Figure 4 illustrates the mean and variance of F1 scores on all 10 test data based on the 36 FCN models with different parameter settings, as listed in Table 2. The mean F1 score has a large dynamic range from 0 to ~0.92. There are 15 trained models with a mean F1 score above 0.8, and there are 7 trained models with a mean F1 score below 0.5. Meanwhile, the variance of the F1 score varies from 0 to ~0.2 with the models. It can be inferred from Figure 4 that models with large mean values have low variances. Figure 5 illustrates the mean and variance of F1 scores on the 36 FCN models for each of the 10 test data with different parameter settings as indicated in Table 2. The mean F1 score fluctuates from ~0.5 to ~0.8, and the variance remains relatively stable around 0.1. T5 and T7 are the most and the least accurately processed test data, respectively. In general, results of the test data (T1-T6) from D302 where training data were collected are better than those from D609 (T7-T10).
The results imply that the performance of the FCN based method is significantly affected by the parameter setup, though the choice of a qualified parameter setting for the FCN model is not hard to find. The bad average performance on T7 may be due to the ratio of the area of water bodies to shadows in this test data being small compared to that of others. Thus few omissions or false alarms may lead to a large change of the F1 score. The situations for T1, T4, and T5 are largely the opposite. To further validate the effectiveness of the FCN-based method, we applied a few well-trained FCN models to the two full images. Consistent results were obtained based on visual inspection of the resultant water maps.   Table 2.  Table 2. Figure 6 shows the mean accuracies on all 10 test data and on test data T7-T10 of all combinations as listed in Table 4. Generally, the classic methods performed better on test data T1-T6 than on test data T7-T10. The boundary training samples, namely water-shadow, did not necessarily improve the accuracy of the methods, nor did the SLIC and radiometric normalization. Nevertheless, the best combinations for each of the three methods are norm-water-slic-ndwi, hm-water-grid-svm, and water-best-smc, respectively, and the mean F1 scores on all 10 test data of the three best combinations are about 0.85, 0.88 and 0.89. This implies that the strategies employed in our experiments such as radiometric normalization and segmentation are helpful for improving the three methods on the extraction of water from VHR. However, we did not find a single best strategy for all three methods. Here, we focus on the best combination from each of the three methods. Deeper analysis of the classic methods is beyond the scope of this paper. Figure 7 shows the comparison of the best FCN-based method with the best SM-based method (best-SM), the best SVM based method (best-SVM), and the best NDWI (best-NDWI) based method in terms of per test data F1 score. Figure 8 illustrates the extracted water body maps from 10 test data by all 4 methods. From Figure 7, best-FCN outperforms the other three methods, especially on the test data T2, T7, and T8. Best-SM achieves slightly better results than best-SVM, while best-NDWI is comparatively the worst in terms of overall accuracy. According to Figure 8, there are plenty of small false alarms in the results except that of best-FCN. However, the distribution of false alarms for each of the three methods differs a lot on specific test data. Overall best-FCN is obviously more effective than the other three methods, as a result of a prominent advantage for dealing with areas mixed with water and shadow. This advantage of best-FCN comes from its ability to exploit the spatial context in the image so as to discriminate the shadows of high-rising buildings from water bodies and becoming largely invariant to the radiometric change. The latter merit is quite distinct as the input to the FCN-based method is in L1A level. Figure 9 illustrates the mean sample spectrum of water in W1 and four typical test data from D609. The radiometric differences between W1 and other test data are quite clear in L1A data. hm and norm as described in Table 3 do reduce radiometric changes between images acquired in different conditions, especially for hm. Their effects will be more distinct if the whole images are taken into consideration. However, due to hm being a nonlinear point operator concentrating on matching the global histogram between different images, it may not be helpful to discriminate between water and shadow, which are located at the low end of the radiometric range and share similar spectral features that vary a lot from space and time. The situation is the same for norm, which also works on the whole image. The effect of slic on radiometric normalization is not as distinct as that of the other two strategies, but it may help in reducing spatial noises in the image.  T2  T3  T4  T5  T6  T7  T8  T9  T10 VARIANCE MEAN F1 score Figure 5. Illustration of the mean and variance of F1 scores on the 36 FCN models for each of the 10 test data with different parameter settings, as indicated in Table 2. Figure 6 shows the mean accuracies on all 10 test data and on test data T7-T10 of all combinations as listed in Table 4. Generally, the classic methods performed better on test data T1-T6 than on test data T7-T10. The boundary training samples, namely water-shadow, did not necessarily improve the accuracy of the methods, nor did the SLIC and radiometric normalization. Nevertheless, the best combinations for each of the three methods are norm-water-slic-ndwi, hm-water-grid-svm, and water-best-smc, respectively, and the mean F1 scores on all 10 test data of the three best combinations are about 0.85, 0.88 and 0.89. This implies that the strategies employed in our experiments such as radiometric normalization and segmentation are helpful for improving the three methods on the extraction of water from VHR. However, we did not find a single best strategy for all three methods. Here, we focus on the best combination from each of the three methods. Deeper analysis of the classic methods is beyond the scope of this paper. Figure 7 shows the comparison of the best FCN-based method with the best SM-based method (best-SM), the best SVM based method (best-SVM), and the best NDWI (best-NDWI) based method in terms of per test data F1 score. Figure 8 illustrates the extracted water body maps from 10 test data by all 4 methods. From Figure 7, best-FCN outperforms the other three methods, especially on the test data T2, T7, and T8. Best-SM achieves slightly better results than best-SVM, while best-NDWI is comparatively the worst in terms of overall accuracy. According to Figure 8, there are plenty of small false alarms in the results except that of best-FCN. However, the distribution of false alarms for each of the three methods differs a lot on specific test data. Overall best-FCN is obviously more effective than the other three methods, as a result of a prominent advantage for dealing with areas mixed with water and shadow. This advantage of best-FCN comes from its ability to exploit the spatial context in the image so as to discriminate the shadows of high-rising buildings from water bodies and becoming largely invariant to the radiometric change. The latter merit is quite distinct as the input to the FCN-based method is in L1A level. Figure 9 illustrates the mean sample spectrum of water in W1 and four typical test data from D609. The radiometric differences between W1 and other test data are quite clear in L1A data. hm and norm as described in Table 3 do reduce radiometric changes between images acquired in different conditions, especially for hm. Their effects will be more distinct if the whole images are taken into consideration. However, due to hm being a nonlinear point operator concentrating on matching the global histogram between different images, it may not be helpful to discriminate between water and shadow, which are located at the low end of the radiometric range and share similar spectral features that vary a lot from space and time. The situation is the same for norm, which also works on the whole image. The effect of slic on radiometric normalization is not as distinct as that of the other two strategies, but it may help in reducing spatial noises in the image. Remote Sens. 2018, 10, x FOR PEER REVIEW 10 of 20 Figure 6. Illustration of the mean and variance of the F1 scores on all 10 test data based on the 32 combinations from the three classic methods, as indicated in Table 4. water-best-sm water-grid-svm water-ndwi water-shadow-best-sm water-shadow-grid-svm water-shadow-ndwi water-shadow-slic-grid-svm water-shadow-slic-ndwi water-slic-grid-svm water-slic-ndwi norm-water-best-sm norm-water-grid-svm norm-water-ndwi norm-water-shadow-best-sm norm-water-shadow-grid-svm norm-water-shadow-ndwi norm-hm-water-best-sm norm-hm-water-grid-svm norm-hm-water-ndwi norm-hm-water-shadow-best-sm norm-hm-water-shadow-grid-… norm-hm-water-shadow-ndwi norm-water-shadow-slic-grid-… norm-water-shadow-slic-ndwi norm-water-slic-grid-svm norm-water-slic-ndwi hm-water-best-sm hm-water-grid-svm hm-water-ndwi hm-water-shadow-best-sm hm-water-shadow-grid-svm hm-water-shadow-ndwi  T1  T2  T3  T4  T5  T6  T7  T8  T9  T10 best-FCN( if2-tf1-td2-da2) best-NDWI(norm-water-slic-ndwi) best-SVM(hm-water-grid-svm) best-SM(water-best-smc) F1 score False Color Ground Truth best-NDWI best -SVM best -SM best-FCN F1 score Figure 6. Illustration of the mean and variance of the F1 scores on all 10 test data based on the 32 combinations from the three classic methods, as indicated in Table 4.  water-best-sm water-grid-svm water-ndwi water-shadow-best-sm water-shadow-grid-svm water-shadow-ndwi water-shadow-slic-grid-svm water-shadow-slic-ndwi water-slic-grid-svm water-slic-ndwi norm-water-best-sm norm-water-grid-svm norm-water-ndwi norm-water-shadow-best-sm norm-water-shadow-grid-svm norm-water-shadow-ndwi norm-hm-water-best-sm norm-hm-water-grid-svm norm-hm-water-ndwi norm-hm-water-shadow-best-sm norm-hm-water-shadow-grid-… norm-hm-water-shadow-ndwi norm-water-shadow-slic-grid-… norm-water-shadow-slic-ndwi norm-water-slic-grid-svm norm-water-slic-ndwi hm-water-best-sm hm-water-grid-svm hm-water-ndwi hm-water-shadow-best-sm hm-water-shadow-grid-svm hm-water-shadow-ndwi  T1  T2  T3  T4  T5  T6  T7  T8  T9  T10 best-FCN( if2-tf1-td2-da2) best-NDWI(norm-water-slic-ndwi) best-SVM(hm-water-grid-svm) best-SM(water-best-smc) F1 score False Color Ground Truth best-NDWI best -SVM best -SM best-FCN F1 score  Table 3A for a detailed explanation of those terms. Values on the y-axis refer to the digital number for the first three groups and reflectance with a scale factor of 10000 for the last three groups. Figure 10 shows the average F1 scores of 10 test areas for 12 groups of models with different input features, as explained in Table 2. Figure 11 shows the average F1 scores of 12 models for 10 test data with different input features. For each group of models, the model parameters used in the training are the same except for the input feature, as indicated in Figure 10. Overall, input feature 2 clearly gives better performance than that of input future 1 and input feature 3. Input feature 1 is also slightly better than input feature 3. The results indicate that the NIR band may contribute most to the water body extraction, while the blue band may be least helpful or even harmful to the water body extraction. This phenomenon can be explained based on the fact that the blue band is more sensitive to atmospheric scattering as well as suspended matters on the surface of the water when compared with the NIR band. Value raw hm slic norm norm-hm norm-slic F1 score Figure 9. Each group of spectra refers to the mean spectrum of water samples from W1 and T7-T10 (D609). From left to right, they are from L1A data, hm enhanced image, slic enhanced image, norm enhanced image, norm-hm enhanced image, and norm-slic enhanced image. Refer to Table 3 for a detailed explanation of those terms. Values on the y-axis refer to the digital number for the first three groups and reflectance with a scale factor of 10000 for the last three groups. Figure 10 shows the average F1 scores of 10 test areas for 12 groups of models with different input features, as explained in Table 2. Figure 11 shows the average F1 scores of 12 models for 10 test data with different input features. For each group of models, the model parameters used in the training are the same except for the input feature, as indicated in Figure 10. Overall, input feature 2 clearly gives better performance than that of input future 1 and input feature 3. Input feature 1 is also slightly better than input feature 3. The results indicate that the NIR band may contribute most to the water body extraction, while the blue band may be least helpful or even harmful to the water body extraction. This phenomenon can be explained based on the fact that the blue band is more sensitive to atmospheric scattering as well as suspended matters on the surface of the water when compared with the NIR band.  Table 3A for a detailed explanation of those terms. Values on the y-axis refer to the digital number for the first three groups and reflectance with a scale factor of 10000 for the last three groups. Figure 10 shows the average F1 scores of 10 test areas for 12 groups of models with different input features, as explained in Table 2. Figure 11 shows the average F1 scores of 12 models for 10 test data with different input features. For each group of models, the model parameters used in the training are the same except for the input feature, as indicated in Figure 10. Overall, input feature 2 clearly gives better performance than that of input future 1 and input feature 3. Input feature 1 is also slightly better than input feature 3. The results indicate that the NIR band may contribute most to the water body extraction, while the blue band may be least helpful or even harmful to the water body extraction. This phenomenon can be explained based on the fact that the blue band is more sensitive to atmospheric scattering as well as suspended matters on the surface of the water when compared with the NIR band. Value raw hm slic norm norm-hm norm-slic F1 score Figure 10. Average F1 scores of 10 test areas for 12 groups of models with different input features, as explained in Table 2.

Analysis of the Input Feature
Remote Sens. 2018, 10, x FOR PEER REVIEW 13 of 20 Figure 10. Average F1 scores of 10 test areas for 12 groups of models with different input features, as explained in Table 2. Figure 11. Average F1 scores of 12 models for 10 test data with different input features. Figure 12 shows the average F1 scores of 10 test areas for 12 groups of models with different training data, as explained in Table 2. Figure 13 shows the average F1 scores of 12 models for 10 test data with different training data. For each group of models, the model parameters used in the training are the same, except for training data, as indicated in Figure 12. Overall, training data 2 and training data 1 clearly show better performance than that of training data 3. Training data 2 is also slightly better than training data 1. The results largely indicate that representativeness is more important than the amount of size in terms of the quality of training data in our experiment. The high quality of training data 2 may be due to it containing samples covering water body areas as well as typical high-rising building areas with plenty of shadows in a balanced manner. The shadow is quite spectrally similar with the water body but has a different spatial context. Although the size of the samples in the training data 3 is larger than that of the samples in training data 2, the bad performance of training data 3 may be due to its lack of balance in the amount of samples covering water body areas and other low reflectance targets such as shadows. This should be taken into consideration in the application of deep learning models in remote sensing data when the total amount of training samples is relatively small.  T1  T2  T3  T4  T5  T6  T7  T8  T9  T10   if1 Figure 12 shows the average F1 scores of 10 test areas for 12 groups of models with different training data, as explained in Table 2. Figure 13 shows the average F1 scores of 12 models for 10 test data with different training data. For each group of models, the model parameters used in the training are the same, except for training data, as indicated in Figure 12. Overall, training data 2 and training data 1 clearly show better performance than that of training data 3. Training data 2 is also slightly better than training data 1. The results largely indicate that representativeness is more important than the amount of size in terms of the quality of training data in our experiment. The high quality of training data 2 may be due to it containing samples covering water body areas as well as typical high-rising building areas with plenty of shadows in a balanced manner. The shadow is quite spectrally similar with the water body but has a different spatial context. Although the size of the samples in the training data 3 is larger than that of the samples in training data 2, the bad performance of training data 3 may be due to its lack of balance in the amount of samples covering water body areas and other low reflectance targets such as shadows. This should be taken into consideration in the application of deep learning models in remote sensing data when the total amount of training samples is relatively small. Remote Sens. 2018, 10, x FOR PEER REVIEW 13 of 20 Figure 10. Average F1 scores of 10 test areas for 12 groups of models with different input features, as explained in Table 2. Figure 11. Average F1 scores of 12 models for 10 test data with different input features. Figure 12 shows the average F1 scores of 10 test areas for 12 groups of models with different training data, as explained in Table 2. Figure 13 shows the average F1 scores of 12 models for 10 test data with different training data. For each group of models, the model parameters used in the training are the same, except for training data, as indicated in Figure 12. Overall, training data 2 and training data 1 clearly show better performance than that of training data 3. Training data 2 is also slightly better than training data 1. The results largely indicate that representativeness is more important than the amount of size in terms of the quality of training data in our experiment. The high quality of training data 2 may be due to it containing samples covering water body areas as well as typical high-rising building areas with plenty of shadows in a balanced manner. The shadow is quite spectrally similar with the water body but has a different spatial context. Although the size of the samples in the training data 3 is larger than that of the samples in training data 2, the bad performance of training data 3 may be due to its lack of balance in the amount of samples covering water body areas and other low reflectance targets such as shadows. This should be taken into consideration in the application of deep learning models in remote sensing data when the total amount of training samples is relatively small.  T1  T2  T3  T4  T5  T6  T7  T8  T9  T10   if1 Table 2.

Analysis of the Training Data
Remote Sens. 2018, 10, x FOR PEER REVIEW 14 of 20 Figure 12. Average F1 scores of 10 test areas for 12 groups of models with different training data as explained in Table 2. Figure 13. Average F1 scores of 12 models for 10 test areas with different training data. Figure 14 shows the average F1 scores of 10 test data for 18 groups of models with and without transfer learning. Figure 15 shows the average F1 scores of 18 models for 10 test data with and without transfer learning. For each group of models, the model parameter used in the training is the same, except for transfer learning, as indicated in Figure 14. In nearly all cases, a model with transfer learning is better than the one without it in terms of the performance measured by the F1 score. Introduction of the transfer learning in the training process increases the F1 score by about 10% for the FCN-based method, as indicated in Figure 15. We consider that the merit of the transfer learning is based on the similarity of the spatial context between VHR images and natural images. For model training without transfer learning, the model initialization plays an important role. Xavier [24], used in our experiment, initializes weights with a properly scaled uniform distribution and has been proven to be substantially helpful for increasing the training convergence. That may be the reason that, in a few cases, trained models without transfer learning can still achieve very good results. As shown in Figure 14, the trained model with the best mean F1 score on all test data was actually achieved without transfer learning. Overall, taking the stability of the training into consideration, the inclusion of transfer learning in the FCN model training for processing VHR images is strongly suggested in practical use.  T2  T3  T4  T5  T6  T7  T8  T9  T10   td1 td2 td3 F1 score Figure 13. Average F1 scores of 12 models for 10 test areas with different training data. Figure 14 shows the average F1 scores of 10 test data for 18 groups of models with and without transfer learning. Figure 15 shows the average F1 scores of 18 models for 10 test data with and without transfer learning. For each group of models, the model parameter used in the training is the same, except for transfer learning, as indicated in Figure 14. In nearly all cases, a model with transfer learning is better than the one without it in terms of the performance measured by the F1 score. Introduction of the transfer learning in the training process increases the F1 score by about 10% for the FCN-based method, as indicated in Figure 15. We consider that the merit of the transfer learning is based on the similarity of the spatial context between VHR images and natural images. For model training without transfer learning, the model initialization plays an important role. Xavier [24], used in our experiment, initializes weights with a properly scaled uniform distribution and has been proven to be substantially helpful for increasing the training convergence. That may be the reason that, in a few cases, trained models without transfer learning can still achieve very good results. As shown in Figure 14, the trained model with the best mean F1 score on all test data was actually achieved without transfer learning. Overall, taking the stability of the training into consideration, the inclusion of transfer learning in the FCN model training for processing VHR images is strongly suggested in practical use.     Figure 16 shows the average F1 scores on 10 test data for 18 groups of models with and without data augmentation. Figure 17 shows the average F1 scores on 18 models for 10 test data with and without data augmentation. For each group of models, the model parameter used in the training is the same, except data augmentation, as indicated in Figure 16. Overall, data augmentation does not show clear advantages over the original input data in our experiment. However, trained models without data augmentation perform slightly better for test data for images where training data is collected, but the situation is vice versa for test data in another image. The result shows weak support for the effectiveness of data augmentation in the training. However, the typical contrast enhancement does not work in a distinct way as it works for natural images. This may due to the remote sensing image being relatively physical calibrated in its quantity compared with natural images and also the self-similarity of geometric structures being represented in the remote sensing image prevailing on the earth surface. The latter would confuse the trained model by targets that show similar textures of the water body but have different spectral properties from the water body such as soil and large roofs.  T2  T3  T4  T5  T6  T7  T8  T9  T10 tf1 tf2 F1 score Figure 15. Average F1 scores of 18 models for 10 test data with and without transfer learning. Figure 16 shows the average F1 scores on 10 test data for 18 groups of models with and without data augmentation. Figure 17 shows the average F1 scores on 18 models for 10 test data with and without data augmentation. For each group of models, the model parameter used in the training is the same, except data augmentation, as indicated in Figure 16. Overall, data augmentation does not show clear advantages over the original input data in our experiment. However, trained models without data augmentation perform slightly better for test data for images where training data is collected, but the situation is vice versa for test data in another image. The result shows weak support for the effectiveness of data augmentation in the training. However, the typical contrast enhancement does not work in a distinct way as it works for natural images. This may due to the remote sensing image being relatively physical calibrated in its quantity compared with natural images and also the self-similarity of geometric structures being represented in the remote sensing image prevailing on the earth surface. The latter would confuse the trained model by targets that show similar textures of the water body but have different spectral properties from the water body such as soil and large roofs.

Analysis of the Data Augmentation
1 F1 score re Figure 16. Average F1 scores of 10 test areas for 18 groups of models with and without data augmentation as explained in Table 2. Figure 16. Average F1 scores of 10 test areas for 18 groups of models with and without data augmentation as explained in Table 2.  Figure 18 shows the average F1 scores on 10 test data for 9 trained models as indicated in the last row of Table 2. It can be seen that the stability of the model training is relatively weak. This is especially true for models without transfer learning. With a better input feature such as input feature 2, the stability of the training can also be largely enhanced. The randomness of the results is mainly brought by the stochastic method exploited in the model training. The results encourage the inclusion of transfer learning and good input features to weaken the side-effect of randomness in the FCN-based method, meanwhile, any parameter setting should be tried at least twice before rejection.  T2  T3  T4  T5  T6  T7  T8  T9  T10   da1  da2 F1 score Figure 17. Average F1 scores of 18 models for 10 test areas with and without data augmentation. Figure 18 shows the average F1 scores on 10 test data for 9 trained models as indicated in the last row of Table 2. It can be seen that the stability of the model training is relatively weak. This is especially true for models without transfer learning. With a better input feature such as input feature 2, the stability of the training can also be largely enhanced. The randomness of the results is mainly brought by the stochastic method exploited in the model training. The results encourage the inclusion of transfer learning and good input features to weaken the side-effect of randomness in the FCN-based method, meanwhile, any parameter setting should be tried at least twice before rejection.

Discussion and Conclusions
In this paper, we study the use of the FCN model to extract water bodies from VHR images. To better adapt to the property of remote sensing data, we introduce a flexible one-to-one convolution layer in the typical FCN model. Two seasonal representative Chinese GaoFen-2 images with a spatial resolution of 0.8 m experimentally validated the FCN based method in the extraction of water bodies from VHR images. Meanwhile, we selected and analyzed four key factors using 36 FCN models with different parameter settings with the purpose of understanding the contribution of each key factor and how to make good choices of them in practical use.
The FCN-based method can work as a robust and effective tool in the extraction of water bodies from VHR images. If properly trained with a small number of labeled samples, the FCN based method can also significantly outperform the SM-based method, the SVM-based method, and the NDWI-based method on either capability or transferability, especially for urban areas with mixed water and shadows. The advantage of the FCN based method remains even when radiometric normalization and spatial context information are introduced to preprocess the input data for the other three methods. This is largely due to the capability of the FCN model to exploit the spatial context in VHR images well. This capability also makes the FCN-based method R2 R3 if1-tf1-td1-da1 if3-tf2-td3-da2 if2-tf2-td2-da2 F1 score Figure 18. Average F1 scores on 10 test data for 9 trained models in the last row of Table 2.

Discussion and Conclusions
In this paper, we study the use of the FCN model to extract water bodies from VHR images. To better adapt to the property of remote sensing data, we introduce a flexible one-to-one convolution layer in the typical FCN model. Two seasonal representative Chinese GaoFen-2 images with a spatial resolution of 0.8 m experimentally validated the FCN based method in the extraction of water bodies from VHR images. Meanwhile, we selected and analyzed four key factors using 36 FCN models with different parameter settings with the purpose of understanding the contribution of each key factor and how to make good choices of them in practical use.
The FCN-based method can work as a robust and effective tool in the extraction of water bodies from VHR images. If properly trained with a small number of labeled samples, the FCN based method can also significantly outperform the SM-based method, the SVM-based method, and the NDWI-based method on either capability or transferability, especially for urban areas with mixed water and shadows.
The advantage of the FCN based method remains even when radiometric normalization and spatial context information are introduced to preprocess the input data for the other three methods. This is largely due to the capability of the FCN model to exploit the spatial context in VHR images well. This capability also makes the FCN-based method withstand radiometric changes, which is a quite challenging task for typical feature extraction methods [13][14][15]. It should be noted that the high accuracy of the FCN-based method in our experiments is at the expense of efficiency compared with the NDWI based method. We still consider that the NDWI is the best choice in appropriate sceneries when efficiency is more important than precision.
Our study supports the prevalent existence of the qualified FCN models with various parameter settings. Although the randomness in the training is unavoidable due to the nature of the optimization method for the FCN, and the performance of the FCN-based method is easily affected by model parameter settings, a well-trained FCN model for water body extraction from VHR images is not hard to find and the choice of a qualified parameter setting for the FCN model varies. This eases the difficulty of the selection of the optimal parameter setting in the use of the FCN model. Empirical results encourage the selection of the input feature, the inclusion of transfer learning, and the use of training data with balanced positive and negative samples to improve the stability and accuracy of the FCN based method. These findings help to build a stable and accurate FCN-based method in real applications. Finally, due to the holistic nature of remote sensing data, the extraction of a specific type of target from VHR images covering a large area is commonly seen in real remote sensing applications. Though the appearance of the type of target in the image may diverge a lot depending on applications, our lessons learned from the successful use of the FCN model in the water body extraction from VHR images can be extended to extract other land covers especially in cases with limited training samples.
We obtained FCN models with satisfactory results based on relatively small training data. However, our experimental images were acquired with clear sky and flat terrain. These criteria may not always be fulfilled in practical use. The remote sensing image may be contaminated by vapor and haze. Shadows of mountains may be different from that of buildings. One research direction is to enrich the training data by accounting for more situations. As the manual interpretation of water body from VHR images from scratch is quite tedious work, we may collect the training data in an iterative way by using extracted results. However, the representativeness of the samples should be kept in mind in the preparation of the training data.
Our experiments were carried out on 0.8 m GaoFen-2 images. The feasibility of the proposed FCN-based method on more images from other VHR sensors with different specifications should be validated further because short-term temporal monitoring of water bodies needs collaborative data from multiple sensors in practice, but the accuracy of water body extraction is sensitive to minor spatial resolution changes [33] of the data. Very narrow rivers also are not extracted well by the FCN-based method in our experiments. We are considering improving this by introducing an advanced FCN model [34]. This will be studied in the future. Last but not least, treating water bodies as a single class like in our experiments is not enough in practical use. Our next step is to work on extracting more subtle types of water bodies with different depths and turbidity.
Author Contributions: B.Z. and L.L. came up with the original idea for the study, and L.L. and Q.S. carried out the design. L.L. was responsible for recruitment and follow-up of study participants. Z.Y. and G.C. were responsible for programming and data processing. L.L. and L.G. conceived the experiments and carried out the analysis with assistance from Z.Y., L.L. structured and drafted the manuscript.