1. Introduction
Water-body extraction is of great significance in water resources monitoring, natural disaster assessment and environmental protection [
1,
2,
3]. These applications rely on the quantification of the water-body change. Accurately obtaining water-body segmentation from remote sensing images is an important mission for monitoring water body changes. In this paper, our aim is to accurately delineate water-bodies in complicated and challenging scenarios from very high resolution (VHR) remote sensing imagery. Instruments onboard satellites and aerial vehicles provide remote sensing imagery that covers large-scale water surface on Earth. As shown in
Figure 1, the contours of water-body in VHR remote sensing images are often unclear. Such degradations are typically caused by aquatic vegetation blocking, silts/boats near the bank and shadows from the surrounding tall plants. The distinct colors are commonly caused by imaging conditions, water quality and microorganisms. Hence, it is a great challenge to extract the outline of water-bodies accurately in complex scenes from VHR remote sensing imagery.
Traditionally, the existing methods of extracting water-bodies from remote sensing images mainly focus on the spectral characteristics of each band and the manually designed features, such as band threshold-based methods [
4], supervised classification-based methods [
5], water and vegetation indices-based methods [
6], and spectral relationship-based methods [
7]. However, these methods pay little attention to the spatial information (i.e., shape, size, texture, edge, shadow, and context semantics) of the water-bodies, which significantly affects the classification accuracy. For massive remote sensing images, the drawbacks of traditional methods additionally include their low degree of automation.
Convolutional neural network (CNN) has shown remarkable performance in image classification, target detection and semantic segmentation [
8,
9,
10,
11,
12,
13], creditable to the strong feature representation ability of CNN. Long et al. [
8] first proposed the fully convolutional network (FCN), which replaces the last fully connected layers with convolutional ones to achieve end-to-end semantic segmentation. Hereafter, FCNs in an end-to-end manner are widely applied and extensively developed, becoming a mainstream technology in semantic segmentation and edge detection [
12,
13,
14,
15,
16,
17,
18]. Ronneberger et al. [
9] designed a contracting path and a symmetric expanding path to merge different semantic features for biomedical image segmentation. Lin et al. [
10] made full use of the feature information available in the down-sampling process and used long-distance residual connections to achieve high-resolution prediction. Yu et al. [
11] proposed an end-to-end deep semantic edge learning architecture for category-aware semantic edge detection. Bertasius et al. [
12] presented a multi-scale bifurcated deep network, which exploited object-related features as high-level cues for contour detection. Xie et al. [
13] developed a novel convolutional neural-network-based edge detection system by combining multi-scale and multi-level visual responses.
Recently, deep learning-based water-body segmentation from remote sensing imagery has attracted some attention and developments [
14,
15,
16,
17,
18]. Yu et al. [
14] pioneers at introducing a CNN-based method for water-body extraction from Landsat imagery by considering both spectral and spatial information. However, this CNN-based method cut an image into small tiles for pixel-level predictions, which introduced a lot of redundancy and is of low efficiency. Miao et al. [
15] proposed a restricted receptive field deconvolution network to extract water bodies from high-resolution remote sensing images. Li et al. [
16] adopted a typical FCN model to extract water bodies from VHR images and significantly outperformed the normalized difference water index (NDWI) based method, the support vector machine (SVM) based method, and the sparsity model (SM) based method. However, these two approaches didn’t consider the multi-scale information from different decoder layers and the channel relationship of feature maps in the encoder, which incorporated insufficient extraction of water bodies in complex scenes. Duan et al. [
17] proposed a novel multi-scale refinement network (MSR-Net) for water-body segmentation, which made full use of the multi-scale features for more accurate segmentation. However, the MSR-Net does not reuse high-level semantic information and the multi-scale module it possesses does not consider channel relationships between feature maps. Guo et al. [
18] adopted a simple FCN-based method for water-body extraction and presented a multi-scale feature extractor, including four dilated convolutions with different rates, which was deployed on top of the encoders. This FCN-based method simply used the multi-scale information of high-level semantic features, but did not extract complete features at other scales. It is evident that current FCN-based water extraction studies emphasized feature extraction and prediction optimization, but the room for further improvements is considerable. Feature fusion in the FCN-based method preferably combines high-semantic features and features with precise locations, which facilitates water-body identification and the accurate extraction of water-body edges. In this work, we design our method by considering three aspects: feature extraction, prediction optimization, and the feature fusion of shallow and deep layers.
How to design optimal multi-layer convolution structures to extract excellent features from images has been widely studied in visual tasks. Simonyan and Zisserman [
19] stacked deep convolutional layers to enhance the feature representation, which has been proven to be effective in large-scale image classification. He et al. [
20] presented a residual learning framework to further deepen networks to achieve better feature representation ability. Huang et al. [
21] established dense connections between the front layers and the back layers to promote the reuse of features. These methods mainly utilize the convolution operation itself to learn layer-wise local feature representations and use pooling operations to expand the receptive field. However, between-layer and local-global feature representation ability may require to be further improved. Zhang et al. [
22] proposed a split-attention module to focus on the relationship between different feature groups to achieve better feature extraction results. However, this approach mainly considered local information and the between-channel relationship of the feature maps at each scale, but neglected larger receptive fields information of feature maps. To fully extract water-body features in complex scenes from VHR remote sensing imagery, we design a multi-feature extraction and combination module to extract rich features from both small and large receptive fields and between-channel information to increase the feature representation ability.
Prediction optimization: To obtain more refined semantic segmentation results, especially better edges and boundaries, many researchers optimize the rough prediction results [
10,
23,
24,
25]. Lin et al. [
10] used long-distance residual connections for all multi-scale features in the down-sampling process to achieve high-resolution prediction. Qin et al. [
23] designed an independent encoder-decoder, named residual refine module (RRM), to post-process the semantic segmentation results. Yu et al. [
24] proposed the refinement residual block (RRB) to optimize the feature maps. Cheng et al. [
25] designed a special-purpose refine network via global and local refinement to optimize the rough prediction results. However, most of these methods may introduce redundancy due to the repeated structural design. In our method, based on the features extracted from our feature extraction module, we propose a simple and effective multi-scale prediction optimization module to refine the water-body predictions from different scales.
Feature fusion: In semantic segmentation, shallow features have accurate localization while deep features consist advanced semantic information. The fusion of deep and shallow features plays an important role in achieving high-precision semantic segmentation [
9,
26,
27,
28]. Ronneberger et al. [
9] directly concatenated the shallow features and deep features to fuse features. Liu et al. [
27] designed a feature aggregation module, which used pooling operations to learn features on multiple scales, and added them to obtain the integrated result. Our previous work [
28] promoted the fusion of different semantic spatial-temporal features by learning the global information of 3D feature maps. And the way has been proved effective in the fusion of complicated spatial-temporal features. In this study, we extend the work by introducing a semantic feature fusion module between the encoder and decoder in 2D water-body feature fusion to improve semantic inconsistency.
To sum up, this study has three contributions:
We propose a rich feature extraction network for the extraction of water-bodies in complex scenes from VHR remote sensing imagery. A novel multi-feature extraction and combination module is designed to consider feature information from a small receptive field and a large one, and between-channels. As a basic unit of the encoder, this module fully extracts feature information at each scale.
We present a simple and effective multi-scale prediction optimization module to achieve finer water-body segmentation by aggregating prediction results from different scales.
An encoder-decoder semantic feature fusion module is designed to promote the global consistency of feature representation between the encoder and decoder.
4. Discussion
The boundary of water-bodies in VHR remote sensing imagery is irregular, unclear and complex involving in various scenes. In view of these difficulties, our proposed MEC module adopts three different feature extraction sub-modules to obtain more comprehensive and richer information based on the spatial and channel correlation of feature maps at each scale, compared with other methods mentioned in this paper. Our method is also applicable to other application scenarios, such as semantic segmentation and object detection.
To obtain both high pixel classification accuracy and accurate location, a simple multi-scale prediction fusion (MPF) module is designed to make full use of the prior knowledge, benefiting from our proposed MEC module which provides rich and advanced water-body features in complex remote sensing imagery. This simple and effective design is much more efficient than designing a complex network independently, such as cascadePSP, and will have more advantages in practical application.
We designed a semantic feature fusion module (DSFF) to improve the semantic consistency between the encoder and decoder. This structure not only proved to be effective in crop classification, but is also effective in water-body segmentation in VHR remote sensing imagery. However, this design pays more attention to the global information of feature maps, ignoring the influence of the spatial relationship between feature maps. This will be a focus in our future works.