#### 2.2. Self-Adaptive Pooling Convolutional Neural Networks (CNN) Architecture

A convolutional neural network (CNN) is a type of artificial neural network that draws inspiration from the biological visual cortex [

21,

22,

23]. Compared with the shallow machine-learning algorithm, it has the advantages of strong applicability, parallel processing ability, and weight sharing, meaning global optimization training parameters are greatly reduced. CNN has become a hot topic in the field of deep learning [

24]. The CNN architecture often consists of the input layer, convolution layer, pooling layer, full connection layer and output layer, as shown in

Figure 2.

The convolutional layer consists of multiple feature maps which consist of multiple neurons. Each neuron is connected to the local area of the previous feature map by the convolution kernel [

25]. The convolution kernel is a matrix of weights (such as 3 × 3 or 5 × 5 matrices for two-dimensional images). The convolutional layer extracts different features of the input layer through convolution operations. The first convolution layer extracts lower-level features such as edges, lines, corners, and higher-level convolution layers extract more advanced features.

The input image is convolved in the convolutional and filtering layers. Generally, convolutional and filtering layers require an activation function to connect [

26]. We use

${G}_{i}$ to represent the feature map of the

$i$th layer of the convolutional neural network. The convolution process can be described as:

where,

${W}_{i}$ represents the weight feature vector of the

$i$th convolution kernel, the operation symbol

$\otimes $ represents a convolution operation of the

$i$th layer of the image and the

$i-1$th layer of the image, and

${b}_{i}$ is the offset vector. Finally, the feature map

${G}_{i}$ of the

$i$th layer is obtained by a linear activation function

$f(\u2022)$.

There are two kinds of activation functions, one is a linear activation function, and the other is a non-linear activation function, such as sigmoid, hyperbolic and rectified functions. The rectified function is currently the most used in the literature because neurons, with a rectified function, work better to avoid saturation during the learning process, induce sparsity in the hidden units and do not face the gradient vanishing problem, which occurs when the gradient norm becomes smaller after successive updates in the back-propagation process [

27]. So, in this paper, we use rectified linear unit (ReLU)

$f(x)=\mathrm{max}(x)$ as an activation function.

The pooling performs a sampling along the spatial dimensions of feature maps via a predefined function (e.g., maximum, average, etc.) on a local region. Although the high-level feature maps are more abstract, they lose a lot of detail due to the pooling operation. In order to reduce the loss of image features during the process of pooling, this paper presents an adaptive pooling model.

Due to the complexity of the objects in high-resolution images, the traditional pooling model cannot extract the image features very well. Therefore, this research takes two kinds of pooling areas in the pooling layer as shown in

Figure 3. The blank space indicates that the pixel value is 0, the shaded area is composed of different pixel values, and a represents the maximum value area. The features of the whole feature map are mainly concentrated at A as shown in

Figure 3a. If pooling is done with the average pooling model, the features of the entire feature map will be weakened. The features of the feature map are mainly distributed in A, B, C as shown in

Figure 3b. In the case of the unknown relationship between A, B and C, the features of the entire feature map will be weakened by using the maximum pooling model. This will eventually affect the extraction accuracy of the water body in remote-sensing images.

There are two main models of pooling layer: one is the max pooling model as shown in Equation (2), and the other is an average pooling model as shown in Equation (3). The feature map obtained by convolution layer is

${G}_{ij}$, the size of the pooling area is

$c\times c$, the pooling step length is

$c$, and

${b}_{i}$ is the offset. The max pooling model can be expressed as:

The average pooling model can be expressed as:

where,

$\underset{i=1,j=1}{\overset{c}{{\displaystyle \mathrm{max}}}}({G}_{ij})$ represents the max element from the feature map

$G$ in the pooled region size

$c\times c$.

In order to reduce the loss of image features during the process of pooling, this paper presents an adaptive pooling model according to the principle of interpolation, based on the maximum pool model and the average model. The model can adaptively adjust the pooling process through the pooling factors

$u$ in the complex pooled area. The expression is:

where,

$u$ indicates pooling factor. The role of

$u$ is to dynamically optimize the traditional pooling model based on different pooled areas. The expression is:

where,

$a$ is the average of all elements except for the max element in the pooled area,

${b}_{\mathrm{max}}$ is the max element in the pooled area. The range of

$u$ is (0, 1). The model takes into account the advantages of both the max pooling model and the average model. According to the characteristics of different pooling regions, the adaptive optimization model can be used to extract the features of the map as much as possible, so as to improve the removal accuracy of the convolution neural network.

In order to verify that the self-adaptive pooling model can reduce the loss of features in the pooling process that this paper proposes, an example image with the size of 300 × 300 pixels is input into a simple network with a network structure of four layers.

Figure 4a is the original image.

Figure 4b is the feature map obtained from the self-adaptive pooling model,

Figure 4c is the feature map obtained from the max pooling model, and

Figure 4d is the feature map obtained from the average pooling model.

From

Figure 4b–d, the feature map obtained from the adaptive pooling model has obvious features, but the max pooling model and the average pooling model weaken the image features.

As demonstrated in

Figure 5, the overall architecture of the designed self-adaptive pooling convolutional neural network (SAPCNN) contains one input patch, two convolutional layers, two self-adaptive pooling layers, and two fully connected layers. An input patch is

[email protected] × 28, consisting of three channels, each with a dimension of 28 × 28. The first convolution layer is

[email protected] × 24, composed of 128 filters, followed by self-adaptive pooling of dimension 2 × 2 resulting in

[email protected] × 12. This process is followed by convolution layer and self-adaptive pooling; the convolution layer is

[email protected] × 8, composed of 256 filters, and self-adaptive pooling is

[email protected] × 4. All convolution layers have a stride of one pixel, and the size of filters is 5 × 5. In this paper, the output of the last fully connected layer indicates the probabilities that the input patch belongs to water or no water. This means that the unit number of the output layer is two.

#### 2.3. Pre-Processing

The convolutional neural networks extracts water bodies, but it does not guarantee continuous water bodies and water boundaries. Similarly, with building shadow, vegetation shadow, and mountain shadow, it does not guarantee compact contours and, hence, may misclassify water bodies. Therefore, a pre-processing step is required to reduce misclassified water bodies.

Superpixel refers to the adjacent image blocks with similar color and brightness characteristics [

28]. It groups the pixels based on the similarities of features between pixels and obtains the redundant information of the image, which greatly reduces the complexity of subsequent image-processing tasks.

In this work, the image is segmented into superpixels, which are used as basic units to extract water bodies. As a widely used superpixel algorithm [

29], the simple linear iterative clustering (SLIC) algorithm can output superpixels of good quality that are compact and roughly equally sized, but there are still some problems such as the facrt that the number of superpixels should be designed artificially and the ultra-pixel edges are divided vaguely. However, because SLIC obtains initial cluster centers by dividing the image into several equal-size grids and its search space is limited to a local region [

30], the superpixels produced cannot adhere to weak water boundaries well and the water bodies will be over-segmented. In this paper, the SLIC algorithm was improved by affinity propagation clustering and by expanding the search space.

#### 2.3.1. Color Space Transformation

Generally speaking, the color of water bodies is black and azure, with low reflectivity and high saturation. According to the features of the reflection spectrum of water bodies, a water body’s region is prominent in B1, B2, and B4 to the data used in this study. Similar to the RGB color model, the color space transformation to the hue, saturation, and intensity (HSI) color model is first performed using these 3 bands [

31]. The transformation from the RGB to the HSI color model is expressed as follows:

where

$R$,

$G$, and

$B$ are the values of B1, B2, and B4 channels of input remote sensing image.

$H$,

$S$, and

$I$ are the values of hue, saturation, and intensity components in the HSI space.

Figure 6 shows the HSI color space of an example remote sensing image.

Figure 6a is original RGB color image.

Figure 6b is the intensity component image,

Figure 6c is the hue component image, and

Figure 6d is the saturation component image. From

Figure 6b–d, it can be seen that the water bodies region is prominent in the

$H$ and

$S$ components. Therefore, the

$H$ and

$S$ components are used in our improved SLIC algorithm.

#### 2.3.2. Adaptive Simple Linear Iterative Clustering (A-SLIC) Algorithm

The number of superpixels should be designed artificially as well as initial clustering. In this paper, the idea of affinity propagation algorithm is introduced to reduce the dependence of segmentation on initial conditions.

Usually, a weighted similarity measure combining color and spatial proximity is needed for the SLIC algorithm. In this study, the

$i$th pixel and

$j$th pixel space distance is expressed as follows:

where,

${c}_{j}$ is the

$j$th pixel cluster center.

We define the color difference between

$i$th and

$j$th pixels as:

We define the similarity measure between the

$i$th pixel and

$j$th cluster center

${c}_{j}$ is expressed as follows:

where,

$S$ is the area of the

$j$th cluster in the current loop. The

$\alpha $ parameter is used to control the relative importance between color similarity and spatial proximity.

By defining the attribution function (Equation (13)) and the attraction function (Equation (15)), the number and location of cluster centers are adjusted during the iteration to complete the superpixel adaptive segmentation. The attribution function reflects the possibility that pixel

$i$ attracts pixel

$j$ into its cluster [

32]. The attribution function is expressed as:

The iteration relationship of the attribution function is expressed as:

The attraction function reflects the possibility of the

$j$ pixel attracting

$i$ pixel as its cluster [

33]. The attraction function is expressed as:

The iteration relationship of the attraction function is expressed as:

where,

$s(i,j)=-d(i,j)$ is the similarity between

$i$ pixel and

$j$ pixel,

$s(i,{j}^{\prime})=-d(i,{j}^{\prime})$ is the similarity between

$i$ pixel and non-

j pixel, and

$t$ is the number of iterations.

Using both attraction and attribution functions, two types of messages are continuously transmitted to possible clustering centers to increase their likelihood of becoming cluster centers. So, the larger the sum of $\alpha (i,j)$ and $\beta (i,j)$, the more likely the $j$ pixel is a cluster center. In this case, the greater the probability that the $i$ pixel belongs to this class, then the point is updated as a new cluster center. In order to reduce the computation complexity, the image is divided into $n$ regions firstly and calculates $\alpha (i,j)$ and $\beta (i,j)$ in the local area. In this study, the main process of the A-SLIC algorithm is as follows:

**Step 1.** For an image containing $M$ pixels, the size of the pre-divided region in this algorithm is $N$, then the number of regions is $n$. Each pre-divided area is labeled as $\eta $. In this paper, $\alpha (i,j)$ and $\beta (i,j)$ are defined zero, and $t$ is defined one.

**Step 2.** HIS transformation is performed on each pre-divided area. In the $\eta $th region, according to Equation (10), the similarity between two pixels is calculated in turn.

**Step 3.** According to Equations (14) and (16), the sum of ${\beta}^{t}(i,j)$ and ${\alpha}^{t}(i,j)$ is calculated and the iteration begins.

**Step 4.** If ${\beta}^{t}(i,j)$ and ${\alpha}^{t}(i,j)$ no longer change or reach the maximum number of iterations, the iteration is terminated. The point where the sum of ${\beta}^{t}(i,j)$ and ${\alpha}^{t}(i,j)$ is max is regarded as the cluster center (${R}_{i}^{\eta}$, where $i=1,2,3\cdots {W}_{\eta}$).

**Step 5.** Repeat steps 3 to 4 until the entire image is traversed, and adaptively determine the number of superpixels (${R}^{\prime}={\displaystyle {\displaystyle \sum}_{\eta =1}^{n}{W}_{\eta}}$). In this paper, the HSI value are the center of the pixel. Finally, complete the superpixel segmentation.

#### 2.4. Network Semi-Supervised Training and Extraction Waters

Convolution neural network training requires a large number of samples, but building a sample library requires a lot of time and manpower. In this paper, semi-supervised training is proposed. We use principal component analysis (PCA) to initialize the network structure [

9], and then the entire network will be fine-tuned through the water label data.

Assume the input image has

$N$ scenes, its size is

$m\times n$. The convolution filter size is

${g}_{1}\times {g}_{2}$. In the

$i$th scene of the training image, all the image blocks of size

${g}_{1}\times {g}_{2}$ are extracted and expressed as a vector form

${X}_{i}=\{{x}_{i,1},{x}_{i,2},{x}_{i,3}\cdots ,{x}_{i,nm}\}$. Then the algorithm removes the mean of

${x}_{i,nm}$ and expressed as a vector form

${\overline{X}}_{i}=\{{\overline{x}}_{i,1},{\overline{x}}_{i,2},{\overline{x}}_{i,3}\cdots ,{\overline{x}}_{i,nm}\}$. So the image block of training data can be expressed as [

9]:

where,

$i$ is the number of the scene image.

The principal component analysis method can minimize the reconstruction error to solve the feature vector:

where,

${I}_{H}$ is a unit matrix,

$V$ is the

$H$ feature vector of the covariance matrix (

$X{X}^{\mathrm{{\rm T}}}$).

$V$ represents the main features of the input image block. The convolutional neural network

${W}_{h}$ filter were initialized by the principal component analysis, which can be expressed as follows:

where,

${m}_{{g}_{1}{g}_{2}}({V}_{h})$ represents that vector

$V$ is mapped to

${W}_{h}$,

${V}_{h}$ represents the

$h$th main feature of the image.

In the training stage, we use a semi-supervised training method to train networks. First, the image of the training set is cut into the same size as the filter 5 × 5 according to Equation (17) to create the training data set. According to Equations (18) and (19), the principal component analysis is used to obtain the initialized filter weight. Training is carried out by optimizing the logistic regression function using a stochastic gradient descent and mini-batch size of 128 with the momentum of 0.8. The training is regularized by weight decay set to 0.0001, and dropout regularization for all fully connected layers with the dropout ratio set to 0.1. The learning rate starts from 0.01 and is divided by 10 when the error plateaus. Finally, the algorithm fine-tunes the entire network through the water label data to complete the final network training. Through training, a SAPCNN classifier with two class predictions is generated for the extraction of urban water bodies.

In the extraction stage, superpixels are first obtained from the test remote-sensing image using the adaptive simple linear iterative clustering algorithm described in

Section 2.3.2. Image patches with a size of 28 × 28 centered at its geometric center pixel are extracted. Finally, image patches size of 28 × 28 are inputted into the trained SAPCNN model. The procedure of water extraction is demonstrated in

Figure 7.

#### 2.5. Accuracy Assessment Method

Reference water mapping is manually digitized by a visual interpretation process of the high-resolution imagery with reference to Google Earth. We evaluate the algorithm performance for the water extraction in two aspects: (i) water classification accuracy, and (ii) water edge pixel extraction accuracy. Therefore, six metrics are used including overall accuracy (OA), producer’s accuracy (PA), user’s accuracy (UA), edge overall accuracy (EOA), edge omission error (EOE), and edge commission error (ECE).

Unit rates (Equation (20)) based on the confusion matrix are utilized to evaluate the final water maps produced using different method, including PA, UA and OA [

4]. The definition is as follows:

where,

$T$ is the total number of the pixels in the experimental remote sensing image, and

$TP$,

$FN$,

$FP$, and

$TN$ are the pixels categorized by comparing the extracted water pixels with ground truth reference:

$TP$: true positives, i.e., the number of correct extraction pixels;

$FN$: false negatives, i.e., the number of the water pixels not extracted;

$FP$: false positives, i.e., the number of incorrect extraction pixels;

$TN$: true negatives, i.e., the number of no-water bodies pixels that were correctly rejected.

This paper defines the evaluation water edge pixel extraction accuracy as follows: (1) Firstly, obtain the boundary of water body by manual drawing. (2) The morphological expansion is performed in the water body boundary obtained in step (1) to create a buffer zone centered on the boundary line and having a radius of 3 pixels. (3) Finally, the pixels in the buffer area are judged. Suppose the total number of pixels in the buffer area is

$M$, the number of correctly classified pixels is

${M}_{R}$, the number of missing pixels is

${M}_{O}$, and the number of false alarm pixels is

${M}_{c}$. Then

EOA,

EOE and

ECE are defined as: