RCSANet: A Full Convolutional Network for Extracting Inland Aquaculture Ponds from High-Spatial-Resolution Images

: Numerous aquaculture ponds are intensively distributed around inland natural lakes and mixed with cropland, especially in areas with high population density in Asia. Information about the distribution of aquaculture ponds is essential for monitoring the impact of human activities on inland lakes. Accurate and efﬁcient mapping of inland aquaculture ponds using high-spatial-resolution remote-sensing images is a challenging task because aquaculture ponds are mingled with other land cover types. Considering that aquaculture ponds have intertwining regular embankments and that these salient features are prominent at different scales, a Row-wise and Column-wise Self-Attention (RCSA) mechanism that adaptively exploits the identical directional dependency among pixels is proposed. Then a fully convolutional network (FCN) combined with the RCSA mechanism (RCSANet) is proposed for large-scale extraction of aquaculture ponds from high-spatial-resolution remote-sensing imagery. In addition, a fusion strategy is implemented using a water index and the RCSANet prediction to further improve extraction quality. Experiments on high-spatial-resolution images using pansharpened multispectral and 2 m panchromatic images show that the proposed methods gain at least 2–4% overall accuracy over other state-of-the-art methods regardless of regions and achieve an overall accuracy of 85% at Lake Hong region and 83% at Lake Liangzi region in aquaculture pond extraction.


Introduction
Aquaculture has become one of the main sources of animal protein and increasingly contributes to food security for many inland cities with large populations in Asia. Freshwater aquaculture products such as fish, crustaceans, and molluscs are supplied from aquaculture ponds built around natural lakes. Aquaculture in China already accounts for 60% of global production [1]. Aquaculture foods provided by inland aquaculture ponds have become predominant contributors of aquatic foods in Chinese banquets [2]. Provinces in the middle and lower reaches of the Yangtze River basin account for more than half the country's total freshwater production. In recent years, pond aquaculture has become predominant and has contributed on average 71 percent to total freshwater production (China Fishery Statistical Yearbook 2004-2016), maintaining an average growth rate of 5.8 percent per year. The area under pond aquaculture has greatly increased. However, intensive aquaculture has a severely destructive effect on the environment, including high the aquaculture ponds with their intertwining regular boundaries will be visual attention areas because human perception commonly pays attention to parts of visual space where patterns can be acquired, according to neuroscience and cognitive science literature [27].
Attention mechanisms have been extensively used for various visual tasks. The recurrent attention model is used for object recognition through a recurrent neural network (RNN) integrated with reinforcement learning to mimic the process of the human visual system as it recurrently determines the attention region. The attention mechanism on top of the RNN proposed by the neural machine translation community [28,29], was also adopted to perform image captioning by assigning different weights to image representations [30]. The self-attention mechanism without the RNN model is exploited in a super-resolution image generator [31], which is a variant of the TRANSFORMER [32], a cutting-edge deep neural network for language translation. Furthermore, self-attention mechanisms have been introduced into scene segmentation for modelling feature dependencies from spatial and channel dimensions [33]. In remote sensing, attention models have also been used for object classification in various satellite images. For instance, attention mechanisms are integrated into multi-scale and feedback strategies of deep neural networks for pixel-wise classification of very-high-resolution satellite images [34]. The attention model is combined with a learning layer to capture class-specific feature dependencies [35].
When human beings visually identify densely distributed aquaculture ponds on remote-sensing images, the intertwining regular embankments around these ponds are prominent visual attention features. This paper is inspired by this visual attention mechanism used for human interpretation of satellite images. Moreover, the intertwining regular embankments are a salient feature that is available at different scales. The two motivations of this study are first to develop a novel attention mechanism that can mimic the process of the human visual system to recurrently determine the attention region, which is the intertwining regular embankments of aquaculture ponds, and to evolve multi-scale visual attention through the encoder-decoder, fully convolutional network architecture that integrates the attention mechanism with atrous convolutions to better extract aquaculture ponds.
Therefore, the main contributions of the paper can be summarized as follows: (1) Propose the Row-wise and Column-wise Self-Attention (RCSA) mechanism, which can work in parallel to capture visual emphasis on salient pixels in the context of rows and columns from a remote-sensing image. (2) Propose an improved fully convolutional network based on the RCSA mechanism that is combined with an ASPP structure for multi-scale attention. (3) Evaluate the validity of the proposed method on a developed dataset that contains abundant aquaculture ponds around inland lakes.

Study Area
Hubei Province, known as the province of thousands of lakes, lies in the middle reaches of the Yangtze River and has densely distributed lakes. Hubei has a mature freshwater aquaculture industry with large numbers of aquaculture ponds developed surrounding natural lakes. As shown in Figure 1, six regions with densely distributed aquaculture ponds were selected as study areas from three large lakes (Lake Liangzi, Lake Futou, and Lake Hong) along the Yangtze River because these are typical inland aquaculture areas in China. Among them, Lake Hong and Lake Liangzi are the two largest freshwater lakes in Hubei Province. The population in this part of China is dense, and aquaculture is very developed. Lake Liangzi, and its surroundings, however, have been relatively well protected since the 1980s. The six selected regions were divided into two categories: type I and type II. The type I regions, including regions A and B, are used for testing, whereas type II regions are used for training. Region A is an area of 73.76 km 2 close to eastern Lake Hong which is an artificial lake, and region B is an area of 33.92 km 2 close to eastern Lake Liangzi, which has been preserved in a state more like a natural lake.  Figure 1. Location of the study area. The pseudo-colour images (A,B) are pansharpening images using the near infrared, the red and the green band as red, green and blue. The corresponding labelling image for each is given below.

Dataset
The Landsat multispectral images were selected because of their long history. The bands such as the near infrared can be beneficial for extracting water bodies. However, the spatial resolution of Landsat multispectral data is only 30 m. Panchromatic images with 2-2.5 m spatial resolutions from the panchromatic and multispectral (PMS) camera of the GaoFen-1 (GF-1) satellite [36], the panchromatic remote-sensing instrument for stereo mapping (PRISM) of the ALOS satellite, and the NAD panchromatic sensor of the ZiYuan-3 (ZY-3) satellite [37] were also used to improve recognition and extraction of aquaculture ponds and natural water bodies. Table 1 lists the images used for the selected study regions. The Landsat multi-spectral images used in this study were captured in the winter of 2010-2011 and 2013-2014 and the spring of 2015. High-resolution panchromatic images were used for fusion with multi-spectral images. The panchromatic images were mainly selected from the GF-1 satellite and had acquisition dates close to the corresponding OLI images from Landsat satellite, whereas panchromatic images from the ALOS satellite were used instead for 2010 TM images from the Landsat satellite. However, when ALOS or GF-1 panchromatic images with similar acquisition dates were still not found, panchromatic images from the ZY-3 satellite captured in same season as Landsat images from a nearby year were selected because the ZY-3 satellite was launched in 2012.
Three classes: aquaculture ponds (artificial water surfaces), natural water surfaces and background (non-water surfaces) were included in the reference dataset (Figure 1), which was mainly generated by human visual interpretation. Field investigations were also conducted on some difficult-to-identify features, in cases where aquaculture ponds were mixed with small natural water surfaces ( Figure 2).

Methodology
To better understand the effectiveness of the proposed method for aquaculture pond segmentation, the methodology will be introduced in three parts: data pre-processing, the basic model, and a fusion strategy designed to further improve accuracy. In the preprocessing stage, the multi-spectral image and the corresponding 2 m panchromatic image were pansharpened. The pansharpened image was then fed into the proposed network, i.e., RCSANet, for semantic segmentation. The result generated from the network was finally fused with a water surface extraction image using the water index to further improve segmentation quality.

Preprocessing
Multi-spectral satellite images contain more spectral information, especially in the infrared spectral bands, which is beneficial for aquaculture pond identification, whereas panchromatic satellite images have higher spatial resolution, which helps to better distinguish the shape of the aquaculture pond. To use both together, the multi-spectral and high-spatial-resolution panchromatic images must be pansharpened to obtain images with both spectral information and higher spatial resolution. First, multi-spectral images were synthesized by selecting the three bands (green, red, NIR) that are useful for water body identification. The pixel values were normalized and then mapped to the range (0, 255). Similarly, the gray values of panchromatic images were also normalized and mapped to the range of (0, 255). The multi-spectral images were re-projected into the coordinate system of the corresponding panchromatic images to ensure consistent coordinates. The multi-spectral and panchromatic images were fused by the GRAM-SCHMIDT method [38], which is a widely used high-quality pansharpening method providing a fusion of panchromatic image and multi-spectral images with any number bands through orthogonalization of different multi-spectral bands [39].

Network Architecture
The deep neural network architecture, depicted in Figure 3a, for semantic segmentation of aquaculture ponds in the proposed method is based on an FCN framework, that uses ResNet-101 [15] as the encoder to generate multiple semantic features. The encoding part produces the feature maps through five convolution layers, including the first convolution layer (Conv1) followed by a pooling layer, and the other four convolution layers (Res-1 to Res-4) are all residual subnetworks. The feature maps are abstract representations of the input image at different levels. Semantic segmentation by the FCN framework is a dense prediction procedure in that the coarse outputs of the convolution layers are connected by upsampling to produce pixel-level prediction. In the proposed method, the RCSA mechanism (introduced in Section 3.2.2) was developed on the coarse outputs at different levels of abstract representation (detailed in Section 3.2.3). Next, channel attention blocks (CAB), which were designed to assign different weights to features at different stages for consistency [40], were used to connect the coarse abstract representations from the encoder with the upsampling feature at the decoder in the whole dense prediction procedure. The spatial size of the coarse outputs derived from the different convolution layers were kept consistent by the upsampling blocks ( Figure 3c) to achieve end-to-end learning through backward propagation. Specifically, to accurately capture aquaculture ponds and their context information at multiple scales, the ASPP module combined with the RCSA mechanism (ASPP-RC) forms a branch from Conv1 to the end of the decoder before a 1 × 1 convolution layer and is integrated with the corresponding feature as a skip connection. To extract spatial context information at different scales, atrous convolutions with different rates, followed by the RCSA mechanism, were performed in parallel on the low-level feature map in the ASPP-RC module. These branches for capturing features at different scales are connected by weighting each branch in terms of its own importance ( Figure 3b, introduced in Section 3.2.4).

RCSA Mechanism
When human beings use visual perception to understand remote-sensing images containing inland lakes with densely distributed aquaculture ponds, the ponds as a group will be eye-catching. The attention focuses on the spatial dependencies of aquaculture ponds and their surroundings. To mimic this human visual mechanism, the proposed model first establishes inter-pixel contextual dependencies through bidirectional gated recurrent units (GRUs) [41], which are a powerful variant of RNN, and then the self-attention modules are used on top of the bidirectional GRUs to establish this visual attention.
The self-attention mechanism is essentially a special case of the attention model. The unified attention model contains three types of inputs: key, value, and query [42], as depicted in Figure 4. The key and the value are a pair of data representations. Assume that there are T pairs < k i , v i > (i ∈ 1, . . . , T). By evaluating the similarity between a query q and each key, the model essentially captures the weight coefficient of each key and then weights the corresponding values to derive their final attention values. The attention mechanism first scores the similarity between a query and a key pair by the f function: Then the original scores e i are normalized by a Softmax function to obtain the weight coefficients: Finally, the context vector c t is evaluated by a weighted sum of the values: The attention model can be presented in a unified form The attention model becomes a self-attention mechanism when all inputs, including the query, the key, and the value, have the same value.
Normalise ReLU (c) Figure 3. RCSANet: FCN architecture combined with RCSA mechanism for semantic segmentation of aquaculture ponds: (a) Network architecture; (b) ASPP-RC module; (c) Upsampling block. The input image of the entire deep neural network is a 256 × 256 pansharpening patch with three spectral channels. Through encoding and decoding, a three-channel matrix for classification was output through a 1 × 1 convolution layer at the end, and finally a Softmax layer gave a prediction map with the same size as the input image. (b) Self-attention model The RCSA mechanism takes a feature map, which is the convolutional result from the previous layer or the input image, as an input x ∈ R h×w×C , where h, w, and C are the number of rows, columns, and channels respectively. The feature map can be spatially divided into h rows r i ∈ R 1×w×C (i ∈ 1 . . . h) or w columns c j ∈ R h×1×C (j ∈ 1 . . . w). RCSA enables the construction of spatial dependencies between pixels within a row or a column by the self-attention mechanism. Hence, the RCSA mechanism consists of two parallel branches, column-wise and row-wise self-attention, which are subsequently concatenated by summation, as shown in detail in Figure 5. In the upper branch, the row-wise selfattention mechanism first uses the bidirectional GRU model to depict the dependencies between the pixels in a row of the feature map Then the outcome from the GRUs r i is fed into the self-attention model by which the importance of the dependencies between pixels in the row is evaluated. The self-attention model is a specific variant of the attention model, in which the input query, key, and value have the same value, as shown in Figure 4b. The r i are respectively conducted by three 1 × 1 convolution kernels, W Q , W K ,and W V , so that the query, the key, and the value can be obtained by Then they are substituted into the following Equation (4): where the similarity function and d k is the dimension of the key. The computation for one row can traverse to each row of the feature map. Equivalently, in the bottom branch, the same operations are performed in parallel on each column of the feature map. Eventually, the two branches are combined with equal weights. Figure 5. Attention layer consisting of column-wise and row-wise self-attention models.

RCSA for Dense Prediction
In semantic segmentation of remote-sensing images, dense prediction must fuse abstract representations of different levels from the encoder to improve pixel-level prediction. Visual attention on densely distributed aquaculture ponds could be involved in the dense prediction procedure. Consequently, the outputs of different convolution blocks in the encoding part are conducted by RCSA and then participate in dense prediction. These RCSA modules in the lateral connection enhance the features pixel-wise by assigning different weights to achieve a reasonable optimization of visual attention. In fact, this optimization takes place in a two-dimensional space made up of row and column vectors. However, the importance of different band channels must also be emphasized. The CAB module is directly used to fuse encoder and decoder features through assigning different weights to channels.

ASPP-RC Module
Atrous convolutions at different rates can enlarge the field of view so that spatial information at different scales can be extracted. Aquaculture ponds, which are water bodies surrounded by dikes with regular shapes, are densely distributed close to inland lakes. These features show visual salience in remote-sensing images. Hence, the RCSA block is arranged next to atrous convolution to selectively focus attention. After the first convolution blocks of the encoder, in the ASPP-RC module, the low-level feature map is executed in parallel by atrous convolutions with different rates combined with RCSA. Eventually, the branches are connected by: where b i is the feature map produced by the ith branch in which the atrous convolution and RCSA are conducted in sequence and w i is the weight of the ith branch that evaluates the importance of different scales. This is unlike the original ASPP structure in which each branch has the same importance. The importance of each branch is adjusted adaptively in the proposed ASPP-RC module. All weight parameters are initially defined by a random vector w 0 i , which can be optimized during backpropagation when training the whole network. Finally, these weights are normalized using a Softmax function:

Fusion Strategy
To further improve the segmentation quality of aquaculture ponds, the normalized difference water index (NDWI) maps from pansharpening images are fused with the prediction probability matrices from the proposed network to produce the final classification result ( Figure 6). This implementation is called "RCSANet-NDWI". The classification probability matrices are produced from the three classes (aquaculture ponds, natural water surfaces, and background) probability maps after the Softmax layer. Both aquaculture ponds and natural water surfaces are water bodies surrounding inland lakes. Hence, the water extraction index, which is a typical representation of the spectral characteristics of a water body used to distinguish ground features, has been extensively used. The NDWI maps were used to provide prior knowledge for aquaculture pond extraction. Through OTSU threshold binary segmentation [43], NDWI maps were divided into water and non-water parts. The water parts in the NDWI maps were used to refine the three-class probability matrix described earlier. Assume that the original probability matrix P 0 and the refined matrix P are both h × w × c in size, whereas the NDWI map S is h × w in size. c is the channel number, k is the channel ID, and the kth channel represents the kth class. Hence, k = 1, 2, 3 represent background, water, and aquaculture ponds, respectively. The fusion operation can be defined as: where y is the indicator variable For the pixel in the i-th row and j-th column of S, if its value is 1 (representing water), the corresponding background probability (k = 1) in P 0 is set to 0, and the fused matrix P is generated. The final classification maps can be obtained using the maximum probability judgment. The maximum probability judgment is the usual method for mapping the probability matrix to the final label image: the classification label of this pixel is determined with maximum probability: l ij = arg max k (p k ij ), where probability p k ij is the probability of a pixel in the i-th row and j-th column from k different sources. With the NDWI, the interference from the background of the water body extraction is eliminated because the probability of the non-water part is set to 0.

Experiments
This section describes a series of qualitative and quantitative comprehensive evaluations that were conducted using the proposed methods with the dataset introduced in Section 2.

Experimental Set-Up
The inputs of the proposed network were 256 × 256 pansharpening patches with three spectral channels. Table 2 lists the parameters of the convolution kernels, which are basic operators of different modules in the entire process of the proposed RCSANet. Parameter rate means that the convolution kernels in different atrous convolution branches of the ASPP-RC module have different padding and dilation configurations, which are set to 6, 12, and 18, respectively, according to Figure 3b. Validation consisted of two parts: (1) Evaluating the performance of the proposed methods. The pansharpening images of the six regions (both type I and type II in Figure 1) were segmented into image patches 256 × 256 pixels in size. These image slices were randomly divided into training and test sets, of which 80% (4488 images) made up the training set and 20% (1122 images) made up the test set. The overall accuracy, user's accuracy, producer's accuracy, and kappa coefficients were used as the main evoluation metrics. (2) To assess the quality of aquaculture pond extraction and evaluate the generalization and migration capabilities of RCSANet, four regions (type I) were used as training data, and the other two regions (type II) were used as test areas. The overall accuracy, user's accuracy, producer's accuracy, and kappa coefficients were calculated to assess aquaculture pond extraction accuracy on the 2 m spatial resolution pansharpened images. In addition, the proposed methods were divided into two versions: RCSANet (without NDWI fusion) and RCSANet-NDWI (with NDWI fusion) to verify the role of NDWI fusion. Three state-of-the-art segmentation methods, including DeeplabV3+ [20], Reseg [44], and Homogeneous Convolutional Neural Network (HCN) [25] were selected for comparison. In addition, the performance of SVM was also assessed as a representative of traditional machine learning methods that directly use each pixel as a feature. DeeplabV3+ is an FCN method for semantic segmentation with the help of an ASPP module. Reseg is a hybrid deep network for semantic segmentation. Except for CNN, the bidirectional GRU is also used in Reseg to capture contextual dependencies. HCN was originally proposed for automatic raft labelling and is now considered to have potential for aquaculture pond extraction. HCN was implemented following the settings in [25], and Resnet-101 was simultaneously used as the encoder in DeeplabV3+, Reseg, and the proposed methods.
In the present experiments, the parameters of the proposed methods were optimized by minibatch stochastic gradient descent using a momentum algorithm with a batch size of 2. The learning rate was set to 10 −2 and decayed with training epoch according to the "polynomial" strategy. The number of training epochs was configured as 40. The SVM was implemented with the help of the LIBSVM package [45], and two important factors, C and γ, were determined through a five-fold cross validation grid search. Except for the HCN, which was operated using TensorFlow 1.9.0, the other deep learning-based algorithms were implemented in Pytorch 1.1.0. All deep learning methods were implemented on a single NVIDIA GeForce GTX 1080 GPU.

Results
The performance of the various semantic segmentation methods in Part 1 of the experiments is depicted in Table 3. Clearly, the deep learning-based methods perform better than the traditional SVM algorithm because the latter cannot perceive spatial semantic information in the image. DeeplabV3+ is a state-of-the-art FCN method that has been widely used. Resnet-101 was also chosen as the backbone for DeeplabV3+. HCN is a deep convolutional neural network for automatic raft labelling, and Reseg is a deep recurrent neural network for semantic segmentation. The classification accuracy of the proposed methods for natural water surfaces and aquaculture ponds was consistently better than the other methods. Meanwhile, compared with DeeplabV3+, the overall accuracy in the two versions of the proposed methods led to an improvement of more than 7% and the Kappa coefficients of the proposed methods were greater than 0.72, indicating that the proposed method is significantly better than DeeplabV3+. Moreover, the results also demonstrated the effect of the proposed fusion strategy because RCSANet-NDWI further surpassed RCSANet on most metrics.  Figure 7 gives a detailed display of the classification results in Part 1 of the experiment. Inland water areas contain various natural water bodies as well as aquaculture ponds. These natural water bodies greatly interfere with the segmentation result for aquaculture ponds, making pixel-scale classification intricate. Figure 7 shows that the SVM classification results misclassified many aquaculture ponds as natural water bodies and many natural water bodies as aquaculture ponds, indicating that the traditional pixel-based method cannot efficiently distinguish natural water bodies from aquaculture ponds. The segmentation maps created by DeeplabV3+ look significantly better than those from SVM, but in some difficult zones where natural water bodies look similar to aquaculture ponds, they are also trapped by their own performance limitations and misclassified natural water bodies as aquaculture ponds (area in the 7th row) or aquaculture ponds as natural water bodies (districts in the 5th row). HCN, which has good performance for raft-culture extraction in offshore waters, performed poorly on semantic segmentation of inland aquaculture ponds and serious misclassifications also happened with HCN. Reseg, which combines CNN and bidirectional GRU, can perform semantic segmentation for aquaculture ponds. However, the identification of natural water bodies that closely resemble aquaculture ponds around inland lakes is not as good as with the proposed methods. In Table 3, the overall accuracy of the Reseg method can reach greater than 80% but its Kappa coefficient is less than 0.7. This shows that Reseg has established a spatial relationship through the construction of GRU, which has a certain effect on the segmentation of aquaculture ponds around inland lakes, but it is not good enough. In the Reseg segmentation map, many objects are stuck together, and the edges of aquaculture ponds are not well displayed. Among these result maps, the two versions of the proposed method separated natural water bodies and aquaculture ponds more satisfactorily than the other methods. The ASPP-RC module of the proposed method feeds back the details at different scales into the low-level feature map, which can draw visual attention to the decoding part. This facilitates identification of the thin edges surrounding the aquaculture ponds in semantic segmentation. Hence, the edges of aquaculture ponds were clearly identified in most cases, as shown in the results from RCSANet and RCSANet-NDWI. Finally, note that RCSANet-NDWI further improved the quality of aquaculture pond extraction compared with RCSANet. Table 4 provides assessment results for the various algorithms in Part 2 of the experiment and shows the corresponding extraction accuracies of the aquaculture pond and natural water surface classes in the two experimental areas (regions A and B in Figure 1) by different sensors. The overall accuracy and Kappa coefficient show that the two versions of the proposed method (RCSANet and RCSANet-NDWI) both performed better than the other methods, regardless of sensor or area. Moreover, compared with RCSANet, the accuracy of RCSANet-NDWI was further improved with the aid of NDWI fusion. In region A, their overall accuracies in pansharpening images from different sensors were greater than 85 percent, and the Kappa coefficients were definitely greater than 0.7. These results were better than those of other deep learning-based methods, not to mention SVM. In region B, the proposed methods still performed the best. Unlike region A, where the lake is greatly influenced by residents living nearby, causing the aquaculture ponds to be neatly and regularly distributed, the aquaculture ponds in region B have a sparser distribution.
Region B is relatively well protected, and some small natural water bodies, which are easily confused with aquaculture ponds and interfere with network identification, were produced when the lake was split for artificial development. Hence, the situations in the two regions are completely different, which shows the stability of the proposed methods under various scenarios. The overall accuracies of the proposed methods in pansharpening images from different regions were close to or greater than 80 percent. In addition, it should be noted that user accuracy in identifying natural water bodies in almost all methods is relatively high. This is because natural water bodies tend to be extensive, homogeneous, self-contained, and distributed in aggregates, a situation that is easier to recognize for the classifier. Compared with the proposed methods, Reseg and DeeplabV3+ may also obtain higher user accuracy in some cases. However, because of their limited recognition ability, they cannot explicitly judge the difference between aquaculture ponds and natural water bodies (Figure 8).   Figure 8 shows the classification results in the two study regions. Extracting aquaculture ponds in region B is more difficult than in region A because region B contains more natural water bodies that are hard to distinguish from aquaculture ponds. The two versions of the proposed method performed significantly better than the other methods for aquaculture pond extraction. The proposed methods were predominantly successful in predicting aquaculture ponds that are divided into regular shapes by embankments, as well as the natural water bodies in the two regions. In region A, the proposed methods generally extracted almost all aquaculture ponds compared with the ground truth, whereas other methods failed, especially in the upper part of the scene. In region B, compared with Reseg, the proposed methods had lower misclassification rates, and the natural rivers located at the bottom, which could not be identified by Reseg, were not misclassified as aquaculture ponds by the proposed methods. Moreover, the shapes of the ponds are best retained, as shown in the results of the proposed method. The advantage of the proposed method is the proposed RCSA mechanism for determining salient pixels in a row or column, which is essentially a description of the pixel-level context. This enables the proposed method to identify detailed features of the 2 m spatial resolution image, where the dikes around aquaculture ponds are such pixel-level details. Hence, the aquaculture ponds in region B were more fully extracted by the proposed RCSANet than by other state-of-the-art methods, such as DeeplabV3+ and Reseg. On the other hand, fusion using NDWI can better distinguish water surfaces, including natural water bodies and aquaculture ponds, from background. In effect, the proposed method with NDWI re-segments the leaked water surface from the background, which improves the producer's accuracy of the aquaculture pond. However, this also entails a phenomenon whereby a small part of the background is mistakenly classified as water surface.

Discussion
This study has used a fully convolutional network architecture with row-and columnwise self-attention to semantically segment aquaculture ponds around inland lakes. Artificial aquaculture ponds around inland lakes are small, and the dikes between these ponds are only about 2 m wide. On medium-resolution multispectral images, water pixels are firstly separated from land, and then water objects are formed based on connectivity. After that, these water objects are classified as natural water bodies and aquaculture ponds using geometric characteristics [8]. However, for inland lake area where aquaculture ponds are intensively distributed with narrow dikes (e.g., Lake Hong), the 15-30 m spatial resolution of the image limits the capability of the object based method to accurately extract aquaculture ponds. Hence, finer-spatial-resolution images are considered for pond extraction. By fusing multi-spectral information into panchromatic images from the GF-1, ZY-3 or ALOS satellites, the spatial resolution of the resultant satellite images can achieve up to 2 meters, enabling the identification of thin narrow dams. Meanwhile, the multi-spectral capability is utilized to recognize water. From the segmentation results, the proposed network structure was shown to be capable of extracting these regular pond boundaries, mainly because semantic segmentation of the aquaculture ponds benefits from establishing a spatial relationship between pixels in the same direction by the self-attention model. Although HCN was also an FCN-based method used to automatic raft labeling [25], nevertheless, its performance for extracting aquaculture ponds around inland lakes are not as effective as that for labeling raft-culture. Because the spatial context of raft-culture in coastal area is much simple than that of the inland lake area. In general, through high-spatial-resolution images that incorporate multi-spectral and panchromatic data, the proposed RCSANet enables the extraction of large-scale aquaculture ponds around inland lakes where complex spatial contexts of water surfaces exist. However, it is still challenging for the recognition of small water bodies in such complex spatial context. The experimental region B was in the process of recovering aquaculture ponds and farmland as lake area from 2011 to 2014. Therefore, various aquaculture ponds and natural water bodies are spatially mixed on the images of pansharpening multispectral and panchromatic data from 2011 and 2014, which poses great challenges for semantic segmentation of aquaculture ponds. For example, Figure 9c,d are images of the same area, which changed significantly between 2011 and 2014. Several small reservoirs were apparent in Figure 9c, but the profiles of these reservoirs had changed significantly in Figure 9d, and the left side of this area had been recovered into a large lake. The segmentation results in Figure 9g,f show that the restored large lake has been well segmented, but the small reservoirs are easily classified as aquaculture ponds or missed segmentations.
In the paper, extracting aquaculture ponds is performed on images that pansharpen multi-spectral data from Landsat satellites and panchromatic data from other satellites in the same period, and therefore the semantic segmentation might also be affected by the spectral range of the panchromatic image. Table 5 gives the results of an accuracy analysis that divided the training data of Part 2 of the experiment into two portions: pansharpened TM images and pansharpened OLI images. The predicted results of pansharpened TM images from Region B are based on RSCANet, which was trained by fusing TM images with panchromatic images from ZY-3 or ALOS satellites. The predicting results of pansharpened OLI images from Region B is based on RSCANet, which was trained by fusing OLI images with panchromatic images from GF-1 satellites. Table 5 shows that the results of pansharpened OLI images with panchromatic images from GF-1 satellites are significantly better than the results of pansharpened TM images with panchromatic images from ZY-3 or ALOS satellites. The spectrum of panchromatic images from GF-1 satellites ranges from 0.45 to 0.90 µm, which can completely cover the three NIR, red, and green bands of Landsat OLI data. However, the spectrum of panchromatic images from ZY-3 or ALOS satellites can only partly cover the NIR band of the TM sensor. The acquisition time of the TM images was earlier than 2012, and therefore it is difficult to use a GF-1 panchromatic image for pansharpened TM images. The RCSANet can extract aquaculture ponds around inland lakes on 2 m satellite images more accurately than other methods because the involvement of two connection groups from the encoder to the decoder. The first connection group is the combination of the RCSA module and the ASPP-RC module, which links Conv1 of encoder part to decoder part. The second is the RCSA modules, linking Res-1, Res-2, and Res-3 of encoder part to decoder part. Table 6 indicates that the first connection group of RCSANet achieves an additional 2.32% overall accuracy gains over RCSANet 1 , and the second connection group brings 3.59% overall accuracy gains over RCSANet 2 , i.e., a plain FCN architecture based on ResNet-101 model. Nevertheless, the connections expend more computing resources because they involve the non-local self-attention mechanism, which contains many innerproduct operations. Moreover, the RCSANet is an encoder-decoder architecture in which the gradual upsampling are conducted, requiring more memory and calculation time. Table 7 shows that the RCSANet consumes more memory and training and prediction time than Deeplabv3+ and Reseg methods. It is feasible to sacrifice some computing resources to achieve higher accuracy of aquaculture pond extraction, especially the GPU performance will increase gradually.

Conclusions
This study has implemented a semantic segmentation network on high-spatial-resolution satellite images for aquaculture pond extraction. A row-and column-wise self-attention (RCSA) mechanism has been proposed to capture the intertwining regular embankments of aquaculture ponds in feature maps, and then a fully convolutional network framework combined with the RCSA mechanism is proposed for semantic segmentation of aquaculture ponds. The proposed methods have been evaluated on high-spatial-resolution pansharpened images obtained by fusing multi-spectral and panchromatic images in typical regions with inland lakes and densely distributed aquaculture ponds. Experiments on satellite images of both a highly developed lake and a reserved lake show that the overall accuracy of the proposed method is significantly better than those of other methods (3-8% overall accuracy gains at Lake Liangzi and 1-2% overall accuracy gains at Lake Hong over the best of other methods). Specifically, from the experimental semantic segmentation results for large regions, detailed information, such as the embankments of aquaculture ponds, can be more accurately identified by the proposed method. It can be concluded that the proposed method is effective for large-scale extraction of aquaculture ponds. In addition, RCSANet-NDWI further improves the accuracy of the proposed method compared with RCSANet, indicating the significance of the proposed NDWI fusion strategy. For future study, the proposed methods can be extended to raft-culture extraction in offshore waters.