FCAU-Net for the Semantic Segmentation of Fine-Resolution Remotely Sensed Images

: The semantic segmentation of ﬁne-resolution remotely sensed images is an urgent issue in satellite image processing. Solving this problem can help overcome various obstacles


Introduction
Semantic image segmentation is one of the most important image processing tasks [1,2]. Its purpose is to classify each pixel in the input image and attach a semantic label [3][4][5][6]. The semantic segmentation of remote sensing images often refers to the pixel-level classification and prediction of geographic entities (e.g., buildings, water bodies, roads, cars, and vegetation) [7]. Therefore, semantic segmentation is a critical tool for improving image comprehension [8]. With the advancement of remote sensing technologies, a constellation of Earth observation satellites has been launched by China [9][10][11], and these satellites acquire substantial fine-resolution images that can be used for semantic segmentation [4,5,8].
This process has crucial research significance for urban planning [12], vehicle monitoring [13], land cover mapping [14], and change detection [15], as well as building and road extraction [16,17]. As semantic segmentation is a continually growing technique, several classifiers have been created for this process in the field of remote sensing [18], including traditional methods (e.g., logistic regression [19], distance-based metrics [20], clustering [21]) and machine learning (e.g., support vector machines (SVMs) [22], random forests (RFs) [23], artificial neural networks (ANNs) [24], and multilayer perceptions (MLPs) [25]), but the flexibility and adaptability of these approaches are limited due to their high dependency on hand made features and information transformations [4,5,7,26,27]. For example, spectral, spatial, and texture characteristics are difficult to optimize, resulting in insufficient dependability [8]. The advancement of deep learning has encouraged the use of convolutional neural networks (CNNs) for image processing [28]. CNNs, which are independent of handcrafted descriptors, have the powerful ability to automatically capture nonlinear and hierarchical features, remarkably influencing the field of computer vision (CV) [1,4,29]. A CNN loses part of the semantic information of the input image because of the local nature of the convolution kernel, leading to a lack of long-term relationship determination ability for image segmentation [5,30]. Therefore, it is more difficult to achieve accurate classification in fine-resolution images, and the segmentation of fine-resolution images remains a challenging topic.
Based on the advantages of local texture extraction, a fully convolutional network (FCN) [31] was the first demonstrated and effective end-to-end CNN structure. Skip connections enhance encoder feature information and upsample decoder feature information according to the size of the original input data, demonstrating their significant generalization ability and high efficiency [6,7]. A series of FCN-based networks for scene segmentation have been proposed (such as the segmentation network (SegNet) [32] and U-Net [33]). Although FCNs have shown elegant structures and remarkable achievements, the abstraction capabilities of FCNs are insufficient for considering meaningful global context information for high-level features. Precise boundaries cannot be recovered correctly by performing eight upsampling operations [7]; therefore, the insufficient utilization of information flows hinders the original U-Net architecture's potential [18]. A more detailed encoder-decoder structure was proposed to address this issue [4]. Generally, the feature maps produced by an encoder contain low-level and fine-grained semantic information, while maps produced by a decoder contain high-level and coarse-grained semantic data [34,35]. Skip connections are additional methods that act as bridges between low-level and high-level feature maps [18]. For example, in U-Net++ [34], nested and dense skip connections are used instead of direct skip connections; this not only improves the strength of the skip connections but also minimizes the semantic gaps between encoders and decoders [18]. Full-scale skip connections are used in U-Net3+ [36] to improve the capabilities of skip connections and extract characteristic information from the network. The pyramid scene parsing network (PSPNet) [37] directly builds feature maps of varying resolutions through the global average pooling technique. Stride-based spatial pyramid pooling (SSPP) [38] alters the sizes of feature maps by using a pooling procedure using strides. These approaches described above have significance for semantic segmentation and multiscale feature information extraction [4]. Although these approaches can gather context information to some extent, they merely mix features with distinct receptive fields via concatenation procedures. Moreover, the different feature representations and context information extraction capabilities of these neural networks have been ignored; therefore, they cannot explore global context information [2,7,8,29,35].
A self-attention mechanism is simply a method of imitating how humans observe objects. For example, when viewing character pictures, most people focus on crucial local information (such as the character itself) rather than the visual backdrop. This form of self-attention mechanism was originally introduced in natural language processing and has been widely used in CV and remote sensing since its vast potential was first discovered [39]. Attention mechanisms [40,41] are a hot topic in convolution and recurrence research. The long-term dependencies of feature maps and the features extracted via refinement improve the segmentation capabilities of deep networks [35,42,43]. For example, the squeeze-and-excitation network (SENet) [44] uses a channel attention structure to effectively establish interdependencies between channels and selects the most suitable channel by itself. Nevertheless, it ignores the importance of the position dimension for semantic segmentation. The dual-attention network (DANet) [45] designs spatial and channel attention modules according to a dot-product attention mechanism to extract rich context. CBAM [46] utilizes a spatial attention module and channel module to refine intermediate feature maps adaptively. MAResU-Net [35] embeds a multistage attention model into the direct skip connections of the original U-Net, thereby refining the multiscale feature maps. Unlike these methods that use expensive and heavyweight nonlocal or self-attention blocks, a coordinate attention (CA) mechanism [47] that effectively captures the position information and channel-wise relationships has been proposed. The CA mechanism enhances the feature representations of networks and obtains essential contextual information and the long-distance dependencies of geographic entities, improving the final segmentation results.
A standard convolution kernel that extracts information with irregular proportions has a more extensive weight range at the central crisscross positions. The points in the corners contain less information that may be used to extract features. Therefore, we used an asymmetric convolution block (ACB) to enhance the spatial details of high-level abstract characteristics by intensifying the weights of the central crisscross portions [18]. ACB convolution is incorporated into FCAU-Net for semantic segmentation. Finally, for the fusion of features, many researchers have proposed effective feature fusion strategies from the perspective of feature-level fusion. For example, Liu et al. [48] proposed a novel cross-resolution hidden layer features fusion (CRHFF) approach for the joint classification of multi-resolution MS and PAN images. The CRHFF solved the inconsistent feature representation problem of the local patches, and the objects can be modeled in a more comprehensive way while increasing the classification accuracy. Zheng et al. [49] proposed a novel multitemporal deep fusion network (MDFN) for short-term multitemporal HR image classification, which includes a long short-term memory (LSTM) and a convolution neural network (CNN). The spatio-temporal-spectral features are extracted and fused by integrating LSTM and CNN branches, improving the classification accuracy. However, a shallow feature mapping contains rough semantics and introduces noise information during feature extraction; the fusion of features with different resolutions leads to the insufficient utilization of information flows. To address the inadequate feature utilization issue, a refinement fusion block (RFB) is designed to merge high-level abstract features and low-level spatial features, thereby eliminating the background noise and reducing the fitting residuals after feature fusion.
Experiments on two public remote sensing image datasets (ZY-3 and DeepGlobe dataset) prove the efficacy of our fusion coordinate and asymmetry-based U-Net (FCAU-Net). For the binary classification problem in our experiment, 0 and 1 represent background and arable land in the ZY-3 dataset, respectively, and 0 and 1 denote background and building in the DeepGlobe dataset. Furthermore, a well-designed model structure can offer a unified solution for semantic segmentation [50], object recognition [51], and change detection [15], which undoubtedly promotes the use of deep learning technology. In summary, the main contributions of this paper are as follows: (1) A novel CA mechanism is introduced into the encoding process to effectively simulate channel-wise relationships. Accurate position information is used to capture long-term dependencies, enabling the model to accurately locate and identify objects of interest. (2) In the decoding process, we use an ACB to capture and refine the obtained features by enhancing the weights of the central crisscross positions to improve the convolutional layer's representation capabilities. (3) We design an RFB to combine low-level spatial data with high-level abstract features to take advantage of feature information. The RFB can fully utilize the benefits of advantages of these aspects based on the representations of various levels. (4) To avoid the imbalance between the target and nontarget areas, which may cause the learning process to fall into the local minimum of the loss function and strongly bias the classifier toward the background class, we utilize a combination of the cross-entropy loss function and Dice loss function, which solves the sample imbalance issue.
The flowchart of the FCAU-Net is shown in Figure 1. It includes the CA, the ACB, and the RFB. The FCAU-Net solves the feature fusion problem of pixel-level segmentation and accurately extracts different types of contour information regarding target objects. Furthermore, the target item's location, forms, and spatial distribution are more precise. The following section introduces the architecture and components of the FCAU-Net in detail. Experimental comparisons on two public remote sensing image datasets (ZY-3 and DeepGlobe) are provided in Section 3. A discussion is presented in Section 4. Finally, conclusions are drawn in Section 5. (2) In the decoding process, we use an ACB to capture and refine the obtained features by enhancing the weights of the central crisscross positions to improve the convolutional layer's representation capabilities. (3) We design an RFB to combine low-level spatial data with high-level abstract features to take advantage of feature information. The RFB can fully utilize the benefits of advantages of these aspects based on the representations of various levels. (4) To avoid the imbalance between the target and nontarget areas, which may cause the learning process to fall into the local minimum of the loss function and strongly bias the classifier toward the background class, we utilize a combination of the cross-entropy loss function and Dice loss function, which solves the sample imbalance issue.
The flowchart of the FCAU-Net is shown in Figure 1. It includes the CA, the ACB, and the RFB. The FCAU-Net solves the feature fusion problem of pixel-level segmentation and accurately extracts different types of contour information regarding target objects. Furthermore, the target item's location, forms, and spatial distribution are more precise. The following section introduces the architecture and components of the FCAU-Net in detail. Experimental comparisons on two public remote sensing image datasets (ZY-3 and DeepGlobe) are provided in Section 3. A discussion is presented in Section 4. Finally, conclusions are drawn in Section 5.

Methodology
With the continuous improvement of the encoder-decoder network structure, feature information is more abstract, and the feature representation capacity of a network becomes more vital. Nevertheless, position data is lost during encoding and cannot be effectively restored during the upsampling process (as shown in Figure 2). The feature maps must be recovered to the same size as the original image in the decoding process, and the decoder combines low-level spatial characteristics with high-level abstract notions for simple summation or concatenation. Therefore, mainstream deep CNNs (DCNNs) are not capable of dealing with the feature fusion problem successfully.

Methodology
With the continuous improvement of the encoder-decoder network structure, feature information is more abstract, and the feature representation capacity of a network becomes more vital. Nevertheless, position data is lost during encoding and cannot be effectively restored during the upsampling process (as shown in Figure 2). The feature maps must be recovered to the same size as the original image in the decoding process, and the decoder combines low-level spatial characteristics with high-level abstract notions for simple summation or concatenation. Therefore, mainstream deep CNNs (DCNNs) are not capable of dealing with the feature fusion problem successfully. To address the issue of long-term dependencies and feature fusion, we designed a semantic segmentation network (FCAU-Net) for fine-resolution remotely sensed imagery with the idea of an encoder-decoder structure. First, in the encoding process, we designed two stacked convolutional layers with an embedded CA module in each level. The CA To address the issue of long-term dependencies and feature fusion, we designed a semantic segmentation network (FCAU-Net) for fine-resolution remotely sensed imagery with the idea of an encoder-decoder structure. First, in the encoding process, we designed two stacked convolutional layers with an embedded CA module in each level. The CA module utilizes accurate position information to address the issues of encoding channelwise relationships and long-term dependency correlations. Second, we used the ACB to acquire and enhance feature information in each encoder layer. Finally, the RFB in the feature fusion stage was designed to effectively fuse different level features, thereby eliminating background noise and reducing the fitting residuals after feature fusion; this improved the segmentation performance of the network. The overall architecture of the proposed FCAU-Net is shown in Figure 3. To address the issue of long-term dependencies and feature fusion, we designed a semantic segmentation network (FCAU-Net) for fine-resolution remotely sensed imagery with the idea of an encoder-decoder structure. First, in the encoding process, we designed two stacked convolutional layers with an embedded CA module in each level. The CA module utilizes accurate position information to address the issues of encoding channelwise relationships and long-term dependency correlations. Second, we used the ACB to acquire and enhance feature information in each encoder layer. Finally, the RFB in the feature fusion stage was designed to effectively fuse different level features, thereby eliminating background noise and reducing the fitting residuals after feature fusion; this improved the segmentation performance of the network. The overall architecture of the proposed FCAU-Net is shown in Figure 3.

CA Module
Capturing the long-range dependencies of features through horizontal and vertical attention maps, CA can embed position information into channel attention. To construct an accurate position information model, two steps are required: coordinate information embedding (CIE) and CA generation (CAG). The CA module designed in this study is shown in Figure 4.

CA Module
Capturing the long-range dependencies of features through horizontal and vertical attention maps, CA can embed position information into channel attention. To construct an accurate position information model, two steps are required: coordinate information embedding (CIE) and CA generation (CAG). The CA module designed in this study is shown in Figure 4.  In Figure 4, any feature tensor X = [x 1 ,x 2 ,x 3 ,…,x n ]∈R C×H×W is taken as input, where C, H, and W are the pixel sizes along the channel dimension, height, and width of the feature tensor X, respectively. Given the input X in the attention block, we utilized two spatial pooling kernels, (H, 1) or (1, W), to acquire a pair of 1D vectors along the horizontal coor- In Figure 4, any feature tensor X = [x 1 , x 2 , x 3 , . . . , x n ] ∈ R C×H×W is taken as input, where C, H, and W are the pixel sizes along the channel dimension, height, and width of the feature tensor X, respectively. Given the input X in the attention block, we utilized two spatial pooling kernels, (H, 1) or (1, W), to acquire a pair of 1D vectors along the horizontal coordinate and the vertical coordinate. Thus, the output of the c-th channel at a height h and at a width w can be written as: Specifically, CAG combines the feature maps extracted via CIE. We first performed the concatenation operation to obtain feature maps and then sent them to a shared 1 × 1 convolution transformation function F1 to compress the channels, yielding: where [I, I] is the spatial dimension concatenation procedure and f ∈ R C×(H+W) is the intermediate feature map that encodes spatial information in the horizontal and vertical directions. Specifically, δ denotes the hard-Swish nonlinear activation function, which is utilized instead of the sigmoid layer in the rectified linear unit 6 (ReLU6) and SE blocks. The effect of hard-Swish activation on the deep model is better than that of the commonly used ReLU and Swish models, and the hard-Swish activation function is shown as follows: The performance of the hard-Swish nonlinear activation function and other related activation functions is shown in Figure 5. Then, the feature maps were encoded by combing batch normalization (BN) and non-Linear. f is split into two tensors along the spatial dimension: f h ∈R C×H and f w ∈R C×W . Two 1 × 1 convolutional transformation functions Fh and Fw were used to transform f h and f w into tensors with the same dimensions as those of the input X.
where σ is the sigmoid function, which smoothly maps the real number domain to [0, 1], and g h and g w are exaggerated and utilized as attention weights. The output of the CA Then, the feature maps were encoded by combing batch normalization (BN) and non-Linear. f is split into two tensors along the spatial dimension: f h ∈ R C×H and f w ∈ R C×W . Two 1 × 1 convolutional transformation functions F h and F w were used to transform f h and f w into tensors with the same dimensions as those of the input X.
where σ is the sigmoid function, which smoothly maps the real number domain to [0, 1], and g h and g w are exaggerated and utilized as attention weights. The output of the CA mechanism Y is described as:

ACB Module
As mentioned in [52], the amplitudes of the weights of the middle crisscross spots (i.e., the nuclear skeleton) are bigger. Still, the points in the corners offer less information regarding feature extraction and the standard convolution kernel refines features unevenly. Therefore, we introduced the ACB in each decoder layer to generate a smooth image (as shown in Figure 6). The ACB uses three branches: the 3 × 3 convolution captures features through a relatively large receptive field. In contrast, the 1 × 3 convolution and 3 × 1 convolution achieve crisscross receptive fields, ensuring the relevance of the skeletal characteristics and expanding the network's depth. Then, to obtain fusion results, the three distinct convolution feature maps were combined. Finally, a BN and a ReLU were utilized to improve the numerical stability of the process and activate the output in a nonlinear manner. The calculation of the ACB is expressed as: where x i and x i−1 are the output and input of the ACB, respectively. v is the variance function of the input, and u is the expectation of the input. ε i represents a tiny constant that ensures numerical stability. γ and β are two trainable parameters for the BN layer, where γ scales the normalization result and β shifts it. σ indicates the ReLU activation function.
Remote Sens. 2022, 14, x FOR PEER REVIEW 8 of 21 Figure 6. Structure of the ACB.

RFB Module
With the application of DCNN, the repeated pooling layers and stride convolutions led to the loss of a quantity of feature information during the process of feature extraction. In addition, the shallow feature mappings of the network contained rough semantics, thus introducing noise information into the target object extraction process. Therefore, to effec-

RFB Module
With the application of DCNN, the repeated pooling layers and stride convolutions led to the loss of a quantity of feature information during the process of feature extraction. In addition, the shallow feature mappings of the network contained rough semantics, thus introducing noise information into the target object extraction process. Therefore, to effectively unitize different level features information, we designed the RFB, which is composed of three parts, namely, a feature aggregation module (FAM), a linear attention module (LAM), and a compact spatial attention module (CSAM). Furthermore, we constructed parallel and serial connections to link the LAM and CSAM, and the structures are shown in Figure 7a,b.

RFB Module
With the application of DCNN, the repeated pooling layers and stride convolutions led to the loss of a quantity of feature information during the process of feature extraction. In addition, the shallow feature mappings of the network contained rough semantics, thus introducing noise information into the target object extraction process. Therefore, to effectively unitize different level features information, we designed the RFB, which is composed of three parts, namely, a feature aggregation module (FAM), a linear attention module (LAM), and a compact spatial attention module (CSAM). Furthermore, we constructed parallel and serial connections to link the LAM and CSAM, and the structures are shown in Figure 7a The low-level features have the advantage of high spatial resolution, maintaining complete spatial information, and are sensitive to gradient changes at different points. The high-level features have fully extracted abstract information. Therefore, the purpose of the RFB is to comprehensively utilize the advantages of low-level spatial information and high-level semantic information to obtain refined aggregation features. The input features of the RFB involve the CA features (CAFs) and ACB features (ACBFs). Then, the fusion characteristics are subjected to a convolution operation, a BN operation, and a ReLU activation function to obtain aggregate features (AFs), which are fusions of different level information flows. The specific expression is as follows: x(ACBF, CAF) = (U(ACBF), CAF) where C denotes the concatenation function. U indicates a two-scale factor upsampling technique. • indicates the convolution process. W × is a size of n × n convolutional ker- The low-level features have the advantage of high spatial resolution, maintaining complete spatial information, and are sensitive to gradient changes at different points. The high-level features have fully extracted abstract information. Therefore, the purpose of the RFB is to comprehensively utilize the advantages of low-level spatial information and high-level semantic information to obtain refined aggregation features. The input features of the RFB involve the CA features (CAFs) and ACB features (ACBFs). Then, the fusion characteristics are subjected to a convolution operation, a BN operation, and a ReLU activation function to obtain aggregate features (AFs), which are fusions of different level information flows. The specific expression is as follows: x(ACBF, CAF) = C(U(ACBF), CAF) where C denotes the concatenation function. U indicates a two-scale factor upsampling technique. • indicates the convolution process. W n×n is a size of n × n convolutional kernel. b denotes the bias vector, and x is the input pixel. W 1 is a 1 × 1 convolutional layer. f BN denotes batch normalization. f ReLU indicates the activation function of the corrected linear unit. Next, the AFs were subject to background noise elimination and residual reduction fitting to achieve refinement feature maps (RFs). Finally, a matrix multiplication technique between the AFs and RFs was executed to achieve a refined aggregation feature (RAF). The specific expression is as follows: serial structures : RFs = CSAM (LAM (AFs)), RFB (RAF) = AFs • RFs (13) parallel structures : RFs = LAM (AFs) + CSAM (AFs), RFB (RAF) = AFs • RFs (14) LAM: The LAM cements the spatial relationships among the AFs, thus restraining the fitting residuals. Supposing that the normalization function is softmax, the i-th row in the output matrix produced by the dot-product attention mechanism can be written as: Then, Equation (15) can be generalized and rewritten as: According to the first-order approximation of the Taylor expansion in Equation (15), we designed the LAM, which distinguishes the feature of our study from the mechanisms utilized in previous studies. e q T i k j ≈ 1 + q T i k j However, the above approximation cannot ensure the nonnegative property of the normalization function. Therefore, we normalize q i and k j by the l 2 norm to guarantee that q T i k j ≥ −1.
Equation (16) can be rewritten as Equations (19) and (20): The vectorized form of Equation (20) is: where ∑ N j=1 k j ||kj|| 2 v T j and ∑ N j=1 k j ||kj|| 2 can be calculated once and then reutilized for each query.

Datasets
To assess the effectiveness of the FCAU-Net, experiments were conducted on two public datasets. These datasets consist of remotely sensed cultivated land images obtained by Ziyuan-3 satellite sensors (ZY-3) and building images extracted from the Digital Earth Worldview-3 satellite sensor (DeepGlobe). Furthermore, the ZY-3 dataset was used for the MathorCup University Mathematical Modeling Challenge in 2021 (https://www.saikr. com/c/nd/7256, accessed on 23 November 2021). The DeepGlobe dataset was employed in the SpaceNet (https://registry.opendata.aws/spacenet, accessed on 23 November 2021) Building Detection Challenge.
ZY-3 dataset: The spectrum contains visible bands (i.e., red, green, and blue). The images are labeled with two classes (background and arable land) with spatial resolutions of 2 m and 600 × 500 pixels. Considering that a small number of dataset samples can cause difficulties during model training, a series of enhancement methods (e.g., rotation by 90 • , rotation by 180 • , rotation by 270 • , flipping, light adjustment, blurring, and adding increased noise) were used to expand the training dataset to 3000. The dataset was then split randomly into a training set, validation set, and test set at a ratio of 8:1:1. Examples of augmented images are shown in Figure 8. DeepGlobe dataset: The source images detected by the Worldview-3 satellite sensor, which contain RGB and 8-band multispectral data including 302,701 building footprints with 30 cm ground sample distances (GSDs). The dataset includes 650 × 650-pixel images of all areas of interest and is labeled with two classes (background and building). In this experiment, the Las Vegas subsets were selected to evaluate the generalization performance of the proposed algorithm. Approximately 151,367 buildings were utilized, covering over 216 km 2 . Finally, the dataset was randomly divided into training/validation/testing subsets at a ratio of 8:1:1. Example images and labels from both the ZY-3 dataset and DeepGlobe dataset: The source images detected by the Worldview-3 satellite sensor, which contain RGB and 8-band multispectral data including 302,701 building footprints with 30 cm ground sample distances (GSDs). The dataset includes 650 × 650-pixel images of all areas of interest and is labeled with two classes (background and building). In this experiment, the Las Vegas subsets were selected to evaluate the generalization performance of the proposed algorithm. Approximately 151,367 buildings were utilized, covering over 216 km 2 . Finally, the dataset was randomly divided into training/validation/testing subsets at a ratio of 8:1:1. Example images and labels from both the ZY-3 dataset and the DeepGlobe dataset are shown in Figure 9. DeepGlobe dataset: The source images detected by the Worldview-3 satellite sensor, which contain RGB and 8-band multispectral data including 302,701 building footprints with 30 cm ground sample distances (GSDs). The dataset includes 650 × 650-pixel images of all areas of interest and is labeled with two classes (background and building). In this experiment, the Las Vegas subsets were selected to evaluate the generalization performance of the proposed algorithm. Approximately 151,367 buildings were utilized, covering over 216 km 2 . Finally, the dataset was randomly divided into training/validation/testing subsets at a ratio of 8:1:1. Example images and labels from both the ZY-3 dataset and the DeepGlobe dataset are shown in Figure 9.

Implementation Details
We used the PyTorch framework on the Windows 10 platform with an NVIDIA GeForce RTX3090 24 GB GPU to complete the training and testing processes in this study. The best hyperparameters were selected based on a comparison of the results obtained from repeated experiments. The input size and batch size were set to 480 × 480 × 3 and 8, respectively. The number of training epochs was 100. Adam [53] was chosen as an optimizer, with an initial learning rate of 0.001, and β 1 and β 2 were set to their default values as recommended, namely β 1 = 0.9, β 2 = 0.999. Furthermore, we combined the cross-entropy loss function and Dice loss function to calculate the differences between the obtained segmentation maps and the ground reference, and both ratio coefficients were set to 1:1. Finally, to assess the performance of the FCAU-Net, U-Net [33], Attention U-Net [54], the PSPNet [37], DeepLab v3+ [55], the MAResU-Net [35], MACU-Net [18], and the TransUNet [56] were implemented.

The Loss Function
As the number of non-target pixels in most scenes is substantially more than the number of target pixels, the effect of class imbalance may lead the learning process to become caught in a local minimum of the loss function, severely biasing the classifiers toward the background class [57,58]. To tackle the issue of class disparities, the objective function was calculated by the Dice loss and cross-entropy function with a pixel-wise softmax function over the final feature maps. The formulas are expressed as: where y i and p i indicate the pixel values of the ground truth (GT) and the predicted probability map achieved through the sigmoid function, respectively. The total number of pixels is denoted by N. In our experiment, α = β = 1.0, ε represents a factor that is applied to smooth the loss and gradient, where ε = 1 × 10 −5 .

Evaluation Metrics
To quantitatively assess the performance of the FCAU-Net, we estimated the accuracy of the target extraction results based on the associated confusion matrix. The form of the confusion matrix is shown in Table 1. In the confusion matrix, the true positive (TP) entries represent the number of positives that are predicted to be positive, the false positive (FP) entries indicate the number of negatives that are predicted to be positive, and the true negative (TN) and false negative (FN) entries refer to the number of negative and positive results that are predicted to be negative, respectively. Based on the confusion matrix, we used the mean intersection over union (mIoU), pixel accuracy (PA), mean pixel accuracy (mPA), and frequency-weighted IoU (FWIoU) as the critical evaluation metrics to calculate the difference between the predicted mask and the GT. The mIoU, PA, mPA, and FWIoU are calculated as: where n is the total number of categories (including background categories). P ii and P ij are the total numbers of pixels belonging to true pixel category i that are predicted to belong to i and j, respectively.

Experimental Results
We evaluated the accuracy of the FCAU-Net and other architectures on the ZY-3 and DeepGlobe datasets (as shown in Table 2) based on the PA, mPA, mIoU, and FWIoU evaluation metrics. According to the calculation results, the FCAU-Net not only has obvious advantages over the contextual information aggregation methods (e.g., DeepLab v3+ and the PSPNet) that were originally designed for natural images but also outperformed the latest feature aggregation models (e.g., the MAResU-Net and MACU-Net). In addition, the proposed FCAU-Net transcends the TransUNet (which is an architecture based on the latest Transformer), achieving the highest PA of 97.97%, mPA of 98.53%, mIoU of 95.17%, and FWIoU of 96.07% on the ZY-3 dataset. The PA, mPA, mIoU, and FWIoU are 0.82%, 0.78%, 1.86%, and 1.52% higher than those of the TransUNet, respectively. For the DeepGlobe dataset, the proposed FCAU-Net achieves 95.05% PA, 91.27% mPA, 85.54% mIoU, and 90.74% FWIoU, bringing an immediate 0.77% increase in PA, a 1.08% increase in mPA, a 1.96% increase in the mIoU, and a 1.31% increase in the FWIoU over the values of the TransUNet. Therefore, the proposed FCAU-Net is effective and stable. For a qualitative efficacy verification, we showed the segmentation images generated by the proposed FCAU-Net and the comparative approaches in Figures 10 and 11. White represents an object (cultivated land or building), and black represents the background. According to the visualization results obtained on the ZY-3 and DeepGlobe datasets, it is clear that the proposed FCAU-Net captures more delicate features and has smaller target misclassification rates than other architectures (e.g., the MAResU-Net, U-Net, and the PSPNet). On the ZY-3 dataset, the direct use of multiscale modeling methods (e.g., the PSPNet and MACU-Net) and attention mechanism architectures (e.g., the MAResU-Net and Attention U-Net) yield specific improvements; a severe misclassification phenomenon occurs for edge pixels. The classification results of DeepLab v3+ and the TransUNet are the same as the GT. However, the proposed FCAU-Net can ultimately retain the contour information of complex objects. The cultivated land and building contours created by our FCAU-Net are smoother than those achieved by other approaches. Comparing the visualization results obtained on the DeepGlobe dataset, the segmentation results generated by our FCAU-Net are also closest to the GT.
Considering that the number of parameters and the computational complexity are essential to the advantages of the efficiency of the evaluated architecture, we listed the training time required per each epoch. The numbers of floating-point operations per second (Flops) and parameters used by different algorithms (see Table 3) were counted, where "M" represents millions. The comparison indicates that the spatial efficiency and computation efficiency of the FCAU-Net were the third-best among the seven models, with 32.42 M parameters and 101.04 GMac. Regarding the training time, although the FCAU-Net occupied less memory, it was slower than DeepLab v3+ and other models but was notably faster than Attention U-Net. and Attention U-Net) yield specific improvements; a severe misclassification phenomenon occurs for edge pixels. The classification results of DeepLab v3+ and the TransUNet are the same as the GT. However, the proposed FCAU-Net can ultimately retain the contour information of complex objects. The cultivated land and building contours created by our FCAU-Net are smoother than those achieved by other approaches. Comparing the visualization results obtained on the DeepGlobe dataset, the segmentation results generated by our FCAU-Net are also closest to the GT.  Considering that the number of parameters and the computational complexity are essential to the advantages of the efficiency of the evaluated architecture, we listed the training time required per each epoch. The numbers of floating-point operations per second (Flops) and parameters used by different algorithms (see Table 3) were counted, where "M" represents millions. The comparison indicates that the spatial efficiency and computation efficiency of the FCAU-Net were the third-best among the seven models, and Attention U-Net) yield specific improvements; a severe misclassification phenomenon occurs for edge pixels. The classification results of DeepLab v3+ and the TransUNet are the same as the GT. However, the proposed FCAU-Net can ultimately retain the contour information of complex objects. The cultivated land and building contours created by our FCAU-Net are smoother than those achieved by other approaches. Comparing the visualization results obtained on the DeepGlobe dataset, the segmentation results generated by our FCAU-Net are also closest to the GT.  Considering that the number of parameters and the computational complexity are essential to the advantages of the efficiency of the evaluated architecture, we listed the training time required per each epoch. The numbers of floating-point operations per second (Flops) and parameters used by different algorithms (see Table 3) were counted, where "M" represents millions. The comparison indicates that the spatial efficiency and computation efficiency of the FCAU-Net were the third-best among the seven models, Figure 11. The experimental results obtained on the DeepGlobe dataset.

Ablation Study
Ablation experiments were separately carried out on ZY-3 and DeepGlobe datasets to assess the efficacy of various combinations of the ACB, RFB, and CA modules in the FCAU-Net. Table 4 display the experimental settings and quantitative comparative results. Baseline: We selected the original U-Net as the baseline in our ablation experiments. The low-level feature maps created by each encoder layer through the convolutional layer, and the high-level feature maps created by the decoder through upsampling, were directly subjected to feature aggregation to restore the final segmented shape.
Ablation for the ACB: Since rich spatial details are significant for segmentation, we replaced the ordinary convolution in the decoder part of the FCAU-Net with the ACB. The features in each encoder layer were captured and refined, and the generated feature maps were smoother. Ablation for the RFB: As the feature information generated by the encoder-decoder structure was contained in different domains, simple feature map channel summation and concatenation were not the best feature aggregation methods. To eliminate background noise information and reduce the fitting residuals after feature fusion, we designed the RFB to effectively fuse different level features characteristics in this study.
Ablation for the CA module: Considering that inter-channel information and spatial position information are essential for capturing the long-term dependencies of object architectures and modeling in visual tasks, we proposed a novel CA mechanism that captures cross-channel, directional, and position-sensitive information. Therefore, the CA module can help the model to accurately locate and identify the object of interest. Table 4 show that the performance of the network with the ACB is superior to that of the encoder-decoder baseline. For the ZY-3 dataset, with the introduction of the ACB, we found that the PA increased by 2.50%, the mPA increased by 1.76%, the mIoU increased by 5.04%, and the FWIoU increased by 4.33%. The PA, mPA, mIoU, and FWIoU increasd by 0.59%, 2.98%, 2.10%, and 1.16%, respectively, for the DeepGlobe dataset. As the ACB cannot fully capture the feature information, we developed the RFB to realize the merging of low-level position information and high-level abstract features. For the ZY-3 dataset, the introduction of RFB increased the PA by more than 0.90%, the mPA by 0.64%, the mIoU by 1.94%, and the FWIoU by 1.63%, while the improvements achieved on the DeepGlobe dataset were 1.05%, 0.65%, 2.35%, and 1.69%, respectively. We designed a CA module to embed position information into the channel attention mechanism, enhancing the representations of objects of interest. The utilization of the CA module contributed to increases of 0.58% in the PA, 0.40% in the mPA, 1.28% in the mIoU, and 1.06% in the FWIoU on the ZY-3 dataset, while the improvements achieved on the DeepGlobe dataset were 0.47%, 0.50%, 1.15%, and 0.78%, respectively.

Influence of the Input Size
In this subsection, the different input sizes are inputted into the network for training to further evaluate the FCAU-Net. We discovered that the greater the input size, the better the performance effect. The input sizes were set at 480, 256, and 224. The accuracy of the test set was greatest at a size of 480 (seen Table 5). As the sample library involves single-target segmentation, we needed to select as large a slice size as possible to stabilize the network training procedure. Due to GPU source constraints, the batch size and input size were set at 8 and 480, respectively.

Optimization
In this experiment, adaptive moment estimation (Adam) was selected as the FCAU-Net optimizer. The stochastic gradient descent (SGD) optimization method is affected by the learning rate. However, the Adam optimization technique is not affected by the learning rate, and Adam may regulate the learning rate to a specific extent. Thereby, Adam was selected as the FCAU-Net optimizer to accelerate network convergence and obtain the best performance on the ZY-3 and DeepGlobe datasets. Figure 12a,b illustrate the loss values achieved on the training and validation sets. The training and validation loss function gradually decreased at the beginning; however, after a certain epoch, the loss no longer decreased and began to balance. The accuracy values of different approaches are shown in Figure 12c, which presents the finding that the FCAU-Net proposed by us has the best performance. Therefore, our experimental setting has certain feasibility. In addition, to examine the influence of the sensitivity of the hyperparameters (α and β in Equation (25)) on model accuracy, we conducted a test on the ZY-3 dataset as an example. As shown in Table 6, different scale factors have prominent effects on the performance of the model. The best performance was found when the cross-entropy (CE) and Dice losses were used and when α and β were set to 1. The FCAU-Net achieved 97.27% in In addition, to examine the influence of the sensitivity of the hyperparameters (α and β in Equation (25)) on model accuracy, we conducted a test on the ZY-3 dataset as an example. As shown in Table 6, different scale factors have prominent effects on the performance of the model. The best performance was found when the cross-entropy (CE) and Dice losses were used and when α and β were set to 1. The FCAU-Net achieved 97.27% in PA, a 97.87% mPA, a 93.60% mIoU, and a 94.77% FWIoU for the ZY-3 dataset.

Limitations and Future Work
Although the proposed FCAU-Net bridges the gap in feature fusion between low-level spatial information and high-level semantic information, some inherent issues need to be considered. As shown in Figure 13, the total number of trainable parameters in the FCAU-Net was 32.42 M, which is smaller than that required by medium-scale networks (e.g., the PSPNet (62.97 M) and DeepLab v3+(40.35 M)) but more significant than that required by small-scale networks (e.g., MACU-Net (5.15 M) and the MAResU-Net (26.79 M)). Therefore, the network efficiency of the FCAU-Net is relatively low and is not suitable for mobile platforms. Next, Transformer has gained prominence in a variety of CV tasks (e.g., image classification, object detection, and semantic segmentation) in recent years due to its powerful long-distance dependency capturing and sequence-based graphics modeling capabilities. Transformer divides the input image into non-overlapping continuous patches. It constructs a feature extraction block composed of a multihead self-attention (MHSA) module and an MLP to capture long-distance dependencies. Due to its nonconvolutional architecture and attention mechanism, Transformer can capture long-term dependencies more effectively than other approaches. Based on the above limitations and advantages, we will dedicate ourselves to studying semantic remote sensing image segmentation based on the Transformer architecture in future work.
Remote Sens. 2022, 14, x FOR PEER REVIEW 18 of 21 M)). Therefore, the network efficiency of the FCAU-Net is relatively low and is not suitable for mobile platforms. Next, Transformer has gained prominence in a variety of CV tasks (e.g., image classification, object detection, and semantic segmentation) in recent years due to its powerful long-distance dependency capturing and sequence-based graphics modeling capabilities. Transformer divides the input image into non-overlapping continuous patches. It constructs a feature extraction block composed of a multihead self-attention (MHSA) module and an MLP to capture long-distance dependencies. Due to its nonconvolutional architecture and attention mechanism, Transformer can capture long-term dependencies more effectively than other approaches. Based on the above limitations and advantages, we will dedicate ourselves to studying semantic remote sensing image segmentation based on the Transformer architecture in future work.

Conclusions
In this paper, we proposed an FCAU-Net for the semantic segmentation of fine-resolution remote sensing images. Specifically, a CA mechanism first embeds position information into a channel attention mechanism in the encoding stage to capture long-range relationships, providing a clear segmentation boundary and improving the accuracy of semantic segmentation. Second, an ACB was utilized to improve the representation ability of the standard convolution layer, and the features in each layer of the encoder were captured and refined, achieving a smooth image. Finally, we designed an RFB to achieve the

Conclusions
In this paper, we proposed an FCAU-Net for the semantic segmentation of fineresolution remote sensing images. Specifically, a CA mechanism first embeds position information into a channel attention mechanism in the encoding stage to capture long-range relationships, providing a clear segmentation boundary and improving the accuracy of semantic segmentation. Second, an ACB was utilized to improve the representation ability of the standard convolution layer, and the features in each layer of the encoder were captured and refined, achieving a smooth image. Finally, we designed an RFB to achieve the successful integration of low-level semantic and high-level abstract characteristics, eliminating the background noise when extracting feature information, reducing the fitting residuals of the fused features, and enhancing the ability of the network to capture information flows.
We performed extensive experiments and ablation studies on the public ZY-3 (0: background, 1: arable land) and DeepGlobe datasets (0: background, 1: building) to demonstrate the superiority of the FCAU-Net. Namely, the FCAU-Net can capture more delicate features, and the segmentation results have smoother and clearer visual effects than those of other methods.