1. Introduction
Semantic image segmentation is one of the most important image processing tasks [
1,
2]. Its purpose is to classify each pixel in the input image and attach a semantic label [
3,
4,
5,
6]. The semantic segmentation of remote sensing images often refers to the pixel-level classification and prediction of geographic entities (e.g., buildings, water bodies, roads, cars, and vegetation) [
7]. Therefore, semantic segmentation is a critical tool for improving image comprehension [
8]. With the advancement of remote sensing technologies, a constellation of Earth observation satellites has been launched by China [
9,
10,
11], and these satellites acquire substantial fine-resolution images that can be used for semantic segmentation [
4,
5,
8]. This process has crucial research significance for urban planning [
12], vehicle monitoring [
13], land cover mapping [
14], and change detection [
15], as well as building and road extraction [
16,
17]. As semantic segmentation is a continually growing technique, several classifiers have been created for this process in the field of remote sensing [
18], including traditional methods (e.g., logistic regression [
19], distance-based metrics [
20], clustering [
21]) and machine learning (e.g., support vector machines (SVMs) [
22], random forests (RFs) [
23], artificial neural networks (ANNs) [
24], and multilayer perceptions (MLPs) [
25]), but the flexibility and adaptability of these approaches are limited due to their high dependency on hand made features and information transformations [
4,
5,
7,
26,
27]. For example, spectral, spatial, and texture characteristics are difficult to optimize, resulting in insufficient dependability [
8]. The advancement of deep learning has encouraged the use of convolutional neural networks (CNNs) for image processing [
28]. CNNs, which are independent of handcrafted descriptors, have the powerful ability to automatically capture nonlinear and hierarchical features, remarkably influencing the field of computer vision (CV) [
1,
4,
29]. A CNN loses part of the semantic information of the input image because of the local nature of the convolution kernel, leading to a lack of long-term relationship determination ability for image segmentation [
5,
30]. Therefore, it is more difficult to achieve accurate classification in fine-resolution images, and the segmentation of fine-resolution images remains a challenging topic.
Based on the advantages of local texture extraction, a fully convolutional network (FCN) [
31] was the first demonstrated and effective end-to-end CNN structure. Skip connections enhance encoder feature information and upsample decoder feature information according to the size of the original input data, demonstrating their significant generalization ability and high efficiency [
6,
7]. A series of FCN-based networks for scene segmentation have been proposed (such as the segmentation network (SegNet) [
32] and U-Net [
33]). Although FCNs have shown elegant structures and remarkable achievements, the abstraction capabilities of FCNs are insufficient for considering meaningful global context information for high-level features. Precise boundaries cannot be recovered correctly by performing eight upsampling operations [
7]; therefore, the insufficient utilization of information flows hinders the original U-Net architecture’s potential [
18]. A more detailed encoder-decoder structure was proposed to address this issue [
4]. Generally, the feature maps produced by an encoder contain low-level and fine-grained semantic information, while maps produced by a decoder contain high-level and coarse-grained semantic data [
34,
35]. Skip connections are additional methods that act as bridges between low-level and high-level feature maps [
18]. For example, in U-Net++ [
34], nested and dense skip connections are used instead of direct skip connections; this not only improves the strength of the skip connections but also minimizes the semantic gaps between encoders and decoders [
18]. Full-scale skip connections are used in U-Net3+ [
36] to improve the capabilities of skip connections and extract characteristic information from the network. The pyramid scene parsing network (PSPNet) [
37] directly builds feature maps of varying resolutions through the global average pooling technique. Stride-based spatial pyramid pooling (SSPP) [
38] alters the sizes of feature maps by using a pooling procedure using strides. These approaches described above have significance for semantic segmentation and multiscale feature information extraction [
4]. Although these approaches can gather context information to some extent, they merely mix features with distinct receptive fields via concatenation procedures. Moreover, the different feature representations and context information extraction capabilities of these neural networks have been ignored; therefore, they cannot explore global context information [
2,
7,
8,
29,
35].
A self-attention mechanism is simply a method of imitating how humans observe objects. For example, when viewing character pictures, most people focus on crucial local information (such as the character itself) rather than the visual backdrop. This form of self-attention mechanism was originally introduced in natural language processing and has been widely used in CV and remote sensing since its vast potential was first discovered [
39]. Attention mechanisms [
40,
41] are a hot topic in convolution and recurrence research. The long-term dependencies of feature maps and the features extracted via refinement improve the segmentation capabilities of deep networks [
35,
42,
43]. For example, the squeeze-and-excitation network (SENet) [
44] uses a channel attention structure to effectively establish interdependencies between channels and selects the most suitable channel by itself. Nevertheless, it ignores the importance of the position dimension for semantic segmentation. The dual-attention network (DANet) [
45] designs spatial and channel attention modules according to a dot-product attention mechanism to extract rich context. CBAM [
46] utilizes a spatial attention module and channel module to refine intermediate feature maps adaptively. MAResU-Net [
35] embeds a multistage attention model into the direct skip connections of the original U-Net, thereby refining the multiscale feature maps. Unlike these methods that use expensive and heavyweight nonlocal or self-attention blocks, a coordinate attention (CA) mechanism [
47] that effectively captures the position information and channel-wise relationships has been proposed. The CA mechanism enhances the feature representations of networks and obtains essential contextual information and the long-distance dependencies of geographic entities, improving the final segmentation results.
A standard convolution kernel that extracts information with irregular proportions has a more extensive weight range at the central crisscross positions. The points in the corners contain less information that may be used to extract features. Therefore, we used an asymmetric convolution block (ACB) to enhance the spatial details of high-level abstract characteristics by intensifying the weights of the central crisscross portions [
18]. ACB convolution is incorporated into FCAU-Net for semantic segmentation. Finally, for the fusion of features, many researchers have proposed effective feature fusion strategies from the perspective of feature-level fusion. For example, Liu et al. [
48] proposed a novel cross-resolution hidden layer features fusion (CRHFF) approach for the joint classification of multi-resolution MS and PAN images. The CRHFF solved the inconsistent feature representation problem of the local patches, and the objects can be modeled in a more comprehensive way while increasing the classification accuracy. Zheng et al. [
49] proposed a novel multitemporal deep fusion network (MDFN) for short-term multitemporal HR image classification, which includes a long short-term memory (LSTM) and a convolution neural network (CNN). The spatio-temporal-spectral features are extracted and fused by integrating LSTM and CNN branches, improving the classification accuracy. However, a shallow feature mapping contains rough semantics and introduces noise information during feature extraction; the fusion of features with different resolutions leads to the insufficient utilization of information flows. To address the inadequate feature utilization issue, a refinement fusion block (RFB) is designed to merge high-level abstract features and low-level spatial features, thereby eliminating the background noise and reducing the fitting residuals after feature fusion.
Experiments on two public remote sensing image datasets (ZY-3 and DeepGlobe dataset) prove the efficacy of our fusion coordinate and asymmetry-based U-Net (FCAU-Net). For the binary classification problem in our experiment, 0 and 1 represent background and arable land in the ZY-3 dataset, respectively, and 0 and 1 denote background and building in the DeepGlobe dataset. Furthermore, a well-designed model structure can offer a unified solution for semantic segmentation [
50], object recognition [
51], and change detection [
15], which undoubtedly promotes the use of deep learning technology. In summary, the main contributions of this paper are as follows:
- (1)
A novel CA mechanism is introduced into the encoding process to effectively simulate channel-wise relationships. Accurate position information is used to capture long-term dependencies, enabling the model to accurately locate and identify objects of interest.
- (2)
In the decoding process, we use an ACB to capture and refine the obtained features by enhancing the weights of the central crisscross positions to improve the convolutional layer’s representation capabilities.
- (3)
We design an RFB to combine low-level spatial data with high-level abstract features to take advantage of feature information. The RFB can fully utilize the benefits of advantages of these aspects based on the representations of various levels.
- (4)
To avoid the imbalance between the target and nontarget areas, which may cause the learning process to fall into the local minimum of the loss function and strongly bias the classifier toward the background class, we utilize a combination of the cross-entropy loss function and Dice loss function, which solves the sample imbalance issue.
The flowchart of the FCAU-Net is shown in
Figure 1. It includes the CA, the ACB, and the RFB. The FCAU-Net solves the feature fusion problem of pixel-level segmentation and accurately extracts different types of contour information regarding target objects. Furthermore, the target item’s location, forms, and spatial distribution are more precise. The following section introduces the architecture and components of the FCAU-Net in detail. Experimental comparisons on two public remote sensing image datasets (ZY-3 and DeepGlobe) are provided in
Section 3. A discussion is presented in
Section 4. Finally, conclusions are drawn in
Section 5.