SPNet: Dual-Branch Network with Spatial Supplementary Information for Building and Water Segmentation of Remote Sensing Images

Zhao, Wenyu; Xia, Min; Weng, Liguo; Hu, Kai; Lin, Haifeng; Zhang, Youke; Liu, Ziheng

doi:10.3390/rs16173161

Open AccessArticle

SPNet: Dual-Branch Network with Spatial Supplementary Information for Building and Water Segmentation of Remote Sensing Images

by

Wenyu Zhao

¹

,

Min Xia

^1,*

,

Liguo Weng

¹,

Kai Hu

¹

,

Haifeng Lin

²

,

Youke Zhang

³ and

Ziheng Liu

⁴

¹

Collaborative Innovation Center on Atmospheric Environment and Equipment Technology, B-DAT, Nanjing University of Information Science and Technology, Nanjing 210044, China

²

College of Information Science and Technology, Nanjing Forestry University, Nanjing 210037, China

³

Beijing-Dublin International College, Beijing University of Technology, Beijing 100124, China

⁴

Department of Computer Science, University of Reading, Whiteknights House, Reading RG6 6DH, UK

^*

Author to whom correspondence should be addressed.

Remote Sens. 2024, 16(17), 3161; https://doi.org/10.3390/rs16173161

Submission received: 9 July 2024 / Revised: 19 August 2024 / Accepted: 21 August 2024 / Published: 27 August 2024

(This article belongs to the Section Remote Sensing Image Processing)

Download

Browse Figures

Versions Notes

Abstract

Semantic segmentation is primarily employed to generate accurate prediction labels for each pixel of the input image, and then classify the images according to the generated labels. Semantic segmentation of building and water in remote sensing images helps us to conduct reasonable land planning for a city. However, many current mature networks face challenges in simultaneously attending to both contextual and spatial information when performing semantic segmentation on remote sensing imagery. This often leads to misclassifications and omissions. Therefore, this paper proposes a Dual-Branch Network with Spatial Supplementary Information (SPNet) to address the aforementioned issues. We introduce a Context-aware Spatial Feature-Extractor Unit (CSF) to extract contextual and spatial information, followed by the Feature-Interaction Module (FIM) to supplement contextual semantic information with spatial details. Additionally, incorporating the Goal-Oriented Attention Mechanism helps in handling noise. Finally, to obtain more detailed branches, a Multichannel Deep Feature-Extraction Module (MFM) is introduced to extract features from shallow-level network layers. This branch guides the fusion of low-level semantic information with high-level semantic information. Experiments were conducted on building and water datasets, respectively. The results indicate that the segmentation accuracy of the model proposed in this paper surpasses that of other existing mature models. On the building dataset, the mIoU reaches 87.57, while on the water dataset, the mIoU achieves 96.8, which means that the model introduced in this paper demonstrates strong generalization capabilities.

Keywords:

semantic segmentation; building and water; spatial information; dual-branch network

1. Introduction

With the continuous advancement of technology, the resolution of remote sensing images has been steadily improving [1,2]. Through remote sensing images, we can clearly obtain the texture and spatial features of building and water bodies. The characteristics of water bodies and building contribute to a clearer understanding of urban land planning and water coverage [3,4], helping to overcome many challenges related to urban planning, environmental engineering, and natural landscape monitoring [5,6]. Therefore, semantic segmentation techniques based on remote sensing images are of great significance for the development of semantic segmentation tasks. Due to the significant differences in the spatial distribution and spectral characteristics of building and water bodies, traditional semantic segmentation models struggle to effectively capture and distinguish the subtle features of these objects. In high-resolution images, the geometric structure of building and the dynamic textures of water bodies pose severe challenges to the feature-extraction and region-segmentation capabilities of segmentation models. Particularly in high-resolution images, the boundaries of these objects are often more blurred, and the scale variations are more pronounced, leading to the need for models to handle more complex spatial information during the feature-extraction and semantic mapping processes. The rapid development of high-resolution remote sensing satellite technology has simultaneously brought significant challenges to the accuracy of land surface classification [7,8,9]. The complex features and background noise in high-resolution images impose higher demands on the robustness and generalization ability of models [10]. In recent years, with the rapid development of artificial intelligence, a novel and efficient approach has emerged for achieving automatic image segmentation [11].

Semantic segmentation is primarily employed to generate accurate prediction labels for each pixel in an input image [12]. Based on the generated labels, the image is classified, and these pixels may originate from objects of different categories. This includes objects with smaller shapes, situated at a distance, or challenging to identify. Effectively classifying these objects has become one of the prominent topics in computer vision and machine learning. Existing deep learning models enhance adaptability to dynamic changes by integrating multiscale features and handling background interference. However, this also introduces new challenges to the design and training methods of these models. Therefore, in the semantic segmentation of high-resolution remote sensing images, how to effectively reduce computational complexity while preserving details and how to improve model robustness when dealing with diverse features have become the current research hotspots and challenges. This not only involves innovation in algorithm design but also challenges existing data-processing and model-training methods, driving the development of remote sensing image-analysis technology towards higher precision and efficiency.

Later, with the fast development of deep learning, an increasing number of researchers applied deep learning to semantic segmentation tasks and proposed many efficient deep learning-based methods for semantic segmentation [13,14,15]. Deep learning methods can learn multi-level semantic information, enabling models to require a deeper understanding of the different features of images [16,17,18].

In 2006, Hinton [19,20] was the first to propose a deep learning method, which achieved significant advancements in image-processing tasks, particularly excelling in semantic segmentation tasks. However, some challenges arise when applying certain deep learning methods to datasets involving remote sensing images of building and water: Remote sensing images are complex, often containing noise such as clouds and shadows in addition to building and water features. The instability in image quality can lead to overfitting and underfitting, affecting segmentation accuracy. Remote sensing images typically encompass numerous categories. During training, there may be a bias toward common categories in the dataset, resulting in inaccurate segmentation for rare categories. In semantic segmentation tasks with remote sensing images, spatial information is crucial. The shape and even the quantity of objects in the same image can vary when captured from different angles. Therefore, capturing change information to improve accuracy poses a significant challenge [21,22]. It is evident that while simple deep learning methods can enhance accuracy through extensive training and learning of more useful information, they still face substantial challenges when dealing with complex datasets [23].

During the contemporary period, with the rapid advancement of deep learning, an increasing number of efficient methods for semantic segmentation of remote sensing images have been proposed. Long [24] and others introduced the Fully Convolutional Network (FCN), which was the first end-to-end Convolutional Neural Network (CNN) structure designed specifically for semantic segmentation. Before the advent of FCN, image features relied on manual extraction followed by pixel classification. These methods not only required multiple stages of processing but also yielded suboptimal results. FCN revolutionized semantic segmentation by implementing an end-to-end process, associating each pixel in the image with its corresponding class. FCN replaced the fully connected layers of CNN with fully convolutional layers, enabling the acceptance of larger images and producing output images of the same size. This simplification of the semantic segmentation process improved segmentation accuracy, driving the advancement of semantic segmentation. Since then, a series of efficient networks for semantic segmentation have been continuously proposed.

However, FCN still lacks attention to global information [25]. Reference [26] introduced the DeepLab network, which maintains high image resolution by reducing down-sampling operations. This network incorporates Fully Connected Conditional Random Field (FCCRF) by considering pixel relationships using image labeling, leading to improved segmentation. Subsequently, variations such as DeepLabV2 [27], DeepLabV3 [28], and DeepLabV3+ [17] were proposed based on this model. DeepLabV2 introduced Atrous Spatial Pyramid Pooling (ASPP) to extract and integrate multiscale information. Chen et al., drawing inspiration from methods in references [18,29], proposed the DeepLabV3 network. This network eliminates the use of FCCRF and improves ASPP. Building upon this, the DeepLabV3+ network was introduced, adding a decoder module to extract boundary information from features. Additionally, it utilizes depthwise separable convolutions to enhance model performance and reduce parameter count compared to DeepLabV3.

Subsequently, some researchers proposed encoder–decoder-based methods [30,31,32]. Ronneberger et al. [16] introduced UNet, which strengthens the extraction of spatial semantic information. Similar to FCN, UNet is primarily applied in the semantic segmentation of medical images. Its unique design includes skip connections, connecting feature maps from the encoder and decoder to merge information from different layers, thereby improving segmentation accuracy. The later improvement, UNet++ [33], enhanced UNet by incorporating short-link skip connections and feature fusion. Badrinarayanan et al. [34] then introduced SegNet, another encoder–decoder architecture. SegNet is a variant of FCN, and while it shares similarities with UNet, its encoder utilizes the initial 13 layers of the VGG16 convolutional network. Each encoder layer is matched with a corresponding decoder layer. The encoder is utilized to reduce the resolution of feature maps, preserving high-level features, while the decoder restores lower-resolution feature maps to the original resolution. SegNet uses pooling indices instead of traditional weight parameters for upsampling in the decoder stage, minimizing the parameter count and improving memory efficiency. However, SegNet requires extended training time, and optimal results are achieved with demanding parameter settings.

Subsequently, researchers integrated attention mechanisms into networks to enhance segmentation accuracy. Attention-based methods focus more on the features we want to extract, disregarding redundant features [35,36,37], and reducing the impact of noise on feature extraction. Attention mechanisms learn feature weights through forward propagation and backward feedback, extracting features based on these learned weights. Zhao et al. [38] proposed PSANet, which introduces self-attention mechanisms, allowing each pixel to interact with other pixels in the image, addressing long-range dependencies between pixels. Following this, a series of attention-based networks have been proposed, with CCNet [39] introducing a crisscross module for extracting global contextual information. BiSeNet [40] uniquely employs a dual-branch approach for semantic segmentation, dedicated to extracting global contextual information and local details. The feature-fusion module of this network adapts to a multiscale feature pyramid, enabling the segmentation of images with different sizes and resolutions. While BiSeNet is more commonly used in lightweight models, it may perform less optimally in more complex models. Wang et al. [41] introduced self-attention mechanisms into semantic segmentation tasks, achieving significant results [42,43]. The paper proposes a non-local block to obtain long-range spatial dependency information, thereby capturing the global contextual information of the image and effectively improving segmentation accuracy.

To address the challenge of extracting features at different levels, some scholars proposed pyramid structures. By utilizing pyramids with varying scales, this approach extracts features at different hierarchies, obtaining more context information while minimizing performance loss. Zhao et al. [18] introduced PSPNet (Pyramid Scene Parsing Network), which incorporates the Pyramid Pooling Module (PPM). This module employs dilated convolutions to increase the receptive field. Dilated convolutions do not merely capture more context information but additionally avoid an increase in computational complexity. However, this module does add computational complexity, limiting the applicability of the model. In situations with restricted resources, it may lead to longer training times. Yang et al. [44] combined DeepLab’s ASPP with DenseNet’s dense connections to create a DenseASPP network. The uniqueness of this network lies in its denser sampling points, allowing for the acquisition of more information.

With the introduction of transformers, semantic segmentation technology has seen significant improvements. Transformers replace the convolutional layers in traditional models by employing self-attention to compute inputs and outputs. Unlike traditional convolutions that rely on local receptive fields, transformers excel in modeling long-range dependencies and capturing global contextual information. Due to their unique attention mechanism, transformers can acquire more comprehensive global information, thereby enhancing their ability to handle more complex scenarios. The introduction of Vision Transformers (ViT) in 2020 [45] marked the beginning of transformer research in the visual domain. ViT divides images into fixed-size patches and treats each patch as a sequence for processing. Segformer [46] innovatively combines transformers with a lightweight MLP decoder, achieving efficient segmentation at various resolutions. STT (Sparse Token Transformer) [47] deepens the interdependence between spatial and channel dimensions through a dual-branch structure, allowing it to capture global context and obtain a global receptive field.

The above-mentioned networks have addressed many challenges in semantic segmentation. However, some limitations still persist, particularly in the segmentation of building and water bodies [48,49]. One issue is the neglect of spatial information in many networks. As the network deepens, the image resolution decreases due to increased downsampling, leading to the loss of crucial boundary information and inaccuracies in segmentation [50]. Therefore, preserving boundary information while deepening the network is a critical challenge for improving semantic segmentation accuracy. Another concern is that many networks focus more on extracting high-level semantic features from deeper layers, overlooking some low-level semantic features, resulting in the omission of basic image features and attributes. Effectively integrating high-level semantic features with low-level semantic features can significantly enhance segmentation accuracy. To address these challenges, this paper introduces a dual-branch semantic segmentation network with spatial complementation information. The network utilizes dual branches to further extract features extracted by deep networks. The module consists of two branches, extracting contextual information and spatial information from the images. While extracting contextual information, this branch reinforces the extraction of spatial information. To mitigate the impact of noise on spatial information, spatial attention is introduced. Finally, to generate feature maps incorporating both high-level and high-resolution features, we combine low-level feature maps with high-level feature maps. Both the deep network and the low-level network pass through a multichannel deep feature-extraction module, controlling the extraction of useful semantic information by measuring the utility of different features. Experiments demonstrate achieving a notable enhancement in the precise extraction of features through the proposed network.

Our work has made the following contributions:

The Context-aware Spatial Feature-Extractor Unit is introduced, composed of two branches dedicated to extracting contextual semantic information and spatial semantic information from images. This module is designed to enhance attention to the relative positional relationships between target objects and their surroundings while considering the relationships between target objects and the surrounding environment. The spatial information branch supplements the partially lost boundary information due to network deepening, thereby addressing the issue of edge blurring in semantic segmentation.
In the Multichannel Deep Feature-Extraction Module of the Context-aware Spatial Feature-Extractor Unit, a spatial attention module has been incorporated. This attention module, instead of focusing more on the target objects, places greater emphasis on the position of the target objects, intensifying attention to the targets and reducing the impact of noise on semantic segmentation accuracy.
Utilizing the Feature-Interaction Module, the contextual semantic information of high-level semantic features is effectively integrated with spatial semantic information. The spatial information is employed to complement the contextual information, resulting in a more comprehensive extraction of features from high-level semantic information.
The Multichannel Deep Feature-Extraction Module is introduced, designed for extracting features at various levels and fusing high-level semantic features with low-level semantic features. This module supplements high-level features with low-level semantic features, controlling information propagation by predicting the usefulness of each pixel. When the information at the current layer is deemed unnecessary, the layer forwards useful information to other layers or receives information from other layers.

2. Methodology

With the continuous enhancement of remote sensing image resolution, the complexity of target objects and background information increases. Spatial information has become a crucial factor affecting the accuracy of semantic segmentation [51]. In complex spatial scenarios, issues such as misclassification and loss of information may arise, particularly concerning structures and water bodies. To tackle the issue of inaccurate segmentation caused by the loss of spatial information in semantic segmentation, this paper introduces a dual-branch semantic segmentation network with spatial supplementary information. This network effectively extracts spatial semantic information and integrates low-level semantic information with high-level semantic information to enhance segmentation accuracy. Figure 1 illustrates the overall design of the network. We choose ResNet [52] as the backbone network for feature extraction at different phases. Subsequently, we employ Context-aware Spatial Feature-Extractor Units to acquire context and spatial information from the images. These feature maps not only contain the semantic information of the target objects but also retain a certain level of spatial resolution. Subsequently, these feature maps are passed to two main branch modules: the spatial information-extraction branch and the contextual information-extraction branch. In the spatial information-extraction branch, context-aware spatial feature-extraction units first capture both contextual and spatial information from the image. This unit primarily focuses on capturing spatial relationships within the image, enhancing attention to critical spatial information through a spatial attention mechanism, while suppressing unnecessary noise. The extracted spatial features are then passed to the feature-interaction module. Meanwhile, the contextual information-extraction branch focuses on capturing global semantic information from the image using the features extracted at different stages of ResNet. The global features within the contextual information are fused with the spatial information through the feature-interaction module. In this process, the feature-interaction module plays a key role, coordinating and combining the features obtained from the two branches to ensure collaboration between global contextual information and local spatial information. Finally, the multi-channel deep feature-extraction module is applied to extract low-level semantic information from the bottom layers of the network. These low-level semantic features are primarily used to supplement the spatiotemporal information previously obtained, thus complementing and enhancing the high-level semantic information. The features obtained from both the low and high levels are then fused to generate the final output feature map. This fusion process ensures that the segmentation results not only possess high-resolution spatial information but also retain comprehensive semantic context, thereby significantly improving segmentation accuracy and robustness.

2.1. Backbone

The backbone network plays a crucial role in semantic segmentation, where an appropriate backbone network can more accurately extract features from images, resulting in enhanced segmentation accuracy. With the success of AlexNet [53], an increasing number of complex neural networks have gradually come into the public eye. VGG [54] enhances network performance by increasing its depth, while GoogleNet [55] executes multiple convolution and pooling operations in parallel, achieving deeper networks with reduced parameters. MobileNet [56] finds wide applications in mobile and embedded systems, and FCN employs fully convolutional layers to process input images, enabling pixel-level labeling or segmentation. The mentioned networks are currently popular backbone networks that exhibit good performance in semantic segmentation tasks. Deepening the network layers enables the extraction of richer and more intricate features from images, thereby enhancing segmentation accuracy. However, deeper networks may face challenges such as gradient explosions and overfitting. Moreover, increasing parameters with network depth raises computational demands, limiting the network’s applicability. In summary, in this article, we have chosen ResNet50 as the backbone network. The residual network structure of ResNet effectively addresses issues like gradient explosions and overfitting [57]. The residual structure, depicted in Figure 2, differs from regular connections by introducing a shortcut mapping that directly connects the input and output. This approach captures the differences between input and output.

The formula for this residual structure is as follows:

x_{i + 1} = W_{i + 1} σ (W_{i} x_{i}) + x_{i},

(1)

where

X_{i}

is the input matrix of the i-th residual module,

X_{i} + 1

is the output of the

i + 1

residual block.

W_{i}

and

W_{i} + 1

are weight matrices, with

σ

representing ReLU functions.

In this paper, to reduce parameters, we modified the ResNet by removing the last layer; only the first three layers were utilized for semantic feature extraction.

2.2. Context-Aware Spatial Feature-Extractor Unit

Many existing networks have incorporated attention to spatial and contextual information. For example, the PSPNet and DeepLab series focus on contextual information, while Unet and SegNet emphasize spatial information. These networks have shown excellent performance in semantic segmentation tasks. However, the above networks typically focus on either spatial or contextual information. Contextual information is concerned with global semantic details, whereas spatial information focuses on local details. Focusing on only one type of information results in features that are local rather than comprehensive. In contrast, the Context-aware Spatial Feature-Extractor Unit proposed in this paper simultaneously attends to both contextual and spatial information, producing more complete features compared to existing mature networks.

This module consists of two components, the Environmental Perception Feature-Extraction Module and the Spatial Relationship Perception Feature-Extraction Module. The Environmental Perception Feature-Extraction Module extracts contextual information from the image [58], while the Spatial Relationship Perception Feature-Extraction Module focuses on extracting spatial information. Contextual information primarily addresses the environment in which pixels are located but lacks attention to the relationships between pixels. The proposed network in this paper incorporates a branch for extracting spatial information to complement contextual information, thereby enhancing the overall understanding of the layout and improving the comprehension of contextual information.

2.2.1. Environmental Perception Feature-Extraction Module

This module is primarily crafted for extracting contextual information from images. Contextual information in images refers to the environmental information surrounding the features of interest, specifically the contextual information of pixels. Strengthening the focus on contextual information aids in a more comprehensive understanding of the image, thereby improving segmentation accuracy.

This module obtains multiscale contextual information by using convolutions with different dilation rates. The structure of this module is shown in Figure 3. Regular convolutions, due to their smaller receptive field, can only capture local features. In this module, we use

3 \times 3

convolutions with dilation rates of 6, 12, and 18 to achieve larger receptive fields. Convolutions with different dilation rates focus on different features, thereby helping us obtain more contextual information. Max pooling involves downsampling the input feature map to enlarge the receptive field and gather more contextual information by reducing resolution. Additionally, the module includes a

1 \times 1

convolution aimed at reducing the number of channels to decrease computational complexity. Finally, the features obtained from these operations are integrated to produce the ultimate feature map. The feature map obtained through this module contains more contextual information compared to previous ones, effectively improving segmentation accuracy.

The formula for DWconv is shown below:

M_{s} (X) = (f \times x) (i, j) = \sum_{m = 1}^{k} \sum_{n = 1}^{k} ω (m, n) \cdot x (i + d \cdot m, j + d \cdot n),

(2)

which f is the feature map, x is the input image, w is the convolution kernel weight, d is the dilation rate, i and j are the coordinates of the output position.

2.2.2. Spatial Relation Perception Feature-Extraction Module

This module is designed to primarily extract spatial information from images. Spatial information in images refers to the spatial position of pixels and their relationships with other pixels. In the previous Environmental Perception Feature-Extraction Module, we neglected the spatial information of features. Therefore, this module is introduced to extract and supplement the missing spatial information. The component comprises two segments: the Spatial Feature-Extraction Component and the Spatial Attention Component.

The Spatial Feature-Extraction Module employs a convolutional neural network to extract features because convolutional networks can automatically learn the features needed for semantic segmentation. This module consists of four convolutional layers, including three

3 \times 3

convolutional layers and one

1 \times 1

convolutional layer. The first three convolutional layers are mainly used for feature extraction, while the last

1 \times 1

convolutional layer is primarily for channel restoration. To obtain more features, we double the number of channels in the second and third

3 \times 3

convolutional layers. Increasing the number of channels enhances feature representation capacity, as different channels learn different features, and more channels can capture a more diverse range of features. Nevertheless, the rise in the quantity of channels brings the problem of increased parameters. To tackle this concern, we employ depthwise separable convolutional layers instead of regular convolutional layers, as depthwise separable convolutions have fewer parameters. Finally, a

1 \times 1

convolutional layer is applied to restore the quantity of channels to its original size, reducing the extraction of irrelevant information. The model diagram for this module is shown in Figure 4, where C represents the number of channels.

Depthwise separable convolution consists of depthwise convolution and pointwise convolution. In depthwise convolution, each input has a separate convolutional kernel for each channel, significantly reducing the number of parameters. Depthwise convolution has a very small receptive field for each channel, focusing more on local features of the image, which helps strengthen attention to spatial information and acquire more spatial details. Pointwise convolution is used to obtain local spatial features by performing convolution operations within each channel. Therefore, we replace regular convolutions with depthwise separable convolutions, reducing the number of parameters and providing more attention to the spatial information of features. Figure 5 illustrates the difference between regular convolution and depthwise separable convolution, demonstrating that regular convolution involves more computations when the number of channels is the same.

In the Spatial Attention Module of the Spatial Feature-Extraction Module, spatial attention is assigned different weights based on the different positions of targets in the input feature map. It focuses on the positions of features by obtaining global features of the input data through average pooling and max pooling. Subsequently, convolutional layers and the sigmoid function are used to generate corresponding weights. The formula for spatial attention is as follows:

M_{s} (X) = σ (C_{k \times k} ([f_{A v g P o o l} (X), f_{M a x P o o l} (X)])),

(3)

where

M_{s} \in R^{1 \times H \times W}

represents the final attention obtained.

X \in R^{C \times H \times W}

is the input feature map.

C_{k \times k}

is the convolutional layer with a kernel size of k.

σ

is the sigmoid function.

f_{A v g P o o l} (X)

and

f_{M a x P o o l} (X)

are operations for average pooling and max pooling on the input feature, respectively.

The Spatial Feature-Extraction Module is used for the preliminary extraction of spatial information, and then the spatial attention is incorporated to better comprehend the structure and spatial relationships within the input data. The Spatial Information-Extraction Module, composed of these two modules, can comprehensively extract spatial information that was previously overlooked by the network. This aids in understanding the relative positions and distances between features in the image, as well as grasping the location, size, and shape of features. Ultimately, it enhances our understanding of the image.

2.3. Feature-Interaction Module

In semantic segmentation, both contextual information and spatial information are crucial for effective segmentation. Effectively combining the two can help us understand the relationships between features and grasp the location, size, and shape of features in the image. By focusing on the features themselves and their related content, we can gain a more comprehensive understanding of the features. To address this issue, we propose a Feature-Interaction Module to efficiently integrate the information obtained from the Environmental Perception Feature-Extraction Module and the Spatial Relation Perception Feature-Extraction Module.

This module is illustrated in Figure 6. Initially, we cross-fuse the feature maps obtained from the Environmental Perception Feature-Extraction Module and the Spatial Relation Perception Feature-Extraction Module. Each branch’s feature map is passed into the other branch to create a fused map that combines simple semantic and spatial information. Subsequently, the fused maps from both branches are input into the Goal-Oriented Attention Mechanism to obtain a noise-filtered feature map. This attention-filtered feature map is then concatenated with the original feature map through residual connections, effectively improving feature quality and better integrating global and local information to enhance segmentation accuracy. Following this, the feature maps from both branches are fused to obtain the final complete feature-fusion map. The purpose of the two fusion steps is to enhance the model’s representation and generalization capabilities. The introduction of attention during the fusion stages allows the model to selectively emphasize features in different regions or channels adaptively, contributing to a better understanding of various scenes.

The Goal-Oriented Attention Mechanism in this module, as shown in Figure 7, primarily involves adjusting the feature map effectively using weights. This mechanism aims to make the model focus more on important regions, reducing the impact of noise on segmentation, and thereby enhancing the model’s expressiveness and generalization capability.

The formula for the Goal-Oriented Attention Mechanism is as follows:

X_{out} = D r o p (m_{v}^{T} \otimes N o r m (S o f t max (X_{i n} \otimes m_{k}^{T}))),

(4)

where

X_{i n} \in R^{N \times d}

is the input feature map, N is the number of pixels in the image and d is the dimensionality of the features.

m_{k}

and

m_{v}

are linear layers with 64 nodes. ⊗ represents the dot product.

S o f t m a x

is the softmax function.

N o r m

stands for L1 normalization, and

D r o p

refers to the Dropout layer.

The two linear layers in this section are used to perform linear transformations on the input, along with employing additional linear layers to generate queries, keys, and values. Here,

m_{k}

maps the input features to a new space with a dimension of S. This is equivalent to generating queries in the attention mechanism (Query). In addition,

m_{v}

maps the output with a dimension of s back to the original feature dimension. This is equivalent to generating values in the attention mechanism (Value). This design ensures a clear method for generating queries and values, which are then applied to compute attention scores. Consequently, it achieves our goal of a weighted combination of input features through attention scores.

The softmax function is defined as:

{\hat{x}}_{i, j} = \frac{exp ({\tilde{x}}_{i, j})}{\sum_{k} exp ({\tilde{x}}_{k, j})} .

(5)

Softmax primarily involves computing each row of the matrix generated by

X_{i n} \otimes m_{k}^{T}

. transforming the elements in that row into a probability distribution between 0 and 1, while ensuring that the sum of all elements is within 1. The purpose of applying softmax is to assign a weight to each position [59], thereby enhancing the network’s attention to specific locations.

After applying softmax, we further use L1 normalization to ensure that the sum of the output probability distribution is 1. The formula for L1 normalization is as follows:

x_{i, j} = \frac{{\hat{x}}_{i, j}}{\sum_{k} {\hat{x}}_{i, k}},

(6)

where

\hat{x_{i, j}}

is the output of softmax.

\sum_{k} \hat{x_{i, k}}

is the sum of all current elements. Finally, we use a Dropout layer to prevent overfitting and improve the model’s generalization ability. The formula for Dropout is as follows:

D r o p o u t (x) = \frac{x_{i n} \times B e r n o u l l i (p)}{1 - p},

(7)

where

x_{i n}

represents the elements of the input tensor,

B e r n o u l l i (p)

is a Bernoulli distribution, returning 1 with probability p and 0 with probability

1 - p

. Here, p represents the probability of dropout. The spatiotemporal information-aggregation module effectively handles the complex relationships between the feature of the Environmental Perception Feature-Extraction Module and the feature of the Spatial Relation Perception Feature-Extraction Module. This enables the model to better capture multiscale features of the image, enhancing its ability to process details and the overall image. Additionally, we introduced linear layers and softmax functions to learn complex features of the image. Dropout functions are employed to prevent overfitting, enabling the model to concentrate attention on crucial feature parts and enhancing the expressiveness of the network.

2.4. Multichannel Deep Feature-Extraction Module

The final layer of ResNet primarily captures high-level semantic information from the image, obtaining more abstract features such as object shapes and categories. In contrast, the earlier layers of ResNet capture low-level semantic information, including details like edges and textures [8], as well as local features. While the top feature map contains rich semantic information, its drawback lies in its lower resolution, lacking detailed semantic information. Conversely, the lower-level feature maps can effectively complement the missing parts of high-level semantic information. Therefore, by using low-level semantic features to supplement high-level semantic features, we can obtain a more complete feature map with a resolution restored to that of the original image. This effectively enhances segmentation accuracy.

This module integrates features from the first three layers of ResNet by assigning different parameters to each layer, extracting and fusing shallow network features. Through the analysis of each feature, irrelevant interference is removed, and useful information is fused. In this paper, the sigmoid function is utilized to classify the usefulness of features. When the sigmoid value is closer to 1, it indicates more usefulness and such information is retained. If the sigmoid value is closer to 0, it signifies lower usefulness and other features are used to complement and fuse with this feature. The module is illustrated in Figure 8.

The formula for this module is as follows:

X_{o u t} = X_{a} + s i g m o i d (a) \cdot X_{a} + (1 - s i g m o i d (a)) (s i g m o i d (b) \cdot X_{b} + s i g m o i d (c) \cdot X_{c}) .

(8)

The formula for this module is as follows. We use this module to extract and fuse features from the first three layers of ResNet. If

X_{a}

represents a layer of ResNet, then

X_{b}

and

X_{c}

represent the remaining two layers of ResNet, primarily used to complement the features of the first layer. We sequentially use the first three layers of ResNet as inputs

X_{a}

to assess the importance of the extracted information at each layer. When one layer serves as input

X_{a}

, the remaining two layers act as

X_{b}

and

X_{c}

to complement the features.

S (a)

,

S (b)

and

S (c)

is the sigmoid function, and

X_{a}

,

X_{b}

, and

X_{c}

first undergo a sigmoid function to assign a weight. When

S (a)

approaches 1,

1 - S (a)

approaches 0, indicating the importance of the information in that layer, and there is no need for supplementary information from the other two layers. Conversely, if

S (a)

approaches 0, it indicates that the information in that layer is relatively unimportant, and supplementary information from the other two layers is required. This way, the final features of each layer exclude unimportant information while retaining important information. Finally, the supplemented features are added to the original features to obtain a more complete feature map.

By adding the weighted feature maps to the original feature map, the model is allowed to adaptively choose the degree to which the feature maps are fused. This process preserves the relatively important information in the original feature map while discarding less important information. It utilizes features with stronger semantic information to complement features with weaker semantic information, making better use of information from each layer and achieving accurate image-segmentation tasks. The module can adaptively weight the contributions of the feature maps from the first three layers of ResNet, allowing the network to learn the most useful information. Additionally, this module can dynamically highlighting the significance of each feature map, allowing the network to concentrate attention on critical areas, reducing attention to noise, and enhancing the model’s segmentation precision.

3. Experiment

3.1. Dataset

In this paper, we selected datasets for water bodies and building. Accurately segmenting water bodies and building can provide more technical support for applications such as urban planning, environmental engineering, and natural landscape monitoring. The characteristics of water bodies and building differ significantly. building typically have clear boundaries and geometric shapes, with strong edge information and regular geometric features. Therefore, our focus is more on edge information, shape regularity, and texture details. In contrast, the shape and boundaries of water bodies are variable, and they are more significantly affected by lighting conditions, exhibiting properties like reflection, transparency, or translucency, which are major differences from building. As a result, when segmenting water bodies, it is crucial to better understand color gradients, changes in reflected light, and fluidity characteristics, as well as to incorporate contextual information to focus on global information, accurately distinguishing water bodies from their surrounding environment. It is evident that the key information to focus on during segmentation differs between water bodies and building. building require an emphasis on detail-oriented information, whereas water bodies necessitate a focus on global information. Balancing attention to these different key features poses the greatest challenge to the network proposed in this paper.

3.1.1. Building Dataset

The dataset is derived from Google Earth maps and consists of 300 images with a resolution of

1600 \times 900

. These images were cropped to

224 \times 224

, and after careful inspection and filtering to remove images with a single label, a total of 2000 images were obtained for subsequent experiments. The dataset covers a diverse geographical range, including suburban areas in North America, coastal residential areas in China, and rural parks in Europe, captured from various angles. We applied data augmentation to the images by flipping them horizontally (

50 %

), vertically (

50 %

), and randomly in orientation (−10 degrees to 10 degrees) after cropping. This enhances the diversity of the data, reduces overfitting, and enhances the model’s capacity for generalization. The dataset is structured as a three-class dataset, with labels for building, water, and background. Subsequently, we split the dataset into training and validation sets with a ratio of 4:6. Partial displays of this dataset are shown in Figure 9.

In this dataset, accurate segmentation of the model faces some challenges due to various factors. For instance, under poor lighting conditions, building and water bodies may have similar shapes, making it challenging to differentiate them. This scenario effectively tests the model’s representational capabilities. Additionally, shadows from some building and tall trees may obscure building and water bodies, leading to blurry boundaries and segmentation difficulties. This situation serves as a test for the model’s resilience to interference. The dataset’s rich variety of scenes poses a challenge to the model’s detection capabilities.

3.1.2. Water Dataset

The water body dataset is derived from the multispectral remote sensing images of the Chinese HJ-1A (HJ-1B) environmental remote sensing satellite and the multispectral remote sensing satellite Landsat-8 from NASA. The Landsat-8 satellite images consist of 11 bands, and the water body dataset is primarily composed of images from the 4th, 3rd, and 2nd bands. The resolution of Landsat-8 satellite images is 10,000 × 10,000, and the resolution of Google images is

4800 \times 2742

. We cropped them to images with a resolution of

256 \times 256

. After cropping, we performed data augmentation, resulting in a total of 8000 images in this dataset. We split the dataset into training and validation sets with a ratio of 4:1. This dataset is a binary classification dataset, where the labels of the images fall into two categories: water and background. The dataset encompasses rich information on water, including various shapes, types, and colors of water, effectively evaluating the model’s generalization ability. Partial displays of this dataset are shown in Figure 10.

3.2. Experimental Parameter Setting

In the experimental process, we utilized an Intel Core i5-13400F CPU@2.5GHz (Intel Corporation, Santa Clara, CA, USA) NVIDIA RTX4060ti GPU (NVIDIA Corporation, Santa Clara, CA, USA) and NVIDIA RTX4060ti GPU (NVIDIA Corporation, Santa Clara, CA, USA) as the hardware environment, paired with the Windows 10 operating system and the PyTorch deep learning framework. For optimization during experiments, we employed the adaptive learning rate optimization algorithm known as Adam. This algorithm adjusts the learning rate of parameters by computing the first and second-order matrix estimates of the gradients. The advantage of this algorithm lies in its effective combination of the benefits of the momentum method and the RMSProp (Root Mean Square Propagation) algorithm. The formula for the Adam optimization algorithm is as follows:

l r = b a s e_l r \times {(1 - \frac{e p o c h}{n u m_e p o c h})}^{p o w e r},

(9)

where

l r

is the updated learning rate.

b a s e_l r

is the base learning rate.

e p o c h

is the iteration count.

n u m_e p o c h

is the total iteration count.

p o w e r

is controls the shape of the learning rate curve. In this paper, we set the base learning rate to 0.001 because in most models, the loss stabilizes after around 200 epochs in this experiment. To prevent overfitting due to too many epochs, we set the total iteration count to 300. We set the power to 0.9. In this experiment, we use BCEWithLogitsLoss as the loss function, and due to GPU memory limitations, we set the batch size to 16. The evaluation metrics in this experiment include pixel accuracy (PA), mean pixel accuracy (MPA), and mean intersection over union (mIoU), with the following formulas:

P A = \frac{\sum_{i = 0}^{k} ρ_{i, j}}{\sum_{i = 0}^{k} \sum_{j = 0}^{k} ρ_{i, j}}

(10)

M P A = \frac{1}{k} \sum_{i = 0}^{k} \frac{ρ_{i, j}}{\sum_{j = 0}^{k} ρ_{i, j}}

(11)

M I o U = \frac{1}{k + 1} \sum_{i = 0}^{k} \frac{ρ_{i, i}}{\sum_{j = 0}^{k} ρ_{i, j} + \sum_{j = 0}^{k} ρ_{j . i} - ρ_{i . i}}

(12)

The variables are represented by Table 1.

3.3. Backbone Network Comparative Experiment

In this paper, we chose ResNet as the foundational network for our model. We compared three ResNet architectures-ResNet18, ResNet50, and ResNet101—to determine the most effective backbone for our network. ResNet18 has 18 layers, ResNet50 has 50 layers, and ResNet101 has 101 layers. Table 2 shows our experimental results.

We observe that ResNet18 has the smallest number of parameters, but it has the lowest mIoU, differing significantly from the other two. Therefore, ResNet18 is not selected. ResNet50 and ResNet101 have similar mIoU scores, but ResNet50 has relatively higher accuracy, and its parameter count is only half of ResNet101. Consequently, we choose ResNet50 with 50 layers as the backbone network for our model.

3.4. Ablation Experiments

Ablation Experiments We conducted ablation experiments to assess the impact of each module on overall segmentation performance. We used a modified ResNet50 as the backbone network and sequentially added each module. In this experiment, we primarily used mIoU for model evaluation. The results are presented in the table below. From Table 3, we can clearly see that incorporating only the Context-aware Spatial Feature-Extraction Unit increased mIoU by

1.209 %

. Adding both the Context-aware Spatial Feature-Extraction Unit and the Feature-Interaction Module resulted in a

2.929 %

increase in mIoU. Further, integrating the Context-aware Spatial Feature-Extraction Unit, the Feature-Interaction Module, and the Multi-Channel Depth Feature-Extraction Module led to a

4.801 %

improvement in mIoU. Finally, incorporating all proposed modules yielded a

6.305 %

enhancement over the backbone network, with mIoU reaching a peak of

87.572 %

. This indicates that all modules used in this paper effectively improved the model’s performance, demonstrating strong performance in semantic segmentation tasks.

To obtain a clearer understanding of the impact of different modules on semantic segmentation tasks, we selected results from several ablation experiments for comparison, as shown in Figure 11. The backbone network shown in Figure 11b can only capture rough contours and fails to obtain more detailed semantic information. After incorporating the Context-Aware Spatial Feature extraction shown in Figure 11c, the model shows a noticeable improvement in attention to contextual semantic information and spatial details. However, the handling of some detailed features still performs suboptimally, and the feature map still contains some noise. Subsequently, with the addition of the Feature-Interaction Module shown in Figure 11d, noise reduction is achieved by aggregating contextual information and spatial information while introducing attention to handle noise. After incorporating the Multi-Channel Deep Feature-Extraction Module shown in Figure 11e, the model focuses on the low-level semantic information of the backbone network, resulting in higher resolution and more detailed feature details in the bottom-level semantics. It is evident that the model’s handling of image details is improved. Finally, when all modules are combined to obtain the complete model, from the image, we observe that the network’s processed feature map demonstrates better attention to detail compared to the feature map obtained by the backbone network alone. It simultaneously enhances attention to the surrounding environment of pixels and the relationships between pixels, contributing to a thorough understanding of pixels for classification. Moreover, it effectively handles noise, reducing misclassifications caused by noise. In comparison to the backbone network, the entire network obtains more complete feature maps, leading to a noticeable improvement in segmentation accuracy.

Later, we further compared the heatmaps of the ablation experiments, as shown in Figure 12. In the heatmap, shades of orange-red represent pixels of primary interest, while yellow and blue indicate secondary pixels. The first row depicts the results for water, and the second row illustrates the results for building. From column Figure 12a, it can be observed that although the backbone network can roughly capture the areas of building and water bodies, the boundaries are too blurry, and the handling of surrounding noise is poor. After incorporating the Context-Aware Spatial Feature-extraction module shown in Figure 12b (column), the boundary information of water bodies and building becomes clearer. Subsequently, with the addition of the Feature-Interaction Module Figure 12c (column), it can be seen that noise within the heatmap is notably diminished, and the model concentrates more on pixels corresponding to building and water bodies. After incorporating the Multi-Channel Deep Feature-Extraction Module, the overall network shows a noticeable improvement in handling image details. The colors for building and water body areas become darker, and the contours become more pronounced, achieving precise segmentation of the image.

3.5. Comparison Experiments Based on the Building Dataset

To assess the efficacy of our model, we performed comparative experiments using the building dataset, comparing it with some popular existing networks. The results demonstrate that our proposed model attains higher accuracy than all existing models, indicating that our SPNet model performs well in land cover segmentation on the dataset.

We selected DeepLabV3+ [17], Unet [16], PSPNet [18], SegNet [34], DABNet [60], DDRNet [61], EDANet [62], FCN [24], and DSFNet [63] for comparison. DDRNet is also a dual-branch network, using a deep dual-resolution branch for semantic segmentation of high-resolution images. While DDRNet is a lightweight network, its accuracy is lower, with PA, MPA, and MIOU being 89.94, 88.96, and 79.15, respectively. UNet, DABNet, PSPNet, SegNet, and FCN focus on spatial information in images. UNet employs an encoder–decoder structure with skip connections in between to preserve spatial information, resulting in PA, MPA, and mIoU of 90.37, 89.67, and 80.89. DABNet incorporates dense attention branches to deepen attention to spatial information, yielding PA, MPA, and mIoU of 90.09, 90.83, and 81.19. SegNet, like UNet, has an encoder–decoder structure, but it uses hierarchical upsampling and downsampling to retain and restore spatial information, with PA, MPA, and mIoU being 87.62, 89.00, and 79.88. FCN uses a fully convolutional structure for dense pixel-level predictions of spatial information, resulting in PA, MPA, and mIoU of 90.55, 89.60, and 81.07. DeepLabV3+ and PSPNet focus on contextual information in images. DeepLabV3+ introduces dilated convolutions to expand the receptive field and proposes Atrous Spatial Pyramid Pooling (ASPP) for multiscale context information fusion, achieving PA, MPA, and mIoU of 90.38, 90.22, and 82.26. PSPNet incorporates a pyramid pooling module to extract information at different scales for multiscale context information, resulting in PA, MPA, and mIoU of 90.63, 91.49, and 82.74. Although DeepLabV3+ and PSPNet demonstrate good segmentation accuracy, this accuracy comes at the expense of computational complexity and parameter count.EDANet and DSFNet aim to improve segmentation accuracy by introducing attention mechanisms. EDANet adds a progressive attention mechanism, with PA, MPA, and mIoU of 89.42, 89.83, and 81.09. DSFNet enhances attention between horizontal directions and positions to boost segmentation accuracy, yielding PA, MPA, and mIoU of 91.03, 91.52, and 83.73.

The proposed SPNet in this paper is a dual-branch network that simultaneously focuses on the contextual information of the image and supplements the spatial information of the features. Additionally, it incorporates processing for low-resolution feature maps, enabling the feature maps to possess excellent spatiotemporal information and detailed local features. The network achieves high performance with PA, MPA, and mIoU reaching 93.54, 93.41, and 87.57, respectively, making it the most accurate among all networks in the comparative experiments.

From Table 4, we can observe that the computational complexity and parameter count of the proposed SPNet are 11.56 GMac and 39.03 M, respectively. Compared to other semantic segmentation networks, SPNet has a lower computational complexity than the majority of networks. Additionally, it is noteworthy that DeepLabV3+ and PSPNet exhibit better segmentation accuracy; Yet, this level of accuracy is achieved at the cost of heightened computational complexity and an increased number of parameters. In contrast, the proposed SPNet achieves improved segmentation accuracy while simultaneously reducing computational complexity and parameter count.

To obtain a clearer insight into the processing performance of each network, we visualized the segmentation outcomes for six images from the dataset, as depicted in Figure 13. Each column represents the predicted output of each model, and the last column represents the original labels. From Figure 13, it’s evident that conventional semantic segmentation networks effectively segment rivers and building, but the results mostly provide a rough outline and shape, losing many details. Additionally, traditional networks may encounter misjudgments in more complex areas. In contrast, the proposed network handles details better, achieving accurate segmentation of building and water bodies. Traditional semantic segmentation networks tend to overlook small areas of water bodies and building, as highlighted by the red circles in Figure 13. However, our network effectively avoids this issue by incorporating a Multichannel Deep Feature-Extraction Module to handle image details. In cases where building and water bodies are similar to the background, as indicated by the red rectangles in Figure 13. traditional networks may make misjudgments. We address this by introducing a Spatial Relation Perception Feature-Extraction Module to enhance spatial information processing, effectively distinguishing between the background and building or water bodies, thereby reducing misjudgments. Furthermore, our model exhibits fewer noise artifacts in the predicted images compared to SegNet, as attention mechanisms are incorporated in both the Spatial Relation Perception Feature-Extraction Module and Feature-Interaction Modules, mitigating the impact of noise.

3.6. Comparison Experiments Based on the Water Dataset

In order to test the model’s generalization ability, we conducted a comparative experiment on a water dataset, and the results are shown in Table 5. At the same time, in the comparative experiments on this dataset, we included STTs (Sparse Token Transformers) for binary classification comparison. This model is a semantic segmentation network based on transformers. In this experiment, the proposed model achieved the highest accuracy, with PA, MPA, and MioU values of 98.8, 98.7, and 96.8, respectively. This experiment demonstrates that SPNet exhibits strong generalization ability.

As shown in Figure 14, The proposed model outperforms other networks on the water dataset is superior to other networks. In the highlighted regions of the circles, some water areas with similar colors to the background resulted in misjudgments. Among all the compared networks, the effect of SPNet proposed in this paper is the best, with the fewest misjudgments. The dataset also contains many small water areas that are often ignored by networks due to their small size, as indicated by the rectangular marked area in Figure 14. However, SPNet, introduced in this paper, incorporates a spatial information-extraction module, enhancing the understanding of features regarding their position, size, and shape. This effectively reduces instances of missing information.

4. Conclusions

With the advancement of remote sensing technology, the resolution of remote sensing images has been increasing. The segmentation performance of traditional semantic segmentation models falls short of our expectations for high-resolution remote-sensing images. Therefore, this paper proposes a dual-branch network with spatial complementary information for semantic segmentation of remote sensing images. The method utilizes ResNet50 as the backbone network to extract features at different stages. Initially, the high-level networks of ResNet are processed through a context-aware spatial feature-extraction unit to extract contextual and spatial information from the images. Contextual information primarily captures global information about building and water bodies, while spatial information focuses on local features of building and water bodies. We use spatial information to complement contextual information, allowing the model to simultaneously focus on the connections between the background and various objects around the image objects, as well as the position and layout information of the objects. Subsequently, a feature-interaction module is employed to integrate the features extracted by the previous dual-branch, enabling effective fusion and mutual supplementation of global and local information to obtain more complete feature maps. This module incorporates target-guided attention to handle noise and reduce misjudgments caused by noise. Later, the low-level networks of ResNet are processed through a detail extraction branch to extract detailed information about the image, making the boundary information of the features more detailed and accurate. The model then combines the low-level semantic information with the high-level semantic information and spatiotemporal information through a fusion module. High-level semantic information carries stronger semantic information but lower resolution, while low-level semantic information has higher resolution and more detailed information. Therefore, fusing high-level and low-level semantic information can effectively improve the quality of feature maps and enhance segmentation accuracy. The model achieves mIoU of 87.572 on the building dataset and 96.8 on the water area dataset, demonstrating excellent performance on both datasets and indicating strong generalization capabilities. However, the model still has room for improvement, and reducing parameters and computational complexity while maintaining accuracy remains a major challenge.

5. Discussion

This study primarily utilized two distinct datasets: the land cover dataset and the water dataset. These datasets are entirely different, with the land cover dataset being a three-class classification dataset consisting of villa areas in North America, coastal residential areas in China, and rural parks in Europe. The water dataset, on the other hand, is a binary classification dataset primarily composed of various rivers. The experimental parameters in this study mainly include pixel accuracy (PA), mean pixel accuracy (MPA), and mean intersection over union (mIoU), which are compared to determine model performance. Initially, a comparison was made among ResNet18, ResNet50, and ResNet101, where ResNet50 exhibited the highest accuracy and most suitable parameters. So, we select the ResNet as the backbone. The ablation experiments were conducted to validate the effectiveness of each module. Through these experiments, it was observed that with the addition of each module, the model’s accuracy improved. Specifically, after adding the environment-aware spatial feature-extraction unit, mIoU increased by 1.209 to reach 82.476. Further enhancement was achieved with the incorporation of the information-aggregation module, resulting in an increase of 1.72 to reach 84.196. Finally, integrating the multiscale information-extraction module led to a significant improvement in mIoU by 3.376, reaching 87.572. Subsequently, the effectiveness of each module was further confirmed through experimental visualization. In the comparative experiments on the land cover dataset, the proposed SPNet achieved an accuracy of 87.572, surpassing several mature networks while maintaining a smaller parameter count. This indicates that SPNet effectively improved accuracy without significantly increasing parameter count. Lastly, another comparative experiment was conducted on the water dataset, resulting in an mIoU of 96.8, the highest among all networks. This experiment not only reaffirmed the superior segmentation performance of SPNet but also highlighted its good generalization capabilities.

This study’s findings are of significant importance and value, exerting a positive influence on the development of the semantic segmentation field. Firstly, our experimental results demonstrate that the proposed model excels in segmenting images of specific categories, exhibiting higher accuracy and robustness compared to traditional methods. This offers new methods and insights for addressing image-segmentation issues in practical applications. Secondly, our comparison of different models or parameter settings reveals significant improvements across multiple evaluation metrics, proving the effectiveness and superiority of the proposed model in semantic segmentation tasks. This holds significant implications for advancing the research field, providing strong references and insights for further optimization and improvement of semantic segmentation models. Additionally, our study explores the trends in model performance variations across different scenarios and analyzes the challenges and limitations the model may encounter when handling specific situations. This aids researchers in better understanding the model’s applicability and limitations, offering directions for future research improvements and extensions. Last but not least, in this project, we also conducted research on the current development status and background significance of building and water bodies, which provides valuable reference for subsequent researchers. In summary, the findings of this study hold both theoretical and practical significance, offering valuable insights and inspiration for the development of the semantic segmentation field.

Author Contributions

Conceptualization, W.Z. and M.X.; methodology, W.Z. and M.X.; software, W.Z.; validation, L.W. and H.L.; formal analysis, K.H.; investigation, Z.L., W.Z. and M.X.; resources, M.X. and L.W.; data curation, Z.L. and W.Z.; writing—original draft preparation, W.Z.; writing—review and editing, H.L. and Y.Z.; visualization, M.X.; supervision, M.X.; project administration, M.X.; and funding acquisition, M.X. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported in part by the National Natural Science Foundation of PR China (42075130).

Data Availability Statement

The data and the code of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Liu, R.; Tao, F.; Liu, X.; Na, J.; Leng, H.; Wu, J.; Zhou, T. RAANet: A Residual ASPP with Attention Framework for Semantic Segmentation of High-Resolution Remote Sensing Images. Remote Sens. 2022, 14, 3109. [Google Scholar] [CrossRef]
Ren, H.; Xia, M.; Weng, L.; Hu, K.; Lin, H. Dual-Attention-Guided Multiscale Feature Aggregation Network for Remote Sensing Image Change Detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 4899–4916. [Google Scholar] [CrossRef]
Guo, Z.; Shengoku, H.; Wu, G.; Chen, Q.; Yuan, W.; Shi, X.; Shao, X.; Xu, Y.; Shibasaki, R. Semantic Segmentation for Urban Planning Maps Based on U-Net. In Proceedings of the IGARSS 2018—2018 IEEE International Geoscience and Remote Sensing Symposium, Valencia, Spain, 22–27 July 2018; pp. 6187–6190. [Google Scholar]
Ye, Q.; Zhao, H.; Li, Z.; Yang, X.; Gao, S.; Yin, T.; Ye, N. L1-Norm Distance Minimization-Based Fast Robust Twin Support Vector k -Plane Clustering. IEEE Trans. Neural Netw. Learn. Syst. 2018, 29, 4494–4503. [Google Scholar] [CrossRef]
Huang, B.; Zhao, B.; Song, Y. Urban Land-Use Mapping Using a Deep Convolutional Neural Network with High Spatial Resolution Multispectral Remote Sensing Imagery. Remote Sens. Environ. 2018, 214, 73–86. [Google Scholar] [CrossRef]
Ren, W.; Wang, Z.; Xia, M.; Lin, H. MFINet: Multi-Scale Feature Interaction Network for Change Detection of High-Resolution Remote Sensing Images. Remote Sens. 2024, 16, 1269. [Google Scholar] [CrossRef]
Ding, L.; Xia, M.; Lin, H.; Hu, K. Multi-Level Attention Interactive Network for Cloud and Snow Detection Segmentation. Remote Sens. 2024, 16, 112. [Google Scholar] [CrossRef]
Chen, B.; Xia, M.; Qian, M.; Huang, J. MANet: A multi-level aggregation network for semantic segmentation of high-resolution remote sensing images. Int. J. Remote Sens. 2022, 14, 5874–5894. [Google Scholar] [CrossRef]
Wang, Z.; Xia, M.; Weng, L.; Hu, K.; Lin, H. Dual Encoder–Decoder Network for Land Cover Segmentation of Remote Sensing Image. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 2372–2385. [Google Scholar] [CrossRef]
Wang, Z.; Xia, M.; Weng, L.; Hu, K.; Lin, H. Bitemporal Attention Sharing Network for Remote Sensing Image Change Detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 10368–10379. [Google Scholar] [CrossRef]
Wambugu, N.; Chen, Y.; Xiao, Z.; Wei, M.; Bello, S.A.; Marcato Junior, J.; Li, J. A Hybrid Deep Convolutional Neural Network for Accurate Land Cover Classification. Int. J. Appl. Earth Obs. Geoinf. 2021, 103, 102515. [Google Scholar]
Sun, L.; Cheng, S.; Zheng, Y.; Wu, Z.; Zhang, J. SPANet: Successive Pooling Attention Network for Semantic Segmentation of Remote Sensing Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 4045–4057. [Google Scholar] [CrossRef]
Dai, X.; Chen, K.; Xia, M.; Weng, L.; Lin, H. LPMSNet: Location Pooling Multi-Scale Network for Cloud and Cloud Shadow Segmentation. Remote Sens. 2023, 15, 4005. [Google Scholar] [CrossRef]
Chen, K.; Dai, X.; Xia, M.; Weng, L.; Hu, K.; Lin, H. MSFANet: Multi-Scale Strip Feature Attention Network for Cloud and Cloud Shadow Segmentation. Remote Sens. 2023, 15, 4853. [Google Scholar] [CrossRef]
Ji, H.; Xia, M.; Zhang, D.; Lin, H. Multi-Supervised Feature Fusion Attention Network for Clouds and Shadows Detection. ISPRS Int. J. Geo-Inf. 2023, 12, 247. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015, Munich, Germany, 5–9 October 2015; Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F., Eds.; Springer International Publishing: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid Scene Parsing Network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Mohamed, A.; Dahl, G.E.; Hinton, G. Acoustic Modeling Using Deep Belief Networks. IEEE Trans. Audio Speech Lang. Process. 2012, 20, 14–22. [Google Scholar] [CrossRef]
Hinton, G.E.; Salakhutdinov, R.R. Reducing the Dimensionality of Data with Neural Networks. Science 2006, 313, 504–507. [Google Scholar] [CrossRef]
Hu, Z.; Weng, L.; Xia, M.; Hu, K.; Lin, H. HyCloudX: A Multibranch Hybrid Segmentation Network With Band Fusion for Cloud/Shadow. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 6762–6778. [Google Scholar] [CrossRef]
Jiang, S.; Lin, H.; Ren, H.; Hu, Z.; Weng, L.; Xia, M. MDANet: A High-Resolution City Change Detection Network Based on Difference and Attention Mechanisms under Multi-Scale Feature Fusion. Remote Sens. 2024, 16, 1387. [Google Scholar] [CrossRef]
Song, L.; Xia, M.; Xu, Y.; Weng, L.; Hu, K.; Lin, H.; Qian, M. Multi-granularity siamese transformer-based change detection in remote sensing imagery. Eng. Appl. Artif. Intell. 2024, 136, 108960. [Google Scholar] [CrossRef]
Long, J.; Shelhamer, E.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015. [Google Scholar]
Peng, C.; Zhang, X.; Yu, G.; Luo, G.; Sun, J. Large Kernel Matters—Improve Semantic Segmentation by Global Convolutional Network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs. arXiv 2016, arXiv:1412.7062. [Google Scholar]
Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 834–848. [Google Scholar] [CrossRef]
Chen, L.C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking Atrous Convolution for Semantic Image Segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar]
Wang, P.; Chen, P.; Yuan, Y.; Liu, D.; Huang, Z.; Hou, X.; Cottrell, G. Understanding Convolution for Semantic Segmentation. In Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA, 12–15 March 2018; pp. 1451–1460. [Google Scholar]
Sun, L.; Wu, F.; He, C.; Zhan, T.; Liu, W.; Zhang, D. Weighted Collaborative Sparse and L_1/2 Low-Rank Regularizations with Superpixel Segmentation for Hyperspectral Unmixing. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar]
He, C.; Sun, L.; Huang, W.; Zhang, J.; Zheng, Y.; Jeon, B. TSLRLN: Tensor subspace low-rank learning with non-local prior for hyperspectral image mixed denoising. Signal Process. 2021, 184, 108060. [Google Scholar] [CrossRef]
Zheng, Y.; Jeon, B.; Sun, L.; Zhang, J.; Zhang, H. Student’s t-Hidden Markov Model for Unsupervised Learning Using Localized Feature Selection. IEEE Trans. Circuits Syst. Video Technol. 2018, 28, 2586–2598. [Google Scholar] [CrossRef]
Zhou, Z.; Rahman Siddiquee, M.M.; Tajbakhsh, N.; Liang, J. UNet++: A Nested U-Net Architecture for Medical Image Segmentation. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support; Stoyanov, D., Taylor, Z., Carneiro, G., Syeda-Mahmood, T., Martel, A., Maier-Hein, L., Tavares, J.M.R.S., Bradley, A., Papa, J.P., Belagiannis, V., et al., Eds.; Springer International Publishing: Cham, Switzerland, 2018; pp. 3–11. [Google Scholar]
Vijay, B.; Alex, K.; Roberto, C. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar]
Elahe, A.; Shabbir, M.; Andrei, P.; Bahram, Z. RGPNet: A Real-Time General Purpose Semantic Segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2021; pp. 3009–3018. [Google Scholar]
Wang, Z.; Wang, J.; Yang, K.; Wang, L.; Su, F.; Chen, X. Semantic segmentation of high-resolution remote sensing images based on a class feature attention mechanism fused with Deeplabv3+. Comput. Geosci. 2022, 158, 104969. [Google Scholar] [CrossRef]
Li, Y.; Weng, L.; Xia, M.; Hu, K.; Lin, H. Multi-Scale Fusion Siamese Network Based on Three-Branch Attention Mechanism for High-Resolution Remote Sensing Image Change Detection. Remote Sens. 2024, 16, 1665. [Google Scholar] [CrossRef]
Zhao, H.; Zhang, Y.; Liu, S.; Shi, J.; Loy, C.C.; Lin, D.; Jia, J. PSANet: Point-wise Spatial Attention Network for Scene Parsing. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
Huang, Z.; Wang, X.; Huang, L.; Huang, C.; Wei, Y.; Liu, W. CCNet: Criss-Cross Attention for Semantic Segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republick of Korea, 27 October–2 November 2019. [Google Scholar]
Yu, C.; Wang, J.; Peng, C.; Gao, C.; Yu, G.; Sang, N. BiSeNet: Bilateral Segmentation Network for Real-time Semantic Segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8 September 2018. [Google Scholar]
Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-Local Neural Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual Attention Network for Scene Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Li, X.; Zhong, Z.; Wu, J.; Yang, Y.; Lin, Z.; Liu, H. Expectation-Maximization Attention Networks for Semantic Segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republick of Korea, 27 October–2 November 2019. [Google Scholar]
Yang, M.; Yu, K.; Zhang, C.; Li, Z.; Yang, K. DenseASPP for Semantic Segmentation in Street Scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Alexey, D.; Lucas, B.; Alexander, K.; Dirk, W.; Xiaohua, Z.; Thomas, U.; Mostafa, D.; Matthias, M.; Georg, H.; Sylvain, G.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2021, arXiv:2010.11929. [Google Scholar]
Xie, E.; Wang, W.; Yu, Z.; An kumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar]
Chen, K.; Zou, Z.; Shi, Z. Building Extraction from Remote Sensing Images with Sparse Token Transformers. Remote Sens. 2021, 13, 4441. [Google Scholar] [CrossRef]
Yin, J.; Dong, J.; Hamm, N.A.; Li, Z.; Wang, J.; Xing, H.; Fu, P. Integrating remote sensing and geospatial big data for urban land use mapping: A review. Int. J. Appl. Earth Obs. Geoinf. 2021, 103, 102514. [Google Scholar] [CrossRef]
Lu, C.; Xia, M.; Qian, M.; Chen, B. Dual-Branch Network for Cloud and Cloud Shadow Segmentation. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–12. [Google Scholar] [CrossRef]
Wu, H.; Liang, C.; Liu, M.; Wen, Z. Optimized HRNet for image semantic segmentation. Expert Syst. Appl. 2021, 174, 114532. [Google Scholar] [CrossRef]
Zhan, Z.; Ren, H.; Xia, M.; Lin, H.; Wang, X.; Li, X. AMFNet: Attention-Guided Multi-Scale Fusion Network for Bi-Temporal Change Detection in Remote Sensing Images. Remote Sens. 2024, 16, 1765. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Alex, K.; Ilya, S.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems; Pereira, F., Burges, C.J., Bottou, L., Weinberger, K.Q., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2012; Volume 25. [Google Scholar]
Karen, S.; Andrew, Z. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2015, arXiv:1409.1556. [Google Scholar]
Christian, S.; Wei, L.; Yangqing, J.; Pierre, S.; Scott, R.; Dragomir, A.; Dumitru, E.; Vincent, V.; Andrew, R. Going Deeper With Convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015. [Google Scholar]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
Dai, X.; Xia, M.; Weng, L.; Hu, K.; Lin, H.; Qian, M. Multiscale Location Attention Network for Building and Water Segmentation of Remote Sensing Image. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–19. [Google Scholar] [CrossRef]
Gao, G.; Xu, G.; Li, J.; Yu, Y.; Lu, H.; Yang, J. FBSNet: A Fast Bilateral Symmetrical Network for Real-Time Semantic Segmentation. IEEE Trans. Multimed. 2023, 25, 3273–3283. [Google Scholar] [CrossRef]
Zhou, T.; Wang, W.; Konukoglu, E.; Van Gool, L. Rethinking Semantic Segmentation: A Prototype View. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 2582–2593. [Google Scholar]
Li, G.; Yun, I.; Kim, J.; Kim, J. DABNet: Depth-wise Asymmetric Bottleneck for Real-time Semantic Segmentation. arXiv 2019, arXiv:1907.11357. [Google Scholar]
Hong, Y.; Pan, H.; Sun, W.; Jia, Y. Deep Dual-resolution Networks for Real-time and Accurate Semantic Segmentation of Road Scenes. arXiv 2021, arXiv:2101.06085. [Google Scholar]
Lo, S.Y.; Hang, H.M.; Chan, S.W.; Lin, J.J. Efficient Dense Modules of Asymmetric Convolution for Real-Time Semantic Segmentation. arXiv 2019, arXiv:1809.06323. [Google Scholar]
Ma, Z.; Xia, M.; Weng, L.; Lin, H. Local Feature Search Network for Building and Water Segmentation of Remote Sensing Image. Sustainability 2023, 15, 3034. [Google Scholar] [CrossRef]

Figure 1. Dual-branch semantic segmentation network with spatial supplementary information. The model consists of a base network and three auxiliary modules. The backbone network is ResNet50, and the auxiliary modules include the Context-aware Spatial Information-Extraction Module unit (CSF). This unit further comprises the Environmental Perception Feature-Extraction Module (EPF) and the Spatial Relation Perception Feature-Extraction Module (SRP). Additionally, there is a Feature-Interaction Module (FIM) and a Multichannel Deep Feature-Extraction Module (MFE).

Figure 2. ResNet’s residual structure.

Figure 3. The structure of the Environmental Perception Feature-Extraction Module.

Figure 4. The structure of Spatial Relation Perception Feature-Extraction Module.

Figure 5. Regular convolution vs. separable convolution.

Figure 6. The structure of Feature-Interaction Module.

Figure 7. The structure of Feature-Interaction Module.

Figure 8. The structure of Multichannel Deep Feature-Extraction Module.

Figure 9. Building dataset.

Figure 10. Water dataset.

Figure 11. Visual comparison of ablation experiments, where (a) is the original image, (b) is the backbone, (c) is bakebone + CSF, (d) is baseline + CSF + FIM, (e) is baseline + CSF + MFE, (f) is SPNet (ours), and (g) is the ground truth label.

Figure 12. Heatmaps of ablation experiments. (a) Backbone, (b) backbone + CSF, (c) backbone + CSF + FIM, (d) SPNet (ours).

Figure 13. The comparison results on the building dataset.

Figure 14. The comparison results on the water dataset.

Table 1. Variables and their meanings for pixel accuracy (PA), mean pixel accuracy (MPA), and mean intersection over union (mIoU).

Variable	Meaning
k	Number of categories for target segmentation
$ρ_{i, j}$	Samples i predicted as j samples
$ρ_{j, i}$	Samples j predicted as i samples
$ρ_{i, i}$	Samples i predicted as i samples

Table 2. The results of backbone network comparative experiment.

Backbone	Flops	Params	mIoU (%)
ResNet18	5.45 GMac	19.38 M	82.48
ResNet50	11.56 GMac	39.03 M	87.57
ResNet101	16.32 GMac	57.25 M	85.94

Table 3. The results of the ablation experiments.

Method	MPA (%)	PA (%)	mIoU (%)
Baseline	89.99	90.96	81.26
Baseline + CSF	90.59	91.18	82.47
Baseline + CSF + FIM	91.30	91.56	84.19
Baseline + CSF + MFE	92.46	93.27	86.06
Baseline + CSF + FIM + MFE	93.54	93.41	87.57

Table 4. Evaluation results of different models on the building dataset.

Method	MPA (%)	PA (%)	mIoU (%)	Flops	Params
DeepLabV3+	90.38	90.22	82.26	64.92 GMac	91.77 M
UNet	90.37	89.67	81.84	40.00 GMac	17.27 M
PSPNet	90.63	91.49	82.74	46.07 GMac	49.07 M
SegNet	87.62	89.00	79.88	42.48 GMac	29.48 M
DABNet	90.09	90.83	81.19	1.27 GMac	752.50 k
DDRNet	89.94	88.96	79.15	8.78 GMac	32.36 M
EDANet	89.42	89.83	81.09	12.27 GMac	2.63 M
FCN	90.55	89.60	81.07	20.10 GMac	15.12 M
DSFNet	91.03	91.52	83.73	4.84 GMac	15.72 M
SPNet (ours)	93.54	93.41	87.57	11.56 GMac	39.03 M

Table 5. Evaluation results of different models on the water dataset.

Method	MPA (%)	PA (%)	mIoU (%)
DeepLabV3+	98.5	98.4	95.8
UNet	98.4	98.4	95.7
PSPNet	98.3	98.2	95.6
SegNet	97.7	97.5	94.0
DABNet	98.4	98.4	95.7
DDRNet	98.4	98.4	95.8
EDANet	98.3	98.2	95.4
FCN	98.0	97.8	94.7
DSFNet	98.1	98.0	95.1
STT	98.2	98.1	95.2
SPNet (ours)	98.8	98.7	96.8

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhao, W.; Xia, M.; Weng, L.; Hu, K.; Lin, H.; Zhang, Y.; Liu, Z. SPNet: Dual-Branch Network with Spatial Supplementary Information for Building and Water Segmentation of Remote Sensing Images. Remote Sens. 2024, 16, 3161. https://doi.org/10.3390/rs16173161

AMA Style

Zhao W, Xia M, Weng L, Hu K, Lin H, Zhang Y, Liu Z. SPNet: Dual-Branch Network with Spatial Supplementary Information for Building and Water Segmentation of Remote Sensing Images. Remote Sensing. 2024; 16(17):3161. https://doi.org/10.3390/rs16173161

Chicago/Turabian Style

Zhao, Wenyu, Min Xia, Liguo Weng, Kai Hu, Haifeng Lin, Youke Zhang, and Ziheng Liu. 2024. "SPNet: Dual-Branch Network with Spatial Supplementary Information for Building and Water Segmentation of Remote Sensing Images" Remote Sensing 16, no. 17: 3161. https://doi.org/10.3390/rs16173161

APA Style

Zhao, W., Xia, M., Weng, L., Hu, K., Lin, H., Zhang, Y., & Liu, Z. (2024). SPNet: Dual-Branch Network with Spatial Supplementary Information for Building and Water Segmentation of Remote Sensing Images. Remote Sensing, 16(17), 3161. https://doi.org/10.3390/rs16173161

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

SPNet: Dual-Branch Network with Spatial Supplementary Information for Building and Water Segmentation of Remote Sensing Images

Abstract

1. Introduction

2. Methodology

2.1. Backbone

2.2. Context-Aware Spatial Feature-Extractor Unit

2.2.1. Environmental Perception Feature-Extraction Module

2.2.2. Spatial Relation Perception Feature-Extraction Module

2.3. Feature-Interaction Module

2.4. Multichannel Deep Feature-Extraction Module

3. Experiment

3.1. Dataset

3.1.1. Building Dataset

3.1.2. Water Dataset

3.2. Experimental Parameter Setting

3.3. Backbone Network Comparative Experiment

3.4. Ablation Experiments

3.5. Comparison Experiments Based on the Building Dataset

3.6. Comparison Experiments Based on the Water Dataset

4. Conclusions

5. Discussion

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI