Attention Swin Transformer UNet for Landslide Segmentation in Remotely Sensed Images

Liu, Bingxue; Wang, Wei; Wu, Yuming; Gao, Xing

doi:10.3390/rs16234464

Open AccessArticle

Attention Swin Transformer UNet for Landslide Segmentation in Remotely Sensed Images

¹

State Key Laboratory of Resources and Environmental Information System, Institution of Geographic Sciences and Natural Resources Research, Chinese Academy of Sciences, Beijing 100101, China

²

University of Chinese Academy of Sciences, Beijing 100049, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Remote Sens. 2024, 16(23), 4464; https://doi.org/10.3390/rs16234464

Submission received: 28 October 2024 / Revised: 23 November 2024 / Accepted: 26 November 2024 / Published: 28 November 2024

Download

Browse Figures

Versions Notes

Abstract

:

The development of artificial intelligence makes it possible to rapidly segment landslides. However, there are still some challenges in landslide segmentation based on remote sensing images, such as low segmentation accuracy, caused by similar features, inhomogeneous features, and blurred boundaries. To address these issues, we propose a novel deep learning model called AST-UNet in this paper. This model is based on structure of SwinUNet, attaching a channel Attention and spatial intersection (CASI) module as a parallel branch of the encoder, and a spatial detail enhancement (SDE) module in the skip connection. Specifically, (1) the spatial intersection module expands the spatial attention range, alleviating noise in the image and enhances the continuity of landslides in segmentation results; (2) the channel attention module refines the spatial attention weights by feature modeling in the channel dimension, improving the model’s ability to differentiate targets that closely resemble landslides; and (3) the spatial detail enhancement module increases the accuracy for landslide boundaries by strengthening the attention of the decoder to detailed features. We use the landslide data from the area of Luding, Sichuan to conduct experiments. The comparative analyses with state-of-the-art (SOTA) models, including FCN, UNet, DeepLab V3+, TransFuse, TranUNet, and SwinUNet, prove the superiority of our AST-UNet for landslide segmentation. The generalization of our model is also verified in the experiments. The proposed AST-UNet obtains an F1-score of 90.14%, mIoU of 83.45%, foreground IoU of 70.81%, and Hausdorff distance of 3.73, respectively, on the experimental datasets.

Keywords:

remote sensing (RS); landslide segmentation; attention mechanism; spatial detailed information

Graphical Abstract

1. Introduction

Mass movement, such as landslides, is one of the main natural disasters in mountainous areas [1]. Sudden landslides usually have a serious impact on the surroundings, including environmental degradation, damage to public facilities, and even threats to human property and life [2]. Therefore, landslide identification is important, and the timeliness of landslide information acquisition is required for disaster rescue. The rapid development of remote sensing technology has provided abundant real time data for landslide recognition, such as high-resolution optical remote sensing images [3]. Therefore, automatic landslide segmentation methods based on artificial intelligence are urgently needed.

In recent years, with the continuous development of artificial intelligence, deep learning models based on convolution computation have become the mainstream in the field of image segmentation. In landslide segmentation, the method of field surveying is time-consuming, costly, and inefficient [4]. It is difficult to achieve mapping in settlement areas or areas with high vegetation coverage [5]. Deep learning models can greatly reduce labor pressure and bring potential for automatic landslide monitoring based on remote sensing images [6,7]. Research on landslides segmentation by CNN-based models is increasing. FCN is the first model used for image segmentation [8], and traditional segmentation models such as UNet [9] and DeepLabV3+ [10] are developed from it. The studies on landslide segmentation using these traditional CNN-based models have achieved good results in accuracy [11,12,13,14]. In these studies, CNN-based models such as UNet extract landslide features automatically, achieving higher landslide segmentation accuracy compared to traditional machine learning methods that rely on manual feature modeling [15]. In order to reduce the uncertainty of single model predictions, the deep learning integration model was proposed to increase the landslide prediction accuracy [16]. Additionally, it has been proved that terrain information modeled by convolution operations can also improve landslide segmentation results [17]. However, these traditional CNN-based models still have some problems which need to be addressed, including the ignorance of contextual features, loss of detailed information, and insufficient special attention.

There have been many optimization methods to improve the segmentation performance of traditional CNN-based models. Feature pyramids [18] and dilated convolutions [19] expand the receptive field of convolution to model multi-scale information. Some modules based on spatial pyramid pooling have been proposed to extract multi-scale features of topography [20,21,22], and some models such as DMMSTNet and SSMSN have achieved high accuracy in segmentation of remote sensing images [23,24,25]. The studies using ASPP modules, including ICSSN [26] and a UNet-based model [27], have shown that multi-scale features can effectively improve the accuracy of landslide segmentation. On the one hand, multi-scale modeling can expand spatial attention to construct global features. On the other hand, it can restore the detailed features which are lost in the hierarchical structure. In addition, the effective skip connection structure in LSC [28], skip connection auto-encoders [29], and the added skip-connection module [30] can also compensate for the lost details. Special attention relies on the attention mechanisms. Spatial attention mechanisms in subtask attention [31] and combining residual units [32] can concentrate attention of the model on small features and boundary details. The channel attention mechanism solves the problem of repeated modeling of features [33] and can expand the feature modeling space. Squeeze-and-Excitation networks can recalibrate features by emphasizing useful information and suppressing irrelevant information in the channel dimension [34]. The models combining attention mechanisms, including attention-boosting mechanisms [4] and SENet-optimized Deeplabv3+ [35], have shown better recognition performance on the boundary details of landslides, and the overall accuracy is higher.

However, the CNN-based models mentioned above still have limitations in capturing long-range dependencies [36]. Remote sensing images usually capture a vast expanse [37], containing intricate features [38]. The interdependence of pixels representing “long-range dependencies” plays a crucial role in organizing the complex information. To address this challenge, the transformer structure [39] has emerged as an effective solution, which models long-range dependencies by displaying the attention weight values of a pixel to its surrounding pixels. The pioneering network that introduced the transformer structure to computer vision is ViT [40]. ViT partitions the image into patches and encodes each patch into a vector, allowing the model to establish relationships between vectors similar to constructing a sentence. Experimental results of combining the transformer structure with CNN-based models, such as ShapeFormer [41], the model for forested landslide detection [42,43], and HPCL-Net [44], have shown promising potential for detecting landslides in remote sensing images. By integrating convolution and transformers, the strengths of both can be fully harnessed, such as MFFSP [45] and ResU-net with a transformer [46]. And the model combining with DETR leading to higher accuracy in landslide extraction compared to pure CNN-based networks [47]. Subsequently, the swin transformer was developed to model the interaction relationship among patches, using the swin transformer as the backbone of the Mask R-CNN [48], the MAST net [49] and the ST-UNet [50], with existing results demonstrating its effectiveness in optimizing feature extraction for landslide segmentation. This was followed by the introduction of SwinUNet [51], a U-shaped symmetrical transformer network, which showcased significant advantages in image segmentation. Experiments in remote sensing image segmentation using SwinUNet have exhibited superior accuracy compared to CNN-based and earlier transformer-based methods [52,53]. Chen et al. [54] further improved SwinUNet by incorporating ConvNeXt block, enhancing its ability to capture landslide features. SwinUNet directly captures the global contextual information of an image through a pure transformer structure. Despite this, there remains a paucity of studies focusing on landslide segmentation based on SwinUNet. Addressing the similarities, inhomogeneities, and blurred boundaries present in remote sensing images during landslide segmentation is an ongoing challenge that requires further exploration.

In this paper, we introduce a novel deep learning model called Attention Swin Transformer U-shaped net (AST-UNet) for landslide segmentation in remote sensing images, with SwinUNet serving as the backbone. Additionally, two supplementary modules are incorporated: (1) a channel attention and spatial intersection (CASI) module on the encoding side, and (2) a spatial detail enhancement (SDE) module on the skip-connection to enrich transmitted information. Our goal is to enhance the accuracy of landslide segmentation by separately optimizing the feature modeling and detail restoration performance of the U-shaped network. The key contributions of this paper are as follows:

(1): The CASI module introduces spatial intersection attention to reduce noise within landslides and address the issue of “holes”. It also helps to compensate for the loss of global contextual information resulting from the hierarchical structure of SwinUNet, ultimately optimizing feature extraction in the encoder.
(2): The channel attention in the CASI module enhances the differentiation of visually similar targets in images and fine-tunes the outputs of the spatial attention module. The fusion of the convolution branch and transformer backbone maximizes the strengths of both, thus improving the feature modeling performance for landslide segmentation in the encoder.
(3): The SDE module establishes a connection between the shallow feature maps of the network and the decoding side to enhance detailed information in the skip connection. This optimization enhances the decoder’s ability to recover detailed features, consequently improving accuracy in identifying landslide boundaries.

2. Materials and Methods

2.1. Data

Our experiments are based on data sourced from GF-6 remote sensing images in the Luding region, Sichuan. This region is located in the transition zone from the Qinghai Tibet Plateau to the Sichuan Basin, and belongs to a typical subtropical monsoon climate. The landform types range from low to medium mountain canyon areas to high mountains and extremely high mountainous areas, which makes it representative in landslide research. The GF-6 image has a spatial resolution of 8m and comprises 4 bands, including RGB and near-infrared. The selected area encompasses an actual ground range of approximately 13 km × 25 km. In this area, landslides are mainly distributed in mountainous regions near rivers, while human settlements are primarily found in flatter areas. The size and shapes of the landslides in the area are significantly influenced by the topography. To train the model, labels are created through manual visual interpretation, where landslide vectors are rasterized and binarized into 0/1 labels (Figure 1). In Figure 1, the adjacency between landslides and surrounding vegetation and rivers, as well as the diversity and complexity of landslide characteristics, are evident. The entire research range is divided into three areas, used to create training, validation, and test datasets, respectively (Figure 2).

In producing the training and validation datasets, the original remote sensing image is cropped to a size of 80 × 80, with a cropping overlap rate of 0.75. To ensure the effectiveness of the training samples, images with a minimum count of 400 landslide pixels are selected for the training datasets. Given the significant differences in the range of pixel values among the channels in remote sensing images, which can influence feature construction, we normalize each channel of the image before input. The normalization in our experiments is specifically calculated as (1):

{x_n o r m}_{n, i, j}^{c} = \frac{x_{n, i, j}^{c} - {\bar{x}}^{c}}{s^{c}}

(1)

where

c

represents the number of image channels.

x_{n, i, j}^{c}

represents the pixel value in the row i and column j on the channel

c

of the image

n

.

{\bar{x}}^{c}

represents the average value of pixels in the channel

c

of all images in datasets, and

s^{c}

is their standard deviation.

{x_n o r m}_{n, i, j}^{c}

is the normalized result of each pixel.

The size of the training and validation datasets in our experiments are 1237 and 264, respectively. Ahead of training, the training and validation datasets are randomly cropped to a size of 64 × 64. Training dataset enhancement includes random flipping, random rotation, and random scaling. The test datasets are directly cropped to a size of 64 × 64 and normalized before input. In our experiments, the size of the testing datasets is 171. The points in Figure 2 are randomly selected samples to display landslide segmentation results in the experiments.

2.2. Method

In this section, we first provide an overview of the proposed AST-UNet model, and then introduce the structure of the swin transformer block serving as the backbone, along with the supplementary CASI and SDE modules.

2.2.1. Overview of the Network

Our proposed AST-UNet model consists of the SwinUNet backbone, the CASI module, and the SDE module. Figure 3 illustrates the overall architecture of the network, which follows an encoder–decoder structure. The encoder is responsible for encoding both foreground and background information in the image, while the decoder focuses on recovering image details and size from the encoded feature maps. The number of channels remains consistent on both the encoding and decoding sides of the network.

Within the backbone, swin transformer blocks establish the interdependent relationships between pixels to build differential features of the foreground and background in the encoder. The decoder hierarchically up-samples the feature maps to generate binary classification results. The CASI module is incorporated into the encoding side and comprises the CA module and the SI module. The SDE module functions as a connection between the encoder and decoder. We optimize the performance of landslide feature construction in the encoder and improving the decoder’s capability to restore details of landslide boundaries.

Specifically, the process begins with remote sensing images being segmented into non-overlapping 4 × 4 size patches, which are then encoded into eigenvectors through convolution. These eigenvectors are subsequently input as tokens into the following swin transformer blocks. The encoder is organized into three stages, each containing two swin transformer blocks. At each stage, the feature map’s size is halved while the channel dimension is doubled, achieved through the rearrangement of pixels in the patch merging module. At each stage, the feature map’s size is halved while the channel dimension is doubled, achieved through the rearrangement of pixels in the patch merging module. The CASI branch attached to the encoder comprises three CASI modules connected hierarchically. The shape of the feature map output from each CASI module matches the output of the corresponding backbone stage. The attention weight map from the CASI module is combined with the output of the backbone stage through multiplication. The encoder is connected to the decoder through a bottleneck consisting of two swin transformer blocks. The decoder is also structured into three stages, implementing the doubling of image size and halving of channel dimension through the patch expansion module. The SDE module establishes the connection between each stage of the encoder and the corresponding layers of the decoder. As the feature map is restored to its original size, the channel dimension aligns with the tokens in the encoder. Finally, the output produces 2-channel landslide segmentation maps through linear mapping.

2.2.2. Swin Transformer Block

The formation of landslides typically occurs when the local slope stability is disrupted, causing rock and soil mass to move along fracture surfaces due to gravity. In mountainous regions, rivers can significantly impact the stability of nearby slopes. The infiltration effect occurs when water from the river flows into the pores or cracks of rocks and soil, increasing pressure in the cracks and leading to slope instability. This elevated instability, combined with external disturbances, increases the likelihood of landslides occurring, particularly near rivers as observed in remote sensing images (Figure 4a). Landslides can also alter the original landscape of slopes, causing notable changes in color, texture, and other features of the affected area, disrupting the continuity of vegetation cover (Figure 4b,c). Therefore, identifying abnormal areas as landslides in remote sensing images becomes more achievable. By leveraging the interplay between landslide characteristics and their surroundings, the accuracy of landslide identification can be enhanced. Swin transformer blocks enable the modeling of global relationships among pixels, facilitating the creation of high-level semantic information on landslides and surrounding objects in the image.

The swin transformer module in our experiments treats the feature maps as word sequences, segmenting them into patches of size 2 × 2 in the W-MSA module, where a multi-head attention mechanism is employed to enhance modeling capabilities.

In order to construct attentions in patches, the MSA module calculates the Query(Q), Key(K), and Value(V) in each head. The calculation of attention in the MSA module is as follows:

A t t e n t i o n (Q, K, V) = s o f t m a x (\frac{Q K^{T}}{\sqrt{d_{k}}})

(2)

where

d_{k}

represents the dimension of K. Then, the updated patches are obtained by the following equation:

Z = (A t t e n t i o n (Q, K, V) + R P B) \times V

(3)

where RPB is the relative position bias. Finally, the results of each head are concatenated in the channel dimension to restore the dimension of pixels. Following the MSA module is the MLP module, which is composed of two fully connected layers and an activation function layer, aimed at improving the complexity of feature mining.

The shifted-window multi-scale attention (SW-MSA) module is the second stage of the swin transformer block. SW-MSA module shifts all the windows by 1/2 the window size to the upper-left direction, building connections between pixels that are not in the same window in the W-MSA module, thereby achieving interactive modeling of different windows.

In the swin transformer block, a layer normalization module is connected before each MSA and MLP module. Additionally, the residual structure is also a component of the block.

The swin transformer block creates long-term contextual relationships within the feature maps, capturing the interdependence of neighboring pixels. This capability is particularly essential in landslide segmentation.

2.2.3. Channel Attention and Spatial Intersection (CASI) Module

The properties of landslides exhibit spatial non-uniformity under the influence of external disturbances and topography. Factors such as the heterogeneity of landslide accumulation thickness and the variability of landslide water content contribute to this phenomenon. Such heterogeneity often leads to noise within landslides in remote sensing images (Figure 5c). When the extent of this noise is significant, it can create a “hole” within the landslides (Figure 5a,b). To address this issue, we incorporate a spatial intersection (SI) module [50] into the encoder. The SI module performs average pooling of the feature map in an expanded receptive field, which helps to regularize the internal features of the landslide and mitigate the influence of noise. This average pooling enhances the contrast between foreground and background pixel values, thereby improving the network’s focus on landslide areas. Furthermore, the expanded attention provided by the SI module can mitigate the loss of global information resulting from hierarchical modeling and window partitioning in the swin transformer block. Moreover, in remote sensing images, there are usually some landscapes with features intuitively similar to landslides, such as residential areas and clouds (Figure 5d–f), which can potentially increase the error rate in landslide recognition. We introduce a channel attention (CA) module alongside the SI module to address this problem. The CA module generates new features by considering various combinations of bands, a particularly advantageous approach in remote sensing images with multiple channels. Additionally, the channel attention weights can further refine the spatial attention weights derived from the SI module. The parallel integration of the spatial attention module and the channel attention module forms the comprehensive CASI module, as illustrated in Figure 6.

The additional branch is composed of three cascading CASI modules, corresponding to the three stages of the backbone. The CASI branch shares the same input as the first encoding stage of the backbone, represented as

X \in R^{L \times C}

. Before entering the first CASI module, it is reshaped into a two-dimensional feature map of

X_{C A S I - r} \in R^{H \times W \times C}

. The attention modeling process begins with two convolution blocks. In the first block, the kernel size is set to 3 with a padding size of 1, and the number of output channels is doubled. The second block utilizes a kernel size of 1, with the number of output channels reduced to half of the input. Following the convolution layers, a batch-normalization layer and a ReLU activation function layer are sequentially applied. The output from the first convolution block undergoes a 2 × 2 average pooling before serving as the input to the subsequent CASI module. The output of the second convolution block is the input for both the CA module and SI module. In the SI module, row pooling is conducted on the feature map of

X_{C A S I - c o n v 2} \in R^{H \times W \times C}

as follows:

S_{w} = {A v g P o o l}_{r o w} (X_{C A S I - c o n v 2})

(4)

The shape of

S_{w}

is

H \times 1 \times C

. Similarly to row pooling, column pooling calculates the average value of each layer’s columns. The layer size changes from

H \times W \times C

to

1 \times W \times C

, as:

S_{h} = {A v g P o o l}_{c o l u m n} (X_{C A S I - c o n v 2})

(5)

The output of the SI module is obtained through matrix multiplication of the two pooling results, resulting in a matrix of the shape

H \times W \times C

. The spatial weight feature map

W_{s}

is obtained through normalization of the sigmoid function:

W_{s} = S i g m o i d (S_{w} \times S_{h})

(6)

The CA module also takes the feature map of

X_{C A S I - c o n v 2}

as input. The calculation of channel attention weights is as (7). Firstly, perform global average pooling on each channel of the feature map, then normalize the results to obtain the channel attention weights

W_{c}

.

W_{c} = S i g m o i d ({A v g P o o l}_{g l o b a l} (X_{C A S I - c o n v 2}))

(7)

W_{c}

is used to update

W_{s}

as (8) and (9). The shape of the final weighted feature map

W_{C A S I}

is

H \times W \times C

, which aligns with the feature map

X_{t r a n s}

from the swin transformer blocks of the backbone. The multiplication of

X_{t r a n s}

and

W_{C A S I}

results in the output of a whole encoder stage.

{W_{C A S I} = W}_{c} ⨀ W_{s}

(8)

{X_{t_C} = X}_{t r a n s} ⨀ W_{C A S I}

(9)

in which,

⨀

represents the real multiplication. The CASI module opens up new spaces for modeling landslide features by expanding the attention field in both spatial and channel dimensions, which is extremely beneficial for identifying landslides with complex characteristics in remote sensing images. In addition, the independence between the CASI branch and the backbone also reduces the redundancy of the network to some extent.

2.2.4. Spatial Detail Enhancement (SDE) Module

The accuracy of landslide segmentation in remote sensing images is inherently tied to the accuracy of recognized landslide boundaries. In the backbone, as the feature map size diminishes, detailed features such as the position and shape of the landslide boundary are gradually lost on the encoding side. For an encoder consisting of transformer blocks, this issue becomes more pronounced, as shown in Figure 7, in which the images in the middle column represent the outputs of our SDE module in the second stage of the network, while the images in the right column depict the outputs of the second encoding swin transformer blocks. The transformer structure, used to establish contextual features between landslides and their surroundings, inadvertently diminishes its focus on retaining image details. Even though a skip connection structure provides reference information for the decoding process, the information transmitted is often insufficient to recover the details of the landslide boundary. Average pooling acts as a straightforward blur of an image, applying equal-weight processing to the image features. Therefore, average pooling does not inherently erode the detailed boundary features of landslides. This approach amplifies the weight of the detailed boundary features in the original skip connection, ultimately rendering the transmitted detailed information more clearly, and effectively compensating for the loss of details.

Based on this, we introduce our SDE module, which modifies the original skip-connection. The feature map input into the first swin transformer block in the encoder is located in a shallower layer. We establish a connection between this feature map and all stages of the decoder to provide detailed information for reference during up-sampling. The SDE module is implemented through down-sampling and consists of a series connection of a 1×1 convolution layer, an activation function layer, and a spatial pooling layer. Figure 8 illustrates the shape change in the feature map.

X is the input of the SDE module. A 1 × 1 convolution layer doubles the number of channels of X, followed by a nonlinear activation function. The feature map size is halved by spatial average pooling with a window size of 2 × 2. The down-sampled feature map is concatenated with the feature map from the encoder stage in channel dimension as

C_{n}

, which is then appended to the feature map

X_{d e c o d e}^{n}

of the same stage in the decoder. Before entering the swin transformer blocks in the decoder, the channel of the feature map is restored to its original state using linear mapping as:

X_{{d e c o d e}^{'}}^{n} = L i n e a r (C o n c a t e n a t e (C_{n}, X_{d e c o d e}^{n}))

(10)

The SDE module includes the details from the shallower layers of the network in the reference information and enhances the attention to detailed features of the landslide boundary, which improves the richness and effectiveness of the transmitted information in the skip connection structure. It is aimed at increasing the accuracy of landslide segmentation by improving the restoration performance of landslide boundaries.

2.2.5. Experimental Settings

Our training experiments are conducted using a single NVIDIA Tesla V100 GPU (16G graphics memory, NVIDIA, Santa Clara, CA, USA) and the deep learning models are constructed using PyTorch (Python 3.9). For training, the optimizer used is Adam with a learning rate set to 0.001. The shuffle parameters of the training datasets are set to “True”. Training epochs of 200 are set based on the convergence of all the models in the experiments. Validation is performed after every training epoch. The weight file corresponding to the highest accuracy of the validation is retained for subsequent testing. The cross-entropy loss is selected as the loss function in model training as (11):

L_{C E} = - \sum_{i}^{h} \sum_{j}^{w} \sum_{c}^{C} y_{i, j}^{c} \times l o g \frac{e^{{\hat{y}}_{i, j}^{c}}}{\sum_{c}^{C} e^{{\hat{y}}_{i, j}^{c}}}

(11)

where

y_{i, j}^{c}

represents the value of class c in the one hot label of the pixel located on the position of (i, j) in the image.

{\hat{y}}_{i, j}^{c}

represents whether the prediction of the pixel belongs to class c. The final loss is the average of all the pixel losses.

In experiments, we select several state-of-the-art (SOTA) models to conduct landslide segmentation experiments for comparison. The models include FCN, UNet, DeeplabV3+, SwinUNet, TransUNet and TransFuse. FCN, UNet, and DeeplabV3+ are all based on pure convolution structures. SwinUNet is a U-shaped network based on pure transformer structures. TransUNet and TransFuse are the models which integrate convolution and transformer structures. Our proposed model also combines swin transformer with convolution structures. We analyze the advantages of our model by comparing the landslide segmentation results. For all models, the datasets and training parameter settings are kept the same, following the principle of controlling variables. We select intersection of union (IoU), F1-score, and Hausdorff distance (HD) as the evaluation indices for landslide segmentation. The calculation is as follows:

I o U = \frac{T P}{T P + F P + F N}

(12)

F 1 - s c o r e = \frac{2 T P}{2 T P + F P + F N}

(13)

H D (X, Y) = m a x {d_{X Y}, d_{Y X}}

(14)

in which TP, FP, and FN represent the number of true positive, false positive and false negative pixels in prediction, respectively. X and Y represent the real and predicted point sets of boundaries.

d_{X Y}

and

d_{Y X}

represent the unidirectional distance between X and Y.

In order to more intuitively express the average parameter performance and the time cost of accuracy of the models, we calculate the average parameter contribution (APC), average parameter loss (APL), and unit time benefit (UTB) of each model according to the following definitions:

A P C = \frac{m I o U}{P a r a m s}

(15)

A P L = \frac{H D 95}{P a r a m s}

(16)

U T B = m I o U \times F P S \times 10^{- 2}

(17)

where Params and FPS represent the number of parameters and frames per second of the model. APC is the average allocation of landslide segmentation accuracy to the parameters, measuring the average performance of the parameters. APL is the average of landslide boundary recognition errors on parameters. UTB combines FPS with landslide segmentation accuracy. High FPS and mIoU indicate that the model has superior performance.

We compare the landslide segmentation accuracy of different models based on the above indices and analyze their advantages and disadvantages. In addition, we also conduct qualitative analysis on the landslide segmentation results of all the models.

3. Results

3.1. Quantitative Results

Table 1 shows the segmentation accuracy of all the models on the landslide datasets. FIoU and BIoU are the foreground IoU and background IoU, respectively. The mIoU is the average of the two. The data in the table demonstrates that the evaluation indices of landslide segmentation of our model are all optimal. The F1-score, which combines precision and recall, reached 90.14% in our model, the only one to achieved over 90%. Our model’s HD95 is 3.73, the smallest among all the models, indicating a high similarity between the predicted boundary and label, and a lower offset of the predicted boundary. In addition, the mIoU and FIoU of our model are 83.45% and 70.81%, respectively, both higher than those of other SOTA methods. The mIoU of UNet, TransUNet, and SwinUNet all exceed 80%, indicating that models based on simple U-shaped structures are more likely to achieve better results in landslide segmentation. The models with HD95 below 10 include UNet and SwinUNet, indicating that the effective skip connection structure can improve the boundary accuracy of landslide segmentation.

The higher IoU and F1-score of our model suggest that the encoder has better feature modeling performance on landslides. This is due to the additional CASI module on the encoder, which improves the structure by adding spatial and channel attention. It expands the receptive field of the encoding layer and eliminates the internal noise of the landslides through spatial intersection pooling. The channel weights further calibrate the landslide features. The combination of the CASI module with swin transformer blocks achieves superior results in modeling landslide features. The optimal performance of HD95 in our model is attributed to the SDE module, which transmits the boundary details retained in the shallow feature map to the decoding side, providing reference information for the recovery of landslide boundaries to improve the final boundary accuracy. The mIoU of the models, including FCN, DeepLabV3+, and TransFuse are all less than 80%, and their HD95 are all above 10. Due to the simplicity or high redundancy of decoder in these models, their performances in restoring landslide boundaries are poor, resulting in the lower segmentation accuracy.

3.2. Qualitative Results on Validation Datasets

In order to provide a more intuitive depiction of the landslide segmentation results, we visualize the predicted results of the validation datasets as shown in Figure 9. The results show the strengths of our model in dealing with fuzzy pixels, recovering boundary details, identifying small landslides, and maintaining the continuity of landslides. Compared with other models, our model achieves better performance in landslide segmentation.

Specifically, in Figure 9, rows 1, 2, and 5, there are small and fuzzy landslides, and the recognition capabilities of FCN, DeepLabV3+, and TransFuse on these areas are less effective. According to the results, FCN and DeepLabV3+ exhibit poor recognition abilities for small-scale landslides due to the loss of detailed information and insufficient compensation. Although TransFuse combines the U-shaped transformer structure with UNet, it falls short in achieving competitive results due to inadequate structural enhancements tailored for identifying small-scale landslides. UNet and SwinUNet perform relatively better in recognizing small landslides. However, some information of small landslides is still lost during down-sampling in the UNet due to the absence of global information modeling, as indicated in the segmentation results in the first row of Figure 9. In contrast to the aforementioned models, our model exhibits a dual advantage in small landslide identification and detail restoration, thanks to the fusion of transformer and convolution attention structures in the encoder, expanding the feature encoding space. Additionally, the SDE structure linked to the decoder enhances the model’s ability to recover intricate landslide features.

In addition to small landslide boundary recognition, our model also showcases exceptional performance in Figure 9 rows 4, 5, and 6. By referencing boundaries in the label and original image, our model extracts landslide boundaries more accurately. A comparison with the SwinUNet results suggests that the SDE module notably enhances the model’s ability to recover landslide boundaries.

In Figure 9 row 2, the upper rectangle in the image contains a mixed area of landslide and background located at the landslide’s edge, resulting in fuzzy features. Our model’s segmentation of this region aligns most closely with the label. The CASI module reduces noise in mixed areas through average pooling of rows and columns, while the modeling in channel dimensions further optimizes feature construction, ultimately improving the recognition performance in fuzzy areas.

Due to the influence of topography, high altitudes in certain areas may result in the phenomenon of “holes” and disrupt the continuity of landslides. In the experiments, our model achieves improved continuity within landslides compared to other SOTA models, as evidenced by the right rectangle in Figure 9, row 4, and the middle rectangle in row 6. This outstanding performance is attributed to intersected average pooling, which increases the probability of identifying “hole” pixels within landslides as positive pixels by taking into account whole rows and columns of pixels in the image.

3.3. Qualitative Results on Test Datasets

In order to evaluate the generalization of our model, we visualize the landslide segmentation results on the test datasets as shown in Figure 10. As these results are derived from test datasets without labels, the comparative analysis is based on the original remote sensing images. Overall, our model exhibits significant advantages in landslide segmentation performance compared to other models. The performance of FCN, DeepLabV3+, and TransFuse on the test datasets is relatively poor.

Specifically, our model maintains optimal recognition of small-scale landslides, as shown in Figure 10 rows 1, 4 and 5. Conversely, FCN, DeepLabV3+, and TransUNet display weak capabilities in recognizing small-scale landslides. These models fail to effectively compensate for lost information regarding small-scale targets in the encoder, leading to poor recognition of small-scale landslides.

Recognition of fuzzy areas is evident in row 2 and the right rectangle in row 6. In the results, UNet, SwinUNet, and our model achieve higher recognition accuracy. In addition, in row 4, where small landslides appear blurred, only SwinUNet and our model effectively recognize these areas, with our model’s result being more consistent with the original image. This is due to the average pooling structure, which enhances the homogeneity of fuzzy regions.

In Figure 10, row 3, there is a background area in the right rectangle that shares similar characteristics with the landslide region. The key distinction of this area from actual landslides is its weaker brightness. Among all models, only our model successfully eliminates this area from the segmentation results. This is mainly due to the channel feature encoding in our network. Different classes that may be challenging to distinguish based on visual features can exhibit significant differences in channel features.

In the right rectangle in Figure 10, row 4, our model delivers superior outcomes. Other methods struggle to identify the landslide area effectively due to interference from the “hole” present in the original images. Our model not only accurately identifies landslide areas but also maintains good continuity, indicating the robust generalization capabilities of the CASI module in the test datasets.

The left rectangle in the last row contains two different landslide bodies. UNet and our model have better segmentation performances, while the other models fail to clearly distinguish the two landslide bodies. With the strong noise suppression ability within landslides, the overall segmentation of landslides of our model outperforms UNet. The results highlight that our model can preserve continuity within landslides while enhancing the independence between different landslide bodies.

In conclusion, our model demonstrates strong generalization performance on the test datasets, underscoring the advantages of our model in landslide segmentation: The swin transformer blocks in the encoder enhance the capability of the model to identify small-scale landslides. The SDE module focuses on capturing the spatial detailed information from shallow feature maps and transmitting it to the decoder, thereby improving the boundary accuracy of landslide segmentation. Furthermore, the expanded average pooling in the SI module alleviates the “hole” phenomenon and improves the continuity in landslide results by reducing noise in landslides. By paying attention to the channel features, the CA module makes it better at recognizing pixels and regions that are difficult to distinguish intuitively.

Based on the comparison of the results from the validation datasets and test datasets, the following can be seen: in recent years, most of the models used for landslide segmentation still focus on models based on convolution-based networks. However, these models, such as FCN, UNet, and the Deeplab series commonly used in landslide segmentation tasks, are still limited to constructing landslide models based on low-level and high-level visual or spatial features. The models that combine convolution and transformer structures, which have been gradually developed in recent years, although they could construct the contextual information in landslide images, have still shown unsatisfactory results in landslide segmentation. These models mix the two network structures, resulting in a certain degree of disharmony between the two modeling methods, as well as redundancy in the network structure due to the large size of the model, ultimately leading to relatively low accuracy in landslide segmentation. The SwinUNet, which has gradually been used for landslide segmentation in recent years, is completely based on the transformer structure for landslide feature modeling and achieves end-to-end landslide segmentation tasks. However, there is still no relevant structure for improving landslide features based on this model. Therefore, the network model modified based on SwinUNet that we have implemented in this paper, which is specifically designed to improve landslide features, has achieved better accuracy in landslide segmentation tasks.

Finally, in order to further verify the practical application effect of our method, we chose the most commonly used threshold-based object-oriented segmentation method in current landslide segmentation applications, and conducted comparative experiments with our proposed method. An object-oriented algorithm in landslide segmentation is achieved for landslide recognition. The experimental results are shown in Figure 11:

The results show that there is a significant phenomenon of missed segmentation in landslide segmentation based on the object-oriented segmentation method. The reason is that this method is a hard segmentation based only on simple features of image pixel values. For targets with complex and diverse optical and texture features such as landslides, the features constructed by this method are not comprehensive enough, and it is difficult to simultaneously recognize targets with blurred features and obvious features in the image. Therefore, in landslide segmentation tasks, it is necessary to use algorithms that can construct more complex features, and methods based on deep learning networks can achieve this requirement. Therefore, in related research on landslide segmentation, experiments and discussions based on deep learning models are very important and valuable.

4. Discussion

4.1. Computational Efficiency Analysis of Models

In order to comprehensively evaluate our model, in addition to analyzing the landslide segmentation accuracy, we also quantify the computational efficiency of the models. We select three indices to quantitatively measure the complexity and efficiency of the models in our experiments: number of parameters (Params), floating-point operations (FLOPs), and frames per second (FPS). The parameters of all models are shown in Figure 12a–c.

The data reveals that the Params and FLOPs of UNet are significantly higher than other models (Figure 12a,b). UNet is a symmetric structure based on pure convolution operations, the dense computation of which results in a substantial increase in parameters compared to other models. However, the explosive increase in parameters does not necessarily translate into a notable advantage in landslide segmentation accuracy. According to the earlier analysis, the mIoU of UNet is roughly on par with TransUNet, despite the model complexity of TransUNet being approximately one sixth that of UNet (Figure 12a). The parameters utilization of UNet in landslide segmentation is relatively lower. FCN achieves better results in FPS (Figure 12c), but the accuracy analysis shows that the segmentation performance of FCN in landslides is not optimal. The model complexity of FCN is much lower than UNet, with both being structured solely on convolutions. This indicates that the advantage of FCN in FPS is owed to the simpler structure. However, the compromise in accuracy due to the lightweight design is not accepted in landslide segmentation. DeepLabV3+, also built on pure convolution structures, has lower model complexity (Figure 12a), but its FPS performance is not promising. In terms of accuracy in landslide segmentation, DeepLabV3+ does not show exceptional performance. This implies that the ASPP module in the network does not achieve desired impact in landslide segmentation.

The models integrated with swin transformer structure have relatively fewer parameters (Figure 12a). Their accuracy in landslide segmentation surpasses that of other networks based on pure convolution structures, demonstrating the advantage of the swin transformer structure in our experiments: it reduces the parameters while preserving segmentation accuracy and efficiency. It also proves the importance of global contextual information in landslide recognition. Specifically, the advantage of TransFuse is modest, characterized by higher model complexity and lower FPS. Although the fusion of two structures improves the segmentation accuracy to some extent, the overall redundancy of the network remains high, indicating space for improvement in the effective utilization of parameters. In contrast, TransUNet has a significant advantage in FPS (Figure 12c), mainly owing to the lightweight structure. It fuses the two structures by replacing the convolution structure in the encoder with the transformer structure, reducing parameters while incorporating global information modeling of the image. This enables the achievement of landslide segmentation accuracy comparable to UNet with larger parameters.

SwinUNet, based on a pure transformer structure, greatly reduces parameters, with its FPS being comparable to FCN. Additionally, previous results show that SwinUNet has higher accuracy in recognizing landslide boundaries compared to the fusion models (Table 1). The excellent performance of SwinUNet forms the basis for improving the accuracy of our model in landslide segmentation. Although the complexity of our model slightly increases compared to SwinUNet, there is a substantial improvement in landslide segmentation accuracy. Our model improves segmentation accuracy by about three points compared to SwinUNet (Table 1). Furthermore, compared to UNet, the mIoU increases by nearly two points, while the Params are nearly one tenth those of UNet, with the FPS also surpassing that of UNet (Figure 12a,c).

Moreover, the calculation results of the indices of APC, APL, and UTB for all models are shown in Figure 12d–f. The results suggest that the APL of UNet and our model are lower (Figure 12e), while the APC of UNet is only 0.63 (Figure 12d), indicating the parameter redundancy of UNet. Our model boasts an APC of 6.55, ranking second only to SwinUNet’s 7.58 (Figure 12d), signifying superior average performance and utilization of parameters in our model for landslide segmentation. Furthermore, our model excels in UTB (Figure 12f). Notably, FCN, despite its highest FPS, does not hold an advantage in UTB, mainly due to its poor landslide segmentation accuracy.

These indices comprehensively and quantitatively portray the performance of the models. The results unequivocally show the distinct advantage of our model for landslide segmentation.

4.2. Ablation Analysis

To assess the efficacy of the CASI and SDE modules in our model, we conduct ablation experiments on landslide segmentation in this section. The quantitative results are shown in Table 2. In the experiments, we added the channel attention (CA) module, spatial intersection (SI) module, and spatial detail enhancement (SDE) module to the baseline for training. We also train the network with pairwise combinations of these modules. In order to clearly present the accuracy improvement of landslide segmentation after the addition of each module, we depict the optimization of each accuracy index in Figure 13. Given that a lower optimized HD95 is desirable, we display the reciprocal in Figure 13. Specifically, Figure 12a–c, respectively, represent the optimization of accuracy indices after adding the CA, SI, and SDE modules into the network. Figure 14 shows the landslide segmentation results of our ablation experiments.

(1) Effect of Channel Attention Module: Firstly, the CA module in baseline-CA leads to a decrease in HD95 (Figure 13a), indicating that the channel attention mechanism in the encoder can improve the accuracy of recognizing fuzzy pixels in the image. Integrating the CA module with the SI module (baseline-CASI) significantly improves the performance of the network (Figure 13a). Specifically, the FIoU increases by 1.61 points, while HD95 reduces to about half of its origin level (Table 2). This highlights the superior results achieved by the CA module when used in conjunction with the SI module. The effectiveness of this combination is attributed to the fact that both the CA module and the SI module are based on convolution structures, which differ from the transformer structure. Directly attaching the CA module to the backbone composed of transformer structures could lead to a divergence in feature enhancement. However, when integrated with the convolution-based SI module, significant feature enhancement effects are observed.

In Figure 14, row 2, the CA module in baseline-CA improves the accuracy of the foreground in the lower rectangle. Moreover, a comparison between baseline-CASI and baseline-SI reveals that the CA module significantly improves the accuracy of the results built upon the SI module. This is also evident in the upper rectangle of the group baseline-CASI/baseline-SI in row 1. The upper rectangles of the group baseline-CA/baseline in row 4 demonstrate the advantage of the CA module in identifying blurred pixels. In addition, the CA module in baseline-CA-SDE in the last row significantly improves the ability to distinguish clouds from landslides.

The channel attention expands the space of feature modeling for landslides and is beneficial in extracting complex features. Given the frequent presence of visually indistinguishable targets in remote sensing images, multi-channel attention construction can effectively alleviate this problem. Furthermore, landslides in remote sensing images often bear resemblance to other environmental features, such as clouds and buildings. The channel characteristics increase the differences among these features, thereby improving the accuracy of landslide recognition.

(2) Effect of Spatial Intersection Module: In Table 2, the accuracy improvement of baseline-SI relative to baseline is not significant. However, when combined with the CA module and SDE module, the addition of the SI module shows significant positive impact on landslide segmentation accuracy (Figure 13b). Compared with baseline-CA, the FIoU increased by 3.71 in baseline-CASI (Table 2). Similarly, compared to baseline-SDE, baseline-SI-SDE exhibits a 1.11 increase in mIoU, a 2.16 increase in FIoU, and a 0.91 decrease in HD95 (Table 2). The data clearly indicates that SI module significantly improves the accuracy of the model in identifying landslides, with a more pronounced optimization effect when combined with other modules. Firstly, the CA module refines the spatial attention weights obtained from the SI module. Secondly, the SDE module expands the distribution of convolution structures across the network, enhancing the effectiveness of the SI module based on convolution structures.

In Figure 14, the upper rectangles of the baseline-SI/baseline group in row 1, and the groups baseline-SI-SDE/baseline-SDE and ours/baseline-CA-SDE in row 2 all demonstrate the improvement of landslide continuity resulting from the addition of the SI module. The results in rows 3 and 4 further highlight the advantages of the SI module in improving the integrity of landslide recognition.

The SI module pools the image in both row and column directions, expanding the spatial attention range in all stages of the hierarchical global information modeling process. This enables the model to capture more relationships between landslides and their surroundings. The averaging process helps alleviate noise within the landslides and enhances the continuity of landslides in segmentation.

(3) Effect of the Spatial Detail Enhancement Module: The SDE module plays a crucial role in enhancing the accuracy of landslide segmentation, particularly reflected in the reduction in HD95, as shown in Table 2. Compared to the baseline model, Baseline-SDE shows a decrease of 3.61 in HD95. When the SDE module is added to baseline-CA, HD95 decreases by 2.53 while FIoU increases by 1.84. Incorporating the SDE module into baseline-SI reduces HD95 from 13.73 to 3.17, showcasing a significant optimization effect: FIoU increases by 2.21, surpassing 70%, and mIoU increases by 1.95. Additionally, Figure 13c demonstrates that the SDE module yields positive optimization effects across almost all experiments.

In Figure 14 row 1, the model with the SDE module provides better recovery of boundary details in the lower rectangle, evident in the groups baseline-SDE/baseline, baseline-SI-SDE/baseline-SI, and ours/baseline-CASI. The enhancement of boundary details by the SDE module is also noticeable in the left rectangle of the baseline-SI-SDE, baseline-CA-SDE, and our models in row 3. In the last row, we observed that compared to the baseline, the boundary recognition results of baseline-SDE are more aligned with the boundaries in the original image.

The SDE module enhances spatial details for the up-sampling process of the network, which are important for the resolution recovery of small feature maps. Often, the number of landslide images are limited in research, leading to the segmentation of images into smaller patches for input to the network, which results in a great loss of spatial details in the feature maps at the end of the encoder. The SDE module enables the network to focus more on spatial detailed information, effectively improving the performance of detail recovery on landslide boundaries.

5. Conclusions

In this paper, we propose a novel deep-learning model, AST-UNet, to achieve rapid and precise segmentation and recognition of landslides in remote sensing images. AST-UNet is founded on the U-shaped structure of the swin transformer, enhanced by the inclusion of CASI module and SDE module. The spatial attention expansion in the SI module reduces noise and eliminates the “hole” phenomenon within landslides. The CA module not only fine-tunes the performance of the SI module but also boosts the model’s ability identify fuzzy targets. The SDE module optimizes the skip-connection structure of the network, aiding in the recovery of spatial details from small feature maps and effectively elevating the accuracy of detecting landslide boundaries. Through comprehensive experiments comparing AST-UNet with SOTA models, we demonstrate the superior results our model achieves in landslide segmentation with minimal parameters and computational complexity, all while maintaining image processing speeds. In addition, ablation experiments validate the effectiveness of all the additional modules integrated into AST-UNet.

Although our model achieves superior landslide segmentation accuracy, there are still unresolved issues. Due to the limited availability of images, the datasets for network training are not large enough, potentially leading to over-fitting and influencing the generalization of the model, especially in models with large parameters. Therefore, further exploration is warranted to expand training datasets for landslide segmentation. Additionally, as landslide segmentation results are typically represented in vector form, automation of the post-segmentation process is still waiting to be achieved.

Author Contributions

Conceptualization, Y.W. and X.G.; methodology, Y.W., W.W. and B.L.; software, B.L.; validation, B.L.; formal analysis, B.L.; resources, Y.W.; writing—original draft preparation, B.L.; writing—review and editing, B.L. and Y.W.; visualization, B.L.; supervision, Y.W. and W.W.; project administration, Y.W. and X.G.; funding acquisition, Y.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Science and Technology Plan Project of Linzhi, Xizang [grant number: SYQ2024-12] and the Key Project of Innovation LREIS [grant number: KPI007].

Data Availability Statement

The code can be referred to at https://gitcode.com/lbxlld/Attention_Swin-Transformer_UNet_for_Landslide_Segmentation_in_Remotely_Sensed_Images/tree/code (accessed on 23 November 2024). The remote sensing data can be referred to at https://sasclouds.com/chinese/home (accessed on 10 March 2024). The label data can be referred to at https://agupubs.onlinelibrary.wiley.com/doi/10.1029/2023JF007534 (accessed on 10 March 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Ghorbanzadeh, O.; Blaschke, T.; Gholamnia, K.; Meena, S.R.; Tiede, D.; Aryal, J. Evaluation of Different Machine Learning Methods and Deep-Learning Convolutional Neural Networks for Landslide Detection. Remote Sens. 2019, 11, 196. [Google Scholar] [CrossRef]
Lei, T.; Zhang, Y.; Lv, Z.; Li, S.; Liu, S.; Nandi, A.K. Landslide Inventory Mapping From Bitemporal Images Using Deep Convolutional Neural Networks. IEEE Geosci. Remote Sens. Lett. 2019, 16, 982–986. [Google Scholar] [CrossRef]
Wang, X.; Fan, X.; Xu, Q.; Du, P. Change detection-based co-seismic landslide mapping through extended morphological profiles and ensemble strategy. ISPRS J. Photogramm. Remote Sens. 2022, 187, 225–239. [Google Scholar] [CrossRef]
Ji, S.; Yu, D.; Shen, C.; Li, W.; Xu, Q. Landslide detection from an open satellite imagery and digital elevation model dataset using attention boosted convolutional neural networks. Landslides 2020, 17, 1337–1352. [Google Scholar] [CrossRef]
Tang, X.; Tu, Z.; Wang, Y.; Liu, M.; Li, D.; Fan, X. Automatic Detection of Coseismic Landslides Using a New Transformer Method. Remote Sens. 2022, 14, 2884. [Google Scholar] [CrossRef]
Mondini, A.C.; Guzzetti, F.; Chang, K.-T.; Monserrat, O.; Martha, T.R.; Manconi, A. Landslide failures detection and mapping using Synthetic Aperture Radar: Past, present and future. Earth-Sci. Rev. 2021, 216, 103574. [Google Scholar] [CrossRef]
Tehrani, F.S.; Santinelli, G.; Herrera, M.H. Multi-Regional landslide detection using combined unsupervised and supervised machine learning. Geomat. Nat. Haz. Risk 2021, 12, 1015–1038. [Google Scholar] [CrossRef]
Shelhamer, E.; Long, J.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 640–651. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Medical Image Computing and Computer-Assisted Intervention; Springer: Berlin/Heidelberg, Germany, 2015. [Google Scholar]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In Computer Vision; Springer: Berlin/Heidelberg, Germany, 2018. [Google Scholar]
Xin, L.B.; Han, L.; Li, L.Z. Landslide Intelligent Recognition Based on Multi-source Data Fusion. J. Earth Sci. Environ. 2023, 45, 920–928. [Google Scholar] [CrossRef]
Mao, J.Q.; He, J.; Liu, G.; Fu, R. Landslide recognition based on improved DeepLabV3+ algorithm. J. Nat. Disaster. 2023, 32, 227–234. [Google Scholar] [CrossRef]
Shi, W.; Zhang, M.; Ke, H.; Fang, X.; Zhan, Z.; Chen, S. Landslide recognition by deep convolutional neural network and change detection. IEEE Trans. Geosci. Remote Sens. 2020, 59, 4654–4672. [Google Scholar] [CrossRef]
Fu, B.; Li, Y.; Han, Z.; Fang, Z.; Chen, N.; Hu, G.; Wang, W. RIPF-Unet for regional landslides detection: A novel deep learning model boosted by reversed image pyramid features. Nat. Hazards 2023, 119, 701–719. [Google Scholar] [CrossRef]
Meena, S.R.; Soares, L.P.; Grohmann, C.H.; van Westen, C.; Bhuyan, K.; Singh, R.P.; Floris, M.; Catani, F. Landslide detection in the Himalayas using machine learning algorithms and U-Net. Landslides 2022, 19, 1209–1229. [Google Scholar] [CrossRef]
Ganerød, A.J.; Franch, G.; Lindsay, E.; Calovi, M. Automating global landslide detection with heterogeneous ensemble deep-learning classification. Remote Sens. Appl. Soc. Environ. 2024, 36, 101384. [Google Scholar] [CrossRef]
Sameen, M.I.; Pradhan, B. Landslide detection using residual networks and the fusion of spectral and topographic information. IEEE Access 2019, 7, 114363–114373. [Google Scholar] [CrossRef]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar] [CrossRef]
Yu, F.; Koltun, V. Multi-Scale Context Aggregation by Dilated Convolutions. In Proceedings of the International Conference on Learning Representations (ICLR), San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar] [CrossRef]
Peng, H.; Zheng, S.; Li, X.; Yang, Z. Residual Module and Multi-scale Feature Attention Module for Exudate Segmentation. In Proceedings of the 2018 International Conference on Sensor Networks and Signal Processing (SNSP), Xi’an, China, 28–31 October 2018. [Google Scholar] [CrossRef]
Shi, W.; Jiang, F.; Zhao, D. Single image super-resolution with dilated convolution based multi-scale information learning inception module. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017. [Google Scholar] [CrossRef]
Sun, Y.; Dai, D.; Zhang, Q.; Wang, Y.; Xu, S.; Lian, C. MSCA-Net: Multi-scale contextual attention network for skin lesion segmentation. Pattern Recognit. 2023, 139, 109524. [Google Scholar] [CrossRef]
Zhang, J.; Pan, B.; Zhang, Y.; Liu, Z.; Zheng, X. Building Change Detection in Remote Sensing Images Based on Dual Multi-Scale Attention. Remote Sens. 2022, 14, 5405. [Google Scholar] [CrossRef]
Zhang, M.; Liu, Z.; Feng, J.; Liu, L.; Jiao, L. Remote Sensing Image Change Detection Based on Deep Multi-Scale Multi-Attention Siamese Transformer Network. Remote Sens. 2023, 15, 842. [Google Scholar] [CrossRef]
Zhang, Z.; Liu, D.; Gao, D.; Shi, G. A novel spectral-spatial multi-scale network for hyperspectral image classification with the Res2Net block. Int. J. Remote Sens. 2022, 43, 751–777. [Google Scholar] [CrossRef]
Lu, Z.; Peng, Y.; Li, W.; Yu, J.; Ge, D.; Han, L.; Xiang, W. An Iterative Classification and Semantic Segmentation Network for Old Landslide Detection Using High-Resolution Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 4408813. [Google Scholar] [CrossRef]
Zhu, Q.; Chen, L.; Hu, H.; Xu, B.; Zhang, Y.; Li, H. Deep fusion of local and non-local features for precision landslide recognition. arXiv 2020, arXiv:2002.08547. [Google Scholar] [CrossRef]
Zhang, W.; Zhang, Y.; Zhang, L. Multiplanar Data Augmentation and Lightweight Skip Connection Design for Deep-Learning-Based Abdominal CT Image Segmentation. IEEE Trans. Instrum. Meas. 2023, 72, 2532111. [Google Scholar] [CrossRef]
Locke, W.; Lokhmachev, N.; Huang, Y.; Li, X. Radio Map Estimation with Deep Dual Path Autoencoders and Skip Connection Learning. In Proceedings of the 2023 IEEE 34th Annual International Symposium on Personal, Indoor and Mobile Radio Communications (PIMRC), Toronto, ON, Canada, 5–8 September 2023. [Google Scholar] [CrossRef]
Shi, Y.; Guo, E.; Zhu, S.; Gu, J.; Bai, L.; Han, J. Research on optimal skip connection scale in learning-based scattering imaging. In Proceedings of the Seventh Symposium on Novel Photoelectronic Detection Technology and Applications 2020, Kunming, China, 5–7 November 2020. [Google Scholar] [CrossRef]
Xiong, S.; Tan, Y.; Li, Y.; Wen, C.; Yan, P. Subtask Attention Based Object Detection in Remote Sensing Images. Remote Sens. 2021, 13, 1925. [Google Scholar] [CrossRef]
Fan, R.; Wang, L.; Feng, R.; Zhu, Y. Attention based Residual Network for High-Resolution Remote Sensing Imagery Scene Classification. In Proceedings of the 2019 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Yokohama, Japan, 28 July–2 August 2019. [Google Scholar] [CrossRef]
Oktay, O.; Schlemper, J.; Folgoc, L.L.; Lee, M.; Heinrich, M.; Misawa, K.; Mori, K.; McDonagh, S.; Hammerla, N.Y.; Kainz, B.; et al. Attention U-Net: Learning Where to Look for the Pancreas. arXiv 2018, arXiv:1804.03999. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar] [CrossRef]
Zhao, C.; Shikang, L.; Zhangjian, Q. SENet-optimized Deeplabv3+ landslide detection. Sci. Technol. Eng. 2022, 22, 14635–14643. [Google Scholar]
Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.L.; Zhou, Y. TransUNet: Transformers make strong encoders for medical image segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar]
Peng, N.; Sun, S.; Wang, R.; Zhong, P. Combining interior and exterior characteristics for remote sensing image denoising. J. Appl. Remote Sens. 2016, 10, 025016. [Google Scholar] [CrossRef]
Zhou, X.; Shao, Z.; Liu, J. Geographic ontology driven hierarchical semantic of remote sensing image. In Proceedings of the 2012 International Conference on Computer Vision in Remote Sensing, Xiamen, China, 16–18 December 2012. [Google Scholar] [CrossRef]
Vasvani, A. Attention is all you need. In Proceedings of the 31st Annual Conference on Neural Information Processing Systems (NIPS), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Dosovitskiy, A. An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations (ICLR), Vienna, Austria, 3–7 May 2021. [Google Scholar]
Lv, P.; Ma, L.; Li, Q.; Du, F. ShapeFormer: A Shape-Enhanced Vision Transformer Model for Optical Remote Sensing Image Landslide Detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 2681–2689. [Google Scholar] [CrossRef]
Li, D.; Tang, X.; Tu, Z.; Fang, C.; Ju, Y. Automatic Detection of Forested Landslides: A Case Study in Jiuzhaigou County, China. Remote Sens. 2023, 15, 3850. [Google Scholar] [CrossRef]
Tang, X.; Tu, Z.; Ren, X.; Fang, C.; Wang, Y.; Liu, X.; Fan, X. A Multi-modal Deep Neural Network Model for Forested Landslide Detection. Geomat. Inf. Sci. Wuhan Univ. 2023, 49, 1566–1573. [Google Scholar] [CrossRef]
Zhou, Y.; Peng, Y.; Li, W.; Yu, J.; Ge, D.; Xiang, W. A Hyper-pixel-wise Contrastive Learning Augmented Segmentation Network for Old Landslide Detection Using High-Resolution Remote Sensing Images and Digital Elevation Model Data. arXiv 2023, arXiv:2308.01251. [Google Scholar] [CrossRef]
Li, P.; Wang, Y.; Si, T.; Ullah, K.; Han, W.; Wang, L. MFFSP: Multi-scale feature fusion scene parsing network for landslides detection based on high-resolution satellite images. Eng. Appl. Artif. Intell. 2024, 127, 107337. [Google Scholar] [CrossRef]
Yang, Z.; Xu, C.; Li, L. Landslide Detection Based on ResU-Net with Transformer and CBAM Embedded: Two Examples with Geologically Different Environments. Remote Sens. 2022, 14, 2885. [Google Scholar] [CrossRef]
Du, Y.; Huang, L.; Zhao, Z.; Li, G. Landslide body identification and detection of high-resolution remote sensing image based on DETR. Bull. Surv. Mapp. 2023, 5, 16–20. [Google Scholar] [CrossRef]
Fu, R.; He, J.; Liu, G.; Li, W.; Mao, J.; He, M.; Lin, Y. Fast Seismic Landslide Detection Based on Improved Mask R-CNN. Remote Sens. 2022, 14, 3928. [Google Scholar] [CrossRef]
Huang, Y.; Zhang, J.; He, H.; Jia, Y.; Chen, R.; Ge, Y.; Ming, Z.; Zhang, L.; Li, H. MAST: An Earthquake-Triggered Landslides Extraction Method Combining Morphological Analysis Edge Recognition With Swin-Transformer Deep Learning Model. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 2586–2595. [Google Scholar] [CrossRef]
He, X.; Zhou, Y.; Zhao, J.; Zhang, D.; Yao, R.; Xue, Y. Swin Transformer Embedding UNet for Remote Sensing Image Semantic Segmentation. IEEE Trans. Geosci. Remote Sens. 2022, 60, 4408715. [Google Scholar] [CrossRef]
Cao, H.; Wang, Y.; Chen, J.; Jiang, D.; Zhang, X.; Tian, Q.; Wang, M. Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation. In Proceedings of the Computer Vision—ECCV 2022 Workshops, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2023; Volume 13803, pp. 205–218. [Google Scholar]
Ge, C.; Nie, Y.; Kong, F.; Xu, X. Improving Road Extraction for Autonomous Driving Using Swin Transformer Unet. In Proceedings of the 2022 IEEE 25th International Conference on Intelligent Transportation Systems (ITSC), Macau, China, 8–12 October 2022. [Google Scholar] [CrossRef]
Jing, Y.; Zhang, T.; Liu, Z.; Hou, Y.; Sun, C. Swin-ResUNet+: An edge enhancement module for road extraction from remote sensing images. Comput. Vis. Image Underst. 2023, 237, 103807. [Google Scholar] [CrossRef]
Chen, X.; Liu, M.; Li, D.; Jia, J.; Yang, A.; Zheng, W.; Yin, L. Conv-trans dual network for landslide detection of multi-channel optical remote sensing images. Front. Earth Sci. 2023, 11, 1182145. [Google Scholar] [CrossRef]

Figure 1. Examples of the landslide images and the corresponding labels.

Figure 2. The study region for landslide segmentation experiments.

Figure 3. The architecture of the proposed AST-UNet: CASI represents the channel attention and spatial intersection module and SDE represents the spatial detail enhancement module.

Figure 4. The interrelated characteristics between landslides and surroundings. (a) the correlation between landslides and surrounding water bodies; (b,c) the correlation between landslides and surrounding vegetation.

Figure 5. The challenges in landslide segmentation. (a,b) the “hole” phenomenon within the landslide; (c) the noise within the landslide; (d–f) disruptive land features around landslides.

Figure 6. The structure of the channel attention and spatial intersection (CASI) module.

Figure 7. The comparison of feature maps from the SDE module and the swin transformer blocks.

Figure 8. The structure of spatial detail enhancement (SDE) module.

Figure 9. Landslide segmentation results comparison between the proposed AST-UNet and other comparative models on the validation datasets.

Figure 10. Landslide segmentation results comparison between the proposed AST-UNet and other comparative models on the test datasets.

Figure 11. The results of landslide segmentation with AST-UNet and the object-oriented algorithm.

Figure 12. Bar charts of the comparison of model properties: (a) parameters of the models; (b) FLOPs of the models; (c) FPS of the models; (d) APC of the models; (e) APL of the models; and (f) UTB of the models.

Figure 13. The improvement of evaluation indexes after adding our proposed modules: (a) the improvement from the CA module; (b) the improvement from the SI module; (c) the improvement from the SDE module.

Figure 14. Visualizations of results extracted from ablation experiments.

Table 1. Quantitative comparison results with SOTA methods on datasets. The best results are shown in bold, with the second-best being underlined.

Method	F1-Score (%)	HD95	mIoU (%)	FIoU (%)	BIoU (%)
FCN	80.55	15.7	70.7	53	88.4
UNet	89.09	7.53	81.51	69.08	93.95
DeepLab V3+	86.79	11.59	78.98	62.53	95.43
TransFuse	87.51	10.74	79.1	65.32	92.89
TransUNet	88.92	12.91	81.39	68.18	94.59
SwinUNet	88.32	7.69	80.46	67.24	93.68
AST-UNet(ours)	90.14	3.73	83.45	70.81	96.09

Table 2. Ablation experiment of the proposed module on the validation datasets.

Method	Modules			Evaluation Indexes
Method	CA	SI	SDE	F1-Score (%)	HD95	mIoU (%)	FIoU (%)	BIoU (%)
Baseline				88.32	7.69	80.46	67.24	93.68
Baseline-CA	√			88.18	6.72	80.85	65.74	95.97
Baseline-SI		√		88.35	13.73	80.98	67.84	94.12
Baseline-SDE			√	88.73	4.08	81.82	67.89	95.74
Baseline-CASI	√	√		88.85	6.52	81.59	69.45	93.72
Baseline-CA-SDE	√		√	88.75	4.19	81.62	67.58	95.66
Baseline-SI-SDE		√	√	89.85	3.17	82.93	70.05	95.81
AST-UNet(ours)	√	√	√	90.14	3.73	83.45	70.81	96.09

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, B.; Wang, W.; Wu, Y.; Gao, X. Attention Swin Transformer UNet for Landslide Segmentation in Remotely Sensed Images. Remote Sens. 2024, 16, 4464. https://doi.org/10.3390/rs16234464

AMA Style

Liu B, Wang W, Wu Y, Gao X. Attention Swin Transformer UNet for Landslide Segmentation in Remotely Sensed Images. Remote Sensing. 2024; 16(23):4464. https://doi.org/10.3390/rs16234464

Chicago/Turabian Style

Liu, Bingxue, Wei Wang, Yuming Wu, and Xing Gao. 2024. "Attention Swin Transformer UNet for Landslide Segmentation in Remotely Sensed Images" Remote Sensing 16, no. 23: 4464. https://doi.org/10.3390/rs16234464

APA Style

Liu, B., Wang, W., Wu, Y., & Gao, X. (2024). Attention Swin Transformer UNet for Landslide Segmentation in Remotely Sensed Images. Remote Sensing, 16(23), 4464. https://doi.org/10.3390/rs16234464

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Attention Swin Transformer UNet for Landslide Segmentation in Remotely Sensed Images

Abstract

1. Introduction

2. Materials and Methods

2.1. Data

2.2. Method

2.2.1. Overview of the Network

2.2.2. Swin Transformer Block

2.2.3. Channel Attention and Spatial Intersection (CASI) Module

2.2.4. Spatial Detail Enhancement (SDE) Module

2.2.5. Experimental Settings

3. Results

3.1. Quantitative Results

3.2. Qualitative Results on Validation Datasets

3.3. Qualitative Results on Test Datasets

4. Discussion

4.1. Computational Efficiency Analysis of Models

4.2. Ablation Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI