1. Introduction
The Qinghai–Tibet Plateau (QTP) has the world’s highest and largest permafrost region in low- and middle-latitude regions [
1,
2]. The permafrost on the QTP is warmer than the high Arctic [
3] and nearly 40% of permafrost is considered to be especially warm and unstable (ground temperature > −0.5 °C) [
4,
5]. Warm permafrost which is more sensitive to climate warming can lead to higher degradation potential [
6,
7,
8]. Thaw-induced slope failure plays a critical role in the permafrost degradation process, posing considerable threat to infrastructure in cold regions and having a significant environmental impact [
7,
9,
10,
11,
12]. Retrogressive thaw slumps (RTSs) are a kind of typical and regionally widespread slope failure that occurs in the permafrost region, and are triggered by the thaw of permafrost and the melt of massive ground ice [
13].
The Qinghai–Tibet Railway, the Qinghai–Tibet Highway and other linear engineering facilities (e.g., Qinghai–Tibet DC power transmission line) are all located along the central Qinghai–Tibet Engineering Corridor (QTEC) from Golmud to Lhasa. RTSs threaten the stability of infrastructure on the QTP, modify landscapes, and accelerate the organic carbon emission [
14,
15,
16,
17]. Inventory maps of RTSs are important to evaluate risk and vulnerability [
9,
15], perform carbon cycle studies [
16], and evaluate landslide events [
18,
19]. As a prerequisite for these applications, a robust and effective method for detecting and segmenting RTSs is essential.
Given that RTS is a type of landslide, the methods employed for landslide identification remain effective for RTSs. Traditional methods include image visual interpretation [
20,
21,
22], object-based image analysis (OBIA) algorithms [
23,
24,
25] and pixel-based classification [
26,
27,
28,
29]. Visual interpretation generally yields accurate results but is time-consuming and labor-intensive, making it difficult to create RTS inventory maps for large regions and long time series. The OBIA method is usually a two-step process including segmentation and classification [
30,
31]. Segmentation refers to grouping small pixels together into vector objects, and classification refers to identifying target objects based on several object characteristics from spectral, spatial, hierarchical, textural, and morphological information [
31]. The effective object characteristics are all manually designed and heavily rely on the research experience in feature engineering [
32,
33,
34]. Manual feature engineering in the OBIA method is also costly, exhaustive and time-consuming. The pixel-based methods, such as maximum likelihood, minimum distance, parallelepiped, ISODATA, and K-means, have been widely used for per-pixel classification and natural hazard inventory mapping, such as landslide detection [
26,
27,
28]. However, these methods rely solely on the spectral information of individual pixels, ignoring geometric and contextual information [
35,
36,
37], and fail to account for the intricate textures and significant spectral heterogeneity inherent in the imagery [
38].
Machine learning offers a promising data-driven method for automatically identifying RTSs or landslides, such as decision trees, support vector machines, artificial neural networks, random forest, etc., [
35,
39]. Compared to traditional methods, machine learning demonstrates superior performance due to its enhanced capacity for automatically mining high-dimensional data [
40]. However, these aforementioned algorithms are fundamentally pixel-based approaches, making them difficult to effectively capture complex spatial structural relationships. Additionally, their vulnerability to overfitting and dependence on manual feature engineering hinder the application of machine learning to complex tasks and large-scale data mining.
Alongside notable advances in algorithms, increases in computational power, and the availability of large-scale datasets, deep learning, as a key subfield of machine learning, has become a dominant approach for solving complex problems in fields such as computer vision, natural language processing, and beyond [
41]. Given that the identification of RTSs represents a typical semantic or instance segmentation task in the field of computer vision, deep learning models designed for such tasks are inherently well suited and hold considerable potential for RTS mapping [
34,
36,
42,
43,
44,
45,
46]. Recent studies have applied deep learning models to RTS mapping from remote sensing imagery, including CubeSat-based mapping, transfer learning strategies, and deep neural network frameworks [
47,
48,
49,
50], demonstrating the feasibility of automated RTS identification. However, although previous studies have demonstrated the feasibility of deep learning for RTS mapping, accurate delineation of RTSs from medium-resolution Sentinel-2 imagery still remains challenging. Compared with landslides, the surface area of RTSs is much smaller, ranging from 0.02 to 20 ha with an average of 1.44 ha [
51]. In addition, the colors and spectral characteristics of RTSs are similar to bare land in alpine meadow areas and other types of ground collapse, making target–background discrimination and boundary delineation more difficult in medium-resolution imagery.
Despite the suitability of high-resolution (HR) and very-high-resolution (VHR) remote sensing imagery for creating RTS inventories [
52], their utility remains limited due to the late development of these technologies and the lack of freely accessible data. Fortunately, the freely available Sentinel-2 satellite data offer a reliable alternative for its relatively high spatial resolution of 10 m, high global revisit frequency of 5 days, long temporal coverage of nearly 10 years, rich multispectral information, and high data quality [
53,
54,
55]. Crucially, this extensive temporal archive provides invaluable data for characterizing the spatiotemporal evolution of RTSs, which is essential for understanding long-term permafrost degradation dynamics. Therefore, a robust model for automatic segmentation of RTSs from Sentinel-2 imagery is of great practical significance. However, RTS segmentation in Sentinel-2 imagery is hindered by the complex morphology and variable sizes of RTSs, as well as their low contrast and fuzzy boundaries against the surrounding landscape in medium-resolution data.
To address this challenge in RTS mapping from medium-resolution Sentinel-2 imagery, this study proposes a Multi-Scale Object-aware Context Attention Network (MOCA-Net), an end-to-end deep semantic segmentation framework. MOCA-Net utilizes the Swin Transformer as backbone for its powerful ability of hierarchical feature extraction. It employs a three-stage cascaded architecture comprising an encoder, a Feature Enhancement Network, and an Enhanced Decoder to improve target–background discrimination, contextual refinement, and boundary recovery.
The main contributions of this manuscript are as follows: (1) We develop MOCA-Net, a deep semantic segmentation framework featuring a three-stage cascaded architecture with dedicated feature enhancement and sophisticated decoding stages, specifically designed to improve target–background discrimination, contextual refinement, and boundary recovery for accurate RTS mapping from satellite imagery. (2) We evaluate the effectiveness of the MOCA-Net model in segmentation of RTSs along the QTEC using Sentinel-2 data.
4. Results
4.1. Model Comparison During Training Process
As shown in
Figure 9a, the training and validation loss of MOCA-Net exhibit a smoother convergence trend with smaller gaps between them, indicating that training process is stable and has no obvious overfitting. In comparison, the Swin + Basic Decoder presents greater fluctuations, as evidenced by the validation loss curve in
Figure 9a during epochs 8–16. This implies the instability in the training process of Swin + Basic Decoder. In
Figure 9b, the mIoU curves also show the improved performance of MOCA-Net compared to Swin + Basic Decoder. The mIoU curve of MOCA-Net stabilizes above 0.83 since epoch 7, with peak performance surpassing 0.84. By comparison, the mIoU curve of Swin + Basic Decoder only stabilizes around 0.83 and never surpassed 0.84. In addition, because of the transfer learning strategy, both MOCA-Net and Swin + Basic Decoder present relatively fast convergence speeds.
4.2. Model Comparison by Evaluation Metrics
The six metrics listed in
Table 1 are used to evaluate the performance of the CNN-based models (DeepLabv3+ Xception71, FCN ResNet152, U-Net, and HRNet) and Transformer-based models (Vision Transformer, SegFormer, Swin + Basic Decoder, and MOCA-Net). These baseline models were selected to provide representative comparisons across both CNN-based and Transformer-based segmentation paradigms. Specifically, FCN, U-Net, DeepLabv3+, and HRNet represent widely used CNN-based architectures with different feature extraction and fusion strategies, whereas Vision Transformer, SegFormer, and Swin + Basic Decoder represent Transformer-based segmentation models. In particular, Swin + Basic Decoder serves as the most direct baseline for evaluating the effectiveness of the proposed feature enhancement and decoder design, while SegFormer is included as a more recent Transformer-based segmentation baseline. Therefore, the purpose of this comparison is to assess the relative effectiveness of MOCA-Net against representative segmentation frameworks.
As shown in
Table 1, Transformer-based models surpass CNN-based models in optimization robustness (lower test loss) and segmentation accuracy (higher mIoU, RTS IoU and Dice). Notably, MOCA-Net performs best among all compared models for these four metrics, with the lowest loss (0.0542), highest mIoU (0.8609), highest RTS IoU (0.7473) and highest Dice (0.8547). While MOCA-Net does not attain the highest precision (0.8785 achieved by SegFormer), it retains competitive precision (0.8516) while achieving higher recall (0.8606), showing balanced performance in detection sensitivity for RTSs.
4.3. Effect of ESRGAN Preprocessing
To assess whether the ESRGAN-based super-resolution preprocessing materially influences the reported segmentation performance, we further conducted a controlled comparison using the original-resolution Sentinel-2 inputs without ESRGAN enhancement. In this comparison, the same dataset split, random seed, training protocol, and evaluation procedure were retained, and only the ESRGAN preprocessing step was removed. The experiment was performed for both MOCA-Net and the Swin + Basic Decoder baseline in order to examine whether the effect of ESRGAN is consistent across models.
As shown in
Table 2, removing ESRGAN led to a modest but consistent decrease in the overall segmentation performance of both models. For MOCA-Net, the mIoU decreased from 0.8609 to 0.8500 and the RTS-class IoU decreased from 0.7473 to 0.7275. For the Swin + Basic Decoder baseline, the mIoU decreased from 0.8542 to 0.8455 and the RTS-class IoU decreased from 0.7359 to 0.7196. Dice and recall also decreased in both models, whereas precision increased in the non-ESRGAN setting. This pattern suggests that removing ESRGAN made the predictions more conservative, thereby reducing commission errors but increasing omission errors, as reflected by the lower recall, Dice, mIoU, and RTS-class IoU.
At the same time, MOCA-Net remained superior to the Swin + Basic Decoder baseline under both the ESRGAN-enhanced and non-ESRGAN settings. This indicates that the performance improvement of MOCA-Net does not come entirely from ESRGAN preprocessing, but is also related to the feature enhancement and decoder design of the model itself. It should be emphasized that ESRGAN only enlarges the original 10 m Sentinel-2 imagery to an input size corresponding to 2.5 m, and it cannot reconstruct genuine 2.5 m surface details from the original 10 m Sentinel-2 imagery. Because such fine-scale ground information is not contained in the original 10 m imagery, ESRGAN may generate some texture or boundary details that look clearer but may not physically exist. Therefore, the performance improvement obtained with ESRGAN should be understood as the effect of an image-enhancement preprocessing step under the current experimental setting, rather than as evidence that Sentinel-2 imagery has obtained a true 2.5 m spatial resolution. Accordingly, the reliance on ESRGAN-enhanced inputs should be regarded as a limitation of the current experimental setting, rather than as a replacement for true high-resolution remote sensing imagery.
4.4. Stability Analysis and Statistical Comparison Across Random Seeds
To further evaluate the stability of the reported performance, we repeated the experiments under the ESRGAN-enhanced setting using five random seeds, namely 42, 62, 82, 102, and 122. Considering the computational cost, this repeated-run analysis was conducted for MOCA-Net and the Swin + Basic Decoder baseline. The detailed results of all ten runs are reported in
Table 3. Here, the random seed controls both the random train/validation/test split and stochastic factors in training, such as parameter initialization, mini-batch shuffling, and random data augmentation. Therefore, repeating the experiments with different seeds helps assess whether the observed performance gain is robust to both data-partition variation and training randomness rather than arising from a favorable single run.
As shown in
Table 3, MOCA-Net consistently outperformed the Swin + Basic Decoder across all five random seeds. Specifically, MOCA-Net achieved higher mIoU, RTS-class IoU, and Dice in every repeated run, indicating that the observed improvement did not arise from a favorable single seed. The corresponding mean and standard deviation values are also listed in
Table 3, showing relatively small variations across repeated runs for both models.
To further strengthen the statistical interpretation,
Table 4 summarizes the repeated-run results in terms of mean ± standard deviation, 95% confidence intervals of the mean, and paired
t-test
p-values for the core segmentation metrics. MOCA-Net achieved a mean mIoU of 0.8595 ± 0.0012, an RTS-class IoU of 0.7471 ± 0.0028, and a Dice coefficient of 0.8546 ± 0.0019, compared with 0.8529 ± 0.0013, 0.7353 ± 0.0029, and 0.8468 ± 0.0021 for the Swin + Basic Decoder, respectively. Paired
t-tests across the five seeds further indicated that the improvements in mIoU, RTS-class IoU, and Dice were statistically significant (all
p < 0.001).
Overall, these repeated experiments provide additional evidence that the performance gain of MOCA-Net over the Swin + Basic Decoder baseline is stable and statistically supported under different random settings.
4.5. Model Comparison by Visual Inspection of Segmentation Effect
While quantitative metrics have confirmed the effectiveness of the proposed model, visual comparisons with ground truth provide a more intuitive and distinct illustration of its advantages.
4.5.1. Sentinel-2 Imagery
To visually compare the segmentation performance of MOCA-Net and Swin + Basic Decoder on Sentinel-2 imagery, seven representative samples are presented in
Figure 10 and
Figure 11. These examples are intended to illustrate typical visual differences in boundary delineation, preservation of RTS morphology, and target–background confusion. The three samples in
Figure 10 show clearer boundaries and more obvious contrast between RTSs and their surrounding landscape, whereas the samples in
Figure 11 present more ambiguous transitions and less distinct edges. Although both models capture the general morphology of the seven morphologically diverse samples, MOCA-Net demonstrates superior boundary precision and better preservation of morphological details across all samples. For example, in the first instance shown in
Figure 10, a distinct upward-concave morphological feature is visible in both the Sentinel-2 imagery and the corresponding RTS ground truth. Compared with result of Swin + Basic Decoder, the MOCA-Net delineates the curved structure more precisely. In the second example shown in
Figure 10, the Swin + Basic Decoder also fails to identify the upward-protruding part of the lower-left RTS area. In the first example of
Figure 11, the MOCA-Net also identifies a finer left-concave feature on the lower-right side of the RTS, which is not present in the reference label. This further supports the effectiveness of MOCA-Net in segmentation of RTSs and also implies the effectiveness of the designed Feature Enhancement Network and Enhanced Decoder components. Notably, the more accurate segmentation of RTSs—reflected by finer boundary delineation and more precise morphological representation—is of great significance for the long-term analysis of RTS change. The advantages of MOCA-Net revealed by visual inspection of the segmentation results are more evident than those reflected by the evaluation metrics.
4.5.2. UAV Imagery
While remote sensing imagery provides efficient coverage for large-scale RTS mapping, labels derived from image interpretation may contain uncertainties. Therefore, we compared segmentation results between MOCA-Net and Swin + Basic Decoder using UAV orthophotos from two sites adjacent to the Beiluhe section of the QTEC during the summer of 2024 (
Figure 1). The centimeter-level UAV imagery serves as a higher-resolution qualitative reference for visual comparison (
Figure 12). Site 1 represents a typical elliptical RTS morphology, while Site 2 exhibits irregular complex structures. The UAV orthophotos correspond to the same RTS sites as the Sentinel-2 patches shown in
Figure 12 and are used here only as qualitative reference data. Because RTS morphology and boundaries may evolve over time, the Sentinel-2 patches used in
Figure 12 were independently extracted from summer 2024 imagery for these two sites, with sufficient surrounding context retained to facilitate visual comparison with the UAV orthophotos.
The comparison suggests that MOCA-Net better delineates RTS boundaries in these two examples. For Site 1, MOCA-Net produces a more geometrically accurate elliptical contour compared to the baseline model. For Site 2, MOCA-Net better captures the overall contour, showing improved capability in handling irregular RTS boundaries across different morphological types.
Although MOCA-Net identifies RTS objects and delineates relatively precise boundaries using Sentinel-2 imagery, discernible discrepancies in fine-scale details remain when compared with the centimeter-level UAV imagery. By comparing the Sentinel-2 and UAV imagery, it is evident that these limitations are primarily attributable to the inherent resolution of the input satellite data rather than algorithmic deficiencies. Therefore, the UAV comparison should be interpreted as a qualitative visual check rather than a formal quantitative validation. Nevertheless, the segmentation results indicate that MOCA-Net can provide useful RTS delineation for large-scale mapping and long-term spatiotemporal analysis.
4.6. Ablation Study and Mechanism Analysis
To verify the effectiveness of our proposed components, we conducted ablation experiments comparing four model configurations: (1) Baseline M0 (Swin + Basic Decoder), (2) M1 (adding Feature Enhancement Network), (3) M2 (adding Enhanced Decoder), and (4) M3 (MOCA-Net with both components) (
Table 5).
The results show that each component contributes to performance improvements. M1 (Feature Enhancement Network) achieved mIoU of 0.8584 and improved precision to 0.8581, though with slightly lower recall. M2 (Enhanced Decoder) reached mIoU of 0.8572 with the highest precision (0.8628) but more noticeable recall reduction.
The complete model M3 achieved the best overall performance with mIoU of 0.8609 and RTS IoU of 0.7473. Most importantly, M3 maintained a better balance between precision (0.8516) and recall (0.8606), effectively integrating the strengths of both components.
The ablation results demonstrate that both the Feature Enhancement Network and the Enhanced Decoder contribute positively to the model’s overall segmentation performance. While the Feature Enhancement Network yields more balanced improvements across various evaluation metrics, the Enhanced Decoder attains the highest precision at the cost of a more pronounced drop in recall. When integrated, MOCA-Net achieves the best overall performance across all metrics and exhibits a well-balanced precision–recall trade-off. These findings suggest that the two proposed components are complementary and jointly effective for the task of RTS segmentation.
To further interpret the performance differences observed in
Table 5, particularly the contribution of the Feature Enhancement Network, we visualized the learned attention responses in
Figure 13.
The local Attention maps (third column), produced by the improved SE block, mainly highlight visually salient target-related regions and fine-scale feature responses. This behavior is consistent with the channel-wise reweighting mechanism of the SE block, which may help strengthen discriminative spectral–textural cues. However, the local attention still exhibits spurious activations in surrounding background regions (most noticeable in the third row), suggesting that relying solely on local cues may be insufficient to fully suppress context-induced ambiguity in Sentinel-2 imagery.
In contrast, the global attention maps (fourth column), derived from the CFRM and the Non-local block, exhibit more spatially coherent and target-concentrated activation patterns. This observation is qualitatively consistent with the role of global context aggregation and long-range dependency modeling, which may help reduce irrelevant background responses.
Overall, the visual differences between the local and global attention maps qualitatively suggest that combining both mechanisms may contribute to more coherent target-focused feature responses. These visualizations are intended as illustrative qualitative evidence and should not be interpreted as direct proof of causal mechanism.
4.7. Computational Complexity Analysis
To further assess whether the performance gain justifies the additional model complexity, we compared MOCA-Net with the Swin + Basic Decoder baseline in terms of both segmentation accuracy and computational cost (
Table 6).
As shown in
Table 6, MOCA-Net contains 111.744 M trainable parameters, compared with 92.594 M for the Swin + Basic Decoder baseline. Its computational complexity also increases substantially, from 17.793 G to 89.031 G FLOPs. Under the same hardware setting, the average inference time increases from 32.962 ms/image to 38.059 ms/image. Although the FLOPs increase markedly, the increase in measured inference time remains relatively moderate under the tested hardware setting, suggesting that the additional computations do not translate proportionally into runtime overhead in the tested environment. Although the absolute performance improvement over the baseline is moderate, particularly a 1.55% increase in RTS-class IoU, the gain is consistently reflected in both quantitative metrics and qualitative boundary delineation. These results indicate that MOCA-Net achieves improved RTS segmentation at the cost of increased computational complexity, and the practical value of this trade-off should be considered according to the computational resources and deployment requirements of large-scale applications. Therefore, MOCA-Net may be more suitable for applications where boundary accuracy is prioritized, whereas the Swin + Basic Decoder remains a more efficient option for resource-constrained large-scale screening.
5. Conclusions
Segmenting RTSs in medium-resolution satellite imagery is challenging due to their complex morphology, variable sizes, and, more importantly, their low contrast and fuzzy boundaries against the surrounding landscape. To address these challenges, this paper proposes MOCA-Net, which employs a three-stage cascaded architecture comprising an encoder, a Feature Enhancement Network, and an Enhanced Decoder to improve target–background discrimination, contextual refinement, and boundary recovery.
Compared to the baseline Swin + Basic Decoder, MOCA-Net achieves improvements in mIoU (from 0.8542 to 0.8609, +0.78%) and IoU (from 0.7359 to 0.7473, +1.55%). The ablation experiments demonstrate the effectiveness of the proposed components. Visual comparisons with both labels and UAV imagery show that MOCA-Net produces more accurate segmentation results for different RTS morphologies. The results of this study indicate that the model shows some potential for automated RTS monitoring using medium-resolution satellite data, potentially providing a reference for large-scale permafrost degradation assessment in climate change research.
Despite MOCA-Net’s good performance, several limitations require further improvement. First, ESRGAN was used only as an image-enhancement preprocessing step. It cannot reconstruct genuine 2.5 m surface details from the original 10 m Sentinel-2 imagery, and the ESRGAN-related improvement should therefore be interpreted with caution. Second, the current approach relies on RGB optical imagery only and does not incorporate additional multispectral information or topographic variables, such as NIR/SWIR bands and DEM-derived slope and aspect, which may limit the discrimination of RTSs from spectrally similar background features and contribute to false positives in areas such as bare soil and other erosional landforms.
Future improvements should focus on multimodal integration incorporating multispectral, geological, and topographic data to reduce false positives, as well as spatiotemporal analysis for dynamic monitoring.