RingFormer-Seg: A Scalable and Context-Preserving Vision Transformer Framework for Semantic Segmentation of Ultra-High-Resolution Remote Sensing Imagery

Zhang, Zhan; Shu, Daoyu; Gu, Guihe; Hu, Wenkai; Wang, Ru; Chen, Xiaoling; Yang, Bingnan

doi:10.3390/rs17173064

Open AccessArticle

RingFormer-Seg: A Scalable and Context-Preserving Vision Transformer Framework for Semantic Segmentation of Ultra-High-Resolution Remote Sensing Imagery

by

Zhan Zhang

^1,†

,

Daoyu Shu

^2,†,

Guihe Gu

³

,

Wenkai Hu

⁴,

Ru Wang

⁵,

Xiaoling Chen

¹ and

Bingnan Yang

^2,*

¹

State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, Wuhan 430079, China

²

School of Remote Sensing and Information Engineering, Wuhan University, 129 Luoyu Road, Wuhan 430079, China

³

School of Computer Science, Wuhan University, Wuhan 430072, China

⁴

School of Resources and Environmental Engineering, Wuhan University of Technology, Wuhan 430070, China

⁵

School of Urban Design, Wuhan University, Wuhan 430072, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Remote Sens. 2025, 17(17), 3064; https://doi.org/10.3390/rs17173064

Submission received: 27 July 2025 / Revised: 31 August 2025 / Accepted: 1 September 2025 / Published: 3 September 2025

Download

Browse Figures

Versions Notes

Abstract

Semantic segmentation of ultra-high-resolution remote sensing (UHR-RS) imagery plays a critical role in land use and land cover analysis, yet it remains computationally intensive due to the enormous input size and high spatial complexity. Existing studies have commonly employed strategies such as patch-wise processing, multi-scale model architectures, lightweight networks, and representation sparsification to reduce resource demands, but they have often struggled to maintain long-range contextual awareness and scalability for inputs of arbitrary size. To address this, we propose RingFormer-Seg, a scalable Vision Transformer framework that enables long-range context learning through multi-device parallelism in UHR-RS image segmentation. RingFormer-Seg decomposes the input into spatial subregions and processes them through a distributed three-stage pipeline. First, the Saliency-Aware Token Filter (STF) selects informative tokens to reduce redundancy. Next, the Efficient Local Context Module (ELCM) enhances intra-region features via memory-efficient attention. Finally, the Cross-Device Context Router (CDCR) exchanges token-level information across devices to capture global dependencies. Fine-grained detail is preserved through the residual integration of unselected tokens, and a hierarchical decoder generates high-resolution segmentation outputs. We conducted extensive experiments on three benchmarks covering UHR-RS images from 2048 × 2048 to 8192 × 8192 pixels. Results show that our framework achieves top segmentation accuracy while significantly improving computational efficiency across the DeepGlobe, Wuhan, and Guangdong datasets. RingFormer-Seg offers a versatile solution for UHR-RS image segmentation and demonstrates potential for practical deployment in nationwide land cover mapping, supporting informed decision-making in land resource management, environmental policy planning, and sustainable development.

Keywords:

semantic segmentation; ultra-high-resolution remote sensing (UHR-RS); land cover mapping; vision transformer; long-range context modeling; multi-device parallelism

1. Introduction

Semantic segmentation of ultra-high-resolution remote sensing (UHR-RS) imagery (at least a million pixels per frame, often tens of millions or more) has become increasingly vital for Earth observation and large-scale land cover mapping [1,2]. By classifying each pixel into land use or land cover (LULC) categories, semantic segmentation enables the generation of detailed thematic maps that support applications such as urban planning, environmental monitoring, agriculture, disaster response, and natural resource management [3,4,5]. With the advancement of satellite and aerial imaging technologies, UHR geospatial imagery, often containing millions of pixels per image, has become widely available [6,7,8]. While UHR-RS images offer rich spatial detail, their massive size and complexity pose significant challenges for semantic segmentation models [9,10], placing pressure on both computational efficiency and learning accuracy, as models must manage extreme resource demands while effectively capturing fine-grained details and long-range contextual dependencies [11,12].

Researchers have explored a variety of strategies to address these challenges. A straightforward approach is patch-wise processing, wherein a large image is divided into smaller tiles that are segmented independently [13]. This mitigates memory issues but often comes at the expense of global contextual awareness. The model has no knowledge of objects or patterns that span beyond each patch’s boundaries, introducing accuracy loss due to fragmented spatial information. Methods [14,15,16,17] to re-introduce context include overlapping tiles or multi-stage blending, but these add computational overhead and complexity. Another line of research employs multi-scale model architectures for global learning, integrating both local details and global context during both training and inference [12,18,19]. For example, GLNet [20] introduced a collaborative global–local framework that processes a downsampled global image alongside high-resolution local crops, effectively preserving broad context while focusing on local details. Similarly, WiCoNet [21] developed the Wide-Context Transformer, which integrates a context branch and Context Transformer to infuse wide-area context into local CNN predictions, significantly enhancing the segmentation of UHR-RS images. These global–local approaches enhance context modeling, but aggressive downsampling in multi-scale designs can overlook subtle yet important features, leading to accuracy loss, especially for UHR-RS imagery.

To reduce model complexity, researchers have introduced lightweight networks by utilizing simplified backbones, lightweight convolutions, and efficient attention mechanisms [22,23,24]. For instance, UNetFormer [25] integrates the global contextual capabilities of Transformers with the efficient multi-scale feature extraction inherent in U-Net architectures, significantly reducing parameters and computational load while maintaining high segmentation accuracy suitable for real-time tasks. Similarly, LightFormer [26] simplifies Transformer attention mechanisms and adopts a dual-path structure, effectively balancing local detail capture and global semantic context, thus substantially decreasing computational demands and ensuring real-time inference capabilities. Additionally, RTMamba [27] integrates efficient Mamba blocks into the U-Net framework to adaptively capture multi-scale contextual features, significantly reducing computational complexity for UHR-RS imagery. Still, lightweight models often sacrifice accuracy, especially in tasks requiring fine detail or subtle semantics, and their limited capacity can hinder generalization to complex or diverse datasets.

Another emerging direction is representation sparsification. Rather than treating all pixels or tokens equally, this method focuses computational effort on the most informative regions. Recent work has shown that significant speedups are possible by dynamically selecting or merging tokens in Vision Transformers while retaining segmentation quality [8,28,29]. For example, BPT [10] introduced a Boundary-Enhanced Patch-Merging Transformer that allocates higher token resolution to important areas and merges tokens in less important regions, thereby capturing both global and local information with minimal added cost. Likewise, spatial attention mechanisms have been employed to filter out irrelevant features. SAINet [30] proposed a spatially adaptive interaction network that first performs a coarse segmentation and identifies difficult or salient regions for further refinement, enabling the model to concentrate on pertinent areas and features. However, token sparsification in dense prediction remains challenging, as naive pruning can harm segmentation accuracy without reconstruction, and full-resolution outputs still require refined decoding to preserve detail.

Despite these advancements, most existing segmentation models still struggle to scale to arbitrarily large imagery while maintaining global context. Current approaches inherently face fundamental trade-offs between spatial fragmentation, limited representational flexibility, insufficient detail preservation, and difficulties in reconstructing coherent global outputs. In practice, segmenting nationwide satellite images or very large mosaics often requires bespoke tiling strategies and offline merging, which can lead to inconsistencies and missed contextual cues. There is a clear need for a scalable segmentation framework that can be distributed across multiple devices and still produce a seamless, coherent segmentation map as if the image were processed holistically.

In this work, we propose RingFormer-Seg, a novel Vision Transformer framework that enables scalable and context-preserving semantic segmentation for UHR-RS images. RingFormer-Seg divides large input images into subregions processed in parallel across multiple devices while exchanging contextual information through a lightweight ring-style communication topology. To balance efficiency and accuracy, it employs a Saliency-Aware Token Filter (STF) to select the most informative tokens, an ELCM to enhance intra-region features via memory-efficient attention, and a Cross-Device Context Routing (CDCR) mechanism to propagate context globally with minimal synchronization overhead. This design enables long-range dependency modeling across the entire image without requiring full-resolution inputs on a single device and preserves local detail through residual integration.

We evaluate RingFormer-Seg on three challenging UHR-RS benchmarks: DeepGlobe, Wuhan, and Guangdong, with image resolutions ranging from

2048 \times 2048

to

8192 \times 8192

pixels. The results demonstrate that RingFormer-Seg consistently achieves state-of-the-art (SOTA) segmentation accuracy across benchmarks specifically designed for UHR-RS images, striking a competitive balance between accuracy and computational efficiency and generalizing robustly across datasets with diverse content and varying image sizes.

In summary, the main contributions of this work are as follows:

We propose RingFormer-Seg, a distributed Vision Transformer framework for UHR-RS image segmentation, which partitions large-scale inputs across multiple devices to enable scalable, context-aware processing without memory overflow.
We design three complementary modules to enhance efficiency and scalability: the STF for selective token pruning, the ELCM for memory-efficient intra-region modeling, and the CDCR mechanism for lightweight global context exchange across devices.
We conduct extensive experiments on three challenging UHR-RS benchmarks, demonstrating that RingFormer-Seg consistently outperforms SOTA CNN- and transformer-based methods in segmentation accuracy while maintaining competitive computational efficiency, offering a practical and scalable solution for large-scale land cover mapping and Earth observation applications.

2. Methodology

2.1. The Overall Architecture of RingFormer-Seg

The overall framework of our proposed RingFormer-Seg is illustrated in Figure 1. RingFormer-Seg divides large input images into spatial subregions, each processed independently on a separate device, and enables long-range dependency modeling through a lightweight ring-style communication topology. Specifically, each device independently tokenizes its assigned image tile through a patch embedding module, transforming spatial patches into token embeddings that contain local semantic information. To efficiently represent each subregion, we introduce an STF, which selects the top-p% most informative tokens based on their saliency scores derived from local semantic representations, reducing redundancy before context exchange (see Section 2.2 for detailed implementation). The remaining less-salient tokens are aggregated via average pooling into compact representations (“Pooled Remaining Tokens”), thus preserving potentially valuable but less salient information.

These selected tokens are further refined using an ELCM, which consists of multiple Transformer blocks, each composed of a memory-efficient attention mechanism (Flash Multi-Head Attention [31]) and a feed-forward neural network (FFN). The ELCM enhances intra-region representations, allowing for higher spatial resolution processing within limited device memory (detailed in Section 2.3).

To propagate global context, RingFormer-Seg employs a novel CDCR mechanism, which iteratively circulates the refined tokens across devices in ring-wise order [32], progressively aggregating contextual cues from adjacent subregions. This inter-GPU feature exchange facilitates global semantic coherence and enables each subregion to integrate information from the entire image with minimal communication overhead compared to conventional synchronization methods (fully described in Section 2.4). Unlike global all-reduce schemes, CDCR requires no barriers: each GPU exchanges only the key and value matrices of the retained tokens with its two neighbors; for a ring of N devices, the routing completes in

T = N - 1

hops, while queries remain local. This design avoids global synchronization stalls and bounds per-hop communication to a lightweight payload. Furthermore, we implement compute–communication overlap using dual CUDA streams and double-buffered token exchanges, allowing most link latency to be hidden behind local ELCM computation. These design choices ensure that CDCR maintains scalability across devices while keeping synchronization costs and latency overhead negligible in practice (see Section 4.1 for quantitative results). Meanwhile, tokens initially excluded by the STF and aggregated by average pooling are not discarded; instead, these pooled features are subsequently reintegrated into the SETR-Basic segmentation decoder via residual connections. This strategy ensures comprehensive preservation of both global context and fine-grained local details necessary for accurate segmentation. Finally, each device independently generates high-resolution segmentation maps using parallel decoders, and the outputs are seamlessly assembled by stitching the subregion predictions, which inherently ensures global consistency due to the CDCR-enabled context sharing.

To train the proposed RingFormer-Seg, we employ a hybrid segmentation loss function composed of a pixel-wise cross-entropy (CE) loss and a Dice loss. The CE loss facilitates accurate pixel-level semantic classification, while the Dice loss explicitly enhances boundary delineation, addressing class imbalance inherent in remote sensing imagery. Formally, the overall segmentation loss

L_{s e g}

is defined as:

L_{s e g} = L_{C E} + L_{D i c e},

(1)

where

L_{C E}

is computed as:

L_{C E} = - \frac{1}{N} \sum_{i = 1}^{N} \sum_{c = 1}^{C} y_{i, c} log (p_{i, c}),

(2)

and

L_{D i c e}

is defined as:

L_{D i c e} = 1 - \frac{1}{C} \sum_{c = 1}^{C} \frac{2 \sum_{i = 1}^{N} p_{i, c} y_{i, c} + ϵ}{\sum_{i = 1}^{N} (p_{i, c}^{2} + y_{i, c}^{2}) + ϵ} .

(3)

Here,

p_{i, c}

and

y_{i, c}

represent the predicted probability and ground truth for class c at pixel i, respectively; N denotes the total number of pixels; C is the number of semantic classes; and

ϵ

is a small smoothing constant to avoid numerical instability. The combined loss function effectively ensures comprehensive optimization of RingFormer-Seg, balancing accurate pixel-wise predictions and precise boundary localization, thereby significantly improving segmentation performance on UHR-RS datasets.

2.2. Saliency-Aware Token Filter

To efficiently represent spatial subregions and reduce token redundancy for effective cross-region communication, we introduce the STF (Figure 2). STF adaptively selects tokens based on their saliency scores derived from local semantic representations, ensuring informative features are prioritized while preserving secondary yet valuable information.

Given dense patch embeddings extracted from an input subregion, STF evaluates the semantic importance of each token embedding

t_{i}

via a saliency scoring function inspired by feature-based scoring methods. Specifically, we define the token-level saliency score

{score}_{s a l} (t_{i})

as follows:

{score}_{s a l} (t_{i}) = M S E (f_{ShuffleNetV 2} (t_{i}), f_{ShuffleNetV 2} (b l u r (t_{i}, r))),

(4)

where

f_{ShuffleNetV 2} (\cdot)

denotes the semantic feature extractor specifically implemented by a ShuffleNetV2 backbone, and

b l u r (t_{i}, r)

refers to downsampling the patch

p_{i}

, which corresponds to token

t_{i}

, by a factor of r and then upsampling it back to the original resolution. This produces a blurred version of the patch, which is subsequently used for semantic difference scoring. The mean squared error (

M S E

) quantifies semantic information loss caused by resolution reduction, with higher scores indicating higher saliency.

Using these scores, STF selects the top-

k %

salient tokens (

k = 70 %

empirically selected for optimal performance) and pools the remaining less-salient tokens into a compact representation:

T_{s a l i e n t}, T_{r e m a i n} = Φ_{top-k} ({{score}_{s a l} (t_{i})}_{i = 1}^{L}, 70 %),

(5)

t_{p o o l e d} = \frac{1}{| T_{r e m a i n} |} \sum_{t_{j} \in T_{r e m a i n}} t_{j},

(6)

where L represents the total token count, and

Φ_{top-k}

denotes the top-

k %

token selection operation.

The salient tokens

T_{s a l i e n t}

proceed to subsequent token refinement and global context integration stages, while the compact token

t_{p o o l e d}

is preserved through residual connections for later reintegration into the decoder. STF thus ensures an optimal balance between computational efficiency and representational completeness, enabling RingFormer-Seg to leverage both detailed local semantics and comprehensive global context.

2.3. Efficient Local Context Module

To enhance the modeling of intra-region token interactions efficiently, we propose an ELCM, consisting of multiple lightweight Transformer blocks. Each Transformer block incorporates Flash Multi-Head Attention (FlashAttention) to significantly reduce memory usage and computational overhead, thus enabling high-resolution feature processing within limited device memory (Figure 3).

Formally, given a set of salient tokens

T_{s a l i e n t}

from the STF module, each token embedding

t_{i}

is processed through a Transformer block comprising two main components: Flash Multi-Head Attention and a feed-forward network (FFN). For clarity, we define the attention mechanism as follows:

Given query (Q), key (K), and value (V) matrices obtained via linear projections:

Q = T_{s a l i e n t} W_{Q}, K = T_{s a l i e n t} W_{K}, V = T_{s a l i e n t} W_{V},

(7)

Specifically, the detailed procedure of Flash Multi-Head Attention within ELCM is summarized in Algorithm 1, which efficiently computes attention outputs by decomposing the operation into two stages:

Stage 1: Block-wise Partial Attention. The tokens are partitioned into blocks to perform local computations, minimizing global memory access and enabling parallel computation on GPU streaming multiprocessors (SMs). For each query-block

Q_{i}

and key-block

K_{j}

, the attention scores and partial outputs are computed as:

S = Q_{i} K_{j}^{⊤}, (O_{i}, m_{i}, ℓ_{i}) = stable_softmax_accumulate (S, V_{j}),

(8)

where S denotes the scaled dot-product attention scores, and the function

stable_softmax_accumulate

accumulates partial outputs

O_{i}

, maximum logit values

m_{i}

, and log-sum-exp values

ℓ_{i}

for numerical stability.

Stage 2: Global Reduction and Normalization. Partial outputs from each block are globally aggregated to produce normalized attention outputs:

M = max_{j} m_{i}^{(j)}, O_{t o t} = \sum_{j} e^{m_{i}^{(j)} - M} O_{i}^{(j)}, ℓ_{t o t} = \sum_{j} e^{m_{i}^{(j)} - M} ℓ_{i}^{(j)},

(9)

where the final attention output

A_{i}

for block i is obtained via:

A_{i} = \frac{O_{t o t}}{ℓ_{t o t}} .

(10)

After FlashAttention, the tokens undergo layer normalization and pass through the feed-forward network (FFN) for nonlinear refinement:

t_{i}^{'} = LayerNorm (t_{i} + A_{i}), t_{i}^{″} = LayerNorm (t_{i}^{'} + MLP (t_{i}^{'})),

(11)

where

MLP (\cdot)

denotes a two-layer fully connected network with GELU activations.

By leveraging FlashAttention within Transformer blocks, ELCM effectively captures complex local dependencies among salient tokens, achieving a balance between representational power and computational efficiency. This design significantly improves the capability of processing higher-resolution token embeddings within constrained GPU memory budgets.

Algorithm 1 FlashAttention in ELCM: Block-wise Accumulation and Global Reduction

1:: for all query-block i in parallel do
2:: // Stage 1: Block-wise Partial Attention
3:: load $Q_{i}$
4:: initialize $O_{i} \leftarrow 0, m_{i} \leftarrow - \infty, ℓ_{i} \leftarrow 0$
5:: for all key-block j do
6:: load $K_{j}, V_{j}$
7:: $S \leftarrow Q_{i} K_{j}^{⊤}$
8:: $(O_{i}, m_{i}, ℓ_{i}) \leftarrow stable_softmax_accumulate (S, V_{j})$
9:: end for
10:: write back $(O_{i}, ℓ_{i}, m_{i})$
11:: end for
12:: for all query-block i do
13:: // Stage 2: Global Reduction & Normalize
14:: load ${O_{i}^{(j)}, ℓ_{i}^{(j)}, m_{i}^{(j)}}_{j}$
15:: $M \leftarrow {max}_{j} m_{i}^{(j)}$
16:: $O_{tot} \leftarrow \sum_{j} e^{m_{i}^{(j)} - M} O_{i}^{(j)}$
17:: $ℓ_{tot} \leftarrow \sum_{j} e^{m_{i}^{(j)} - M} ℓ_{i}^{(j)}$
18:: $A_{i} \leftarrow O_{tot} / ℓ_{tot}$
19:: output $A_{i}$
20:: end for

2.4. Cross-Device Context Routing

To efficiently propagate global context across spatial subregions, we propose a novel CDCR mechanism (Figure 4). Unlike traditional global synchronization strategies, CDCR iteratively circulates refined salient tokens in a ring-wise communication topology among distributed GPUs, progressively aggregating context from neighboring subregions with minimal communication overhead.

Formally, given N GPUs arranged logically in a ring structure, let

Q_{n}

represent the query tokens on GPU n and

K V_{n}

denote the corresponding key-value token embeddings from its subregion. The ring-wise token exchange iteratively updates each GPU’s query tokens by attending to key-value tokens received from the neighboring GPU. Specifically, at iteration step t, GPU n computes its local attention output

Z_{n}^{(t)}

via the multi-head attention mechanism as follows:

Z_{n}^{(t)} = Attention (Q_{n}^{(t)}, K V_{\mod (n - 1, N)}^{(t)}, K V_{\mod (n - 1, N)}^{(t)}),

(12)

where the attention operation is defined as:

Attention (Q, K, V) = softmax (\frac{Q K^{⊤}}{\sqrt{d}}) V,

(13)

and d denotes the dimensionality of token embeddings.

Subsequently, GPU n updates its token embeddings by integrating attention outputs through residual connections:

Q_{n}^{(t + 1)} = Q_{n}^{(t)} + Z_{n}^{(t)} .

(14)

This iterative procedure continues for a predefined number of steps T, progressively aggregating global context information into local embeddings. After T steps, each GPU effectively incorporates semantic information from all other GPUs, yielding globally coherent feature representations.

The CDCR mechanism achieves global semantic coherence while significantly reducing inter-device communication overhead, as each GPU only interacts with two immediate neighbors at each step. Moreover, this decentralized approach scales efficiently to large-scale GPU clusters interconnected via high-speed InfiniBand networks, making it highly suitable for UHR-RS image segmentation tasks.

3. Experiments

3.1. Datasets

To evaluate the proposed method, we conduct extensive experiments on three UHR-RS segmentation datasets that differ in geographic context, spatial scale, and annotation granularity: DeepGlobe [33], Wuhan [34], and Guangdong (a dataset specifically constructed for this study, with no existing public reference available). These datasets collectively enable a comprehensive assessment of both the effectiveness and generalizability of our approach.

DeepGlobe. The DeepGlobe Land Cover Classification dataset comprises 803 optical satellite images, each with a spatial resolution of approximately 0.5 m and a fixed size of

2448 \times 2448

pixels. It provides pixel-level annotations for seven land cover categories. We follow standard practice by dividing the dataset into training, validation, and test subsets containing 455, 207, and 142 images, respectively.

Wuhan. The Wuhan dataset was acquired from the Wuhan-1 satellite, developed by Wuhan University and equipped with a high-resolution panchromatic and multispectral Camera. It contains 20 scenes of optical imagery covering the urban areas of Wuhan city, with each image sized approximately

5000 \times 6000

pixels and a spatial resolution of 0.5 m. The dataset includes dense, pixel-wise annotations across eight land cover categories (including background), supporting detailed semantic segmentation. We allocate 12 images for training, 4 for validation, and 4 for testing.

Guangdong. The Guangdong dataset was constructed using Gaofen-series satellite imagery and covers representative regions in Guangdong province. It comprises 15 UHR RGB images, each having a spatial resolution between 0.5 and 2 m and dimensions of

8192 \times 8192

pixels. Annotations span 11 semantic categories, including both urban infrastructure and natural surface types. We assigned nine images to the training set, three to the validation set, and three to the test set.

3.2. Implementation Details

The experiments in this study were conducted using the deep learning framework PyTorch (version 2.4.1; PyTorch Foundation, San Francisco, CA, USA), accelerated by CUDA (version 12.4; NVIDIA Corporation, Santa Clara, CA, USA) on GPU hardware. All model training and inference procedures were performed on GPU servers equipped with NVIDIA RTX A100 GPUs (40 GB VRAM; NVIDIA Corporation, Santa Clara, CA, USA). To ensure stability and efficiency during training, the model parameters were randomly initialized following a Gaussian distribution with a mean of 0 and a standard deviation of 0.02. For model optimization, the AdamW optimizer was employed with an initial learning rate of

1 \times 10^{- 4}

and a weight decay coefficient of 0.01. The learning rate schedule was implemented using the cosine annealing strategy over a total of 100 epochs. The batch size for all experiments was consistently set to 1. Additionally, the default top-k value in the proposed STF module was set to 70%.

For comparative experiments, input images from the DeepGlobe dataset were randomly cropped to

2048 \times 2048

pixels, images from the Wuhan dataset were randomly cropped to

4096 \times 4096

pixels, and images from the Guangdong dataset remained at their original size of

8192 \times 8192

pixels to accommodate GPU memory constraints while ensuring consistency across methods. All baselines were re-trained in our environment under this unified protocol to guarantee fairness. We fixed the evaluation resolution for each dataset as above while preserving each method’s canonical operating regime during training: patch-wise methods (e.g., UNet and Swin) were trained on 1024 × 1024 tiles with stitching, global–local methods (e.g., GLNet and WiCoNet) followed their default global downsampling ratios, and full-image/distributed methods (e.g., RingFormer-Seg) were trained directly at the dataset’s native resolution. In OOM cases (e.g., GLNet on Guangdong), we avoided imposing further downsampling since preliminary trials showed significant accuracy degradation; instead, we retained each method’s original tiling or downsampling strategy to ensure optimal performance.

In the training phase, to enhance the generalization capability of the models on RS data, several standard data augmentation techniques suitable for RS imagery were utilized, including random horizontal flips, random vertical flips, random rotations (90°, 180°, and 270°), and random brightness and contrast adjustments. During the inference phase, no data augmentation strategies were applied, and all inferences were conducted using single-scale input images. Furthermore, the hardware and software configurations employed during inference were identical to those used during training to guarantee fairness and reproducibility of the evaluation process. Additionally, to maintain consistency during inference, the default inference mode provided by PyTorch was uniformly applied across all experiments, and gradient computations were disabled to reduce GPU memory overhead and improve inference efficiency.

In addition, we report efficiency and stability metrics for completeness. Inference time was measured as the average per-image runtime (batch size = 1, FP16) over 500 iterations on an RTX A100 GPU, and parameter counts were obtained by summing all trainable weights in both encoder and decoder. Training stability was achieved by employing normalization layers (LayerNorm and BatchNorm) to mitigate gradient vanishing/explosion and was further verified by smooth convergence across multiple runs.

3.3. Evaluation Metrics

The accuracy metric used in our experiments is the mean Intersection-over-Union (mIoU), a standard evaluation metric for semantic segmentation tasks. The mIoU provides a balanced evaluation across all semantic classes by computing the average Intersection over Union (IoU) per class, defined as:

mIoU = \frac{1}{N} \sum_{i = 1}^{N} \frac{T P_{i}}{T P_{i} + F P_{i} + F N_{i}},

(15)

where

T P_{i}

,

F P_{i}

, and

F N_{i}

represent the true positives, false positives, and false negatives for class i, respectively, and N is the total number of semantic classes.

Specifically, for class i:

\begin{matrix} T P_{i} & = | {x ∣ x \in class i, \hat{x} = i} |, \end{matrix}

(16)

\begin{matrix} F P_{i} & = | {x ∣ x \notin class i, \hat{x} = i} |, \end{matrix}

(17)

\begin{matrix} F N_{i} & = | {x ∣ x \in class i, \hat{x} \neq i} |, \end{matrix}

(18)

where x denotes a pixel, and

\hat{x}

is the predicted class label for pixel x.

To assess computational efficiency, we report inference speed measured in frames per second (FPS), peak memory usage in megabytes (MB), and the number of floating-point operations (FLOPs), which reflect the computational complexity of the model during inference. For scalability evaluation, we additionally report speedup and parallel efficiency, where speedup is defined as the ratio of the single-GPU training time to the multi-GPU training time, and efficiency is defined as the speedup divided by the number of GPUs.

3.4. Comparative Methods

To evaluate the effectiveness of our RingFormer-Seg, we select representative comparative methods from four categories, as summarized in Table 1:

1.: Patch-wise processing for local inference, which segments large images into small tiles independently, represented by UNet [35] and Swin Transformer [36];
2.: Multi-scale model architectures for global learning, which integrate local and global contexts across multiple scales, represented by GLNet [20] and WiCoNet [21];
3.: Lightweight networks for reduced computational cost, which balance accuracy and efficiency through simplified architectures, represented by LWGANet [37] and PyramidMamba [24];
4.: Representation sparsification for compact model representation, which selectively allocates computational resources to informative regions, represented by AMR [28] and BPT [10].

These methods collectively provide comprehensive baselines for assessing segmentation accuracy and computational efficiency.

3.5. Experiment Results

We conducted comparative experiments of the proposed RingFormer-Seg against several SOTA methods on three benchmark datasets for UHR-RS imagery: DeepGlobe, Wuhan, and Guangdong. The experimental results are summarized in Table 1. The semantic segmentation visualization results for the three datasets are shown in Figure 5, Figure 6, and Figure 7, respectively.

For the DeepGlobe dataset (cropped 2048 × 2048 pixels), RingFormer-Seg achieved an mIoU of 72.04%, significantly surpassing patch-wise methods, multi-scale global learning methods, lightweight networks, and representation sparsification approaches. Although RingFormer-Seg has a relatively high parameter count (282.4 MB), it demonstrates excellent efficiency and accuracy balance, achieving a throughput of 8.56 img/s and a computational cost of 710.79 GFLOPs, thanks to the efficient FlashAttention mechanism integrated within the ELCM. Particularly, RingFormer-Seg significantly outperforms the multi-scale global learning methods GLNet (71.83%) and WiCoNet (71.44%), indicating that the proposed STF module and CDCR mechanism effectively avoid the loss of fine-grained semantic information commonly caused by aggressive downsampling in multi-scale methods. Moreover, compared to representation sparsification methods such as AMR (69.97%) and BPT (68.19%), the STF strategy adopted by RingFormer-Seg is more precise and efficient, effectively preserving local details critical for segmentation tasks while significantly reducing redundant computations.

In experiments on the larger Wuhan dataset (cropped 4096 × 4096 pixels), most existing methods, including lightweight networks, representation sparsification methods, and WiCoNet, failed to directly handle images on a single GPU due to out-of-memory (OOM) issues. However, RingFormer-Seg effectively supports the distributed processing of large-sized images via its distributed CDCR mechanism, achieving an mIoU of 58.16%, significantly outperforming methods such as UNet (53.98%), Swin Transformer (54.17%), and GLNet (57.08%). These results strongly demonstrate RingFormer-Seg’s ability to effectively integrate cross-device global contextual information during distributed processing, thereby overcoming spatial fragmentation issues inherent in patch-wise methods and achieving higher semantic segmentation accuracy and spatial consistency without sacrificing resolution or fine details as GLNet does.

In the experiments on the ultra-large Guangdong dataset (8192 × 8192 pixels), none of the competing methods other than patch-wise methods could directly process images of this magnitude, highlighting a severe scalability issue in existing global-processing approaches. By leveraging its inherent distributed design, RingFormer-Seg can flexibly scale processing capacity by increasing the number of GPUs, thereby smoothly segmenting images of arbitrary sizes. Consequently, it achieved the highest mIoU of 77.05%, significantly superior to the patch-wise methods UNet (72.68%) and Swin Transformer (73.87%). This advantage is mainly attributed to RingFormer-Seg’s innovative ring-shaped CDCR mechanism, which efficiently propagates global contextual information, ensuring global spatial consistency and semantic integrity in distributed processing outcomes. Furthermore, the synergistic action of the proposed STF and ELCM modules effectively filters and concentrates computational resources on semantically salient regions, reducing unnecessary computation overhead. This achieves an excellent balance between segmentation accuracy and input image size, fully demonstrating the scalability and robustness of the proposed model.

Although RingFormer-Seg has the largest parameter count among the compared methods (282.4 MB on DeepGlobe and 426.4 MB on Wuhan/Guangdong), the footprint remains small relative to the memory required for activations at UHR resolutions, and it can be further reduced by half-precision storage at inference. Importantly, the design avoids aggressive tiling or heavy downsampling, thereby maintaining global context and enabling a strong balance between efficiency and accuracy at native region resolution. In addition, the model offers deployment-friendly controls: the STF top-k ratio allows users to trade a small amount of accuracy for substantial reductions in FLOPs and latency without changing the parameter count, which supports practical adaptation to diverse hardware budgets.

3.6. Ablation Study

3.6.1. Effectiveness of RingFormer-Seg Architecture

Table 2 demonstrates the impact of key components (STF, ELCM, and CDCR) within RingFormer-Seg on model performance. Based on ViT-Base, experimental results indicate that progressively integrating these components effectively enhances the model’s performance across the three UHR-RS benchmark datasets. Notably, experiments utilizing only Full Tokens (FT) and Spatial Attention (SA) without the CDCR module encountered single-GPU Out-of-Memory (OOM) errors on the Wuhan and Guangdong datasets. This highlights that standard spatial attention alone cannot sufficiently address the high memory demands of UHR-RS imagery. After introducing the CDCR module, memory pressures on a single GPU were effectively alleviated through distributed context propagation, enabling successful model execution and achieving mIoU accuracies of 56.09% (Wuhan) and 74.19% (Guangdong). These results confirm that CDCR effectively facilitates global context fusion across multiple computational devices.

With the subsequent addition of the ELCM module, accuracy significantly improved on all three datasets, especially increasing the mIoU to 71.94% on the DeepGlobe dataset. This improvement indicates that ELCM, utilizing the efficient local attention mechanism FlashAttention, better captures local feature interactions, thereby enhancing the model’s ability to represent local semantic information in RS images. We also evaluated a configuration combining STF and ELCM without CDCR. On DeepGlobe, this variant reaches 72.03% mIoU, which is comparable to the performance of the configuration with CDCR and ELCM (72.04%). However, it still leads to OOM errors on Wuhan and Guangdong, further confirming that CDCR is indispensable for scaling to UHR inputs. Finally, integrating the STF module further boosted the mIoU accuracy to 72.04%, 58.16%, and 77.05% for the DeepGlobe, Wuhan, and Guangdong datasets, respectively. The STF module adaptively selects highly salient tokens, effectively reducing redundancy while retaining the most semantically informative features, thus further improving the model’s efficiency and semantic accuracy.

3.6.2. Impact of Selected Token Proportion in STF

Table 3 presents the influence of different top-k ratios in the Saliency-Aware Token Filter (STF) module on model performance and computational efficiency. As observed, decreasing the retained token ratio significantly improves inference throughput and reduces computational complexity (FLOPs), but it leads to a gradual decline in mIoU accuracy. When retaining all tokens (100%), the model achieves the highest accuracy (73.18%), but inference efficiency is relatively low (5.83 img/s). At a reduced token retention ratio of 70%, the model maintains relatively high accuracy (72.04%) while achieving an inference throughput of 8.56 img/s, representing an approximately 30% reduction in computational cost. This balance indicates that the 70% selection ratio optimally trades off between accuracy and efficiency. Further decreasing the token retention ratio below 55% results in a notable decline in mIoU accuracy (from 69.14% to 67.69%), suggesting that excessive token reduction leads to the loss of critical semantic information. Therefore, considering the trade-off between accuracy and efficiency, we selected a retention ratio of 70% as the default setting, ensuring effective capture of essential semantic information while significantly reducing computational overhead.

3.6.3. Comparison of Efficient Attention Mechanisms in Vision Transformers

Table 4 compares different Transformer-based attention mechanisms. Specifically, we selected widely used mechanisms, including Swin Attention, Focal Attention, and Memory-Efficient Attention, for benchmarking purposes. Experimental results indicate that our proposed ELCM consistently outperforms these alternatives across all metrics. In particular, ELCM achieves an mIoU of 72.04% on the DeepGlobe dataset, clearly surpassing Swin Attention (71.16%), Focal Attention (70.94%), and Memory-Efficient Attention (70.64%). Additionally, ELCM exhibits superior computational efficiency, achieving an inference speed of 6.31 img/s, significantly faster than Focal Attention (2.12 img/s), Memory-Efficient Attention (1.60 img/s), and Swin Attention (4.63 img/s). Furthermore, ELCM has the lowest computational overhead (986.90 GFLOPs), approximately one-fifth that of Memory-Efficient Attention. This advantage primarily arises from the FlashAttention mechanism employed by the ELCM module, which utilizes block-wise local attention computations, substantially reducing memory usage and GPU external memory access. This enables the model to efficiently process high-resolution feature representations. These results demonstrate that ELCM effectively balances high segmentation accuracy and computational efficiency, making it highly suitable for large-scale, fine-grained segmentation tasks in RS images.

4. Discussion

4.1. Scalability Analysis with Varying Image Sizes and Device Counts

In the scalability evaluation under varying image resolutions and GPU counts (Table 5), we found that RingFormer-Seg can scale almost linearly to larger UHR-RS images and across multiple GPUs, while consistently maintaining high accuracy. When processing images on a single GPU with resolutions increased from

1024 \times 1024

to

4096 \times 4096

, the number of tokens grows exponentially (4096, 16,384, and 65,536, respectively). However, benefiting from the dynamic filtering and aggregation mechanism of the STF, redundant tokens are mapped into compact representations, effectively managing GPU memory usage at a controlled level (below 40 GB in all cases). Simultaneously, mean Intersection over Union (mIoU) improves from 75.78% to 76.53%. Although higher resolutions naturally lead to increased training times and inference latencies, the efficient operator design of FlashAttention within the ELCM maintains these increments within acceptable limits. Specifically, training time increases from 11.35 h to 24.76 h, and inference latency rises from 39 ms to 694 ms, highlighting the resource management advantages of RingFormer-Seg in high-resolution scenarios.

When scaling to four GPUs for processing

8192 \times 8192

resolution images, RingFormer-Seg utilizes the CDCR module’s ring-based communication topology to efficiently exchange global context across image sub-regions, thereby avoiding global synchronization bottlenecks. Even though the token sequence length reaches 262,144, the GPU memory footprint per GPU only slightly increases to 38.67 GB. Moreover, the training time (25.47 h) remains comparable to that of single-GPU processing at

4096 \times 4096

resolution. Notably, the mIoU achieves 77.05%, which is an approximately 0.5% improvement over single-GPU results at

4096 \times 4096

. This indicates that, by increasing the number of GPUs, RingFormer-Seg can continually improve global consistency and accuracy through parallelization and distributed context routing without significantly increasing memory pressure on individual GPUs.

When scaling beyond one GPU, the CDCR module demonstrates both efficiency and low overhead. On two GPUs with

8192 \times 4096

inputs, the effective bandwidth reaches 9.12 GB/s with per-hop latency of only 1.52 ms, while memory footprint remains balanced (38.71 GB per GPU) and accuracy improves to 76.79%. At four GPUs and

8192 \times 8192

resolution, bandwidth is 5.95 GB/s with a latency of 2.87 ms, yet training time (25.47 h) remains comparable to single-GPU

4096 \times 4096

processing. Notably, the mIoU further rises to 77.05%, about 0.5% higher than the

4096 \times 4096

single-GPU case. These results confirm that CDCR communication requires only moderate bandwidth and incurs low latency, avoiding synchronization bottlenecks even at extreme token counts (262,144).

Furthermore, RingFormer-Seg’s design exhibits potentially unlimited scalability: by adding more GPUs, CDCR can continuously incorporate new sub-region contexts into the communication ring. Meanwhile, the STF and ELCM modules ensure near-maximal GPU utilization without memory overflow. With more GPUs, the number of tokens transmitted per iteration remains constant, and the additional communication overhead due to an elongated ring topology can be linearly amortized, thus keeping the overall training and inference times manageable even at large scales. This advantage is achieved by the STF’s Saliency-Aware Token Filter efficiently pruning irrelevant or redundant tokens before global routing and by the block-wise streaming computation based on FlashAttention in ELCM, ensuring that only the most representative refined features, rather than the original set of tokens, are transmitted during each inter-GPU communication.

Through the synergistic effects of its three core modules, STF, ELCM, and CDCR, RingFormer-Seg achieves seamless scalability from low to ultra-high resolutions and from single to multiple GPUs. Across various image sizes and computational resources, this framework not only maintains high accuracy but also effectively controls memory and communication overhead, demonstrating its practical applicability and forward-looking capability in large-scale semantic segmentation tasks of UHR-RS imagery.

4.2. Scalability Analysis with Fixed Image Size Across Device Counts

To further isolate the impact of GPU counts on scalability, we performed an additional evaluation with a fixed image size of

4096 \times 4096

pixels while varying the number of GPUs from 1 to 4 (Table 6). This design removes the confounding effect of varying resolutions and enables a fair assessment of pure GPU scalability.

As expected, RingFormer-Seg demonstrates near-linear acceleration when increasing device counts. Training time is reduced from 24.76 h on a single GPU to 13.10 h and 7.20 h on two and four GPUs, respectively, yielding a

1.89 \times

and

3.44 \times

speedup with corresponding parallel efficiencies of 94% and 86%. A similar trend is observed in inference: latency decreases from 694.06 ms (one GPU) to 212.35 ms (four GPUs), confirming that both training and inference benefit substantially from multi-GPU parallelization. Notably, segmentation accuracy remains stable (76.53% to 76.47%), indicating that performance gains stem from parallel efficiency rather than accuracy trade-offs.

These results are consistent with theoretical expectations based on parallel computing laws, where ideal linear scaling is hindered by inter-GPU communication and synchronization overheads. Nevertheless, the efficiency above 85% confirms that the CDCR’s ring-based topology, combined with the lightweight token routing of STF and FlashAttention’s streaming computation in ELCM, effectively minimizes overhead. This ensures that computation remains the dominant cost even in multi-GPU training.

Taken together, the two scalability evaluations provide a holistic view of RingFormer-Seg’s scalability. The varying-resolution analysis (Table 5) demonstrates the framework’s ability to handle increasingly large UHR-RS inputs by leveraging more GPUs, validating its applicability in real-world large-scale remote sensing tasks. Meanwhile, the fixed-size analysis (Table 6) isolates GPU scalability by controlling the input resolution, confirming near-linear speedup and high efficiency across multiple devices. These complementary results show that RingFormer-Seg not only adapts effectively to higher resolutions in practice but also achieves strong parallel efficiency under controlled conditions, ensuring both computational efficiency and practical applicability for large-scale semantic segmentation of ultra-high-resolution imagery.

4.3. Limitations and Future Works

Although RingFormer-Seg demonstrates strong performance in UHR-RS image segmentation, several limitations remain and suggest promising directions for future improvement. Below, we outline key areas that warrant further exploration.

Hardware-specific optimization of attention mechanisms: RingFormer-Seg leverages FlashAttention to enable memory-efficient token interaction within each device. However, this technique is tightly coupled with NVIDIA GPU architecture and relies on custom CUDA or Triton kernels, which limits portability across hardware platforms such as AMD GPUs or TPUs. Future work could explore hardware-agnostic efficient attention mechanisms or devise adaptable attention modules at the architectural level to ensure broader deployment compatibility.
Enhancing multi-scale representation efficiency: Semantic segmentation requires balancing fine-grained local detail with global context awareness. While RingFormer-Seg addresses this via token filtering and global context routing, its current architecture does not explicitly model multi-scale hierarchical features. Future variants may incorporate multi-scale token sampling or adaptive token pooling strategies to improve representation compactness and data utilization, especially for transformer-based models operating on VHR-RS imagery.
Extension to satellite video and multimodal temporal data: RingFormer-Seg’s parallel and context-sharing design naturally supports spatiotemporal applications, such as satellite video understanding or long-term time-series analysis. By processing temporal token sequences or integrating multimodal data (e.g., textual annotations and sensor metadata), RingFormer-Seg could serve as a foundation for general-purpose RS video or vision-language models. Evaluating RingFormer-Seg with larger spatial coverage or longer temporal sequences may require additional GPUs or distributed computational setups to demonstrate its scalability and effectiveness.

In summary, RingFormer-Seg presents a flexible and scalable foundation for large-scale image understanding. Extending its capabilities toward hardware-general attention, temporal modeling, and adaptive spatial design will further enhance its utility for Earth observation, land cover mapping, and multimodal urban analytics.

5. Conclusions

In this study, we introduced RingFormer-Seg, a scalable and context-preserving Vision Transformer framework specifically designed for semantic segmentation of UHR-RS imagery. RingFormer-Seg effectively addresses critical challenges related to long-range contextual modeling and computational scalability by integrating three key modules: the STF for selective token representation, the ELCM utilizing memory-efficient attention for refined local context modeling, and the CDCR mechanism for efficient global context propagation across distributed GPUs.

Comprehensive experiments on three challenging benchmarks (the DeepGlobe, Wuhan, and Guangdong datasets) demonstrated that RingFormer-Seg consistently achieves SOTA segmentation accuracy, significantly surpassing existing CNN- and transformer-based methods. Moreover, it successfully balances accuracy and computational efficiency across datasets with varying resolutions, confirming its robustness and scalability. In addition, we note that the design of RingFormer-Seg provides certain resilience to noisy or degraded inputs: STF helps filter out less informative or corrupted tokens, while CDCR enables complementary information exchange across regions, potentially mitigating local perturbations. A more systematic evaluation of robustness under explicit noise conditions will be pursued in future work. The proposed method shows strong potential for practical deployment in nationwide land cover mapping and large-scale Earth observation tasks, offering valuable support for informed decision-making in environmental management and sustainable development.

Author Contributions

Z.Z. and D.S. conceived the main idea. Z.Z., D.S., and G.G. developed the methodology and designed the experiments. Z.Z., W.H., and R.W. processed the data and conducted the experiments. Z.Z. and X.C. analyzed the results. Z.Z. prepared the original manuscript draft, and B.Y. supervised the research and provided critical revisions. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (Grant No. 42271354).

Data Availability Statement

The original contributions of this work are fully contained in the article. Requests for additional information may be directed to the corresponding or first author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Huang, L.; Jiang, B.; Lv, S.; Liu, Y.; Fu, Y. Deep-learning-based semantic segmentation of remote sensing images: A survey. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 17, 8370–8396. [Google Scholar] [CrossRef]
Lv, J.; Shen, Q.; Lv, M.; Li, Y.; Shi, L.; Zhang, P. Deep learning-based semantic segmentation of remote sensing images: A review. Front. Ecol. Evol. 2023, 11, 1201125. [Google Scholar] [CrossRef]
Putty, A.; Annappa, B.; Pariserum Perumal, S. Semantic Segmentation of Remotely Sensed Images for Land-use and Land-cover Classification: A Comprehensive Review. IETE Tech. Rev. 2025, 42, 222–237. [Google Scholar] [CrossRef]
Macarringue, L.S.; Bolfe, É.L.; Pereira, P.R.M. Developments in land use and land cover classification techniques in remote sensing: A review. J. Geogr. Inf. Syst. 2022, 14, 1–28. [Google Scholar] [CrossRef]
Ramos, L.; Sappa, A.D. Multispectral semantic segmentation for land cover classification: An overview. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 14295–14336. [Google Scholar] [CrossRef]
Ji, D.; Zhao, F.; Lu, H.; Tao, M.; Ye, J. Ultra-high resolution segmentation with ultra-rich context: A novel benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 23621–23630. [Google Scholar]
Du, B.; Shan, L.; Shao, X.; Zhang, D.; Wang, X.; Wu, J. Transform Dual-Branch Attention Net: Efficient Semantic Segmentation of Ultra-High-Resolution Remote Sensing Images. Remote Sens. 2025, 17, 540. [Google Scholar] [CrossRef]
Sun, Y.; Wang, M.; Huang, X.; Xin, C.; Sun, Y. Fast Semantic Segmentation of Ultra-High-Resolution Remote Sensing Images via Score Map and Fast Transformer-Based Fusion. Remote Sens. 2024, 16, 3248. [Google Scholar] [CrossRef]
Wang, J.; Chen, T.; Zheng, L.; Tie, J.; Zhang, Y.; Chen, P.; Luo, Z.; Song, Q. A multi-scale remote sensing semantic segmentation model with boundary enhancement based on UNetFormer. Sci. Rep. 2025, 15, 14737. [Google Scholar] [CrossRef] [PubMed]
Sun, H.; Zhang, Y.; Xu, L.; Jin, S.; Chen, Y. Ultra-high resolution segmentation via boundary-enhanced patch-merging transformer. AAAI Conf. Artif. Intell. 2025, 39, 7087–7095. [Google Scholar] [CrossRef]
Qin, R.; Liu, T. A review of landcover classification with very-high resolution remotely sensed optical images—Analysis unit, model scalability and transferability. Remote Sens. 2022, 14, 646. [Google Scholar] [CrossRef]
Liu, W.; Li, Q.; Lin, X.; Yang, W.; He, S.; Yu, Y. Ultra-high resolution image segmentation via locality-aware context fusion and alternating local enhancement. Int. J. Comput. Vis. 2024, 132, 5030–5047. [Google Scholar] [CrossRef]
Volpi, M.; Tuia, D. Dense semantic labeling of subdecimeter resolution images with convolutional neural networks. IEEE Trans. Geosci. Remote Sens. 2016, 55, 881–893. [Google Scholar] [CrossRef]
Du, L.; McCarty, G.W.; Zhang, X.; Lang, M.W.; Vanderhoof, M.K.; Li, X.; Huang, C.; Lee, S.; Zou, Z. Mapping forested wetland inundation in the Delmarva Peninsula, USA using deep convolutional neural networks. Remote Sens. 2020, 12, 644. [Google Scholar] [CrossRef]
Brandt, M.; Tucker, C.J.; Kariryaa, A.; Rasmussen, K.; Abel, C.; Small, J.; Chave, J.; Rasmussen, L.V.; Hiernaux, P.; Diouf, A.A.; et al. An unexpectedly large count of trees in the West African Sahara and Sahel. Nature 2020, 587, 78–82. [Google Scholar] [CrossRef]
Hossain, M.D.; Chen, D. A hybrid image segmentation method for building extraction from high-resolution RGB images. ISPRS J. Photogramm. Remote Sens. 2022, 192, 299–314. [Google Scholar] [CrossRef]
Nevavuori, P.; Narra, N.; Lipping, T. Crop yield prediction with deep convolutional neural networks. Comput. Electron. Agric. 2019, 163, 104859. [Google Scholar] [CrossRef]
Guo, S.; Liu, L.; Gan, Z.; Wang, Y.; Zhang, W.; Wang, C.; Jiang, G.; Zhang, W.; Yi, R.; Ma, L.; et al. Isdnet: Integrating shallow and deep networks for efficient ultra-high resolution segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 4361–4370. [Google Scholar]
Chen, W.; Li, Y.; Dang, B.; Zhang, Y. ElegantSeg: End-to-end holistic learning for extra-large image semantic segmentation. arXiv 2022, arXiv:2211.11316. [Google Scholar]
Chen, W.; Jiang, Z.; Wang, Z.; Cui, K.; Qian, X. Collaborative global-local networks for memory-efficient segmentation of ultra-high resolution images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 8924–8933. [Google Scholar]
Ding, L.; Lin, D.; Lin, S.; Zhang, J.; Cui, X.; Wang, Y.; Tang, H.; Bruzzone, L. Looking outside the window: Wide-context transformer for the semantic segmentation of high-resolution remote sensing images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–13. [Google Scholar] [CrossRef]
Zhang, J.; Qin, Q.; Ye, Q.; Ruan, T. ST-unet: Swin transformer boosted U-net with cross-layer feature enhancement for medical image segmentation. Comput. Biol. Med. 2023, 153, 106516. [Google Scholar] [CrossRef]
Liu, Y.; Song, S.; Wang, M.; Gao, H.; Liu, J. DE-Unet: Dual-Encoder U-Net for Ultra-High Resolution Remote Sensing Image Segmentation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 12290–12302. [Google Scholar] [CrossRef]
Wang, L.; Li, D.; Dong, S.; Meng, X.; Zhang, X.; Hong, D. PyramidMamba: Rethinking pyramid feature fusion with selective space state model for semantic segmentation of remote sensing imagery. arXiv 2024, arXiv:2406.10828. [Google Scholar] [CrossRef]
Wang, L.; Li, R.; Zhang, C.; Fang, S.; Duan, C.; Meng, X.; Atkinson, P.M. UNetFormer: A UNet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery. ISPRS J. Photogramm. Remote Sens. 2022, 190, 196–214. [Google Scholar] [CrossRef]
Lv, X.; Zhang, P.; Li, S.; Gan, G.; Sun, Y. Lightformer: Light-weight transformer using svd-based weight transfer and parameter sharing. In Proceedings of the Findings of the Association for Computational Linguistics, ACL 2023, Toronto, ON, Canada, 9–14 July 2023; pp. 10323–10335. [Google Scholar]
Ding, H.; Xia, B.; Liu, W.; Zhang, Z.; Zhang, J.; Wang, X.; Xu, S. A novel mamba architecture with a semantic transformer for efficient real-time remote sensing semantic segmentation. Remote Sens. 2024, 16, 2620. [Google Scholar] [CrossRef]
Zhang, E.; Lyngaas, I.; Chen, P.; Wang, X.; Igarashi, J.; Huo, Y.; Munetomo, M.; Wahib, M. Adaptive patching for high-resolution image segmentation with transformers. In Proceedings of the SC24: International Conference for High Performance Computing, Networking, Storage and Analysis, Atlanta, GA, USA, 17–22 November 2024; pp. 1–16. [Google Scholar]
Chen, H.; Feng, L.; Wu, W.; Zhu, X.; Leo, S.; Hu, K. F2Net: A Frequency-Fused Network for Ultra-High Resolution Remote Sensing Segmentation. arXiv 2025, arXiv:2506.07847. [Google Scholar]
Song, W.; He, H.; Dai, J.; Jia, G. Spatially adaptive interaction network for semantic segmentation of high-resolution remote sensing images. Sci. Rep. 2025, 15, 15337. [Google Scholar] [CrossRef] [PubMed]
Dao, T.; Fu, D.; Ermon, S.; Rudra, A.; Ré, C. Flashattention: Fast and memory-efficient exact attention with io-awareness. Adv. Neural Inf. Process. Syst. 2022, 35, 16344–16359. [Google Scholar]
Liu, H.; Zaharia, M.; Abbeel, P. Ring attention with blockwise transformers for near-infinite context. arXiv 2023, arXiv:2310.01889. [Google Scholar] [CrossRef]
Demir, I.; Koperski, K.; Lindenbaum, D.; Pang, G.; Huang, J.; Basu, S.; Hughes, F.; Tuia, D.; Raskar, R. Deepglobe 2018: A challenge to parse the earth through satellite images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–22 June 2018; pp. 172–181. [Google Scholar]
Li, D.; Wang, M.; Guo, H.; Jin, W. On China’s earth observation system: Mission, vision and application. Geo-Spat. Inf. Sci. 2025, 28, 303–321. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
Lu, W.; Chen, S.B.; Ding, C.H.; Tang, J.; Luo, B. Lwganet: A lightweight group attention backbone for remote sensing visual tasks. arXiv 2025, arXiv:2501.10040. [Google Scholar]

Figure 1. The overall framework of the proposed RingFormer-Seg. It mainly consists of five modules: (1) Spatial Partitioning and Patch Embedding, dividing and embedding image patches into tokens; (2) Saliency-Aware Token Filter (STF), selecting salient tokens and pooling others; (3) Efficient Local Context Module (ELCM), refining tokens using Transformer blocks; (4) Cross-Device Context Routing (CDCR), exchanging tokens ring-wise across GPUs for global context; and (5) SETR-Basic Segmentation Decoder, decoding refined tokens and pooled features to generate segmentation results.

Figure 2. Illustration of Saliency-Aware Token Filter. This module dynamically aggregates features from various RS modalities, utilizing Multi-Head Attention (MHA) to capture cross-modality dependencies. Parallel convolutional layers then refine features spatially and channel-wise. The fused features are weighted and combined to emphasize the most informative aspects from each modality, resulting in a final score.

Figure 3. Illustration of Efficient Local Context Module. (a) Functional block: Each ELCM block uses Flash MHA followed by a two-layer FFN, both with LayerNorm and residual skips, to refine local token features; (b) Hardware-aware dataflow: K/V tiles stream from HBM into each SM’s shared memory for on-chip attention and FFN, then write back, reducing off-chip traffic and capping on-chip memory.

Figure 4. Illustration of Cross-Device Context Routing. The refined salient tokens are circulated iteratively across GPUs arranged in a ring-wise topology. At each step, every GPU updates its local embeddings by attending to token embeddings received from its neighboring GPU, progressively aggregating global contextual information while minimizing communication overhead.

Figure 5. Visual comparison of segmentation results on the DeepGlobe dataset using different semantic segmentation methods. (a) Original images; (b) ground truth; (c) UNet; (d) Swin Transformer; (e) GLNet; (f) WiCoNet; (g) LWGANet; (h) PyramidMamba; (i) AMR; (j) BPT; (k) RingFormer-Seg (ours).

Figure 6. Visual comparison of segmentation results on the Wuhan dataset using different semantic segmentation methods. (a) Original images; (b) ground truth; (c) UNet; (d) Swin Transformer; (e) GLNet; (f) RingFormer-Seg (ours). Methods that encountered OOM (out-of-memory) issues on this dataset are excluded from visualization.

Figure 7. Visual comparison of segmentation results on the Guangdong dataset using different semantic segmentation methods. (a) Original images; (b) ground truth; (c) UNet; (d) Swin Transformer; (e) RingFormer-Seg (ours). Methods that encountered OOM (out-of-memory) issues on this dataset are excluded from visualization.

Table 1. Comparison of semantic segmentation results across different methods on three benchmarks: DeepGlobe, Wuhan, and Guangdong. Metrics are mean Intersection over Union (mIoU, %), Parameters (Parm., MB), Throughput (Thrpt., img/s), and Floating-point Operations (FLOPs, G). “Patch-Wise Processing for Local Inference” methods split images into

1024 \times 1024

tiles with 20% overlap and stitch predictions; all other methods perform full-image processing with default settings. “OOM” indicates GPU out-of-memory on single-GPU evaluation. Here, ↑ indicates that higher values are better, while ↓ indicates that lower values are better.

Table 1. Comparison of semantic segmentation results across different methods on three benchmarks: DeepGlobe, Wuhan, and Guangdong. Metrics are mean Intersection over Union (mIoU, %), Parameters (Parm., MB), Throughput (Thrpt., img/s), and Floating-point Operations (FLOPs, G). “Patch-Wise Processing for Local Inference” methods split images into

1024 \times 1024

tiles with 20% overlap and stitch predictions; all other methods perform full-image processing with default settings. “OOM” indicates GPU out-of-memory on single-GPU evaluation. Here, ↑ indicates that higher values are better, while ↓ indicates that lower values are better.

Method	DeepGlobe (2048 × 2048 Pixel)				Wuhan (4096 × 4096 Pixel)				Guangdong (8192 × 8192 Pixel)
	mIoU	Parm.	Thrpt.	FLOPs	mIoU	Parm.	Thrpt.	FLOPs	mIoU	Parm.	Thrpt.	FLOPs
	(%) ↑	(MB) ↓	(img/s) ↑	(G) ↓	(%) ↑	(MB) ↓	(img/s) ↑	(G) ↓	(%) ↑	(MB) ↓	(img/s) ↑	(G) ↓
Patch-Wise Processing for Local Inference
UNet	68.91	34.53	8.54	1048.83	53.98	34.53	8.54	1048.83	72.68	34.53	8.54	1048.83
Swin Transformer	69.17	119.93	12.53	316.60	54.17	119.93	12.53	316.60	73.87	119.93	12.53	316.60
Multi-Scale Model Architectures for Global Learning
GLNet	71.83	28.07	21.98	695.27	57.08	28.07	5.68	2781.09	OOM	OOM	OOM	OOM
WiCoNet	71.44	38.25	3.17	1163.47	OOM	OOM	OOM	OOM	OOM	OOM	OOM	OOM
Lightweight Networks for Reduced Computational Cost
LWGANet	70.81	12.54	7.86	193.54	OOM	OOM	OOM	OOM	OOM	OOM	OOM	OOM
PyramidMamba	69.92	115.12	3.38	1500.89	OOM	OOM	OOM	OOM	OOM	OOM	OOM	OOM
Representation Sparsification for Compact Model Representation
AMR	69.97	48.21	4.10	6654.95	OOM	OOM	OOM	OOM	OOM	OOM	OOM	OOM
BPT	68.19	17.71	0.54	290.02	OOM	OOM	OOM	OOM	OOM	OOM	OOM	OOM
RingFormer-Seg (Ours)	72.04	282.4	8.56	710.79	58.16	426.40	1.60	2843.51	77.05	426.40	1.52	2858.35

Table 2. Ablation study of RingFormer-Seg components on three UHR-RS benchmarks (DeepGlobe, Wuhan, and Guangdong). The base architecture is ViT-Base with full tokens (FT). Columns indicate inclusion of the Saliency-Aware Token Filter (STF), the Efficient Local Context Module (ELCM), and the Cross-Device Context Router (CDCR); “DP” denotes replacing CDCR with standard data parallelism. “√” indicates that the module is enabled. Performance is measured by mean Intersection over Union (mIoU,%). “OOM” marks GPU out-of-memory on single-GPU evaluation. Here, ↑ indicates that higher values are better.

Structure	Module			mIoU (%) ↑
Structure	STF	ELCM	CDCR	DeepGlobe	Wuhan	Guangdong
ViT-Base (Baseline) +	FT	SA	DP	70.76	OOM	OOM
	FT	SA	√	70.78	56.09	74.19
	FT	√	√	71.94	57.92	76.01
	√	√	DP	72.03	OOM	OOM
	√	√	√	72.04	58.16	77.05

Table 3. Ablation study of the Saliency-Aware Token Filter (STF) module in RingFormer-Seg, in which the fraction of selected tokens (top-k ratio (%)) is varied. Tokens are ranked by their computed saliency probabilities, and only the top-k fraction is retained for subsequent attention computation. Evaluation metrics on the DeepGlobe dataset include mean Intersection-over-Union accuracy (mIoU (%)), the total number of model parameters (Params (MB)), inference throughput measured in images per second (Thrpt. (img/s)), and theoretical computational cost expressed in gigaflops (FLOPs (G)). Here, ↑ indicates that higher values are better, while ↓ indicates that lower values are better.

Top-K Ratio (%)	mIoU (%) ↑	Params (MB) ↓	Throughput (img/s) ↑	FLOPs (G) ↓
100	73.18	282.40	5.83	1007.32
85	72.79	282.40	6.74	858.95
70	72.04	282.40	8.56	710.79
55	69.14	282.40	10.69	562.76
40	67.69	282.40	14.41	414.82

Table 4. Comparison of different attention mechanisms based on the ViT-Base architecture, evaluated on the DeepGlobe dataset. Metrics include the total number of model parameters (Params (MB)), inference throughput in images per second (Throughput (img/s)), theoretical computational cost in gigaflops (FLOPs (G)), and mean Intersection-over-Union accuracy (mIoU (%)). Here, ↑ indicates that higher values are better, while ↓ indicates that lower values are better.

Structure	Attention Mechanism	Params (MB) ↓	Throughput (img/s) ↑	FLOPs (G) ↓	mIoU (%) ↑
ViT-Base (Baseline) +	Swin Attention	87.66	4.63	1266.41	71.16
	Focal Attention	91.16	2.12	1347.52	70.94
	Memory-Efficient Attention	155.90	1.60	5107.25	70.64
	ELCM (Ours)	60.25	6.31	986.90	72.04

Table 5. Scalability analysis of the proposed RingFormer-Seg method on the Guangdong dataset with varying image sizes and numbers of GPUs. The default architecture is ViT-Base with a patch size of

16 \times 16

. Metrics reported include GPU memory consumption in gigabytes (GPU Mem. (GB)), training time in hours (Train Time (h)), inference time per image in milliseconds (Infer. Time (ms)), inter-GPU communication overhead quantified by effective bandwidth (GB/s) and per-hop latency (ms), and segmentation accuracy measured by mean Intersection over Union (mIoU (%)). ”—” indicates that the quantity is not applicable in the single-GPU setting (no inter-GPU communication) and is therefore not reported. Here, ↑ indicates that higher values are better, while ↓ indicates that lower values are better. ”#GPUs” denotes the number of GPUs used.

Table 5. Scalability analysis of the proposed RingFormer-Seg method on the Guangdong dataset with varying image sizes and numbers of GPUs. The default architecture is ViT-Base with a patch size of

16 \times 16

. Metrics reported include GPU memory consumption in gigabytes (GPU Mem. (GB)), training time in hours (Train Time (h)), inference time per image in milliseconds (Infer. Time (ms)), inter-GPU communication overhead quantified by effective bandwidth (GB/s) and per-hop latency (ms), and segmentation accuracy measured by mean Intersection over Union (mIoU (%)). ”—” indicates that the quantity is not applicable in the single-GPU setting (no inter-GPU communication) and is therefore not reported. Here, ↑ indicates that higher values are better, while ↓ indicates that lower values are better. ”#GPUs” denotes the number of GPUs used.

Image Size	Token Seq.	#GPUs	GPU Mem. (GB) ↓	Train Time (h) ↓	Infer. Time (ms) ↓	Bandwidth (GB/s) ↑	Latency (ms) ↓	mIoU (%) ↑
1024 × 1024	4096	1	4.51	11.35	39.16	—	—	75.78
2048 × 2048	16,384	1	11.50	19.60	136.03	—	—	76.06
4096 × 4096	65,536	1	38.75	24.76	694.06	—	—	76.53
8192 × 4096	131,072	2	38.71	24.97	1502.75	9.12	1.52	76.79
8192 × 8192	262,144	4	38.67	25.47	2336.56	5.95	2.87	77.05

Table 6. Scalability analysis of the proposed RingFormer-Seg method on the Guangdong dataset with a fixed image size of 4096×4096 pixels while varying the number of GPUs (1, 2, and 4). Metrics reported include training time in hours (Train Time (h)), speedup, parallel efficiency, inference time per image in milliseconds (Infer. Time (ms)), and segmentation accuracy measured by mean Intersection over Union (mIoU (%)). Here, ↑ indicates that higher values are better, while ↓ indicates that lower values are better. ”#GPUs” denotes the number of GPUs used.

#GPUs	Train Time (h)	Speedup	Efficiency	Infer. Time (ms)	mIoU (%)
1	24.76	1.0×	100%	694.06	76.53
2	13.10	1.89×	94%	372.85	76.50
4	7.20	3.44×	86%	212.35	76.47

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, Z.; Shu, D.; Gu, G.; Hu, W.; Wang, R.; Chen, X.; Yang, B. RingFormer-Seg: A Scalable and Context-Preserving Vision Transformer Framework for Semantic Segmentation of Ultra-High-Resolution Remote Sensing Imagery. Remote Sens. 2025, 17, 3064. https://doi.org/10.3390/rs17173064

AMA Style

Zhang Z, Shu D, Gu G, Hu W, Wang R, Chen X, Yang B. RingFormer-Seg: A Scalable and Context-Preserving Vision Transformer Framework for Semantic Segmentation of Ultra-High-Resolution Remote Sensing Imagery. Remote Sensing. 2025; 17(17):3064. https://doi.org/10.3390/rs17173064

Chicago/Turabian Style

Zhang, Zhan, Daoyu Shu, Guihe Gu, Wenkai Hu, Ru Wang, Xiaoling Chen, and Bingnan Yang. 2025. "RingFormer-Seg: A Scalable and Context-Preserving Vision Transformer Framework for Semantic Segmentation of Ultra-High-Resolution Remote Sensing Imagery" Remote Sensing 17, no. 17: 3064. https://doi.org/10.3390/rs17173064

APA Style

Zhang, Z., Shu, D., Gu, G., Hu, W., Wang, R., Chen, X., & Yang, B. (2025). RingFormer-Seg: A Scalable and Context-Preserving Vision Transformer Framework for Semantic Segmentation of Ultra-High-Resolution Remote Sensing Imagery. Remote Sensing, 17(17), 3064. https://doi.org/10.3390/rs17173064

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

RingFormer-Seg: A Scalable and Context-Preserving Vision Transformer Framework for Semantic Segmentation of Ultra-High-Resolution Remote Sensing Imagery

Abstract

1. Introduction

2. Methodology

2.1. The Overall Architecture of RingFormer-Seg

2.2. Saliency-Aware Token Filter

2.3. Efficient Local Context Module

2.4. Cross-Device Context Routing

3. Experiments

3.1. Datasets

3.2. Implementation Details

3.3. Evaluation Metrics

3.4. Comparative Methods

3.5. Experiment Results

3.6. Ablation Study

3.6.1. Effectiveness of RingFormer-Seg Architecture

3.6.2. Impact of Selected Token Proportion in STF

3.6.3. Comparison of Efficient Attention Mechanisms in Vision Transformers

4. Discussion

4.1. Scalability Analysis with Varying Image Sizes and Device Counts

4.2. Scalability Analysis with Fixed Image Size Across Device Counts

4.3. Limitations and Future Works

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI