Figure 1.
The overall framework of the proposed RingFormer-Seg. It mainly consists of five modules: (1) Spatial Partitioning and Patch Embedding, dividing and embedding image patches into tokens; (2) Saliency-Aware Token Filter (STF), selecting salient tokens and pooling others; (3) Efficient Local Context Module (ELCM), refining tokens using Transformer blocks; (4) Cross-Device Context Routing (CDCR), exchanging tokens ring-wise across GPUs for global context; and (5) SETR-Basic Segmentation Decoder, decoding refined tokens and pooled features to generate segmentation results.
Figure 1.
The overall framework of the proposed RingFormer-Seg. It mainly consists of five modules: (1) Spatial Partitioning and Patch Embedding, dividing and embedding image patches into tokens; (2) Saliency-Aware Token Filter (STF), selecting salient tokens and pooling others; (3) Efficient Local Context Module (ELCM), refining tokens using Transformer blocks; (4) Cross-Device Context Routing (CDCR), exchanging tokens ring-wise across GPUs for global context; and (5) SETR-Basic Segmentation Decoder, decoding refined tokens and pooled features to generate segmentation results.
Figure 2.
Illustration of Saliency-Aware Token Filter. This module dynamically aggregates features from various RS modalities, utilizing Multi-Head Attention (MHA) to capture cross-modality dependencies. Parallel convolutional layers then refine features spatially and channel-wise. The fused features are weighted and combined to emphasize the most informative aspects from each modality, resulting in a final score.
Figure 2.
Illustration of Saliency-Aware Token Filter. This module dynamically aggregates features from various RS modalities, utilizing Multi-Head Attention (MHA) to capture cross-modality dependencies. Parallel convolutional layers then refine features spatially and channel-wise. The fused features are weighted and combined to emphasize the most informative aspects from each modality, resulting in a final score.
Figure 3.
Illustration of Efficient Local Context Module. (a) Functional block: Each ELCM block uses Flash MHA followed by a two-layer FFN, both with LayerNorm and residual skips, to refine local token features; (b) Hardware-aware dataflow: K/V tiles stream from HBM into each SM’s shared memory for on-chip attention and FFN, then write back, reducing off-chip traffic and capping on-chip memory.
Figure 3.
Illustration of Efficient Local Context Module. (a) Functional block: Each ELCM block uses Flash MHA followed by a two-layer FFN, both with LayerNorm and residual skips, to refine local token features; (b) Hardware-aware dataflow: K/V tiles stream from HBM into each SM’s shared memory for on-chip attention and FFN, then write back, reducing off-chip traffic and capping on-chip memory.
Figure 4.
Illustration of Cross-Device Context Routing. The refined salient tokens are circulated iteratively across GPUs arranged in a ring-wise topology. At each step, every GPU updates its local embeddings by attending to token embeddings received from its neighboring GPU, progressively aggregating global contextual information while minimizing communication overhead.
Figure 4.
Illustration of Cross-Device Context Routing. The refined salient tokens are circulated iteratively across GPUs arranged in a ring-wise topology. At each step, every GPU updates its local embeddings by attending to token embeddings received from its neighboring GPU, progressively aggregating global contextual information while minimizing communication overhead.
Figure 5.
Visual comparison of segmentation results on the DeepGlobe dataset using different semantic segmentation methods. (a) Original images; (b) ground truth; (c) UNet; (d) Swin Transformer; (e) GLNet; (f) WiCoNet; (g) LWGANet; (h) PyramidMamba; (i) AMR; (j) BPT; (k) RingFormer-Seg (ours).
Figure 5.
Visual comparison of segmentation results on the DeepGlobe dataset using different semantic segmentation methods. (a) Original images; (b) ground truth; (c) UNet; (d) Swin Transformer; (e) GLNet; (f) WiCoNet; (g) LWGANet; (h) PyramidMamba; (i) AMR; (j) BPT; (k) RingFormer-Seg (ours).
Figure 6.
Visual comparison of segmentation results on the Wuhan dataset using different semantic segmentation methods. (a) Original images; (b) ground truth; (c) UNet; (d) Swin Transformer; (e) GLNet; (f) RingFormer-Seg (ours). Methods that encountered OOM (out-of-memory) issues on this dataset are excluded from visualization.
Figure 6.
Visual comparison of segmentation results on the Wuhan dataset using different semantic segmentation methods. (a) Original images; (b) ground truth; (c) UNet; (d) Swin Transformer; (e) GLNet; (f) RingFormer-Seg (ours). Methods that encountered OOM (out-of-memory) issues on this dataset are excluded from visualization.
Figure 7.
Visual comparison of segmentation results on the Guangdong dataset using different semantic segmentation methods. (a) Original images; (b) ground truth; (c) UNet; (d) Swin Transformer; (e) RingFormer-Seg (ours). Methods that encountered OOM (out-of-memory) issues on this dataset are excluded from visualization.
Figure 7.
Visual comparison of segmentation results on the Guangdong dataset using different semantic segmentation methods. (a) Original images; (b) ground truth; (c) UNet; (d) Swin Transformer; (e) RingFormer-Seg (ours). Methods that encountered OOM (out-of-memory) issues on this dataset are excluded from visualization.
Table 1.
Comparison of semantic segmentation results across different methods on three benchmarks: DeepGlobe, Wuhan, and Guangdong. Metrics are mean Intersection over Union (mIoU, %), Parameters (Parm., MB), Throughput (Thrpt., img/s), and Floating-point Operations (FLOPs, G). “Patch-Wise Processing for Local Inference” methods split images into tiles with 20% overlap and stitch predictions; all other methods perform full-image processing with default settings. “OOM” indicates GPU out-of-memory on single-GPU evaluation. Here, ↑ indicates that higher values are better, while ↓ indicates that lower values are better.
Table 1.
Comparison of semantic segmentation results across different methods on three benchmarks: DeepGlobe, Wuhan, and Guangdong. Metrics are mean Intersection over Union (mIoU, %), Parameters (Parm., MB), Throughput (Thrpt., img/s), and Floating-point Operations (FLOPs, G). “Patch-Wise Processing for Local Inference” methods split images into tiles with 20% overlap and stitch predictions; all other methods perform full-image processing with default settings. “OOM” indicates GPU out-of-memory on single-GPU evaluation. Here, ↑ indicates that higher values are better, while ↓ indicates that lower values are better.
Method | DeepGlobe (2048 × 2048 Pixel) | Wuhan (4096 × 4096 Pixel) | Guangdong (8192 × 8192 Pixel) |
---|
mIoU
|
Parm.
|
Thrpt.
|
FLOPs
|
mIoU
|
Parm.
|
Thrpt.
|
FLOPs
|
mIoU
|
Parm.
|
Thrpt.
|
FLOPs
|
---|
(%) ↑ | (MB) ↓ | (img/s) ↑ | (G) ↓ | (%) ↑ | (MB) ↓ | (img/s) ↑ | (G) ↓ | (%) ↑ | (MB) ↓ | (img/s) ↑ | (G) ↓ |
---|
Patch-Wise Processing for Local Inference |
UNet | 68.91 | 34.53 | 8.54 | 1048.83 | 53.98 | 34.53 | 8.54 | 1048.83 | 72.68 | 34.53 | 8.54 | 1048.83 |
Swin Transformer | 69.17 | 119.93 | 12.53 | 316.60 | 54.17 | 119.93 | 12.53 | 316.60 | 73.87 | 119.93 | 12.53 | 316.60 |
Multi-Scale Model Architectures for Global Learning |
GLNet | 71.83 | 28.07 | 21.98 | 695.27 | 57.08 | 28.07 | 5.68 | 2781.09 | OOM | OOM | OOM | OOM |
WiCoNet | 71.44 | 38.25 | 3.17 | 1163.47 | OOM | OOM | OOM | OOM | OOM | OOM | OOM | OOM |
Lightweight Networks for Reduced Computational Cost |
LWGANet | 70.81 | 12.54 | 7.86 | 193.54 | OOM | OOM | OOM | OOM | OOM | OOM | OOM | OOM |
PyramidMamba | 69.92 | 115.12 | 3.38 | 1500.89 | OOM | OOM | OOM | OOM | OOM | OOM | OOM | OOM |
Representation Sparsification for Compact Model Representation |
AMR | 69.97 | 48.21 | 4.10 | 6654.95 | OOM | OOM | OOM | OOM | OOM | OOM | OOM | OOM |
BPT | 68.19 | 17.71 | 0.54 | 290.02 | OOM | OOM | OOM | OOM | OOM | OOM | OOM | OOM |
RingFormer-Seg (Ours) | 72.04 | 282.4 | 8.56 | 710.79 | 58.16 | 426.40 | 1.60 | 2843.51 | 77.05 | 426.40 | 1.52 | 2858.35 |
Table 2.
Ablation study of RingFormer-Seg components on three UHR-RS benchmarks (DeepGlobe, Wuhan, and Guangdong). The base architecture is ViT-Base with full tokens (FT). Columns indicate inclusion of the Saliency-Aware Token Filter (STF), the Efficient Local Context Module (ELCM), and the Cross-Device Context Router (CDCR); “DP” denotes replacing CDCR with standard data parallelism. “√” indicates that the module is enabled. Performance is measured by mean Intersection over Union (mIoU,%). “OOM” marks GPU out-of-memory on single-GPU evaluation. Here, ↑ indicates that higher values are better.
Table 2.
Ablation study of RingFormer-Seg components on three UHR-RS benchmarks (DeepGlobe, Wuhan, and Guangdong). The base architecture is ViT-Base with full tokens (FT). Columns indicate inclusion of the Saliency-Aware Token Filter (STF), the Efficient Local Context Module (ELCM), and the Cross-Device Context Router (CDCR); “DP” denotes replacing CDCR with standard data parallelism. “√” indicates that the module is enabled. Performance is measured by mean Intersection over Union (mIoU,%). “OOM” marks GPU out-of-memory on single-GPU evaluation. Here, ↑ indicates that higher values are better.
Structure | Module | mIoU (%) ↑ |
---|
STF
|
ELCM
|
CDCR
|
DeepGlobe
|
Wuhan
|
Guangdong
|
---|
ViT-Base (Baseline) + | FT | SA | DP | 70.76 | OOM | OOM |
FT | SA | √ | 70.78 | 56.09 | 74.19 |
FT | √ | √ | 71.94 | 57.92 | 76.01 |
√ | √ | DP | 72.03 | OOM | OOM |
√ | √ | √ | 72.04 | 58.16 | 77.05 |
Table 3.
Ablation study of the Saliency-Aware Token Filter (STF) module in RingFormer-Seg, in which the fraction of selected tokens (top-k ratio (%)) is varied. Tokens are ranked by their computed saliency probabilities, and only the top-k fraction is retained for subsequent attention computation. Evaluation metrics on the DeepGlobe dataset include mean Intersection-over-Union accuracy (mIoU (%)), the total number of model parameters (Params (MB)), inference throughput measured in images per second (Thrpt. (img/s)), and theoretical computational cost expressed in gigaflops (FLOPs (G)). Here, ↑ indicates that higher values are better, while ↓ indicates that lower values are better.
Table 3.
Ablation study of the Saliency-Aware Token Filter (STF) module in RingFormer-Seg, in which the fraction of selected tokens (top-k ratio (%)) is varied. Tokens are ranked by their computed saliency probabilities, and only the top-k fraction is retained for subsequent attention computation. Evaluation metrics on the DeepGlobe dataset include mean Intersection-over-Union accuracy (mIoU (%)), the total number of model parameters (Params (MB)), inference throughput measured in images per second (Thrpt. (img/s)), and theoretical computational cost expressed in gigaflops (FLOPs (G)). Here, ↑ indicates that higher values are better, while ↓ indicates that lower values are better.
Top-K Ratio (%) | mIoU (%) ↑ | Params (MB) ↓ | Throughput (img/s) ↑ | FLOPs (G) ↓ |
---|
100 | 73.18 | 282.40 | 5.83 | 1007.32 |
85 | 72.79 | 282.40 | 6.74 | 858.95 |
70 | 72.04 | 282.40 | 8.56 | 710.79 |
55 | 69.14 | 282.40 | 10.69 | 562.76 |
40 | 67.69 | 282.40 | 14.41 | 414.82 |
Table 4.
Comparison of different attention mechanisms based on the ViT-Base architecture, evaluated on the DeepGlobe dataset. Metrics include the total number of model parameters (Params (MB)), inference throughput in images per second (Throughput (img/s)), theoretical computational cost in gigaflops (FLOPs (G)), and mean Intersection-over-Union accuracy (mIoU (%)). Here, ↑ indicates that higher values are better, while ↓ indicates that lower values are better.
Table 4.
Comparison of different attention mechanisms based on the ViT-Base architecture, evaluated on the DeepGlobe dataset. Metrics include the total number of model parameters (Params (MB)), inference throughput in images per second (Throughput (img/s)), theoretical computational cost in gigaflops (FLOPs (G)), and mean Intersection-over-Union accuracy (mIoU (%)). Here, ↑ indicates that higher values are better, while ↓ indicates that lower values are better.
Structure | Attention Mechanism | Params (MB)
↓ | Throughput (img/s) ↑ | FLOPs (G) ↓ | mIoU (%) ↑ |
---|
ViT-Base (Baseline) + | Swin Attention | 87.66 | 4.63 | 1266.41 | 71.16 |
Focal Attention | 91.16 | 2.12 | 1347.52 | 70.94 |
Memory-Efficient Attention | 155.90 | 1.60 | 5107.25 | 70.64 |
ELCM (Ours) | 60.25 | 6.31 | 986.90 | 72.04 |
Table 5.
Scalability analysis of the proposed RingFormer-Seg method on the Guangdong dataset with varying image sizes and numbers of GPUs. The default architecture is ViT-Base with a patch size of . Metrics reported include GPU memory consumption in gigabytes (GPU Mem. (GB)), training time in hours (Train Time (h)), inference time per image in milliseconds (Infer. Time (ms)), inter-GPU communication overhead quantified by effective bandwidth (GB/s) and per-hop latency (ms), and segmentation accuracy measured by mean Intersection over Union (mIoU (%)). ”—” indicates that the quantity is not applicable in the single-GPU setting (no inter-GPU communication) and is therefore not reported. Here, ↑ indicates that higher values are better, while ↓ indicates that lower values are better. ”#GPUs” denotes the number of GPUs used.
Table 5.
Scalability analysis of the proposed RingFormer-Seg method on the Guangdong dataset with varying image sizes and numbers of GPUs. The default architecture is ViT-Base with a patch size of . Metrics reported include GPU memory consumption in gigabytes (GPU Mem. (GB)), training time in hours (Train Time (h)), inference time per image in milliseconds (Infer. Time (ms)), inter-GPU communication overhead quantified by effective bandwidth (GB/s) and per-hop latency (ms), and segmentation accuracy measured by mean Intersection over Union (mIoU (%)). ”—” indicates that the quantity is not applicable in the single-GPU setting (no inter-GPU communication) and is therefore not reported. Here, ↑ indicates that higher values are better, while ↓ indicates that lower values are better. ”#GPUs” denotes the number of GPUs used.
Image Size | Token Seq. | #GPUs | GPU Mem. (GB) ↓ | Train Time (h) ↓ | Infer. Time (ms) ↓ | Bandwidth (GB/s) ↑ | Latency (ms) ↓ | mIoU (%) ↑ |
---|
1024 × 1024 | 4096 | 1 | 4.51 | 11.35 | 39.16 | — | — | 75.78 |
2048 × 2048 | 16,384 | 1 | 11.50 | 19.60 | 136.03 | — | — | 76.06 |
4096 × 4096 | 65,536 | 1 | 38.75 | 24.76 | 694.06 | — | — | 76.53 |
8192 × 4096 | 131,072 | 2 | 38.71 | 24.97 | 1502.75 | 9.12 | 1.52 | 76.79 |
8192 × 8192 | 262,144 | 4 | 38.67 | 25.47 | 2336.56 | 5.95 | 2.87 | 77.05 |
Table 6.
Scalability analysis of the proposed RingFormer-Seg method on the Guangdong dataset with a fixed image size of 4096×4096 pixels while varying the number of GPUs (1, 2, and 4). Metrics reported include training time in hours (Train Time (h)), speedup, parallel efficiency, inference time per image in milliseconds (Infer. Time (ms)), and segmentation accuracy measured by mean Intersection over Union (mIoU (%)). Here, ↑ indicates that higher values are better, while ↓ indicates that lower values are better. ”#GPUs” denotes the number of GPUs used.
Table 6.
Scalability analysis of the proposed RingFormer-Seg method on the Guangdong dataset with a fixed image size of 4096×4096 pixels while varying the number of GPUs (1, 2, and 4). Metrics reported include training time in hours (Train Time (h)), speedup, parallel efficiency, inference time per image in milliseconds (Infer. Time (ms)), and segmentation accuracy measured by mean Intersection over Union (mIoU (%)). Here, ↑ indicates that higher values are better, while ↓ indicates that lower values are better. ”#GPUs” denotes the number of GPUs used.
#GPUs | Train Time (h) | Speedup | Efficiency | Infer. Time (ms) | mIoU (%) |
---|
1 | 24.76 | 1.0× | 100% | 694.06 | 76.53 |
2 | 13.10 | 1.89× | 94% | 372.85 | 76.50 |
4 | 7.20 | 3.44× | 86% | 212.35 | 76.47 |