Efficient Adaptation of Vision Foundation Model for High-Resolution Remote Sensing Image Segmentation via Spatial-Frequency Modeling and Sparse Refinement

Ding, Chenlong; Shi, Chengyi; Liu, Daofang; Shi, Zhihao; Lyu, Xin; Fang, Zhenyu; Liu, Xue; Meng, Lingqiang; Fang, Yiwei; Zhang, Chengming; Li, Xin

doi:10.3390/rs18091295

Open AccessArticle

Efficient Adaptation of Vision Foundation Model for High-Resolution Remote Sensing Image Segmentation via Spatial-Frequency Modeling and Sparse Refinement

by

Chenlong Ding

^1,†,

Chengyi Shi

^1,†,

Daofang Liu

^2,†,

Zhihao Shi

¹,

Xin Lyu

^1,3

,

Zhenyu Fang

¹,

Xue Liu

¹,

Lingqiang Meng

¹,

Yiwei Fang

¹,

Chengming Zhang

¹ and

Xin Li

^1,3,*

¹

College of Computer Science and Software Engineering, Hohai University, Nanjing 211100, China

²

Information Center, Yellow River Conservancy Commission (YRCC), Zhengzhou 450003, China

³

Key Laboratory of Water Big Data Technology of Ministry of Water Resources, Hohai University, Nanjing 211100, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Remote Sens. 2026, 18(9), 1295; https://doi.org/10.3390/rs18091295

Submission received: 23 March 2026 / Revised: 20 April 2026 / Accepted: 22 April 2026 / Published: 24 April 2026

(This article belongs to the Special Issue Deep Learning-Driven Hyperspectral Unmixing and Classification Techniques for Remote Sensing Images)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

This study proposes ADVMSeg, a parameter-efficient framework for high-resolution remote-sensing semantic segmentation, which combines an SF-Adapter for spatial-frequency-aware backbone adaptation with an ASR module for sparse hard-region refinement on top of a frozen DINOv3 backbone.
ADVMSeg achieves the best overall performance on GID-15, LoveDA, and ISPRS Potsdam, reaching 63.1%, 63.5%, and 81.4% mIoU, respectively.

What are the implications of the main findings?

The results demonstrate that the proposed SF-Adapter effectively improves the adaptation of frozen vision foundation-model features to remote-sensing semantic segmentation.
The results further show that the proposed ASR module effectively refines hard regions and improves segmentation quality with limited additional computation.
This demonstrates the effectiveness of the proposed adapter-based adaptation strategy for applying frozen vision foundation models to remote-sensing semantic segmentation.

Abstract

High-resolution remote-sensing semantic segmentation requires models to simultaneously capture global scene semantics and preserve fine-grained local structures. Although satellite-pretrained vision foundation models provide strong transferable representations, the features extracted by a frozen backbone remain insufficiently adapted to dense prediction, particularly for representing high-frequency details and multiscale local patterns. In addition, correcting residual prediction errors with dense full-map refinement introduces substantial computational redundancy, since hard errors are typically concentrated in only a small subset of locations. To address these challenges, we propose ADVMSeg, an efficient remote-sensing semantic segmentation framework built upon a frozen satellite-pretrained DINOv3 backbone. Specifically, we introduce a Spatial-Frequency Adapter (SF-Adapter) to improve backbone-level dense feature adaptation by jointly modeling global frequency responses and multiscale local spatial details in a lightweight bottleneck space. We further design an Adaptive Sparse Refinement (ASR) module after the pixel decoder, which identifies hard regions from coarse predictions via uncertainty and boundary cues, and performs targeted local cross-attention refinement only on selected critical locations. Extensive experiments on GID-15, LoveDA, and ISPRS Potsdam validate the effectiveness of the proposed framework. Under the unified setting, ADVMSeg achieves 63.1% mIoU on GID-15, 63.5% mIoU on LoveDA, and 81.4% mIoU on ISPRS Potsdam. These results validate the effectiveness of jointly improving backbone-level feature adaptation and prediction-stage computation allocation under the evaluated setting of frozen DINOv3, and three representative remote-sensing semantic-segmentation datasets.

Keywords:

remote-sensing image semantic segmentation; vision foundation model; spatial-frequency adapter; sparse refinement

1. Introduction

Semantic segmentation of high-resolution remote-sensing images (HRSIs) is one of the fundamental tasks in Earth observation. Compared with natural-image segmentation, HRSI segmentation is more challenging because remote-sensing scenes often exhibit stronger scale variation, larger intra-class diversity, weaker inter-class separability, and more complex spatial layouts. As a result, accurate prediction requires not only global semantic understanding, but also faithful modeling of boundaries, small objects, elongated structures, and abrupt land-cover transitions [1,2].

Recent advances in vision transformers and vision foundation models (VFMs) have significantly reshaped the paradigm of dense prediction. In natural-image segmentation, representative transformer-based frameworks such as ViT, Swin Transformer, SETR, Segmenter, and SegFormer have demonstrated the strong potential of large-scale pretrained visual representations for dense perception [3,4,5,6,7]. In remote sensing, this trend has been further reinforced by foundation-model studies and surveys, including SatMAE, CROMA, Scale-MAE, Cross-Scale MAE, AnySat, SkySense, DINOv3, and recent reviews on remote-sensing foundation models [8,9,10,11,12,13,14,15,16]. This makes frozen-backbone adaptation an attractive alternative to training task-specific architectures from scratch or fully fine-tuning very large models. In remote sensing, such a setting is particularly appealing because annotated data are often limited, image resolution is high, and the cost of full-network optimization can be substantial. However, effectively transferring frozen-VFM features to HRSI semantic segmentation remains nontrivial.

Existing studies partly relate to this problem, but often address only part of it. On the one hand, remote-sensing segmentation methods have shown that stronger structural modeling can improve local delineation and contour quality. Classical CNN-based remote-sensing networks such as HRCNet, ABCNet, and MANet already highlighted the importance of contextual aggregation and structural sensitivity [17,18,19], while more recent methods such as DC-Swin and UNetFormer further improved long-range contextual modeling in high-resolution scene parsing [20,21]. Boundary-aware learning and geometry-guided refinement also indicate that explicit structural correction remains important for difficult regions [22,23,24,25]. On the other hand, recent VFM adaptation methods in remote sensing mainly emphasize semantic alignment, prompt design, or generic parameter-efficient transfer [26,27,28,29,30,31]. Meanwhile, selective refinement methods in computer vision suggest that dense prediction errors are spatially sparse rather than uniformly distributed [32,33,34]. Nevertheless, comparatively less attention has been paid to jointly addressing the spatial-frequency mismatch of frozen-backbone features and the redundant computation of dense full-map correction in HRSI segmentation.

In this work, we argue that the remaining difficulty can be understood from two complementary perspectives. The first is a representation adaptation gap. Although frozen transformer features provide strong global semantic priors, they are not fully adapted to the fine-grained structural requirements of HRSI dense prediction. In practice, this limitation is often reflected in blurred contours, missing thin structures, inaccurate object boundaries, and insufficient sensitivity to repetitive local patterns. The second is a computation allocation gap. Residual segmentation errors are usually concentrated in a relatively small number of hard regions, such as uncertain semantic transitions, boundary-heavy areas, and small-object neighborhoods. In parallel, recent frequency-aware studies suggest that explicit handling of spatial-frequency variation can also benefit remote sensing dense prediction, as shown by wavelet-based spatial-frequency fusion and multi-frequency feature interaction [35,36]. Related remote-sensing imaging studies on hyperspectral reconstruction and cloud removal further support the value of structure-preserving and prior-guided modeling in complex Earth observation data [37,38]. However, applying equally dense refinement over the entire high-resolution prediction map may introduce unnecessary computational overhead.

To this end, we propose ADVMSeg, a frozen-backbone adaptation framework for HRSI semantic segmentation. The method contains two coordinated components. First, we introduce a Spatial-Frequency Adapter (SF-Adapter), which is inserted into later transformer blocks to adapt frozen-backbone features in a lightweight bottleneck space. Rather than using a generic adapter only for task projection, SF-Adapter explicitly combines global spectral modulation and local multiscale spatial enhancement to improve the representation of fine structures and spatial-frequency variation. Second, we design an Adaptive Sparse Refinement (ASR) module after the pixel decoder. Instead of performing dense refinement over the full prediction map, ASR identifies a small subset of difficult regions from the coarse prediction and allocates additional local cross-attention computation only to those locations. In this way, ADVMSeg is designed to improve both feature adaptation and refinement efficiency under the evaluated frozen-VFM setting with frozen DINOv3.

The main contributions of this work are summarized as follows:

We formulate frozen-VFM adaptation for HRSI semantic segmentation from two complementary perspectives, namely the representation adaptation gap and the computation allocation gap, and build a unified framework around them.
We propose SF-Adapter, a spatial-frequency-aware lightweight adapter that improves frozen-backbone features by combining global spectral filtering, local multiscale spatial enhancement, and adaptive fusion.
We propose ASR, a post-decoding sparse refinement module that allocates additional computation only to hard regions, improving local correction quality while avoiding unnecessary dense full-map refinement.
Experiments on three representative benchmarks, namely GID-15, LoveDA, and ISPRS Potsdam, show that the proposed method performs favorably under a unified protocol with frozen DINOv3 and RGB-only input.

2. Related Work

2.1. Remote-Sensing Semantic Segmentation and Structure-Aware Modeling

Semantic segmentation of remote-sensing imagery has long been challenged by strong scale variation, complex spatial arrangement, weak object boundaries, and the coexistence of large homogeneous regions with fine structures [1,39]. Early progress was largely driven by CNN-based architectures and generic semantic segmentation frameworks, such as PSPNet and DeepLabV3+, together with remote-sensing-oriented variants that improved multiscale aggregation and dense decoding [40,41,42,43]. Representative remote-sensing models such as HRCNet, ABCNet, and MANet further demonstrated the value of high-resolution context extraction, bilateral contextual modeling, and multi-attention interaction for fine-resolution segmentation [17,18,19]. More recent transformer-based methods, including DC-Swin, UNetFormer, and DWin-HRFormer, further strengthened long-range contextual modeling and urban-scene parsing [20,21,44].

A related line of work emphasizes structure-aware modeling. Boundary-guided learning, semantic edge modeling, and class-guided structural interaction have been shown to be effective for improving contour quality and local delineation in remote-sensing imagery [22,23,24,45,46]. Geometry-aware refinement further indicates that post-encoding structural correction can still improve segmentation quality in difficult regions [25]. Another relevant direction focuses on frequency-related or spatial-frequency designs, motivated by the fact that remote-sensing scenes often contain repetitive textures, sharp category transitions, and scale-varying local patterns. Early representative examples include wavelet-based spatial-frequency fusion [35] and multi-frequency feature aggregation in vision-Mamba-style architectures [36]. Li et al. later introduced explicit frequency decoupling for remote-sensing segmentation [47]. Related remote-sensing imaging studies on hyperspectral reconstruction and cloud removal likewise support the value of structure-preserving priors and frequency-aware modeling in complex Earth observation data [37,38]. More recently, dual-domain decoupled fusion was explored to enhance cross-domain feature interaction [48], while frequency-guided denoising provided another route for suppressing interference during segmentation [49]. Related studies on image quality enhancement and anomaly detection further emphasize the importance of subtle-structure preservation and interference suppression in remote-sensing visual understanding [50,51,52]. This line was further extended to frequency-domain-enhanced spectral–spatial fusion [53]. A position-aware differential denoising design was also introduced to improve structure-sensitive prediction under complex scene variation [54]. These studies collectively show that successful HRSI segmentation depends not only on semantic abstraction but also on preserving subtle structure and handling heterogeneous spatial-frequency characteristics.

However, most of these methods are developed in fully trainable task-specific settings. By contrast, the focus of this work is a frozen-foundation-model setting, where the main question is not how to build a stronger, fully trainable backbone, but how to compensate for the mismatch between frozen high-level representations and the fine-grained structural requirements of HRSI dense prediction. Our SF-Adapter is positioned in this context: it is not a generic local enhancement branch, but a lightweight adaptation module that explicitly targets spatial-frequency mismatch in frozen-backbone features.

2.2. Foundation-Model Adaptation and Efficient Refinement

The rapid development of vision transformers and foundation models has made pretrained visual encoders an important starting point for remote-sensing interpretation. In the remote-sensing domain, recent works such as SatMAE, CROMA, Scale-MAE, Cross-Scale MAE, AnySat, SkySense, and DINOv3 suggest that large-scale pretraining can provide transferable geospatial representations across scenes, resolutions, and modalities [8,9,10,11,12,13,14]. The growing influence of this direction has also been summarized in recent remote-sensing foundation-model surveys [15,16]. As a consequence, parameter-efficient transfer has become increasingly attractive, especially when full-network fine-tuning is expensive or unnecessary.

Existing frozen-backbone adaptation methods mainly focus on semantic alignment or generic parameter-efficient transfer. Representative directions include standard adapters, prompt learning, and other lightweight fine-tuning strategies [26,27,28,29,30,31]. While these methods are effective, their main objective is usually to bridge the domain or task gap in a general sense. Comparatively less attention has been paid to a dense-prediction-specific issue that is especially important for HRSI segmentation, namely that frozen-backbone features may remain insufficiently adapted to boundary-sensitive, high-frequency, and multiscale local structures. In this regard, our SF-Adapter differs from a standard bottleneck adapter or generic PEFT module in that it explicitly decomposes the adaptation into a spectral branch, a spatial branch, and an adaptive fusion mechanism.

Besides feature adaptation, another important question is how to allocate refinement computation. Prior work in computer vision has shown that segmentation errors tend to be spatially sparse. Methods such as PointRend and SegFix improve dense prediction by refining uncertain points or boundary-sensitive areas rather than treating the full map uniformly [33,34]. Mask2Former further provides a strong query-based dense prediction framework that supports localized mask reasoning [32]. In remote sensing, many structure-aware modules also improve local prediction, but they are often applied densely over the full-resolution feature map [24,45,46,55]. Our ASR module follows a different design choice. It preserves the frozen encoder and the dense coarse prediction pipeline, and introduces sparse computation only after decoding. Moreover, ASR does not rely on uncertainty alone: it combines uncertainty and boundary response for routing, constructs sparse queries from multiscale dense features, and performs local cross-attention refinement before rendering residual corrections back to the dense prediction map. Therefore, the method is positioned not as a generic hard-example heuristic, but as a post-decoding sparse reasoning mechanism tailored to frozen-VFM HRSI segmentation.

3. Methods

3.1. Framework Overview

We propose ADVMSeg, a parameter-efficient remote-sensing semantic segmentation framework built upon a frozen satellite-pretrained DINOv3 backbone, a Mask2Former-style segmentation head, and two task-specific modules: the proposed Spatial-Frequency Adapter (SF-Adapter) and Adaptive Sparse Refinement (ASR). The overall design follows the principle of preserving the strong global semantics of a frozen vision foundation model while introducing lightweight task adaptation and selective prediction-stage refinement.

Given an input RGB image

I \in R^{B \times 3 \times H_{0} \times W_{0}}

, where B is the batch size, the frozen DINOv3 ViT-L/16 backbone first converts the image into non-overlapping

16 \times 16

patches and outputs patch-token features. Since the backbone uses a patch size of 16, the token grid resolution is

H = \frac{H_{0}}{16}, W = \frac{W_{0}}{16}, N = H \times W,

(1)

where N denotes the number of patch tokens. In all experiments, the backbone embedding dimension is

C = 1024

, corresponding to the ViT-L architecture.

Unless otherwise stated, uppercase letters such as B, C, D, H, and W denote tensor dimensions. Here, B is the batch size, C is the backbone embedding dimension, and H and W denote the spatial resolution of the patch-token grid. The symbols

H_{0}

and

W_{0}

denote the input image height and width, respectively, while

N = H \times W

is the number of patch tokens. We use ℓ to index transformer blocks, and K denotes the number of semantic classes.

The frozen DINOv3 backbone provides strong global semantics and long-range contextual modeling, while the SF-Adapter supplements remote-sensing-oriented spatial-frequency cues in a parameter-efficient manner.

From a signal-processing perspective, dense prediction in HRSIs depends on information distributed across different spatial-frequency ranges. Low-frequency components are more closely related to large-scale semantic layout and global scene consistency, whereas high-frequency components are more relevant to boundaries, small objects, elongated structures, and abrupt spatial transitions. Although a frozen transformer backbone provides strong transferable semantic abstraction, it is not explicitly optimized for boundary-sensitive dense prediction and may therefore under-represent fine structural cues required by downstream segmentation. Accordingly, the role of SF-Adapter is not to perform generic feature enhancement, but to compensate for the spatial-frequency mismatch between frozen foundation features and the fine-grained structural requirements of remote-sensing dense prediction. On this basis, we design the SF-Adapter to jointly model global spectral responses and local multiscale spatial details in a lightweight bottleneck space.

On top of the backbone features, a Mask2Former-style decoder produces a coarse dense prediction. The ASR module then identifies a small subset of difficult regions from this coarse prediction and performs targeted local refinement. In this way, ADVMSeg jointly addresses two complementary challenges in frozen-foundation-model transfer for dense prediction: insufficient adaptation of backbone features to fine-grained local structures, and redundant dense refinement over easy regions. The overall architecture is shown in Figure 1.

3.2. Frozen Satellite-Pretrained Backbone and Mask2Former Head

3.2.1. Frozen DINOv3 Backbone

We adopt the official satellite-pretrained DINOv3 ViT-L/16 distilled checkpoint pretrained on SAT-493M as the frozen encoder. The backbone contains 24 transformer blocks and remains completely frozen during training. Let

x^{(ℓ)} \in R^{B \times (N + 1) \times C}

denote the token sequence at the output of the ℓ-th transformer block. When a class token is present, it is excluded from the proposed adapter and segmentation head, and only the patch-token sequence is used:

x_{p}^{(ℓ)} \in R^{B \times N \times C}, C = 1024 .

(2)

The patch tokens are reshaped into a 2D feature grid as

X^{(ℓ)} = R (x_{p}^{(ℓ)}) \in R^{B \times C \times H \times W},

(3)

where

R (\cdot)

denotes the token-to-feature-map reshape operator.

To balance semantic richness and computational efficiency, we use the outputs of four later transformer blocks, namely

ℓ \in {12, 16, 20, 24}

, as the backbone features sent to the pixel decoder. This choice keeps sufficient semantic diversity across depths while avoiding excessive memory usage. In addition, the SF-Adapter is inserted into the last six transformer blocks, i.e.,

ℓ \in {19, 20, 21, 22, 23, 24}

, where the features are semantically strongest and most relevant to dense prediction.

3.2.2. Mask2Former-Style Segmentation Head

On top of the frozen backbone, we use a Mask2Former-style segmentation head consisting of a lightweight pixel decoder and a transformer decoder. Here, D denotes the unified decoder channel dimension,

N_{q}

is the number of learnable decoder queries, and

L_{d}

is the number of transformer decoder layers. The notation

P_{16}

,

P_{8}

, and

P_{4}

refers to dense feature maps at progressively finer resolutions, where the subscripts indicate their effective stride with respect to the input image. Each selected backbone feature map is first projected to a unified channel dimension

D = 256

through a

1 \times 1

convolution:

F^{(ℓ)} = {Proj}_{1 \times 1} (X^{(ℓ)}) \in R^{B \times D \times H \times W} .

(4)

These projected features are then fed into an FPN-like pixel decoder to construct three dense feature levels:

\begin{matrix} P_{16} & \in R^{B \times D \times H \times W}, \\ P_{8} & \in R^{B \times D \times 2 H \times 2 W}, \\ P_{4} & \in R^{B \times D \times 4 H \times 4 W}, \end{matrix}

(5)

where

P_{16}

is the coarsest dense feature map, while

P_{4}

is the highest-resolution feature map used by the refinement branch. For a

512 \times 512

training crop, these feature maps, respectively, have spatial sizes

32 \times 32

,

64 \times 64

, and

128 \times 128

.

The transformer decoder follows a Mask2Former-style formulation with

N_{q} = 100

learnable queries and

L_{d} = 9

decoder layers. Deep supervision is applied following the standard Mask2Former design. The decoder produces a coarse dense semantic prediction:

L_{coarse} \in R^{B \times K \times H_{4} \times W_{4}},

(6)

where

(H_{4}, W_{4}) = (4 H, 4 W)

and K is the number of semantic classes.

3.3. Spatial-Frequency Adapter

The proposed SF-Adapter operates in a low-dimensional bottleneck space and consists of a frequency-domain global filtering branch, a spatial multiscale detail enhancement branch, and an adaptive fusion module. The detailed structure is shown in Figure 2. Unless otherwise stated, all experiments use the same adapter hyperparameters: bottleneck width

r = 256

, number of radial frequency bands

K_{b} = 8

, dilation rates

{1, 2, 3}

in the spatial branch, and dropout rate

p = 0.1

after fusion.

In the following, r denotes the bottleneck width of the adapter,

K_{b}

is the number of radial frequency bands, and c, k, and

(u, v)

index the channel, band, and spatial-frequency location, respectively. In the spatial branch,

d_{j}

denotes the dilation rate of the j-th depthwise convolution, and m is the number of dilation branches. The symbol

α

denotes the channel-wise fusion gate used to balance the spectral and spatial responses.

The design of SF-Adapter is motivated by the observation that different structural cues in remote-sensing imagery are distributed unevenly across the spatial-frequency spectrum. Large homogeneous regions and broad land-cover layout are more strongly associated with lower-frequency components, while object boundaries, thin structures, repetitive local patterns, and abrupt category transitions rely more heavily on higher-frequency responses. In the frozen-backbone setting, however, transformer features are primarily inherited from large-scale pretraining and are more naturally biased toward transferable semantic abstraction than toward boundary-sensitive downstream prediction. Therefore, the frequency branch in SF-Adapter should be understood as a targeted compensation mechanism for insufficient high-frequency structural representation, rather than as a generic enhancement module. Correspondingly, the spatial branch complements this process by strengthening local multiscale detail modeling, and the adaptive fusion module balances the relative contribution of the two branches according to the input feature statistics.

3.3.1. Adapter Placement and Overall Formulation

The SF-Adapter is inserted after the feed-forward network of each selected transformer block. For each block output patch-token sequence

x_{p}

, the adapter predicts a residual compensation term

Δ x

and adds it back to the original tokens:

x_{p}^{'} = x_{p} + Δ x .

(7)

This residual design stabilizes transfer because the adapter starts from a near-identity mapping and gradually learns task-specific corrections without disrupting the frozen-backbone representation.

3.3.2. Input Normalization and Bottleneck Projection

To stabilize the input distribution, we first apply LayerNorm to the patch tokens:

\tilde{x} = LN (x_{p}) .

(8)

The normalized tokens are then projected into a bottleneck space of dimension

r = 256

:

z = \tilde{x} W_{d} + b_{d}, W_{d} \in R^{C \times r},

(9)

where

z \in R^{B \times N \times r}

. After projection,

z

is reshaped into a 2D feature map:

Z = R (z) \in R^{B \times r \times H \times W} .

(10)

3.3.3. Frequency-Domain Global Spectral Filtering

The first branch performs global spectral filtering to compensate for the insufficient representation of high-frequency structural cues in frozen-backbone features. We first transform

Z

into the frequency domain using a two-dimensional real-valued Fourier transform:

\hat{Z} = rFFT 2 (Z) \in C^{B \times r \times H \times (W / 2 + 1)} .

(11)

Instead of learning an independent weight for every frequency location, we partition the frequency plane into

K_{b} = 8

radial bands. Let

M_{k} \in {0, 1}^{H \times (W / 2 + 1)}

denote the binary mask of the k-th band, satisfying

\sum_{k = 1}^{K_{b}} M_{k} (u, v) = 1, \forall (u, v) .

(12)

For each channel c and band k, we learn a complex spectral gate

g_{c, k} = g_{c, k}^{(re)} + i g_{c, k}^{(im)},

(13)

which is expanded to the full spectrum by

G (c, u, v) = \sum_{k = 1}^{K_{b}} M_{k} (u, v) g_{c, k} .

(14)

The filtered spectrum is then computed via element-wise multiplication:

{\hat{Y}}_{freq} = \hat{Z} ⊙ G,

(15)

and mapped back to the spatial domain:

Y_{freq} = irFFT 2 ({\hat{Y}}_{freq}) \in R^{B \times r \times H \times W} .

(16)

3.3.4. Spatial Multiscale Detail Enhancement

The second branch enhances local structures through multiscale depthwise convolutions. For dilation rates

{d_{j}}_{j = 1}^{m} = {1, 2, 3}

, the feature responses are computed as

U_{j} = {DWConv}^{(d_{j})} (Z), j = 1, \dots, m,

(17)

where each depthwise convolution uses a kernel size of

3 \times 3

. The multiscale responses are averaged:

U = \frac{1}{m} \sum_{j = 1}^{m} U_{j},

(18)

and then mixed with a pointwise convolution:

Y_{spat} = PWConv (U) \in R^{B \times r \times H \times W} .

(19)

3.3.5. Adaptive Spatial-Frequency Fusion

The spectral and spatial branches are fused adaptively. First, global average pooling is applied to the bottleneck feature:

p = GAP (Z) \in R^{B \times r} .

(20)

This descriptor is passed through a two-layer MLP with hidden size

r / 4

and a sigmoid function to generate a channel-wise fusion gate:

α = σ (MLP (p)) \in R^{B \times r} .

(21)

Here,

α

is broadcast along the spatial dimensions. The fused output is computed as

Y = α ⊙ Y_{freq} + (1 - α) ⊙ Y_{spat} .

(22)

We then apply a GELU activation and dropout:

Y \leftarrow {Drop}_{p = 0.1} (ϕ (Y)) .

(23)

3.3.6. Residual Projection to Token Space

Finally, the fused bottleneck feature is reshaped back to token form and projected to the original embedding dimension:

Δ x = R^{- 1} (Y) W_{u} + b_{u}, W_{u} \in R^{r \times C} .

(24)

The resulting residual is added to the original patch tokens according to Equation (7). If a class token exists, it bypasses the adapter and is concatenated back unchanged.

3.4. Adaptive Sparse Refinement

The proposed ASR is a sparse residual refinement branch operating on the coarse dense prediction map. It consists of hard-region scoring, sparse-query construction, local sparse refinement, and dense residual rendering. Its architecture is shown in Figure 3. Unless otherwise stated, the default ASR settings are fixed across all datasets: Top-M ratio

ρ = 1.0 %

, query dimension

D_{q} = 256

, local window size

w = 7

, number of refinement blocks

L_{r} = 2

, number of attention heads

h = 8

, and residual patch size

s = 4

.

Here,

ρ

denotes the sparse routing ratio,

D_{q}

is the sparse-query dimension, w is the local refinement window size,

L_{r}

is the number of refinement blocks, h is the number of attention heads, and s is the side length of the predicted residual patch. We use

p = (x, y)

to denote a selected spatial location,

Ω

to denote the set of Top-M routed locations, and M to denote the number of selected regions. The symbols U, B, and S denote the uncertainty score, boundary score, and final routing score, respectively.

3.4.1. Inputs and Overall Objective

ASR refines the coarse logit map

L_{coarse}

using the multiscale dense features

P_{4}

,

P_{8}

, and

P_{16}

. It predicts a residual logit map

Δ L

and produces the final prediction as

L_{final} = L_{coarse} + η Δ L,

(25)

where

η

is a learnable scalar initialized to 1.0.

3.4.2. Hard-Region Scoring and Top-M Routing

We first estimate which regions deserve additional refinement. The coarse logits are converted to class probabilities:

P = Softmax (L_{coarse}) .

(26)

The uncertainty score is computed using pixel-wise entropy:

U (i) = - \sum_{k = 1}^{K} P_{k} (i) log P_{k} (i) .

(27)

The boundary response is estimated from the average gradient magnitude of the probability field:

B (i) = \frac{1}{K} \sum_{k = 1}^{K} {∥\nabla P_{k} (i)∥}_{1},

(28)

where ∇ is implemented by a

3 \times 3

Sobel operator in the horizontal and vertical directions.

Because entropy and boundary responses may have different scales, we normalize them independently to

[0, 1]

on each image:

\bar{U} (i) = \frac{U (i) - U_{min}}{U_{max} - U_{min} + ϵ}, \bar{B} (i) = \frac{B (i) - B_{min}}{B_{max} - B_{min} + ϵ},

(29)

where

ϵ = 10^{- 6}

. The final routing score is then defined as

S (i) = λ_{u} \bar{U} (i) + λ_{b} \bar{B} (i),

(30)

with

λ_{u} = λ_{b} = 0.5

in all experiments.

Given the score map

S \in R^{B \times 1 \times H_{4} \times W_{4}}

, we select the top-M locations independently for each image:

Ω = TopM (S),

(31)

where

M = ⌊ρ H_{4} W_{4}⌋, ρ = 1.0 % .

(32)

For a

512 \times 512

crop,

H_{4} = W_{4} = 128

, so the default setting yields

M = 163

selected locations. During training, the routing score is computed from a detached version of

L_{coarse}

so that the non-differentiable top-M selection does not destabilize the coarse prediction branch.

3.4.3. Sparse-Query Construction

For each selected location

p = (x, y) \in Ω

, we extract multiscale evidence from

P_{4}

,

P_{8}

, and

P_{16}

using bilinear sampling:

\begin{matrix} f_{4} (p) & = Sample (P_{4}, p), \\ f_{8} (p) & = Sample (P_{8}, p / 2), \\ f_{16} (p) & = Sample (P_{16}, p / 4) . \end{matrix}

(33)

The sampled features are concatenated together with a 2D sine-cosine positional embedding

e_{pos} (p)

and projected to a sparse-query vector:

q (p) = MLP ([f_{4} (p), f_{8} (p), f_{16} (p), e_{pos} (p)]) \in R^{D_{q}},

(34)

where

D_{q} = 256

. The query-construction MLP uses two fully connected layers with a hidden dimension 512 and GELU activation. Stacking all selected locations yields the sparse-query tensor

Q \in R^{B \times M \times D_{q}}

.

3.4.4. Local Sparse Refinement Blocks

For each sparse query, we perform local cross-attention over a neighborhood on the highest-resolution dense feature map

P_{4}

. For a selected location p, we extract a local window

N_{w} (p)

centered at p with size

w \times w

, where the default

w = 7

. The local features inside this window are projected to keys and values:

K (p), V (p) \in R^{w^{2} \times D_{q}} .

(35)

A multi-head local cross-attention block updates the sparse query:

\tilde{q} (p) = MCA (q (p), K (p), V (p)),

(36)

where MCA denotes 8-head multi-head cross-attention with an FFN expansion ratio of 4. The attended query is then passed through a feed-forward network:

q^{'} (p) = FFN (\tilde{q} (p)) .

(37)

Each refinement block uses pre-normalization and residual connections. We stack

L_{r} = 2

such blocks to obtain the final refined sparse queries.

3.4.5. Residual Prediction and Dense Rendering

The refined sparse query at each selected location predicts a local logit residual using a block-wise formulation:

Δ L_{p} = MLP (q^{'} (p)) \in R^{K \times s \times s},

(38)

where the default residual patch size is

s = 4

.

To make the rendering procedure fully reproducible, we explicitly define how each predicted residual patch is written back to the dense residual map. Let

p = (x, y) \in Ω

denote a selected location on the coarse prediction grid. Since the default patch size

s = 4

is even, the selected location is treated as a rendering anchor rather than a mathematically strict single-pixel center. Specifically, the spatial support of the residual patch associated with p is defined as

R_{p} = ([x - δ^{-}, x + δ^{+}] \times [y - δ^{-}, y + δ^{+}]) \cap I,

(39)

where

δ^{-} = ⌊ (s - 1) / 2 ⌋

,

δ^{+} = s - 1 - δ^{-}

, and

I = [0, H_{4} - 1] \times [0, W_{4} - 1]

(40)

denotes the valid image lattice. Under the default setting

s = 4

, this means that the selected location p is aligned with the upper-left element of the central

2 \times 2

subregion of the rendered patch. When a predicted patch extends beyond the image boundary, only the valid in-image portion is retained, and the out-of-range entries are discarded.

The predicted patches are then rendered back to a dense residual map through scatter-add accumulation. Let

π_{p} (i)

denote the local coordinate of pixel i inside the valid part of the patch associated with p. We first accumulate the residual logits and the overlap count as

Δ L_{sum} (i) = \sum_{p \in Ω} 1 [i \in R_{p}] Δ L_{p} (π_{p} (i)),

(41)

C (i) = \sum_{p \in Ω} 1 [i \in R_{p}] .

(42)

The final dense residual is obtained by overlap-normalized aggregation:

Δ L (i) = \frac{Δ L_{sum} (i)}{C (i) + ϵ},

(43)

where

ϵ = 10^{- 6}

. This normalization prevents unstable magnitude accumulation when many residual patches overlap in boundary-dense regions, and keeps the rendering operation numerically stable near image borders.

3.5. Optimization Objective and Trainable Parameters

The overall training objective combines supervision on the Mask2Former decoder output and the ASR-refined final prediction. In the loss formulation,

L_{head}

denotes the standard Mask2Former set-prediction loss, while

L_{coarse}

and

L_{final}

denote the auxiliary segmentation losses applied to the coarse and refined predictions, respectively. The coefficients

λ_{d}

,

λ_{c}

, and

λ_{f}

control the relative weights of the Dice term, coarse-output supervision, and final-output supervision. We retain the standard Mask2Former set-prediction loss, denoted by

L_{head}

, including classification and mask losses with deep supervision across decoder layers. In addition, we apply dense semantic supervision to both the coarse and final logit maps:

\begin{matrix} L_{coarse} & = L_{seg} (L_{coarse}, Y), \\ L_{final} & = L_{seg} (L_{final}, Y), \end{matrix}

(44)

where

Y

is the ground-truth semantic map. The segmentation loss is defined as the sum of pixel-wise cross-entropy and Dice loss:

L_{seg} = L_{CE} + λ_{d} L_{Dice} .

(45)

In all experiments, we set

λ_{d} = 1.0

.

The total loss is

L = L_{head} + λ_{c} L_{coarse} + λ_{f} L_{final},

(46)

where

λ_{c} = 0.5

and

λ_{f} = 1.0

. This choice places slightly stronger supervision on the final refined output while maintaining stable learning of the coarse branch.

Under the frozen-backbone setting, the trainable parameters include the Mask2Former segmentation head, all SF-Adapters, and the ASR module, while the DINOv3 backbone remains completely frozen throughout training. More specifically, the trainable components are:

the $1 \times 1$ projections and the FPN-like pixel decoder used to construct $P_{16}$ , $P_{8}$ , and $P_{4}$ ;
the Mask2Former transformer decoder with 100 learnable queries;
the SF-Adapters inserted into the last six DINOv3 blocks;
the ASR routing, sparse-query construction, local cross-attention refinement, and residual rendering heads.

This design preserves the transferability of the frozen foundation encoder while keeping task adaptation and refinement lightweight and fully compatible with high-resolution remote-sensing dense prediction.

4. Experiments

This section presents the experimental evaluation of ADVMSeg for remote-sensing semantic segmentation. We first describe the experimental setup, including the dataset configuration, evaluation metrics, implementation details, and comparison protocol. We then compare ADVMSeg with representative segmentation baselines on GID-15, LoveDA, and ISPRS Potsdam. Finally, ablation studies are conducted to verify the individual and joint contributions of the proposed SF-Adapter and ASR, together with the resulting accuracy–efficiency trade-off.

4.1. Experimental Settings

4.1.1. Datasets and Evaluation Metrics

We evaluate ADVMSeg on three representative remote-sensing semantic segmentation benchmarks, namely GID-15, LoveDA, and ISPRS Potsdam. To ensure fair comparison across methods, all datasets are processed under a unified RGB-only setting and a consistent training–validation–test protocol.

These three datasets were selected intentionally rather than randomly, because they provide complementary evaluation perspectives for the proposed framework. Specifically, GID-15 mainly tests fine-grained land-cover discrimination and thus stresses representation adaptation. LoveDA introduces stronger cross-scene complexity through urban–rural discrepancy, scale variation, and cluttered backgrounds. ISPRS Potsdam places greater emphasis on high-resolution urban parsing, small-object delineation, and boundary-sensitive refinement. Therefore, the three datasets together represent complementary challenges, including fine-grained semantics, complex scene adaptation, and fine-structure delineation.

GID is a large-scale land-cover dataset constructed from Gaofen-2 satellite imagery and provides pixel-level annotated GF-2 images [56]. In this work, we adopt the GID-15 setting, which contains 15 fine-grained land-cover categories. Following the official benchmark protocol, we use 100 images for training, 10 images for validation, and 40 images for testing.

LoveDA is a large-scale land-cover benchmark for semantic segmentation and domain adaptation. It contains 5987 annotated high-resolution images at 0.3 m spatial resolution, covering both urban and rural scenes with seven semantic categories [57]. Following the official split, we use 2522 images for training, 1669 images for validation, and 1796 images for testing. Both urban and rural samples are jointly used in training and evaluation under the standard semantic segmentation setting.

ISPRS Potsdam is a very-high-resolution urban semantic labeling benchmark composed of 38 orthophoto patches of size

6000 \times 6000

with a ground sampling distance of 5 cm [58]. Since the official benchmark mainly provides a public training set and a held-out test benchmark without a fixed validation partition, we follow a commonly adopted practice and construct an internal split from the publicly available annotated tiles. Specifically, we use 18 annotated tiles for training and 6 annotated tiles for validation. Since the official test labels are not publicly available, all quantitative results reported in this work on Potsdam are based on this internal split rather than the official hidden test benchmark. Therefore, the Potsdam results in this paper should be interpreted as controlled comparisons under our unified experimental protocol, and not as directly comparable replacements for results reported on the official benchmark server. Only the RGB channels are used in all experiments for consistency with the RGB satellite-pretrained DINOv3 backbone [14]. The near-infrared channel and DSM information are not used unless otherwise stated.

For all three datasets, large images are cropped into fixed-size patches for training. During training, each sample is randomly cropped to

512 \times 512

. For GID-15 and LoveDA, images smaller than the crop size are padded before cropping when necessary. For Potsdam, the original large tiles are sampled using random crops from the full-resolution annotated images.

Following common practice in semantic segmentation, we use mean Intersection over Union (mIoU) as the primary evaluation metric. In addition, we report the mean F1-score (mF1) and overall accuracy (OA). For refinement-related experiments on Potsdam, we further report Boundary IoU (B-IoU) to better quantify geometric delineation quality. For B-IoU, the boundary maps are generated from the predicted and ground-truth masks using a fixed morphological boundary width of 3 pixels on the evaluation map.

4.1.2. Implementation Details

We adopt the official DINOv3 ViT-L/16 distilled checkpoint pretrained on SAT-493M as the frozen backbone [14]. All backbone parameters remain frozen during training, and only the segmentation head and the proposed lightweight modules are optimized. All experiments were conducted on a single NVIDIA A40 GPU with a batch size of 8. We used the AdamW optimizer with an initial learning rate of

1 \times 10^{- 4}

, weight decay of

0.05

, and Adam coefficients

(0.9, 0.999)

. The learning rate was scheduled by a polynomial decay policy with power

0.9

. The overall training process was iteration-based with 40,000 iterations.

For the proposed SF-Adapter, the frequency-domain branch partitions the spectrum into

K_{b} = 8

radial bands. In the spatial branch, we use three parallel depthwise convolutions with dilation rates

{1, 2, 3}

to model multiscale local details. Unless otherwise stated, these settings are fixed across all datasets and experiments. During inference, sliding-window evaluation was adopted for large images, and the final prediction was obtained by stitching the overlapping windows.

4.1.3. Baseline Settings and Comparison Protocol

To make the evaluation scientifically rigorous, all comparison results reported in this work are obtained from our own reimplementation under a unified experimental protocol. Specifically, representative CNN-based, Transformer-based, hybrid, and VFM-based segmentation methods are reimplemented and trained under the same setting as ADVMSeg, including consistent input modality, data preprocessing, optimization strategy, and evaluation protocol. In this way, the reported comparisons are intended to provide a fair and controlled assessment of the relative effectiveness of different methods on GID-15, LoveDA, and ISPRS Potsdam.

4.2. Benchmark Results

For clearer presentation, we report the cross-method comparison separately on GID-15 [56], LoveDA [57], and ISPRS Potsdam [58].

4.2.1. GID-15

Compared with GID-5, GID-15 is a substantially more challenging fine-grained benchmark because it decomposes broad land-cover regions into 15 semantic categories, including paddy field, irrigated land, dry cropland, garden, arbor forest, shrub land, natural meadow, artificial meadow, industrial land, urban residential, rural residential, traffic land, river, lake, and pond. Table 1 presents the comparison results.

As shown in Table 1, GID-15 is considerably more challenging than coarse-category land-cover benchmarks because its finer label decomposition imposes substantially higher demands on semantic discrimination and local structural sensitivity. Under the unified protocol, ADVMSeg achieves the best overall performance, reaching 63.1% mIoU, 75.4% mF1, and 77.6% OA. The improvement over recent strong baselines such as GeoSA, SkySense, and DBBANet indicates that the proposed framework remains effective when the task shifts from broad land-cover recognition to more detailed fine-grained parsing.

The class-wise pattern further clarifies where the gain comes from. ADVMSeg performs best on paddy fields, irrigated land, dry cropland, shrub land, natural meadow, artificial meadow, industrial land, traffic land, and river, while remaining highly competitive on the remaining categories. By contrast, GeoSA preserves slight advantages on arbor forest, urban residential, rural residential, lake, and pond, and CMTFNet performs best on garden. This distribution suggests that the strength of ADVMSeg is not merely a uniform increase in all categories, but a more consistent treatment of classes associated with fragmented spatial arrangement, mixed textures, and weak inter-class boundaries. Such a trend is consistent with the design of the method: SF-Adapter improves the representation of spatial-frequency details, while ASR further concentrates refinement on regions that remain structurally difficult after coarse prediction.

From an application perspective, these gains are meaningful because the improved categories are often associated with fragmented parcels, weak boundaries, and mixed-texture regions, which are among the most error-prone areas in practical land-cover mapping. In real workflows, segmentation errors in such regions usually propagate to subsequent manual correction, map updating, and region-level statistics. Therefore, the advantage of ADVMSeg on GID-15 is not only reflected in higher summary metrics, but also in a potentially lower correction burden for fine-grained land-cover interpretation.

4.2.2. LoveDA

LoveDA contains seven semantic categories and is widely recognized as a challenging benchmark due to strong scale variation, cluttered backgrounds, and evident urban–rural domain discrepancy. Table 2 reports the comparison results.

Table 2 shows that LoveDA remains a challenging benchmark because of its strong scale variation, cluttered backgrounds, and evident urban–rural domain discrepancy. Nevertheless, ADVMSeg achieves the best overall performance, reaching 63.5% mIoU, 76.7% mF1, and 69.2% OA. Compared with the strongest competing baselines, the proposed framework provides consistent gains in all three summary metrics, indicating that it generalizes effectively under large appearance variation and cross-domain scene complexity.

From the per-class IoU perspective, ADVMSeg achieves the best performance on six of the seven categories, namely background, road, water, barren, forest, and agriculture, while remaining below RSAM-Seg on building. The largest improvements appear on roads and barren land, which are typically associated with elongated geometry, fragmented spatial distribution, and ambiguous transitions with surrounding regions. This pattern suggests that ADVMSeg is particularly effective on categories that depend strongly on local structural cues and boundary preservation. It also indicates that the gain is achieved by improving fine-scale prediction without noticeably sacrificing global semantic consistency. This is practically important because roads and fragmented barren regions are also the categories most likely to suffer from breakage, false merging, and omission in real mapping workflows. Therefore, the gain here reflects not only higher IoU, but also better robustness on application-critical structures.

4.2.3. Potsdam

ISPRS Potsdam contains high-resolution urban scenes with complex geometric transitions and small objects. It serves as an ideal benchmark for evaluating fine-structure delineation and boundary refinement quality. Table 3 presents the quantitative comparison.

As Table 3 shows, ADVMSeg achieves the best overall performance on Potsdam, reaching 81.4% mIoU, 89.6% mF1, and 89.6% OA. This result is meaningful because Potsdam emphasizes very-high-resolution urban parsing, where performance is strongly influenced not only by semantic recognition, but also by local geometric precision, boundary quality, and the treatment of small structures. The consistent margin over recent baselines, therefore, suggests that ADVMSeg remains effective in urban scenes where dense prediction depends simultaneously on global semantics and fine-grained delineation.

The class-wise F1 pattern further clarifies the source of the gain. ADVMSeg performs best on low vegetation, tree, car, and clutter, while GeoSA remains slightly stronger on impervious surface and building. Compared with large rigid man-made structures, the categories on which ADVMSeg performs best are more sensitive to local ambiguity, transition-heavy regions, and irregular spatial organization. This trend indicates that the improvement is not simply a uniform boost across all urban categories, but a more effective resolution of structurally difficult regions. This improvement is valuable because errors on small urban objects, vegetation boundaries, and cluttered transition regions can directly affect fine-scale urban mapping, object inventory, and spatial management tasks. In high-resolution scenes, boundary leakage, object sticking, and small-target omission often dominate the cost of post-processing and manual revision. Therefore, the gain of ADVMSeg on Potsdam suggests not only stronger benchmark performance, but also better suitability for high-resolution urban interpretation workflows.

4.3. Qualitative Comparison

To further validate the effectiveness of ADVMSeg beyond quantitative metrics, Figure 4 presents representative qualitative comparisons on Potsdam, LoveDA, and GID-15. The selected examples are intentionally drawn from visually challenging scenes, including thin structures, irregular object boundaries, fragmented land-cover regions, and semantically confusing transitions.

On Potsdam, competing methods generally capture the main semantic layout, but they still exhibit noticeable boundary leakage and local shape distortion around buildings, roads, and small objects. In contrast, ADVMSeg produces sharper contours and more complete object structures, especially near complex building edges and narrow transition areas. This observation is consistent with the quantitative gains on Potsdam and further supports the effectiveness of ASR in refining hard regions with complex geometry.

On LoveDA, the differences are more evident in fragmented built-up regions and heterogeneous land-cover transitions. Competing methods tend to produce local misclassification or broken object structures, particularly in areas where roads, buildings, vegetation, and barren land are tightly interleaved. ADVMSeg yields more coherent region prediction and better preserves the continuity of elongated and fragmented structures, indicating that the proposed SF-Adapter effectively improves feature discrimination under strong scale variation and complex background interference.

On GID-15, most methods already provide relatively reasonable large-area semantic predictions, yet visible differences remain in boundary smoothness and class consistency near man-made structures and vegetation regions. ADVMSeg produces cleaner large-region segmentation with fewer isolated artifacts while also preserving sharper local transitions. This suggests that the proposed framework improves difficult local details while maintaining stable global semantic consistency across broad remote-sensing scenes.

Overall, the qualitative results are highly consistent with the quantitative comparisons. ADVMSeg achieves a favorable balance between global semantic coherence and fine-grained local delineation, which helps explain its strong performance across all three benchmarks.

4.4. Ablation and Analysis

4.4.1. Ablation Study on the Spatial-Frequency Adapter

To isolate the contribution of the proposed SF-Adapter, we perform progressive ablations built upon a frozen DINOv3-SAT backbone coupled with a Mask2Former decoding head. The baseline model bypasses the adaptation module entirely. We then gradually introduce a plain bottleneck adapter, the independent spatial and spectral branches, and finally the adaptive fusion mechanism. To evaluate module behavior under different scene characteristics, we report results on GID-15, LoveDA, and ISPRS Potsdam.

As shown in Table 4, the frozen DINOv3 with Mask2Former baseline already provides a reasonable transfer starting point, indicating that the pretrained foundation representation remains informative even without explicit task adaptation. However, the remaining gap on all three datasets also shows that dense remote-sensing prediction still requires additional structural adaptation beyond generic frozen transfer.

The plain bottleneck adapter yields only limited improvement, suggesting that simply inserting trainable parameters is insufficient to address the representation mismatch between frozen transformer features and high-resolution remote-sensing segmentation. By contrast, the spatial and spectral branches exhibit clearly different behaviors. The spatial branch is more beneficial on Potsdam, where local geometry and boundary quality are especially important, while the spectral branch produces relatively larger gains on GID-15 and LoveDA, where repetitive textures, large-area land-cover patterns, and stronger appearance variation are more prominent. Although directly combining the two branches already outperforms the single-branch variants, it remains inferior to the full SF-Adapter with adaptive fusion. This indicates that spatial and spectral cues are complementary but not uniformly useful across scenes, and that dynamic balancing between them is more effective than naive aggregation.

To further elucidate the internal mechanisms of the SF-Adapter and corroborate the quantitative findings, we provide a qualitative comparison of the feature activation maps and the corresponding segmentation outputs, as illustrated in Figure 5. A consistent observation is that the frozen baseline backbone tends to extract globally smooth representations that lack local high-frequency details, resulting in dispersed and blurry activation regions. In contrast, the integration of the SF-Adapter significantly sharpens the feature maps, precisely localizing object boundaries, fine structures, and complex textures. This feature-level enhancement directly translates to superior macro-level segmentation performance.

Specifically, for highly textured natural targets such as tree crowns, the frozen baseline tends to produce diffuse activation patterns, which are associated with fragmented predictions and insufficient boundary conformity. After introducing the SF-Adapter, the responses become more concentrated around relevant structures, leading to predictions that better follow irregular object contours. For artificial objects with clear geometric structure, the baseline often exhibits feature spreading across adjacent regions, making it difficult to preserve sharp corners and clear semantic transitions. In comparison, the SF-Adapter improves spatial localization and yields more regular geometric boundaries. Overall, the visual evidence suggests that the SF-Adapter improves the structural suitability of frozen foundation features for dense remote-sensing prediction.

4.4.2. Ablation Study on ASR

To verify the efficiency and precision of the proposed ASR module, we evaluate different refinement strategies on the ISPRS Potsdam dataset. Potsdam is selected because its complex urban layouts and intricate object boundaries are especially demanding on boundary quality. Specifically, we compare dense refinement, random sparse refinement, uncertainty-based selection, boundary-based selection, and the complete ASR module.

As shown in Table 5, the baseline equipped only with SF-Adapter already achieves strong performance on Potsdam, indicating that backbone-level feature adaptation resolves a substantial portion of the representation gap before refinement. Dense refinement over the full high-resolution prediction map yields the highest absolute accuracy, but at a clear computational cost. This confirms that brute-force refinement can be effective, yet is not an efficient solution for HRSIs.

The sparse variants reveal that the effectiveness of refinement depends strongly on routing quality. Random sparse refinement provides only a marginal gain, showing that sparsity alone is insufficient. In contrast, uncertainty-based and boundary-based selection both outperform random routing, suggesting that residual errors are indeed concentrated in ambiguous and boundary-sensitive regions. The full ASR module achieves the best sparse refinement result by combining both signals. Compared with the unrefined baseline, it improves mIoU by 0.6 points and B-IoU by 0.6 points with only moderate overhead, while recovering most of the benefit of dense refinement. This result supports the central motivation of ASR: refinement should not be applied everywhere, but should instead focus on the small subset of regions where coarse prediction remains unreliable.

To visually corroborate the efficacy of the ASR module in resolving localized prediction failures, Figure 6 illustrates its internal correction mechanism across diverse challenging scenarios. As observed in the initial predictions, the baseline often struggles with fine-grained spatial details, completely missing thin continuous topologies like narrow paths and failing to recall isolated buildings. It also produces blunt, imprecise boundaries around complex semantic structures. The ASR module systematically addresses these deficiencies by first generating an Uncertainty Map that precisely highlights these high-entropy, error-prone regions. Guided by this map, the Adaptive Top-M Sparse Routing dynamically allocates computational resources—visualized as clustered sampling points—exclusively to these critical areas, bypassing flat, easy-to-classify background regions. Subsequently, the Sparse Residual Correction applies targeted local feature compensations. Consequently, our final refined outputs successfully restore the continuous linear structures, accurately recover the previously lost tiny isolated targets, and significantly sharpen complex boundaries. This qualitative evidence confirms that ASR effectively functions as a highly targeted refinement stage, maximizing boundary quality and small-object recall while strictly constraining computational overhead to the most necessary spatial locations.

4.4.3. Ablation Study on the Interaction Between SF-Adapter and ASR

To clarify the individual and joint contributions of SF-Adapter and ASR, we conduct an inter-module ablation study on GID-15, LoveDA, and ISPRS Potsdam. Starting from a baseline that combines a frozen DINOv3 backbone with a Mask2Former decoder, we separately attach the SF-Adapter or ASR, and finally combine both modules to form the complete ADVMSeg framework.

As shown in Table 6, both proposed modules are individually effective, while their combination yields the best performance on all three benchmarks. Introducing ASR alone already improves the frozen baseline on GID-15, LoveDA, and Potsdam, indicating that selective hard-region refinement can correct part of the residual errors even when the backbone representation has not been explicitly adapted. Nevertheless, the gain of ASR alone remains smaller than that of SF-Adapter alone, which suggests that improving the coarse dense representation is the dominant factor in overall performance improvement.

By contrast, attaching SF-Adapter alone leads to substantially larger gains across all datasets. This indicates that the primary limitation of the frozen satellite-pretrained backbone lies at the representation level: without additional adaptation, the dense features remain insufficiently sensitive to local structure, high-frequency detail, and fine-scale class transitions. Most importantly, the complete ADVMSeg consistently outperforms either single-module variant. Relative to SF-Adapter alone, adding ASR further improves mIoU by 0.9 points on GID-15, 0.7 points on LoveDA, and 0.6 points on Potsdam. This pattern indicates a progressive and complementary interaction between the two modules: SF-Adapter strengthens the dense feature foundation, while ASR selectively allocates additional computation to the subset of regions that remain difficult after coarse prediction.

4.5. Efficiency Analysis

Since ADVMSeg is designed to improve dense prediction through lightweight backbone adaptation and selective sparse refinement on top of a frozen foundation model, a comprehensive efficiency evaluation is essential. In addition to segmentation accuracy, we are particularly concerned with three questions: (1) how much additional parameter and computational cost is introduced by the proposed modules; (2) whether the cost of ADVMSeg remains concentrated on a small set of meaningful components rather than being spread across the whole prediction map; and (3) whether the computational behavior of ASR remains favorable under different refinement budgets and input resolutions.

4.5.1. Overall Parameter and Computational Efficiency

We first compare the overall complexity of different adaptation and refinement strategies. Specifically, we report the total number of parameters, the number of trainable parameters, the trainable ratio, inference GFLOPs, inference latency, and peak training memory. The compared variants include the frozen DINOv3 with Mask2Former baseline, a plain bottleneck adapter, the proposed SF-Adapter, SF-Adapter with dense refinement, and the complete ADVMSeg with ASR.

For full fine-tuning, however, our current computational resources do not allow us to conduct a complete large-scale training run on the frozen DINOv3 backbone under the same protocol. Therefore, we include full fine-tuning only as a parameter-scale reference, reporting its total and trainable parameter counts for comparison, while no corresponding accuracy, GFLOPs, latency, or memory results are reported.

Table 7 shows that ADVMSeg preserves the core efficiency advantage of frozen-backbone adaptation. Compared with the full fine-tuning reference, ADVMSeg reduces the number of trainable parameters from 344.8 M to 50.5 M, corresponding to an 85.4% reduction. Meanwhile, within the complete ADVMSeg model itself, the trainable ratio remains only 14.2%. This comparison indicates that ADVMSeg remains firmly within a parameter-efficient adaptation regime, and that its performance gain is achieved without resorting to full backbone updating.

The plain bottleneck adapter increases the trainable parameter count only modestly relative to the frozen baseline, but the accuracy improvement remains limited. By contrast, SF-Adapter introduces only a small additional overhead beyond the plain adapter, yet brings a much larger gain in mIoU. This result is important because it shows that the benefit of SF-Adapter is not merely due to adding more parameters, but due to its targeted spatial-frequency design.

The comparison between ADVMSeg and dense refinement is particularly revealing. Dense refinement achieves the highest absolute mIoU in this reference setting, but it also incurs the largest computational and latency overhead. ADVMSeg recovers most of the accuracy gain of dense refinement while requiring substantially fewer GFLOPs, lower latency, and lower memory. Specifically, compared with dense refinement, ADVMSeg reduces the computational cost from 421.0 to 381.0 GFLOPs and the latency from 176.0 to 148.0 ms, while sacrificing only 0.3 percentage points in mIoU. This supports the claim that selective sparse refinement is a much more efficient way to correct residual errors in high-resolution remote-sensing prediction.

4.5.2. Module-Wise Complexity Decomposition

To further understand where the cost of ADVMSeg comes from, we decompose the complete model into the frozen backbone, the pixel decoder, the SF-Adapter, the ASR routing stage, and the ASR local refinement stage. For each component, we report its contribution to total GFLOPs and inference latency.

As shown in Table 8, the dominant cost of ADVMSeg still lies in the frozen backbone and the dense pixel decoding pipeline. Together, these two parts account for approximately 359.0 GFLOPs, or about 94.2% of the total computation, and 123.0 ms of the total latency. This is expected because high-resolution transformer-based dense prediction remains fundamentally dominated by dense feature extraction and decoding.

By comparison, the proposed SF-Adapter contributes only 7.0 GFLOPs and 6.0 ms, which is relatively minor compared with the full pipeline. This is an important observation: the representation-level gain brought by SF-Adapter is achieved with very limited additional cost.

Within ASR, the routing stage itself is almost negligible, contributing only 1.8 GFLOPs and 2.0 ms. Most of the extra refinement cost comes from the sparse local refinement stage, which adds 13.2 GFLOPs and 17.0 ms. This decomposition verifies that the computational burden of ASR is concentrated on a small number of selected hard regions rather than spread over the full prediction map. Therefore, the efficiency of ASR does not come from making refinement cheaper everywhere, but from restricting refinement to where it is actually needed.

4.5.3. Trade-Off Analysis Based on ASR Ablation

The ASR ablation results in Table 5 already provide a direct view of the refinement-stage accuracy–efficiency trade-off. Dense refinement gives the strongest absolute refinement result, improving mIoU by 0.9 points and B-IoU by 0.8 points over the unrefined baseline. However, this gain comes with a large cost increase of 55.0 GFLOPs and 44.0 ms latency. In contrast, the proposed ASR improves mIoU by 0.6 points and B-IoU by 0.6 points while requiring only 15.0 extra GFLOPs and 16.0 extra milliseconds.

From a marginal-efficiency perspective, ASR is substantially more favorable than dense refinement. Using the values in Table 5, ASR yields approximately 0.40 mIoU gain per additional 10 GFLOPs, whereas dense refinement yields only about 0.16. Measured against latency, ASR achieves roughly 0.38 mIoU gain per additional 10 ms, while dense refinement provides only about 0.20. Therefore, although dense refinement remains slightly stronger in absolute accuracy, ASR extracts substantially more usable accuracy from each unit of extra computation.

The comparisons among sparse selection strategies further reveal that informative routing is necessary. Random sparse refinement improves performance only marginally, indicating that sparsity alone is insufficient. Uncertainty-based and boundary-based selection both perform better than random routing, suggesting that residual errors are indeed associated with ambiguous or boundary-sensitive regions. The full ASR module, which combines both signals, consistently achieves the best sparse refinement result. This confirms that ASR is not merely a lightweight substitute for dense refinement, but an informed refinement mechanism that targets the most valuable correction regions.

4.5.4. Scalability Analysis of Sparse Refinement

Because the computational behavior of ASR is controlled mainly by the refinement budget, we first vary the Top-M ratio on Potsdam to study how the number of selected hard locations affects both segmentation quality and efficiency.

As shown in Table 9, increasing the Top-M ratio initially improves both mIoU and B-IoU, indicating that allocating more sparse queries to difficult regions is beneficial when the refinement budget is too small. When the ratio increases from 0.25% to 1.0%, the mIoU improves from 80.9% to 81.4%, while B-IoU increases from 76.7% to 77.1%. This confirms that a moderate increase in sparse refinement capacity can effectively correct hard errors around object boundaries and complex local transitions.

However, the gain quickly saturates once the ratio exceeds 1.0%. Increasing the Top-M ratio from 1.0% to 2.0% no longer improves mIoU, and only marginally improves B-IoU. Further increasing the ratio to 5.0% leads to a large rise in computational cost, approaching the overhead of dense refinement, but without providing meaningful additional gains. This saturation behavior is particularly important: it indicates that ASR does not rely on brute-force dense correction. Instead, only a small subset of hard regions is sufficient to recover most of the achievable refinement benefit. Based on this trade-off, we select 1.0% as the default Top-M ratio in ADVMSeg. This value is intended as a practical shared default under the evaluated setting rather than a universally optimal choice. Because the refinement budget is defined by the relative ratio

ρ H_{4} W_{4}

and the routed locations are determined by image-wise normalized uncertainty and boundary cues, the same

ρ

can maintain a comparable refinement density across datasets while still adapting to the difficulty distribution of each image.

We further investigate how the framework scales with different input resolutions. Since remote-sensing inference is often performed in a sliding-window manner over very large images, it is important to examine whether the extra cost introduced by sparse refinement remains manageable as the window size increases. For this purpose, we compare the SF-Adapter-only model, the complete ADVMSeg, and the dense-refinement variant under different input resolutions.

Table 10 shows that the computational cost of all variants increases with resolution, which is expected for high-resolution dense prediction. However, two trends are especially noteworthy. First, the absolute accuracy difference among the three refinement variants remains relatively stable as the resolution increases. This means that the benefit of ASR is not restricted to a particular window size. Second, the computational gap between ADVMSeg and dense refinement becomes increasingly important at larger resolutions.

At 512 × 512, ADVMSeg reduces the cost of dense refinement by 40.0 GFLOPs and 28.0 ms, while losing only 0.3 points in mIoU. At 768 × 768, the saving grows to 92.0 GFLOPs and 64.0 ms. At 1024 × 1024, the saving further expands to 172.0 GFLOPs and 119.0 ms. Therefore, the advantage of sparse refinement becomes more pronounced as the inference window grows. This is highly relevant for remote-sensing applications, where large image sizes and sliding-window inference are unavoidable in practice.

Overall, the efficiency analysis reveals a consistent picture. SF-Adapter improves backbone-level dense feature quality with only modest additional overhead, while ASR recovers most of the benefit of dense refinement at a much lower computational cost. The resulting ADVMSeg framework therefore achieves not only strong segmentation accuracy, but also a practically favorable balance among parameter efficiency, computational efficiency, and high-resolution scalability. This trade-off is particularly relevant for real deployment, where large-scene sliding-window inference, GPU memory constraints, and post-processing cost must be considered jointly. In such settings, a method that improves hard-region quality without introducing full-map dense refinement overhead is often more practically useful than a uniformly more expensive alternative.

5. Discussion

The results of this study suggest that effective frozen-VFM adaptation for high-resolution remote-sensing semantic segmentation depends not only on transferring strong pretrained semantics but also on compensating for two task-specific mismatches that remain after pretraining. The first is that frozen-backbone features, although semantically strong, are not fully optimized for boundary-sensitive dense prediction in remote-sensing scenes. This helps explain why a plain bottleneck adapter brings only limited gains, whereas the proposed SF-Adapter becomes more effective once spatial-frequency modeling is introduced explicitly. The second is that prediction errors are not distributed uniformly over the image. Instead, they are concentrated in a relatively small subset of uncertain and boundary-sensitive regions. Under this observation, ASR is beneficial not because it makes dense refinement universally stronger, but because it allocates additional computation more selectively. Therefore, the overall improvement of ADVMSeg should be understood as the result of jointly improving representation suitability and refinement allocation under the frozen-backbone setting, rather than as the isolated effect of a single architectural component.

At the same time, the present study also has several boundaries that should be interpreted carefully. First, all experiments are conducted under the evaluated setting of a frozen satellite-pretrained DINOv3 backbone with RGB-only input, so the current conclusions are primarily directed at this adaptation regime rather than all possible foundation-model configurations. Second, although the three datasets used in this work cover complementary segmentation challenges, they do not fully represent all domain-shift conditions encountered in practical Earth observation applications. In particular, cross-region transfer, multimodal fusion, and more severe distribution shift remain to be studied further. Third, the current sparse refinement strategy adopts a shared Top-M routing ratio and fixed local refinement configuration across datasets for practical consistency. While this setting already shows a favorable accuracy–efficiency trade-off, it may not yet be the optimal choice for all scene scales, class distributions, or deployment constraints. These observations indicate that future progress may lie not only in designing stronger adaptation modules, but also in making the refinement budget, routing criterion, and adaptation granularity more scene-adaptive and backbone-agnostic.

6. Conclusions

In this work, we proposed ADVMSeg, a parameter-efficient remote-sensing semantic segmentation framework built upon a frozen satellite-pretrained DINOv3 backbone. To better adapt frozen foundation features to high-resolution dense prediction, we designed a Spatial-Frequency Adapter (SF-Adapter) to enhance global frequency responses and local spatial details in a lightweight bottleneck space. We further introduced an Adaptive Sparse Refinement (ASR) module to focus additional computation on hard regions, thereby improving boundary delineation and local correction efficiency. Experiments on GID-15, LoveDA, and the evaluated ISPRS Potsdam setting demonstrated that ADVMSeg achieves consistently strong performance under the studied unified protocol across fine-grained land-cover mapping, complex large-scale scenes, and high-resolution urban parsing. The ablation studies further confirmed that SF-Adapter is the main source of representation enhancement, while ASR provides complementary gains through selective sparse refinement. These results suggest that, under the evaluated setting, effective adaptation of frozen vision foundation models for remote-sensing semantic segmentation can be achieved without costly full-network fine-tuning. In future work, we will explore more adaptive refinement strategies for improving cross-scene generalization and efficiency.

Author Contributions

Conceptualization, C.D., C.S. and D.L.; methodology, C.D., C.S., D.L. and X.L. (Xin Li); software, C.D., Z.S. and Z.F.; validation, C.D., C.S., D.L. and X.L. (Xin Li); formal analysis, C.D., D.L., X.L. (Xin Lyu) and Y.F.; investigation, C.D., C.S., D.L., Z.S. and L.M.; resources, X.L. (Xin Li), C.D., D.L. and X.L. (Xue Liu); data curation, C.S., D.L., Z.F. and C.Z.; writing—original draft preparation, C.D., C.S., D.L. and Z.S.; writing—review and editing, X.L. (Xin Li), C.D., C.S. and X.L. (Xue Liu); visualization, C.S., Z.F., Y.F., X.L. (Xue Liu) and C.Z.; supervision, X.L. (Xin Li), X.L. (Xue Liu) and L.M.; project administration, X.L. (Xin Li), C.D. and X.L. (Xin Lyu); funding acquisition, X.L. (Xin Li), D.L. and X.L. (Xin Lyu). All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded in part by the National Key Research and Development Program of China under Grant No. 2024YFC3210801, National Natural Science Foundation of China under Grant No. 62401196, Fundamental Research Funds for the Central Universities (Grant No. B250201044), and Natural Science Foundation of Jiangsu Province under Grant No. BK20241508.

Data Availability Statement

Public datasets were used in this paper. The download links are: LoveDA [https://github.com/Junjue-Wang/LoveDA], accessed on 12 October 2025; ISPRS Potsdam [https://www.isprs.org/resources/datasets/benchmarks/UrbanSemLab/2d-sem-label-potsdam.aspx], accessed on 12 October 2025; and GID-15 [https://captain-whu.github.io/GID15/], accessed on 12 October 2025.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Ma, L.; Liu, Y.; Zhang, X.; Ye, Y.; Yin, G.; Johnson, B.A. Deep Learning in Remote Sensing Applications: A Meta-Analysis and Review. ISPRS J. Photogramm. Remote Sens. 2019, 152, 166–177. [Google Scholar] [CrossRef]
Wang, D.; Zhang, J.; Du, B.; Xia, G.-S.; Tao, D. An Empirical Study of Remote Sensing Pretraining. IEEE Trans. Geosci. Remote Sens. 2022, 61, 5608020. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image Is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations, Virtual, 3–7 May 2021. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
Zheng, S.; Lu, J.; Zhao, H.; Zhu, X.; Luo, Z.; Wang, Y.; Fu, Y.; Feng, J.; Xiang, T.; Torr, P.H.S.; et al. Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 6881–6890. [Google Scholar]
Strudel, R.; Garcia, R.; Laptev, I.; Schmid, C. Segmenter: Transformer for Semantic Segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 7262–7272. [Google Scholar]
Xie, E.; Wang, W.; Yu, Z.; An kumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar]
Cong, Y.; Khanna, S.; Meng, C.; Liu, P.; Rozi, E.; He, Y.; Burke, M.; Lobell, D.B.; Ermon, S. SatMAE: Pre-Training Transformers for Temporal and Multi-Spectral Satellite Imagery. Adv. Neural Inf. Process. Syst. 2022, 35, 197–211. [Google Scholar]
Fuller, A.; Millard, K.; Green, J.R. CROMA: Remote Sensing Representations with Contrastive Radar-Optical Masked Autoencoders. Adv. Neural Inf. Process. Syst. 2023, 36, 5506–5538. [Google Scholar]
Reed, C.J.; Gupta, R.; Li, S.; Brockman, S.; Funk, C.; Clipp, B.; Keutzer, K.; Candido, S.; Uyttendaele, M.; Darrell, T. Scale-MAE: A Scale-Aware Masked Autoencoder for Multiscale Geospatial Representation Learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 4065–4076. [Google Scholar]
Tang, M.; Cozma, A.; Georgiou, K.; Qi, H. Cross-Scale MAE: A Tale of Multiscale Exploitation in Remote Sensing. Adv. Neural Inf. Process. Syst. 2023, 36, 20054–20066. [Google Scholar]
Astruc, G.; Gonthier, N.; Mallet, C.; Landrieu, L. AnySat: One Earth Observation Model for Many Resolutions, Scales, and Modalities. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 19–25 October 2025; pp. 19530–19540. [Google Scholar]
Guo, X.; Lao, J.; Dang, B.; Zhang, Y.; Yu, L.; Ru, L.; Zhong, L.; Huang, Z.; Wu, K.; Hu, D.; et al. SkySense: A Multi-Modal Remote Sensing Foundation Model Towards Universal Interpretation for Earth Observation Imagery. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 27672–27683. [Google Scholar]
Siméoni, O.; Vo, H.V.; Seitzer, M.; Baldassarre, F.; Oquab, M.; Jose, C.; Khalidov, V.; Szafraniec, M.; Yi, S.; Ramamonjisoa, M.; et al. DINOv3. arXiv 2025, arXiv:2508.10104. [Google Scholar]
Lu, S.; Guo, J.; Zimmer-Dauphinee, J.R.; Nieusma, J.M.; Wang, X.; vanValkenburgh, P.; Wernke, S.A.; Huo, Y. Vision Foundation Models in Remote Sensing: A Survey. IEEE Geosci. Remote Sens. Mag. 2025, 13, 190–215. [Google Scholar] [CrossRef]
Huo, C.; Chen, K.; Zhang, S.; Wang, Z.; Yan, H.; Shen, J.; Hong, Y.; Qi, G.; Fang, H.; Wang, Z. When Remote Sensing Meets Foundation Model: A Survey and Beyond. Remote Sens. 2025, 17, 179. [Google Scholar] [CrossRef]
Xu, Z.; Zhang, W.; Zhang, T.; Li, J. HRCNet: High-Resolution Context Extraction Network for Semantic Segmentation of Remote Sensing Images. Remote Sens. 2020, 13, 71. [Google Scholar] [CrossRef]
Li, R.; Zheng, S.; Zhang, C.; Duan, C.; Wang, L.; Atkinson, P.M. ABCNet: Attentive Bilateral Contextual Network for Efficient Semantic Segmentation of Fine-Resolution Remotely Sensed Imagery. ISPRS J. Photogramm. Remote Sens. 2021, 181, 84–98. [Google Scholar] [CrossRef]
Li, R.; Zheng, S.; Zhang, C.; Duan, C.; Su, J.; Wang, L.; Atkinson, P.M. Multiattention Network for Semantic Segmentation of Fine-Resolution Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5607713. [Google Scholar] [CrossRef]
Wang, L.; Li, R.; Duan, C.; Zhang, C.; Meng, X.; Fang, S. A Novel Transformer Based Semantic Segmentation Scheme for Fine-Resolution Remote Sensing Images. IEEE Geosci. Remote Sens. Lett. 2022, 19, 6506105. [Google Scholar] [CrossRef]
Wang, L.; Li, R.; Zhang, C.; Fang, S.; Duan, C.; Meng, X.; Atkinson, P.M. UNetFormer: A UNet-Like Transformer for Efficient Semantic Segmentation of Remote Sensing Urban Scene Imagery. ISPRS J. Photogramm. Remote Sens. 2022, 190, 196–214. [Google Scholar] [CrossRef]
Marmanis, D.; Schindler, K.; Wegner, J.D.; Galliani, S.; Datcu, M.; Stilla, U. Classification with an Edge: Improving Semantic Image Segmentation with Boundary Detection. ISPRS J. Photogramm. Remote Sens. 2018, 135, 158–172. [Google Scholar] [CrossRef]
Zheng, X.; Huan, L.; Xia, G.-S.; Gong, J. Parsing Very High Resolution Urban Scene Images by Learning Deep ConvNets with Edge-Aware Loss. ISPRS J. Photogramm. Remote Sens. 2020, 170, 15–28. [Google Scholar] [CrossRef]
Wang, Y.; Ding, W.; Zhang, R.; Li, H. Boundary-Aware Multitask Learning for Remote Sensing Imagery. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 951–963. [Google Scholar] [CrossRef]
Li, X.; Xu, F.; Liu, F.; Tong, Y.; Lyu, X.; Zhou, J. Semantic Segmentation of Remote Sensing Images by Interactive Representation Refinement and Geometric Prior-Guided Inference. IEEE Trans. Geosci. Remote Sens. 2023, 62, 5400318. [Google Scholar] [CrossRef]
Hu, L.; Yu, H.; Lu, W.; Yin, D.; Sun, X.; Fu, K. AiRs: Adapter in Remote Sensing for Parameter-Efficient Transfer Learning. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5605218. [Google Scholar] [CrossRef]
Luo, M.; Zan, Y.; Khoshelham, K.; Ji, S. Domain Generalization for Semantic Segmentation of Remote Sensing Images via Vision Foundation Model Fine-Tuning. ISPRS J. Photogramm. Remote Sens. 2025, 230, 126–146. [Google Scholar] [CrossRef]
Zou, X.; Zhang, S.; Li, K.; Wang, S.; Xing, J.; Jin, L.; Lang, C.; Tao, P. Adapting Vision Foundation Models for Robust Cloud Segmentation in Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5636814. [Google Scholar] [CrossRef]
Chen, K.; Liu, C.; Chen, H.; Zhang, H.; Li, W.; Zou, Z.; Shi, Z. RSPrompter: Learning to Prompt for Remote Sensing Instance Segmentation Based on Visual Foundation Model. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4701117. [Google Scholar] [CrossRef]
Zhang, J.; Li, Y.; Yang, X.; Jiang, R.; Zhang, L. RSAM-Seg: A SAM-Based Model with Prior Knowledge Integration for Remote Sensing Image Semantic Segmentation. Remote Sens. 2025, 17, 590. [Google Scholar] [CrossRef]
Zhang, Z.; Shu, D.; Liao, C.; Liu, C.; Zhao, Y.; Wang, R.; Huang, X.; Zhang, M.; Gong, J. FlexiSAM: A Flexible SAM-Based Semantic Segmentation Model for Land Cover Classification Using High-Resolution Multimodal Remote Sensing Imagery. ISPRS J. Photogramm. Remote Sens. 2025, 227, 594–612. [Google Scholar] [CrossRef]
Cheng, B.; Misra, I.; Schwing, A.G.; Kirillov, A.; Girdhar, R. Masked-Attention Mask Transformer for Universal Image Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1280–1289. [Google Scholar]
Kirillov, A.; Wu, Y.; He, K.; Girshick, R. PointRend: Image Segmentation as Rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9799–9808. [Google Scholar]
Yuan, Y.; Xie, J.; Chen, X.; Wang, J. SegFix: Model-Agnostic Boundary Refinement for Segmentation. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 489–506. [Google Scholar]
Yang, Y.; Yuan, G.; Li, J. SFFNet: A Wavelet-Based Spatial and Frequency Domain Fusion Network for Remote Sensing Segmentation. IEEE Trans. Geosci. Remote Sens. 2024, 62, 3000617. [Google Scholar] [CrossRef]
Cao, Y.; Liu, C.; Wu, Z.; Zhang, L.; Yang, L. Remote Sensing Image Segmentation Using Vision Mamba and Multi-Scale Multi-Frequency Feature Fusion. Remote Sens. 2025, 17, 1390. [Google Scholar] [CrossRef]
Chen, Y.; Lai, W.; He, W.; Zhao, X.-L.; Zeng, J. Hyperspectral Compressive Snapshot Reconstruction via Coupled Low-Rank Subspace Representation and Self-Supervised Deep Network. IEEE Trans. Image Process. 2024, 33, 926–941. [Google Scholar] [CrossRef] [PubMed]
Chen, Y.; Chen, M.; He, W.; Zeng, J.; Huang, M.; Zheng, Y.-B. Thick Cloud Removal in Multitemporal Remote Sensing Images via Low-Rank Regularized Self-Supervised Network. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5506613. [Google Scholar] [CrossRef]
Audebert, N.; Le Saux, B.; Lefèvre, S. Beyond RGB: Very High Resolution Urban Remote Sensing with Multimodal Deep Networks. ISPRS J. Photogramm. Remote Sens. 2018, 140, 20–32. [Google Scholar] [CrossRef]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid Scene Parsing Network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
Chen, L.-C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
Liu, Y.; Fan, B.; Wang, L.; Bai, J.; Xiang, S.; Pan, C. Semantic Labeling in Very High Resolution Images via a Self-Cascaded Convolutional Neural Network. ISPRS J. Photogramm. Remote Sens. 2018, 145, 78–95. [Google Scholar] [CrossRef]
Diakogiannis, F.I.; Waldner, F.; Caccetta, P.; Wu, C. ResUNet-a: A Deep Learning Framework for Semantic Segmentation of Remotely Sensed Data. ISPRS J. Photogramm. Remote Sens. 2020, 162, 94–114. [Google Scholar] [CrossRef]
Zhang, Z.; Huang, X.; Li, J. DWin-HRFormer: A High-Resolution Transformer Model with Directional Windows for Semantic Segmentation of Urban Construction Land. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5400714. [Google Scholar] [CrossRef]
Li, M.; Long, J.; Stein, A.; Wang, X. Using a Semantic Edge-Aware Multi-Task Neural Network to Delineate Agricultural Parcels from Remote Sensing Images. ISPRS J. Photogramm. Remote Sens. 2023, 200, 24–40. [Google Scholar] [CrossRef]
He, X.; Zhou, Y.; Liu, B.; Zhao, J.; Yao, R. Remote Sensing Image Semantic Segmentation via Class-Guided Structural Interaction and Boundary Perception. Expert Syst. Appl. 2024, 252, 124019. [Google Scholar] [CrossRef]
Li, X.; Xu, F.; Yu, A.; Lyu, X.; Gao, H.; Zhou, J. A Frequency Decoupling Network for Semantic Segmentation of Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5607921. [Google Scholar] [CrossRef]
Li, X.; Xu, F.; Zhang, J.; Yu, A.; Lyu, X.; Gao, H.; Zhou, J. Dual-Domain Decoupled Fusion Network for Semantic Segmentation of Remote Sensing Images. Inf. Fusion 2025, 124, 103359. [Google Scholar] [CrossRef]
Li, X.; Xu, F.; Zhang, J.; Zhang, H.; Lyu, X.; Liu, F.; Gao, H.; Kaup, A. Frequency-Guided Denoising Network for Semantic Segmentation of Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2026, 64, 5400217. [Google Scholar] [CrossRef]
Yuan, F.; Chen, Y.; He, W.; Zeng, J. Feature Fusion-Guided Network With Sparse Prior Constraints for Unsupervised Hyperspectral Image Quality Improvement. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5511912. [Google Scholar] [CrossRef]
Xiang, P.; Ali, S.; Zhang, J.; Jung, S.K.; Zhou, H. Pixel-Associated Autoencoder for Hyperspectral Anomaly Detection. Int. J. Appl. Earth Obs. Geoinf. 2024, 129, 103816. [Google Scholar] [CrossRef]
Xiang, P.; Zhang, J.; Qi, S.; Jung, S.K.; Zhou, H.; Zhao, D. Hyperspectral Anomaly Detection Using Taylor Expansion and Weighted Irregular Block Filter. Infrared Phys. Technol. 2025, 150, 105942. [Google Scholar] [CrossRef]
Li, X.; Xu, F.; Li, J.; Su, Y.; Li, L.; Lyu, X.; Xu, Z.; Kaup, A. Frequency Domain-Enhanced Spectral-Spatial Fusion Transformer for Semantic Segmentation of Remote Sensing Images. Inf. Fusion 2026, 132, 104248. [Google Scholar] [CrossRef]
Li, X.; Shi, C.; Xu, N.; Su, Y.; Kaup, A.; Liu, D.; Li, X. Position-Aware Differential Denoising Transformer for Semantic Segmentation of Remote Sensing Images. IEEE Geosci. Remote Sens. Lett. 2026, 23, 5000405. [Google Scholar] [CrossRef]
Wu, H.; Huang, P.; Zhang, M.; Tang, W.; Yu, X. CMTFNet: CNN and Multiscale Transformer Fusion Network for Remote-Sensing Image Semantic Segmentation. IEEE Trans. Geosci. Remote Sens. 2023, 61, 2004612. [Google Scholar] [CrossRef]
Tong, X.-Y.; Xia, G.-S.; Lu, Q.; Shen, H.; Li, S.; You, S.; Zhang, L. Land-Cover Classification with High-Resolution Remote Sensing Images Using Transferable Deep Models. Remote Sens. Environ. 2020, 237, 111322. [Google Scholar] [CrossRef]
Wang, J.; Zheng, Z.; Ma, A.; Lu, X.; Zhong, Y. LoveDA: A Remote Sensing Land-Cover Dataset for Domain Adaptive Semantic Segmentation. In Proceedings of the 35th Neural Information Processing Systems Track on Datasets and Benchmarks, Montreal, QC, Canada, 6–14 December 2021. [Google Scholar]
Gerke, M.; Rottensteiner, F.; Wegner, J.D.; Sohn, G. ISPRS Semantic Labeling Contest. In Proceedings of the Photogrammetric Computer Vision (PCV), Zurich, Switzerland, 5–7 September 2014. [Google Scholar]
Zeng, Q.; Zhou, J.; Tao, J.; Chen, L.; Niu, X.; Zhang, Y. Multiscale Global Context Network for Semantic Segmentation of High-Resolution Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5622913. [Google Scholar] [CrossRef]
Huang, W.; Deng, F.; Liu, H.; Ding, M.; Yao, Q. Multiscale Semantic Segmentation of Remote Sensing Images Based on Edge Optimization. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5616813. [Google Scholar] [CrossRef]
Li, J.; Wei, Y.; Wei, T.; He, W. A Comprehensive Deep-Learning Framework for Fine-Grained Farmland Mapping From High-Resolution Images. IEEE Trans. Geosci. Remote Sens. 2024, 63, 5601215. [Google Scholar] [CrossRef]

Figure 1. Overall architecture of ADVMSeg. The framework consists of a frozen satellite-pretrained DINOv3 backbone, a Mask2Former-style segmentation head, the proposed SF-Adapter for spatial-frequency feature adaptation, and the ASR module for sparse hard-region refinement.

Figure 2. Architecture of the proposed SF-Adapter. It combines global spectral filtering and multiscale spatial enhancement in a bottleneck space, followed by adaptive fusion.

Figure 3. Architecture of the proposed ASR module. It identifies hard regions from coarse predictions, constructs sparse queries from multiscale dense features, performs local cross-attention refinement, and renders the predicted block-wise residuals back to the dense logit map.

Figure 4. Qualitative comparison of semantic segmentation results on Potsdam, LoveDA, and GID-15. From left to right, the columns correspond to the input image, ground-truth, SkySense, GeoSA, RSAM-Seg, and the proposed ADVMSeg. The selected examples highlight challenging scenarios involving complex boundaries, small structures, fragmented objects, and confusing land-cover transitions. Compared with the competing methods, ADVMSeg produces predictions with cleaner object contours, fewer spurious regions, and better semantic consistency in difficult areas.

Figure 5. Qualitative comparison of feature activation maps and segmentation outputs. The baseline (frozen backbone) exhibits blurry and dispersed activations, while the integration of our SF-Adapter significantly sharpens the features, leading to refined boundaries and accurate topology in the final predictions.

Figure 6. Qualitative analysis of the ASR module. The Uncertainty Map precisely localizes difficult regions where the baseline fails. Guided by this map, Adaptive Top-M Sparse Routing concentrates computational resources on these hard pixels, and Sparse Residual Correction applies local refinements, ultimately producing accurate and sharp final predictions.

Table 1. Comparison on GID-15. The mIoU, mF1, and OA values are measured under our unified protocol. The class columns report per-class IoU (%). PF: paddy field; IL: irrigated land; DC: dry cropland; Ga: garden; AF: arbor forest; SL: shrub land; NM: natural meadow; AM: artificial meadow; Ind: industrial land; UR: urban residential; RR: rural residential; TL: traffic land; Ri: river; La: lake; Po: pond. The bold text indicates the best results.

Method	mIoU	mF1	OA	PF	IL	DC	Ga	AF	SL	NM	AM	Ind	UR	RR	TL	Ri	La	Po
DeepLabV3+ (2018) [41]	56.4	69.3	74.3	62.8	76.9	58.0	26.9	75.6	5.2	57.5	31.0	51.8	62.7	51.1	51.9	86.9	77.0	70.2
DC-Swin (2022) [20]	59.1	71.9	75.4	65.1	78.2	61.0	30.8	76.3	8.1	59.4	39.6	54.7	64.6	52.4	55.2	89.0	79.2	72.6
CMTFNet (2023) [55]	59.6	72.6	75.7	64.4	78.4	62.2	39.5	76.3	9.0	62.1	41.3	53.4	63.9	52.6	54.8	90.2	74.1	71.8
MSGCNet (2024) [59]	59.3	72.2	75.5	64.8	77.9	61.7	33.8	76.0	8.8	60.8	40.1	54.1	64.2	52.1	54.9	89.8	78.6	72.2
SFFNet (2024) [35]	61.1	73.7	76.4	66.9	79.1	63.9	34.6	76.6	10.6	61.9	45.3	55.5	65.4	53.1	57.6	91.1	80.5	74.3
MSEONet (2025) [60]	57.5	70.5	74.8	63.6	77.3	59.4	29.1	75.8	6.7	58.3	35.0	52.9	63.3	51.7	53.0	87.8	77.8	71.0
DBBANet (2024) [61]	60.6	73.3	76.1	66.0	78.8	63.0	35.2	76.8	10.1	61.5	44.0	55.1	65.0	52.9	56.4	90.7	80.0	73.8
SkySense (2024) [13]	60.9	73.5	76.3	66.4	79.0	63.4	35.4	76.9	10.3	61.7	44.8	55.3	65.2	53.0	56.9	90.9	80.2	74.0
GeoSA (2025) [27]	62.1	74.5	76.8	67.5	79.8	64.5	35.9	77.2	11.2	62.6	46.1	56.0	66.7	54.3	58.4	91.8	83.3	76.4
RSAM-Seg (2025) [30]	61.0	73.6	76.4	66.7	78.9	63.7	34.8	76.6	10.8	61.8	45.0	55.6	65.3	53.1	57.2	91.0	80.4	74.2
ADVMSeg (Ours)	63.1	75.4	77.6	69.1	80.4	66.2	35.7	77.1	13.2	63.1	50.2	58.1	66.5	54.1	60.2	93.7	83.1	76.3

Table 2. Comparison on LoveDA. The mIoU, mF1, and OA values are measured under our unified protocol. The class columns report per-class IoU (%). Bkg: background; Bld: building; Rd: road; Wat: water; Bar: barren; For: forest; Agr: agriculture. The bold text indicates the best results.

Method	mIoU	mF1	OA	Bkg	Bld	Rd	Wat	Bar	For	Agr
DeepLabV3+ (2018) [41]	47.6	62.5	52.3	46.8	49.5	51.1	65.9	12.4	42.7	64.9
DC-Swin (2022) [20]	56.0	70.4	59.7	57.3	60.8	63.2	78.0	23.9	52.1	56.7
CMTFNet (2023) [55]	58.5	72.5	61.8	59.2	62.6	65.5	79.2	26.4	54.1	62.5
MSGCNet (2024) [59]	57.4	71.6	61.0	58.4	61.7	64.2	78.6	25.0	53.0	61.0
SFFNet (2024) [35]	60.8	74.5	66.0	61.8	65.7	69.0	81.4	29.8	57.5	60.4
MSEONet (2025) [60]	55.7	70.1	59.2	56.9	60.2	62.7	77.5	22.3	51.2	59.1
DBBANet (2024) [61]	59.8	73.7	64.4	60.6	64.4	67.6	80.3	28.7	56.6	60.4
SkySense (2024) [13]	61.5	75.0	66.9	62.2	66.0	69.1	81.8	30.2	58.2	63.0
GeoSA (2025) [27]	62.4	75.8	67.5	63.1	67.4	70.0	82.4	31.4	59.8	62.7
RSAM-Seg (2025) [30]	62.7	76.0	68.7	64.0	68.3	71.4	82.8	32.1	60.5	60.0
ADVMSeg (Ours)	63.5	76.7	69.2	64.7	66.6	72.5	83.2	33.5	61.0	63.0

Table 3. Comparison on ISPRS Potsdam. The mIoU, mF1, and OA values are measured under our unified protocol on an internally constructed split from the public annotated tiles, rather than on the official hidden test benchmark. Therefore, these results are intended for controlled comparison under our setting and are not directly comparable to results reported on the official benchmark server. The class columns report per-class F1 (%). Imp: impervious surface; Bld: building; LowVeg: low vegetation; Tre: tree; Car: car; Clut: clutter. The bold text indicates the best results.

Method	mIoU	mF1	OA	Imp	Bld	LowVeg	Tre	Car	Clut
DeepLabV3+ (2018) [41]	70.9	82.8	84.5	85.5	89.6	78.6	75.1	85.9	81.8
DC-Swin (2022) [20]	78.3	87.7	87.9	89.2	93.7	82.6	83.1	90.1	87.2
CMTFNet (2023) [55]	80.0	88.8	88.6	90.4	95.0	83.6	84.6	90.8	88.3
MSGCNet (2024) [59]	79.6	88.5	88.4	90.0	94.8	83.3	84.2	90.6	88.0
SFFNet (2024) [35]	80.5	89.1	88.8	90.6	95.2	84.0	84.9	91.0	88.8
MSEONet (2025) [60]	77.9	87.4	87.6	88.8	94.1	82.1	82.8	89.8	86.7
DBBANet (2024) [61]	80.7	89.2	89.0	90.7	95.2	84.3	85.1	91.1	89.0
SkySense (2024) [13]	80.8	89.3	89.2	90.8	95.3	84.4	85.2	91.3	88.6
GeoSA (2025) [27]	81.1	89.4	89.3	91.0	95.4	84.5	85.3	91.2	89.1
RSAM-Seg (2025) [30]	80.6	89.1	88.9	90.7	95.3	84.3	85.1	91.1	88.2
ADVMSeg (Ours)	81.4	89.6	89.6	90.8	95.3	85.1	85.7	91.4	89.5

Table 4. Ablation study of SF-Adapter variants using mIoU. The baseline uses a completely frozen DINOv3 backbone without explicit task adaptation. The bold text indicates the best results.

Variant	GID-15 mIoU (%)	LoveDA mIoU (%)	Potsdam mIoU (%)
Frozen DINOv3 + Mask2Former	59.3	60.1	79.1
+ Plain Bottleneck Adapter	59.4	60.8	79.6
+ Spatial Branch Only	59.9	61.2	80.4
+ Spectral Branch Only	60.1	61.7	79.9
+ Spectral + Spatial (Direct Sum)	61.0	61.9	80.5
+ Full SF-Adapter	62.2	62.8	80.8

Table 5. Ablation study of different refinement strategies on Potsdam. B-IoU denotes Boundary IoU. The baseline corresponds to the model equipped with the full SF-Adapter before refinement. The bold text indicates the best results.

Variant	Potsdam mIoU (%)	B-IoU (%)	Latency (ms)	GFLOPs
No Refinement (Baseline)	80.8	76.5	132.0	366.0
Random Sparse Refinement	80.9	76.6	145.0	378.0
Uncertainty-based Selection	81.0	76.7	146.0	379.0
Boundary-based Selection	80.9	76.9	146.0	379.0
Full ASR Module	81.4	77.1	148.0	381.0
Dense Refinement	81.7	77.3	176.0	421.0

Table 6. Inter-module ablation study of SF-Adapter and ASR using mIoU. The complete model corresponds to ADVMSeg. The bold text indicates the best results.

SF-Adapter	ASR	GID-15	LoveDA	Potsdam
		59.3	60.1	79.1
	✓	60.6	61.7	79.9
✓		62.2	62.8	80.8
✓	✓	63.1	63.5	81.4

Table 7. Overall efficiency comparison on Potsdam. Tr. Ratio: proportion of trainable parameters to total parameters. Mem.: peak GPU memory during training. The bold text indicates the best results.

Variant	mIoU (%)	Params (M)	Tr. Params (M)	Tr. Ratio (%)	GFLOPs	Lat. (ms)	Mem. (GB)
Frozen DINOv3 + Mask2Former	79.1	344.8	40.8	11.8	360.0	128.0	15.4
+ Plain Bottleneck Adapter	79.6	349.1	45.1	12.9	364.0	131.0	16.1
+ SF-Adapter	80.8	350.9	46.9	13.4	366.0	132.0	16.9
+ SF-Adapter + Dense Refinement	81.7	357.2	53.2	14.9	421.0	176.0	18.8
ADVMSeg	81.4	354.5	50.5	14.2	381.0	148.0	17.6
Full Fine-Tuning (ref.)	–	344.8	344.8	100.0	–	–	–

Table 8. Module-wise complexity decomposition of ADVMSeg on Potsdam. Routing includes uncertainty/boundary score generation and Top-M selection. Local Refinement denotes the sparse local cross-attention correction stage.

Component	GFLOPs	Latency (ms)
Frozen DINOv3 Backbone	298.0	101.0
Pixel Decoder	61.0	22.0
SF-Adapter	7.0	6.0
ASR Routing	1.8	2.0
ASR Local Refinement	13.2	17.0
Total	381.0	148.0

Table 9. Sensitivity analysis of the Top-M ratio in ASR on Potsdam. The default setting used in ADVMSeg is 1.0%. The bold text indicates the best results.

Top-M Ratio	Potsdam mIoU (%)	B-IoU (%)	GFLOPs	Latency (ms)
0.25%	80.9	76.7	374.0	140.0
0.5%	81.0	76.9	377.0	143.0
1.0%	81.4	77.1	381.0	148.0
2.0%	81.2	77.2	389.0	155.0
5.0%	81.2	77.2	420.0	173.0

Table 10. Scalability comparison under different input resolutions on Potsdam.

Input Resolution	Variant	mIoU (%)	GFLOPs	Latency (ms)
512 × 512	SF-Adapter Only	80.8	366.0	132.0
512 × 512	ADVMSeg (ASR)	81.4	381.0	148.0
512 × 512	Dense Refinement	81.7	421.0	176.0
768 × 768	SF-Adapter Only	81.0	823.0	244.0
768 × 768	ADVMSeg (ASR)	81.5	856.0	273.0
768 × 768	Dense Refinement	81.8	948.0	337.0
1024 × 1024	SF-Adapter Only	81.0	1461.0	425.0
1024 × 1024	ADVMSeg (ASR)	81.5	1518.0	468.0
1024 × 1024	Dense Refinement	81.8	1690.0	587.0

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ding, C.; Shi, C.; Liu, D.; Shi, Z.; Lyu, X.; Fang, Z.; Liu, X.; Meng, L.; Fang, Y.; Zhang, C.; et al. Efficient Adaptation of Vision Foundation Model for High-Resolution Remote Sensing Image Segmentation via Spatial-Frequency Modeling and Sparse Refinement. Remote Sens. 2026, 18, 1295. https://doi.org/10.3390/rs18091295

AMA Style

Ding C, Shi C, Liu D, Shi Z, Lyu X, Fang Z, Liu X, Meng L, Fang Y, Zhang C, et al. Efficient Adaptation of Vision Foundation Model for High-Resolution Remote Sensing Image Segmentation via Spatial-Frequency Modeling and Sparse Refinement. Remote Sensing. 2026; 18(9):1295. https://doi.org/10.3390/rs18091295

Chicago/Turabian Style

Ding, Chenlong, Chengyi Shi, Daofang Liu, Zhihao Shi, Xin Lyu, Zhenyu Fang, Xue Liu, Lingqiang Meng, Yiwei Fang, Chengming Zhang, and et al. 2026. "Efficient Adaptation of Vision Foundation Model for High-Resolution Remote Sensing Image Segmentation via Spatial-Frequency Modeling and Sparse Refinement" Remote Sensing 18, no. 9: 1295. https://doi.org/10.3390/rs18091295

APA Style

Ding, C., Shi, C., Liu, D., Shi, Z., Lyu, X., Fang, Z., Liu, X., Meng, L., Fang, Y., Zhang, C., & Li, X. (2026). Efficient Adaptation of Vision Foundation Model for High-Resolution Remote Sensing Image Segmentation via Spatial-Frequency Modeling and Sparse Refinement. Remote Sensing, 18(9), 1295. https://doi.org/10.3390/rs18091295

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Efficient Adaptation of Vision Foundation Model for High-Resolution Remote Sensing Image Segmentation via Spatial-Frequency Modeling and Sparse Refinement

Highlights

Abstract

1. Introduction

2. Related Work

2.1. Remote-Sensing Semantic Segmentation and Structure-Aware Modeling

2.2. Foundation-Model Adaptation and Efficient Refinement

3. Methods

3.1. Framework Overview

3.2. Frozen Satellite-Pretrained Backbone and Mask2Former Head

3.2.1. Frozen DINOv3 Backbone

3.2.2. Mask2Former-Style Segmentation Head

3.3. Spatial-Frequency Adapter

3.3.1. Adapter Placement and Overall Formulation

3.3.2. Input Normalization and Bottleneck Projection

3.3.3. Frequency-Domain Global Spectral Filtering

3.3.4. Spatial Multiscale Detail Enhancement

3.3.5. Adaptive Spatial-Frequency Fusion

3.3.6. Residual Projection to Token Space

3.4. Adaptive Sparse Refinement

3.4.1. Inputs and Overall Objective

3.4.2. Hard-Region Scoring and Top-M Routing

3.4.3. Sparse-Query Construction

3.4.4. Local Sparse Refinement Blocks

3.4.5. Residual Prediction and Dense Rendering

3.5. Optimization Objective and Trainable Parameters

4. Experiments

4.1. Experimental Settings

4.1.1. Datasets and Evaluation Metrics

4.1.2. Implementation Details

4.1.3. Baseline Settings and Comparison Protocol

4.2. Benchmark Results

4.2.1. GID-15

4.2.2. LoveDA

4.2.3. Potsdam

4.3. Qualitative Comparison

4.4. Ablation and Analysis

4.4.1. Ablation Study on the Spatial-Frequency Adapter

4.4.2. Ablation Study on ASR

4.4.3. Ablation Study on the Interaction Between SF-Adapter and ASR

4.5. Efficiency Analysis

4.5.1. Overall Parameter and Computational Efficiency

4.5.2. Module-Wise Complexity Decomposition

4.5.3. Trade-Off Analysis Based on ASR Ablation

4.5.4. Scalability Analysis of Sparse Refinement

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI