Author Contributions
Conceptualization, C.D., C.S. and D.L.; methodology, C.D., C.S., D.L. and X.L. (Xin Li); software, C.D., Z.S. and Z.F.; validation, C.D., C.S., D.L. and X.L. (Xin Li); formal analysis, C.D., D.L., X.L. (Xin Lyu) and Y.F.; investigation, C.D., C.S., D.L., Z.S. and L.M.; resources, X.L. (Xin Li), C.D., D.L. and X.L. (Xue Liu); data curation, C.S., D.L., Z.F. and C.Z.; writing—original draft preparation, C.D., C.S., D.L. and Z.S.; writing—review and editing, X.L. (Xin Li), C.D., C.S. and X.L. (Xue Liu); visualization, C.S., Z.F., Y.F., X.L. (Xue Liu) and C.Z.; supervision, X.L. (Xin Li), X.L. (Xue Liu) and L.M.; project administration, X.L. (Xin Li), C.D. and X.L. (Xin Lyu); funding acquisition, X.L. (Xin Li), D.L. and X.L. (Xin Lyu). All authors have read and agreed to the published version of the manuscript.
Figure 1.
Overall architecture of ADVMSeg. The framework consists of a frozen satellite-pretrained DINOv3 backbone, a Mask2Former-style segmentation head, the proposed SF-Adapter for spatial-frequency feature adaptation, and the ASR module for sparse hard-region refinement.
Figure 1.
Overall architecture of ADVMSeg. The framework consists of a frozen satellite-pretrained DINOv3 backbone, a Mask2Former-style segmentation head, the proposed SF-Adapter for spatial-frequency feature adaptation, and the ASR module for sparse hard-region refinement.
Figure 2.
Architecture of the proposed SF-Adapter. It combines global spectral filtering and multiscale spatial enhancement in a bottleneck space, followed by adaptive fusion.
Figure 2.
Architecture of the proposed SF-Adapter. It combines global spectral filtering and multiscale spatial enhancement in a bottleneck space, followed by adaptive fusion.
Figure 3.
Architecture of the proposed ASR module. It identifies hard regions from coarse predictions, constructs sparse queries from multiscale dense features, performs local cross-attention refinement, and renders the predicted block-wise residuals back to the dense logit map.
Figure 3.
Architecture of the proposed ASR module. It identifies hard regions from coarse predictions, constructs sparse queries from multiscale dense features, performs local cross-attention refinement, and renders the predicted block-wise residuals back to the dense logit map.
Figure 4.
Qualitative comparison of semantic segmentation results on Potsdam, LoveDA, and GID-15. From left to right, the columns correspond to the input image, ground-truth, SkySense, GeoSA, RSAM-Seg, and the proposed ADVMSeg. The selected examples highlight challenging scenarios involving complex boundaries, small structures, fragmented objects, and confusing land-cover transitions. Compared with the competing methods, ADVMSeg produces predictions with cleaner object contours, fewer spurious regions, and better semantic consistency in difficult areas.
Figure 4.
Qualitative comparison of semantic segmentation results on Potsdam, LoveDA, and GID-15. From left to right, the columns correspond to the input image, ground-truth, SkySense, GeoSA, RSAM-Seg, and the proposed ADVMSeg. The selected examples highlight challenging scenarios involving complex boundaries, small structures, fragmented objects, and confusing land-cover transitions. Compared with the competing methods, ADVMSeg produces predictions with cleaner object contours, fewer spurious regions, and better semantic consistency in difficult areas.
Figure 5.
Qualitative comparison of feature activation maps and segmentation outputs. The baseline (frozen backbone) exhibits blurry and dispersed activations, while the integration of our SF-Adapter significantly sharpens the features, leading to refined boundaries and accurate topology in the final predictions.
Figure 5.
Qualitative comparison of feature activation maps and segmentation outputs. The baseline (frozen backbone) exhibits blurry and dispersed activations, while the integration of our SF-Adapter significantly sharpens the features, leading to refined boundaries and accurate topology in the final predictions.
Figure 6.
Qualitative analysis of the ASR module. The Uncertainty Map precisely localizes difficult regions where the baseline fails. Guided by this map, Adaptive Top-M Sparse Routing concentrates computational resources on these hard pixels, and Sparse Residual Correction applies local refinements, ultimately producing accurate and sharp final predictions.
Figure 6.
Qualitative analysis of the ASR module. The Uncertainty Map precisely localizes difficult regions where the baseline fails. Guided by this map, Adaptive Top-M Sparse Routing concentrates computational resources on these hard pixels, and Sparse Residual Correction applies local refinements, ultimately producing accurate and sharp final predictions.
Table 1.
Comparison on GID-15. The mIoU, mF1, and OA values are measured under our unified protocol. The class columns report per-class IoU (%). PF: paddy field; IL: irrigated land; DC: dry cropland; Ga: garden; AF: arbor forest; SL: shrub land; NM: natural meadow; AM: artificial meadow; Ind: industrial land; UR: urban residential; RR: rural residential; TL: traffic land; Ri: river; La: lake; Po: pond. The bold text indicates the best results.
Table 1.
Comparison on GID-15. The mIoU, mF1, and OA values are measured under our unified protocol. The class columns report per-class IoU (%). PF: paddy field; IL: irrigated land; DC: dry cropland; Ga: garden; AF: arbor forest; SL: shrub land; NM: natural meadow; AM: artificial meadow; Ind: industrial land; UR: urban residential; RR: rural residential; TL: traffic land; Ri: river; La: lake; Po: pond. The bold text indicates the best results.
| Method | mIoU | mF1 | OA | PF | IL | DC | Ga | AF | SL | NM | AM | Ind | UR | RR | TL | Ri | La | Po |
|---|
| DeepLabV3+ (2018) [41] | 56.4 | 69.3 | 74.3 | 62.8 | 76.9 | 58.0 | 26.9 | 75.6 | 5.2 | 57.5 | 31.0 | 51.8 | 62.7 | 51.1 | 51.9 | 86.9 | 77.0 | 70.2 |
| DC-Swin (2022) [20] | 59.1 | 71.9 | 75.4 | 65.1 | 78.2 | 61.0 | 30.8 | 76.3 | 8.1 | 59.4 | 39.6 | 54.7 | 64.6 | 52.4 | 55.2 | 89.0 | 79.2 | 72.6 |
| CMTFNet (2023) [55] | 59.6 | 72.6 | 75.7 | 64.4 | 78.4 | 62.2 | 39.5 | 76.3 | 9.0 | 62.1 | 41.3 | 53.4 | 63.9 | 52.6 | 54.8 | 90.2 | 74.1 | 71.8 |
| MSGCNet (2024) [59] | 59.3 | 72.2 | 75.5 | 64.8 | 77.9 | 61.7 | 33.8 | 76.0 | 8.8 | 60.8 | 40.1 | 54.1 | 64.2 | 52.1 | 54.9 | 89.8 | 78.6 | 72.2 |
| SFFNet (2024) [35] | 61.1 | 73.7 | 76.4 | 66.9 | 79.1 | 63.9 | 34.6 | 76.6 | 10.6 | 61.9 | 45.3 | 55.5 | 65.4 | 53.1 | 57.6 | 91.1 | 80.5 | 74.3 |
| MSEONet (2025) [60] | 57.5 | 70.5 | 74.8 | 63.6 | 77.3 | 59.4 | 29.1 | 75.8 | 6.7 | 58.3 | 35.0 | 52.9 | 63.3 | 51.7 | 53.0 | 87.8 | 77.8 | 71.0 |
| DBBANet (2024) [61] | 60.6 | 73.3 | 76.1 | 66.0 | 78.8 | 63.0 | 35.2 | 76.8 | 10.1 | 61.5 | 44.0 | 55.1 | 65.0 | 52.9 | 56.4 | 90.7 | 80.0 | 73.8 |
| SkySense (2024) [13] | 60.9 | 73.5 | 76.3 | 66.4 | 79.0 | 63.4 | 35.4 | 76.9 | 10.3 | 61.7 | 44.8 | 55.3 | 65.2 | 53.0 | 56.9 | 90.9 | 80.2 | 74.0 |
| GeoSA (2025) [27] | 62.1 | 74.5 | 76.8 | 67.5 | 79.8 | 64.5 | 35.9 | 77.2 | 11.2 | 62.6 | 46.1 | 56.0 | 66.7 | 54.3 | 58.4 | 91.8 | 83.3 | 76.4 |
| RSAM-Seg (2025) [30] | 61.0 | 73.6 | 76.4 | 66.7 | 78.9 | 63.7 | 34.8 | 76.6 | 10.8 | 61.8 | 45.0 | 55.6 | 65.3 | 53.1 | 57.2 | 91.0 | 80.4 | 74.2 |
| ADVMSeg (Ours) | 63.1 | 75.4 | 77.6 | 69.1 | 80.4 | 66.2 | 35.7 | 77.1 | 13.2 | 63.1 | 50.2 | 58.1 | 66.5 | 54.1 | 60.2 | 93.7 | 83.1 | 76.3 |
Table 2.
Comparison on LoveDA. The mIoU, mF1, and OA values are measured under our unified protocol. The class columns report per-class IoU (%). Bkg: background; Bld: building; Rd: road; Wat: water; Bar: barren; For: forest; Agr: agriculture. The bold text indicates the best results.
Table 2.
Comparison on LoveDA. The mIoU, mF1, and OA values are measured under our unified protocol. The class columns report per-class IoU (%). Bkg: background; Bld: building; Rd: road; Wat: water; Bar: barren; For: forest; Agr: agriculture. The bold text indicates the best results.
| Method | mIoU | mF1 | OA | Bkg | Bld | Rd | Wat | Bar | For | Agr |
|---|
| DeepLabV3+ (2018) [41] | 47.6 | 62.5 | 52.3 | 46.8 | 49.5 | 51.1 | 65.9 | 12.4 | 42.7 | 64.9 |
| DC-Swin (2022) [20] | 56.0 | 70.4 | 59.7 | 57.3 | 60.8 | 63.2 | 78.0 | 23.9 | 52.1 | 56.7 |
| CMTFNet (2023) [55] | 58.5 | 72.5 | 61.8 | 59.2 | 62.6 | 65.5 | 79.2 | 26.4 | 54.1 | 62.5 |
| MSGCNet (2024) [59] | 57.4 | 71.6 | 61.0 | 58.4 | 61.7 | 64.2 | 78.6 | 25.0 | 53.0 | 61.0 |
| SFFNet (2024) [35] | 60.8 | 74.5 | 66.0 | 61.8 | 65.7 | 69.0 | 81.4 | 29.8 | 57.5 | 60.4 |
| MSEONet (2025) [60] | 55.7 | 70.1 | 59.2 | 56.9 | 60.2 | 62.7 | 77.5 | 22.3 | 51.2 | 59.1 |
| DBBANet (2024) [61] | 59.8 | 73.7 | 64.4 | 60.6 | 64.4 | 67.6 | 80.3 | 28.7 | 56.6 | 60.4 |
| SkySense (2024) [13] | 61.5 | 75.0 | 66.9 | 62.2 | 66.0 | 69.1 | 81.8 | 30.2 | 58.2 | 63.0 |
| GeoSA (2025) [27] | 62.4 | 75.8 | 67.5 | 63.1 | 67.4 | 70.0 | 82.4 | 31.4 | 59.8 | 62.7 |
| RSAM-Seg (2025) [30] | 62.7 | 76.0 | 68.7 | 64.0 | 68.3 | 71.4 | 82.8 | 32.1 | 60.5 | 60.0 |
| ADVMSeg (Ours) | 63.5 | 76.7 | 69.2 | 64.7 | 66.6 | 72.5 | 83.2 | 33.5 | 61.0 | 63.0 |
Table 3.
Comparison on ISPRS Potsdam. The mIoU, mF1, and OA values are measured under our unified protocol on an internally constructed split from the public annotated tiles, rather than on the official hidden test benchmark. Therefore, these results are intended for controlled comparison under our setting and are not directly comparable to results reported on the official benchmark server. The class columns report per-class F1 (%). Imp: impervious surface; Bld: building; LowVeg: low vegetation; Tre: tree; Car: car; Clut: clutter. The bold text indicates the best results.
Table 3.
Comparison on ISPRS Potsdam. The mIoU, mF1, and OA values are measured under our unified protocol on an internally constructed split from the public annotated tiles, rather than on the official hidden test benchmark. Therefore, these results are intended for controlled comparison under our setting and are not directly comparable to results reported on the official benchmark server. The class columns report per-class F1 (%). Imp: impervious surface; Bld: building; LowVeg: low vegetation; Tre: tree; Car: car; Clut: clutter. The bold text indicates the best results.
| Method | mIoU | mF1 | OA | Imp | Bld | LowVeg | Tre | Car | Clut |
|---|
| DeepLabV3+ (2018) [41] | 70.9 | 82.8 | 84.5 | 85.5 | 89.6 | 78.6 | 75.1 | 85.9 | 81.8 |
| DC-Swin (2022) [20] | 78.3 | 87.7 | 87.9 | 89.2 | 93.7 | 82.6 | 83.1 | 90.1 | 87.2 |
| CMTFNet (2023) [55] | 80.0 | 88.8 | 88.6 | 90.4 | 95.0 | 83.6 | 84.6 | 90.8 | 88.3 |
| MSGCNet (2024) [59] | 79.6 | 88.5 | 88.4 | 90.0 | 94.8 | 83.3 | 84.2 | 90.6 | 88.0 |
| SFFNet (2024) [35] | 80.5 | 89.1 | 88.8 | 90.6 | 95.2 | 84.0 | 84.9 | 91.0 | 88.8 |
| MSEONet (2025) [60] | 77.9 | 87.4 | 87.6 | 88.8 | 94.1 | 82.1 | 82.8 | 89.8 | 86.7 |
| DBBANet (2024) [61] | 80.7 | 89.2 | 89.0 | 90.7 | 95.2 | 84.3 | 85.1 | 91.1 | 89.0 |
| SkySense (2024) [13] | 80.8 | 89.3 | 89.2 | 90.8 | 95.3 | 84.4 | 85.2 | 91.3 | 88.6 |
| GeoSA (2025) [27] | 81.1 | 89.4 | 89.3 | 91.0 | 95.4 | 84.5 | 85.3 | 91.2 | 89.1 |
| RSAM-Seg (2025) [30] | 80.6 | 89.1 | 88.9 | 90.7 | 95.3 | 84.3 | 85.1 | 91.1 | 88.2 |
| ADVMSeg (Ours) | 81.4 | 89.6 | 89.6 | 90.8 | 95.3 | 85.1 | 85.7 | 91.4 | 89.5 |
Table 4.
Ablation study of SF-Adapter variants using mIoU. The baseline uses a completely frozen DINOv3 backbone without explicit task adaptation. The bold text indicates the best results.
Table 4.
Ablation study of SF-Adapter variants using mIoU. The baseline uses a completely frozen DINOv3 backbone without explicit task adaptation. The bold text indicates the best results.
| Variant | GID-15 mIoU (%) | LoveDA mIoU (%) | Potsdam mIoU (%) |
|---|
| Frozen DINOv3 + Mask2Former | 59.3 | 60.1 | 79.1 |
| + Plain Bottleneck Adapter | 59.4 | 60.8 | 79.6 |
| + Spatial Branch Only | 59.9 | 61.2 | 80.4 |
| + Spectral Branch Only | 60.1 | 61.7 | 79.9 |
| + Spectral + Spatial (Direct Sum) | 61.0 | 61.9 | 80.5 |
| + Full SF-Adapter | 62.2 | 62.8 | 80.8 |
Table 5.
Ablation study of different refinement strategies on Potsdam. B-IoU denotes Boundary IoU. The baseline corresponds to the model equipped with the full SF-Adapter before refinement. The bold text indicates the best results.
Table 5.
Ablation study of different refinement strategies on Potsdam. B-IoU denotes Boundary IoU. The baseline corresponds to the model equipped with the full SF-Adapter before refinement. The bold text indicates the best results.
| Variant | Potsdam mIoU (%) | B-IoU (%) | Latency (ms) | GFLOPs |
|---|
| No Refinement (Baseline) | 80.8 | 76.5 | 132.0 | 366.0 |
| Random Sparse Refinement | 80.9 | 76.6 | 145.0 | 378.0 |
| Uncertainty-based Selection | 81.0 | 76.7 | 146.0 | 379.0 |
| Boundary-based Selection | 80.9 | 76.9 | 146.0 | 379.0 |
| Full ASR Module | 81.4 | 77.1 | 148.0 | 381.0 |
| Dense Refinement | 81.7 | 77.3 | 176.0 | 421.0 |
Table 6.
Inter-module ablation study of SF-Adapter and ASR using mIoU. The complete model corresponds to ADVMSeg. The bold text indicates the best results.
Table 6.
Inter-module ablation study of SF-Adapter and ASR using mIoU. The complete model corresponds to ADVMSeg. The bold text indicates the best results.
| SF-Adapter | ASR | GID-15 | LoveDA | Potsdam |
|---|
| | | 59.3 | 60.1 | 79.1 |
| | ✓ | 60.6 | 61.7 | 79.9 |
| ✓ | | 62.2 | 62.8 | 80.8 |
| ✓ | ✓ | 63.1 | 63.5 | 81.4 |
Table 7.
Overall efficiency comparison on Potsdam. Tr. Ratio: proportion of trainable parameters to total parameters. Mem.: peak GPU memory during training. The bold text indicates the best results.
Table 7.
Overall efficiency comparison on Potsdam. Tr. Ratio: proportion of trainable parameters to total parameters. Mem.: peak GPU memory during training. The bold text indicates the best results.
| Variant | mIoU (%) | Params (M) | Tr. Params (M) | Tr. Ratio (%) | GFLOPs | Lat. (ms) | Mem. (GB) |
|---|
| Frozen DINOv3 + Mask2Former | 79.1 | 344.8 | 40.8 | 11.8 | 360.0 | 128.0 | 15.4 |
| + Plain Bottleneck Adapter | 79.6 | 349.1 | 45.1 | 12.9 | 364.0 | 131.0 | 16.1 |
| + SF-Adapter | 80.8 | 350.9 | 46.9 | 13.4 | 366.0 | 132.0 | 16.9 |
| + SF-Adapter + Dense Refinement | 81.7 | 357.2 | 53.2 | 14.9 | 421.0 | 176.0 | 18.8 |
| ADVMSeg | 81.4 | 354.5 | 50.5 | 14.2 | 381.0 | 148.0 | 17.6 |
| Full Fine-Tuning (ref.) | – | 344.8 | 344.8 | 100.0 | – | – | – |
Table 8.
Module-wise complexity decomposition of ADVMSeg on Potsdam. Routing includes uncertainty/boundary score generation and Top-M selection. Local Refinement denotes the sparse local cross-attention correction stage.
Table 8.
Module-wise complexity decomposition of ADVMSeg on Potsdam. Routing includes uncertainty/boundary score generation and Top-M selection. Local Refinement denotes the sparse local cross-attention correction stage.
| Component | GFLOPs | Latency (ms) |
|---|
| Frozen DINOv3 Backbone | 298.0 | 101.0 |
| Pixel Decoder | 61.0 | 22.0 |
| SF-Adapter | 7.0 | 6.0 |
| ASR Routing | 1.8 | 2.0 |
| ASR Local Refinement | 13.2 | 17.0 |
| Total | 381.0 | 148.0 |
Table 9.
Sensitivity analysis of the Top-M ratio in ASR on Potsdam. The default setting used in ADVMSeg is 1.0%. The bold text indicates the best results.
Table 9.
Sensitivity analysis of the Top-M ratio in ASR on Potsdam. The default setting used in ADVMSeg is 1.0%. The bold text indicates the best results.
| Top-M Ratio | Potsdam mIoU (%) | B-IoU (%) | GFLOPs | Latency (ms) |
|---|
| 0.25% | 80.9 | 76.7 | 374.0 | 140.0 |
| 0.5% | 81.0 | 76.9 | 377.0 | 143.0 |
| 1.0% | 81.4 | 77.1 | 381.0 | 148.0 |
| 2.0% | 81.2 | 77.2 | 389.0 | 155.0 |
| 5.0% | 81.2 | 77.2 | 420.0 | 173.0 |
Table 10.
Scalability comparison under different input resolutions on Potsdam.
Table 10.
Scalability comparison under different input resolutions on Potsdam.
| Input Resolution | Variant | mIoU (%) | GFLOPs | Latency (ms) |
|---|
| 512 × 512 | SF-Adapter Only | 80.8 | 366.0 | 132.0 |
| 512 × 512 | ADVMSeg (ASR) | 81.4 | 381.0 | 148.0 |
| 512 × 512 | Dense Refinement | 81.7 | 421.0 | 176.0 |
| 768 × 768 | SF-Adapter Only | 81.0 | 823.0 | 244.0 |
| 768 × 768 | ADVMSeg (ASR) | 81.5 | 856.0 | 273.0 |
| 768 × 768 | Dense Refinement | 81.8 | 948.0 | 337.0 |
| 1024 × 1024 | SF-Adapter Only | 81.0 | 1461.0 | 425.0 |
| 1024 × 1024 | ADVMSeg (ASR) | 81.5 | 1518.0 | 468.0 |
| 1024 × 1024 | Dense Refinement | 81.8 | 1690.0 | 587.0 |