Seeing Like Argus: Multi-Perspective Global–Local Context Learning for Remote Sensing Semantic Segmentation
Highlights
- ArgusNet introduces a Hybrid Global–Local Block (HGLB) that synergistically integrates Adaptive Windowed Additive Attention (AWAA) and SS2D-based state-space modeling, enabling complementary fine-grained local detail extraction and global long-range dependency modeling.
- The proposed Macro Guidance Module (MGM) effectively bridges semantic gaps between encoder and decoder features, significantly improving boundary continuity and segmentation consistency on multi-scale remote sensing objects.
- By jointly enhancing global–local context learning with linear computational complexity, ArgusNet improves segmentation accuracy in complex remote sensing scenes containing dense, small objects; large structures; and high intra-class variability.
- The framework demonstrates strong generalization across datasets (LoveDA, Potsdam), suggesting its suitability as a robust backbone for high-resolution land-cover mapping, urban analysis, and other geospatial applications requiring fine-grained semantic understanding.
Abstract
1. Introduction
- 1.
- We propose ArgusNet, a hybrid semantic segmentation network that employs HGLBs, which are composed of AWAA and VSS modules, as the encoder and integrates a lightweight convolutional decoder. This architecture effectively leverages the global modeling capacity of Mamba together with the local modeling strengths of AWAA, thereby achieving substantial improvements in segmentation performance.
- 2.
- Designed the AWAA module, which integrates window-based attention and a dynamic routing mechanism to realize multi-perspective local perception. This design effectively improves dense small-object recognition while maintaining far lower computational complexity than conventional self-attention mechanisms.
- 3.
- Optimized the decoder architecture by introducing a three-stage feature fusion strategy to preserve fine-grained spatial details, while incorporating the MGM to dynamically guide the injection of deep semantic features, thereby alleviating discontinuities in land-cover segmentation and further enhancing accuracy.
- 4.
- Conducted extensive experiments on two publicly available remote sensing datasets, LoveDA and Potsdam, demonstrating that ArgusNet consistently outperforms or matches mainstream methods such as UNetMamba and SegFormer in segmentation accuracy, supporting the effectiveness of the proposed model.
2. Related Works
2.1. RS Semantic Segmentation Based on CNNs
2.2. Transformer- and Mamba-Based Semantic Segmentation
2.3. Multi-Scale Feature Fusion in RS
2.4. Large-Scale Remote Sensing Datasets and Multimodal Benchmarks
3. Methods
3.1. Overall Architecture
3.2. Adaptive Windowed Additive Attention
| Algorithm 1 Adaptive Windowed Additive Attention (AWAA) | ||
| Require: Input feature map ; Window size p; Number of shared vectors ; Number of routable vectors T; Number of selected routable vectors . | ||
| Ensure: Refined feature map . | ||
| 1: | Step 0: Parameter Initialization | |
| 2: | Initialize Shared Pool ∼ Xavier_Normal | |
| 3: | Initialize Routable Pool ∼ Xavier_Normal | |
| 4: | Step 1: Window Partitioning | |
| 5: | ▹, where , Equation (6) | |
| 6: | Step 2: Dynamic Routing Mechanism | |
| 7: | ▹ Compute correlation scores , Equation (7) | |
| 8: | ▹ Select top vector indices, Equation (8) | |
| 9: | ▹ Routing weights for selected vectors, Equation (9) | |
| 10: | Step 3: Linear Projections | |
| 11: | ▹ | |
| 12: | ▹ Initialize output matrix as zero tensor | |
| 13: | Step 4: Shared Context Interaction | |
| 14: | for to do | |
| 15: | ▹ Shared attention weights, Equation (10) | |
| 16: | ▹ Shared global query vector, Equation (11) | |
| 17: | ▹ Aggregate shared global context, Equation (11) | |
| 18: | end for | |
| 19: | Step 5: Routed Context Interaction | |
| 20: | fordo | |
| 21: | ▹ Routed attention weights, Equation (10) | |
| 22: | ▹ Routed global query vector, Equation (11) | |
| 23: | ▹ Weighted routed global context, Equation (11) | |
| 24: | end for | |
| 25: | Step 6: Feature Fusion | |
| 26: | ▹ Context fusion and projection, Equation (12) | |
| 27: | ▹ Restore to spatial resolution | |
| 28: | return | |
3.3. Hybrid Global–Local Block
3.4. Macro Guidance Module
3.5. Architecture Variants
- ArgusNet-T: C = {64, 128, 320, 512},layer numbers = {2, 2, 6, 2}
- ArgusNet-S: C = {96, 192, 384, 512}, layer numbers ={2, 2, 9, 2}
- ArgusNet-B: C = {128, 256, 512, 768}, layer numbers ={2, 2, 9, 2}
4. Experiments and Results
4.1. Datasets
- (1)
- LoveDA Dataset: A high-resolution remote sensing dataset for domain-adaptive semantic segmentation, addressing urban–rural disparities. Collected by Wuhan University’s RSIDEA team, it includes 5987 images (0.3 m resolution) covering across three Chinese cities, with 166,768 annotated objects and 7 land-cover categories. The dataset is split into urban (buildings, roads) and rural (farmland, water bodies) domains to study cross-domain adaptation. Challenges include multi-scale objects and complex backgrounds. In our study, we merged urban and rural data, trained/validated the model, and evaluated results via the official test platform.
- (2)
- Potsdam Dataset: Potsdam dataset contains 38 high-resolution images of 6000 × 6000 pixels over Potsdam City, Germany, and the ground sampling distance is 5 cm. The dataset is split into 24 images for training and 14 images for validation/testing. There are two modalities included in Potsdam dataset, i.e., true orthophoto (TOP) and digital surface model (DSM). While DSM consists of the near-infrared (NIR) band, TOP is corresponding to RGB image. In this work, we use TOP images from Potsdam and ignore DSM images. Following the experiment setup [57], we divide the dataset into 24 images for training and 14 images for testing. The testset of 14 images includes 2_13, 2_14, 3_13, 3_14, 4_13, 4_14, 4_15, 5_13, 5_14, 5_15, 6_13, 6_14, 6_15, and 7_13. This dataset contains five categories, namely impervious surface, building, tree, low vegetation, and car. Each image was divided into a series of 512 × 512 sub-images through an overlapping partitioning method with a step size of 256 × 256.
4.2. Implementation Details
- (1)
- Training Settings: To ensure a fair comparison, all competing methods are implemented within the MMSegmentation framework with a fixed random seed. All experiments are conducted on two NVIDIA RTX3090 GPUs (NVIDIA Corporation, Santa Clara, CA, USA) with a batch size of 16. For faster convergence, we adopt the AdamW optimizer with an initial learning rate of and employ a PolyLR schedule to progressively decay the learning rate, without any fancy parameter tuning beyond that. For the competing methods, we use either the default configuration files provided by the framework or their officially claimed optimal configurations. The same data augmentation pipeline, including random scaling, random rotation, random cropping, random flipping, and random photometric distortion, was applied to all experiments.
- (2)
- Loss Function: We use cross-entropy loss (CE) as the loss function:
- (3)
- Evaluation Metrics: We Assess the model performance using metrics, such as overall mean F1(mF1) score and mean intersection over union (mIoU). They are defined as follows:
4.3. Ablation Study
- (1)
- Effectiveness of Routing Strategies. To evaluate the effectiveness of the routing strategy, this subsection conducts a controlled experiment. The baseline model is defined as a single shared vector with routing disabled (Table 1, last row). Multiple parameter settings are compared using a controlled-variable approach, and the performance is assessed on the validation set of the LoveDA dataset (see Table 1 for details). The results indicate that the baseline model achieves the lowest performance (mIoU = 50.12, mF1 = 65.37), whereas all routing-enabled models exhibit significant improvements. The best performance is obtained when , , and , where mIoU increases to 53.22 (+3.1 percentage points) and mF1 increases to 68.44 (+3.07 percentage points). This configuration demonstrates a balance between feature diversity and utilization. However, when T is increased from 16 to 32 under the setting , mIoU decreases by 0.39 and mF1 decreases by 0.33. Under the setting , , the reductions are more pronounced, with mIoU and mF1 decreasing by 0.53 and 0.42, respectively. This degradation arises because an excessively large T leads to sparse feature allocation. In contrast, when T is fixed, increasing the number of experts results in only minor fluctuations in performance, suggesting that the current number of experts is already sufficient for the given task. Further increasing introduces knowledge redundancy and has a limited impact on overall performance.
- (2)
- Effectiveness of HGLB and MGM. Within the HGLB, AWAA and VSS are responsible for local and global feature modeling, respectively. To verify their necessity, we adopt a switch-based controlled variable strategy, combined with the on/off configuration of MGM, and the complete results are summarized in Table 2. The experiments demonstrate that the best segmentation performance is achieved when all modules are enabled (last row of Table 2). When VSS is retained while AWAA is disabled (second row of Table 2), the network can still capture long-range dependencies, but the lack of local boundary characterization leads to a drop in mIoU from 53.22% to 49.77% and in mF1 to 65.66%.
4.4. Comparison to the State-of-the-Art Methods
- (1)
- Comparisons on the LoveDA Dataset: The comparative results on the LoveDA dataset are summarized in Table 4. Across three model scales (T, S, B), ArgusNet consistently achieves segmentation performance that is comparable to or better than existing state-of-the-art methods. Overall, ArgusNet-B attains an mIoU of 54.89%, the highest among all compared models, exceeding OverLoCK-B with a similar parameter scale by 1.32 percentage points. This demonstrates the effectiveness of the proposed synergistic design of HGLB and MGM. At the medium scale, ArgusNet-S achieves an mIoU of 53.36%, outperforming SegFormer-B4 (52.94%) and Swin-T (51.90%), indicating that the proposed approach offers superior cross-scale feature integration while maintaining lower computational cost. The lightweight ArgusNet-T also achieves competitive results, with an mIoU of 52.7%, surpassing SegFormer-B2 (52.35%) and UNetMamba (51.20%), highlighting its favorable parameter–efficiency trade-off. From the perspective of class-wise performance, the ArgusNet series delivers pronounced advantages in categories with complex textures and high inter-class similarity, such as Forest and Agricultural. For instance, ArgusNet-B achieves 65.61% IoU on the Agricultural category, exceeding the second-best method by 2.42 percentage points, thereby verifying that AWAA significantly enhances detail preservation in boundary-complex regions. Moreover, ArgusNet also maintains competitive performance on structured categories such as Building, reflecting the general adaptability of local–global feature modeling across diverse land-cover classes.
- (2)
- Comparisons on the Potsdam Dataset: The comparative results on the Potsdam dataset are summarized in Table 5. ArgusNet consistently exhibits superior segmentation performance across three model scales. In particular, ArgusNet-B achieves the highest overall performance, with 89.31% in mIoU and 94.55% in mF1, and also obtains the best results in all categories. Compared with OverLoCK-B of similar parameter size, ArgusNet-B improves mF1 by 0.56 percentage points, indicating its advantage in fine-grained segmentation of high-resolution urban scenes. The medium-scale ArgusNet-S attains 88.87% in mIoU and 94.13% in mF1, outperforming SegFormer-B4 (88.50% and 93.92%) and Swin-T (88.23% and 93.61%). It also achieves leading performance in all categories, demonstrating the efficiency and generalization of the joint modeling of AWAA and VSS. The lightweight ArgusNet-T achieves 88.74% in mIoU, which surpasses UNetMamba (86.90%) and RS3Mamba (87.72%).Category-wise analysis shows that the ArgusNet series achieves accuracy comparable to the best competing methods in well-structured classes such as Impervious Surface and Building, while delivering the highest accuracy in small-object categories such as Car. Notably, ArgusNet-B reaches 94.77% in the Car category, highlighting its advantage in handling small objects and capturing precise boundaries. Furthermore, ArgusNet also performs favorably in the Tree category, indicating that the proposed global–local feature fusion strategy enables stable performance in vegetation classes with large texture variations.
- (3)
- Complexity Comparison: The complexity of each model was measured in terms of the number of parameters (M) and floating-point operations (FLOPs). All results were obtained on a single NVIDIA 3090 GPU with an input size of 512 × 512, as reported in Table 3. Methods based on CNNs, such as DeepLabV3+ and FCN, require substantially higher computation due to dilated convolutions in the ResNet backbone. Under comparable parameter sizes, their FLOPs are more than 2.6 times those of ArgusNet-T, while producing lower segmentation accuracy. ArgusNet-S and ArgusNet-B achieve competitive accuracy with only half the computational cost of Swin Transformer, indicating that ArgusNet provides an effective balance between accuracy and efficiency.
- (4)
- Generalization Capability Test: In order to evaluate the generalization ability of ArgusNet under extreme data conditions, we conducted an additional domain generalization experiment on the LoveDA dataset. The model was trained on the Urban domain and tested on the Rural domain, following the same training process as previously described. We compared ArgusNet-T with the second and third best-performing methods of similar model size from Table 4, and the comparison results are shown in Table 6. Due to severe data imbalance and significant style differences, all three methods, including ArgusNet, experienced varying degrees of performance degradation. Nevertheless, our method still achieved the best results, which is consistent with the observations from previous experiments. The experimental results further confirm the strong potential of ArgusNet.
5. Discussion
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Maus, V.; Câmara, G.; Cartaxo, R.; Sanchez, A.; Ramos, F.M.; De Queiroz, G.R. A time-weighted dynamic time warping method for land-use and land-cover mapping. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2016, 9, 3729–3739. [Google Scholar] [CrossRef]
- Sahar, L.; Muthukumar, S.; French, S.P. Using aerial imagery and GIS in automated building footprint extraction and shape recognition for earthquake risk assessment of urban inventories. IEEE Trans. Geosci. Remote Sens. 2010, 48, 3511–3520. [Google Scholar] [CrossRef]
- Shao, Z.; Tang, P.; Wang, Z.; Saleem, N.; Yam, S.; Sommai, C. BRRNet: A fully convolutional neural network for automatic building extraction from high-resolution remote sensing images. Remote Sens. 2020, 12, 1050. [Google Scholar] [CrossRef]
- Chen, J.; Xia, M.; Wang, D.; Lin, H. Double branch parallel network for segmentation of buildings and waters in remote sensing images. Remote Sens. 2023, 15, 1536. [Google Scholar] [CrossRef]
- Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
- Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
- Chen, L.C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking atrous convolution for semantic image segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar] [CrossRef]
- Guo, M.H.; Lu, C.Z.; Liu, Z.N.; Cheng, M.M.; Hu, S.M. Visual attention network. Comput. Vis. Media 2023, 9, 733–752. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
- Wu, C.; Wu, F.; Qi, T.; Huang, Y.; Xie, X. Fastformer: Additive attention can be all you need. arXiv 2021, arXiv:2108.09084. [Google Scholar] [CrossRef]
- Shaker, A.; Maaz, M.; Rasheed, H.; Khan, S.; Yang, M.H.; Khan, F.S. Swiftformer: Efficient additive attention for transformer-based real-time mobile vision applications. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 17425–17436. [Google Scholar]
- Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv 2023, arXiv:2312.00752. [Google Scholar] [CrossRef]
- Zhu, L.; Liao, B.; Zhang, Q.; Wang, X.; Liu, W.; Wang, X. Vision mamba: Efficient visual representation learning with bidirectional state space model. In Proceedings of the 41st International Conference on Machine Learning, Vienna, Austria, 21–27 July 2024; pp. 62429–62442. [Google Scholar]
- Liu, Y.; Tian, Y.; Zhao, Y.; Yu, H.; Xie, L.; Wang, Y.; Ye, Q.; Jiao, J.; Liu, Y. Vmamba: Visual state space model. Adv. Neural Inf. Process. Syst. 2024, 37, 103031–103063. [Google Scholar]
- Li, H.; Pan, H.; Liu, X.; Ren, J.; Du, Z.; Cao, J. GLVMamba: A Global-Local Visual State Space Model for Remote Sensing Image Segmentation. IEEE Trans. Geosci. Remote Sens. 2025, 63, 4412115. [Google Scholar] [CrossRef]
- Zhao, S.; Chen, H.; Zhang, X.; Xiao, P.; Bai, L.; Ouyang, W. Rs-mamba for large remote sensing image dense prediction. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5633314. [Google Scholar] [CrossRef]
- Luo, H.; Chen, C.; Fang, L.; Zhu, X.; Lu, L. High-resolution aerial images semantic segmentation using deep fully convolutional network with channel attention mechanism. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2019, 12, 3492–3507. [Google Scholar] [CrossRef]
- Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
- Li, X.; Yang, X.; Li, X.; Lu, S.; Ye, Y.; Ban, Y. GCDB-UNet: A novel robust cloud detection approach for remote sensing images. Knowl.-Based Syst. 2022, 238, 107890. [Google Scholar] [CrossRef]
- Gao, Y.; Cao, H.; Cai, W.; Zhou, G. Pixel-level road crack detection in UAV remote sensing images based on ARD-Unet. Measurement 2023, 219, 113252. [Google Scholar] [CrossRef]
- Wang, Z.; Wang, J.; Yang, K.; Wang, L.; Su, F.; Chen, X. Semantic segmentation of high-resolution remote sensing images based on a class feature attention mechanism fused with Deeplabv3+. Comput. Geosci. 2022, 158, 104969. [Google Scholar] [CrossRef]
- He, X.; Zhou, Y.; Zhao, J.; Zhang, D.; Yao, R.; Xue, Y. Swin transformer embedding UNet for remote sensing image semantic segmentation. IEEE Trans. Geosci. Remote Sens. 2022, 60, 4408715. [Google Scholar] [CrossRef]
- Wang, L.; Li, R.; Zhang, C.; Fang, S.; Duan, C.; Meng, X.; Atkinson, P.M. UNetFormer: A UNet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery. ISPRS J. Photogramm. Remote Sens. 2022, 190, 196–214. [Google Scholar] [CrossRef]
- Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar]
- Hatamizadeh, A.; Yin, H.; Heinrich, G.; Kautz, J.; Molchanov, P. Global context vision transformers. In Proceedings of the International Conference on Machine Learning, PMLR, Honolulu, HI, USA, 23–29 July 2023; pp. 12633–12646. [Google Scholar]
- Maaz, M.; Shaker, A.; Cholakkal, H.; Khan, S.; Zamir, S.W.; Anwer, R.M.; Shahbaz Khan, F. Edgenext: Efficiently amalgamated cnn-transformer architecture for mobile vision applications. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2022; pp. 3–20. [Google Scholar]
- Kitaev, N.; Kaiser, Ł.; Levskaya, A. Reformer: The efficient transformer. arXiv 2020, arXiv:2001.04451. [Google Scholar] [CrossRef]
- Li, Y.; Yuan, G.; Wen, Y.; Hu, J.; Evangelidis, G.; Tulyakov, S.; Wang, Y.; Ren, J. Efficientformer: Vision transformers at mobilenet speed. Adv. Neural Inf. Process. Syst. 2022, 35, 12934–12949. [Google Scholar]
- Hamilton, M.; Zhang, Z.; Hariharan, B.; Snavely, N.; Freeman, W.T. Unsupervised semantic segmentation by distilling feature correspondences. arXiv 2022, arXiv:2203.08414. [Google Scholar] [CrossRef]
- Wang, X.; Girdhar, R.; Yu, S.X.; Misra, I. Cut and learn for unsupervised object detection and instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 3124–3134. [Google Scholar]
- Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y.; et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 4015–4026. [Google Scholar]
- Oquab, M.; Darcet, T.; Moutakanni, T.; Vo, H.; Szafraniec, M.; Khalidov, V.; Fernandez, P.; Haziza, D.; Massa, F.; El-Nouby, A.; et al. Dinov2: Learning robust visual features without supervision. arXiv 2023, arXiv:2304.07193. [Google Scholar]
- Ma, X.; Zhang, X.; Pun, M.O. Rs 3 mamba: Visual state space model for remote sensing image semantic segmentation. IEEE Geosci. Remote Sens. Lett. 2024, 21, 6011405. [Google Scholar] [CrossRef]
- Ding, H.; Xia, B.; Liu, W.; Zhang, Z.; Zhang, J.; Wang, X.; Xu, S. A novel mamba architecture with a semantic transformer for efficient real-time remote sensing semantic segmentation. Remote Sens. 2024, 16, 2620. [Google Scholar] [CrossRef]
- Wan, Z.; Zhang, P.; Wang, Y.; Yong, S.; Stepputtis, S.; Sycara, K.; Xie, Y. Sigma: Siamese mamba network for multi-modal semantic segmentation. In 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV); IEEE: New York, NY, USA, 2025; pp. 1734–1744. [Google Scholar]
- Ruan, J.; Li, J.; Xiang, S. Vm-unet: Vision mamba unet for medical image segmentation. arXiv 2024, arXiv:2402.02491. [Google Scholar] [CrossRef]
- Zhou, W.; Kamata, S.i.; Wang, H.; Wong, M.S.; Hou, H.C. Mamba-in-mamba: Centralized mamba-cross-scan in tokenized mamba model for hyperspectral image classification. Neurocomputing 2025, 613, 128751. [Google Scholar] [CrossRef]
- Dong, J.; Yin, H.; Li, H.; Li, W.; Zhang, Y.; Khan, S.; Khan, F.S. Dual hyperspectral mamba for efficient spectral compressive imaging. arXiv 2024, arXiv:2406.00449. [Google Scholar] [CrossRef]
- Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
- Wang, X.; Wang, S.; Ning, C.; Zhou, H. Enhanced feature pyramid network with deep semantic embedding for remote sensing scene classification. IEEE Trans. Geosci. Remote Sens. 2021, 59, 7918–7932. [Google Scholar] [CrossRef]
- Gu, Y.; Wang, B.; Xu, B. A FPN-based framework for vehicle detection in aerial images. In Proceedings of the 2018 2nd International Conference on Video and Image Processing, Hong Kong, 29–31 December 2018; pp. 60–64. [Google Scholar]
- Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
- Liu, W.; Shu, Y.; Tang, X.; Liu, J. Remote sensing image segmentation using dual attention mechanism Deeplabv3+ algorithm. Trop. Geogr. 2020, 40, 303–313. [Google Scholar]
- Li, Y.; Chen, W.; Huang, X.; Gao, Z.; Li, S.; He, T.; Zhang, Y. MFVNet: A deep adaptive fusion network with multiple field-of-views for remote sensing image semantic segmentation. Sci. China Inf. Sci. 2023, 66, 140305. [Google Scholar] [CrossRef]
- Ma, P.; Ren, J.; Sun, G.; Zhao, H.; Jia, X.; Yan, Y.; Zabalza, J. Multiscale superpixelwise prophet model for noise-robust feature extraction in hyperspectral images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5508912. [Google Scholar] [CrossRef]
- Lobry, S.; Marcos, D.; Murray, J.; Tuia, D. RSVQA: Visual question answering for remote sensing data. IEEE Trans. Geosci. Remote Sens. 2020, 58, 8555–8566. [Google Scholar] [CrossRef]
- Zheng, X.; Wang, B.; Du, X.; Lu, X. Mutual attention inception network for remote sensing visual question answering. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5606514. [Google Scholar] [CrossRef]
- Zi, X.; Xiao, J.; Shi, Y.; Tao, X.; Li, J.; Braytee, A.; Prasad, M. RSVLM-QA: A Benchmark Dataset for Remote Sensing Vision Language Model-based Question Answering. In Proceedings of the 33rd ACM International Conference on Multimedia, Dublin, Ireland, 27–31 October 2025; pp. 12905–12911. [Google Scholar]
- Wang, J.; Xuan, W.; Qi, H.; Liu, Z.; Liu, K.; Wu, Y.; Chen, H.; Song, J.; Xia, J.; Zheng, Z.; et al. DisasterM3: A Remote Sensing Vision-Language Dataset for Disaster Damage Assessment and Response. arXiv 2025, arXiv:2505.21089. [Google Scholar]
- Wang, Z.; Prabha, R.; Huang, T.; Wu, J.; Rajagopal, R. Skyscript: A large and semantically diverse vision-language dataset for remote sensing. Proc. AAAI Conf. Artif. Intell. 2024, 38, 5805–5813. [Google Scholar] [CrossRef]
- Xiao, A.; Xuan, W.; Wang, J.; Huang, J.; Tao, D.; Lu, S.; Yokoya, N. Foundation models for remote sensing and earth observation: A survey. IEEE Geosci. Remote Sens. Mag. 2025, 13, 297–324. [Google Scholar] [CrossRef]
- Lu, S.; Guo, J.; Zimmer-Dauphinee, J.R.; Nieusma, J.M.; Wang, X.; Wernke, S.A.; Huo, Y. Vision foundation models in remote sensing: A survey. IEEE Geosci. Remote Sens. Mag. 2025, 13, 190–215. [Google Scholar] [CrossRef]
- Wang, J.; Zheng, Z.; Ma, A.; Lu, X.; Zhong, Y. LoveDA: A remote sensing land-cover dataset for domain adaptive semantic segmentation. arXiv 2021, arXiv:2110.08733. [Google Scholar]
- 2D Semantic Labeling Contest—Potsdam. ISPRS 2D Semantic Labeling Contest. Available online: https://www.isprs.org/resources/datasets/benchmarks/UrbanSemLab/2d-sem-label-potsdam.aspx (accessed on 2 February 2026).
- Hanyu, T.; Yamazaki, K.; Tran, M.; McCann, R.A.; Liao, H.; Rainwater, C.; Adkins, M.; Cothren, J.; Le, N. AerialFormer: Multi-resolution transformer for aerial image segmentation. Remote Sens. 2024, 16, 2930. [Google Scholar] [CrossRef]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Lou, M.; Yu, Y. OverLoCK: An Overview-first-Look-Closely-next ConvNet with Context-Mixing Dynamic Kernels. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–15 June 2025; pp. 128–138. [Google Scholar]
- Zhu, E.; Chen, Z.; Wang, D.; Shi, H.; Liu, X.; Wang, L. Unetmamba: An efficient unet-like mamba for semantic segmentation of high-resolution remote sensing images. IEEE Geosci. Remote Sens. Lett. 2024, 22, 6001205. [Google Scholar] [CrossRef]










| T | mIoU | mF1 | ||
|---|---|---|---|---|
| 0 | 1 | 16 | 52.72 | 68.1 |
| 0 | 1 | 32 | 52.17 | 67.67 |
| 0 | 2 | 16 | 52.73 | 68.11 |
| 0 | 2 | 32 | 52.2 | 67.69 |
| 1 | 1 | 16 | 53.22 | 68.44 |
| 1 | 1 | 32 | 52.83 | 68.11 |
| 1 | 2 | 16 | 53.2 | 68.43 |
| 1 | 2 | 32 | 52.79 | 68.07 |
| 1 | 0 | 0 | 50.12 | 65.37 |
| HGLB | MGM | IoU Per Class (%) | mIoU | mF1 | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| AWAA | VSS | Background | Building | Road | Water | Barren | Forest | Agricultural | |||
| ✓ | ✗ | ✓ | 49.83 | 54.44 | 47.95 | 64.74 | 27.36 | 38.52 | 48.01 | 47.26 | 64.2 |
| ✗ | ✓ | ✓ | 52.14 | 57.57 | 53.04 | 65.62 | 27.5 | 42.73 | 49.75 | 49.77 | 65.66 |
| ✓ | ✓ | ✗ | 53.29 | 60.17 | 55.81 | 70.48 | 28.8 | 38.28 | 58.06 | 52.13 | 67.42 |
| ★ | ✓ | ✗ | 52.83 | 58.72 | 54.69 | 68.02 | 27.31 | 38.11 | 56.14 | 50.83 | 65.92 |
| ✓ | ✓ | ✓ | 54.63 | 61.45 | 56.63 | 71.0 | 29.44 | 39.2 | 60.17 | 53.22 | 68.44 |
| Method | Year | Type | Params (M) | Flops (G) |
|---|---|---|---|---|
| DeepLabV3+ (R50) | 2018 | CNN | 41.2 | 177 |
| SegFormer-B2 | 2021 | Attention | 24.7 | 25 |
| UNetFormer (R50) | 2021 | Hybrid | 32 | 66 |
| UNetMamba | 2024 | SSM | 14 | 38 |
| RS3Mamba | 2024 | SSM | 43.4 | 40 |
| ArgusNet-T (Ours) | – | Hybrid | 38.5 | 67 |
| FCN (R50) | 2015 | CNN | 47.1 | 198 |
| SegFormer-B4 | 2021 | Attention | 61.4 | 59 |
| SwinTransformer-T | 2021 | Attention | 58.9 | 236 |
| ArgusNet-S (Ours) | – | Hybrid | 60.6 | 111 |
| SwinTransformer-B | 2021 | Attention | 120 | 298 |
| AerialFormer-B | 2024 | Hybrid | 114 | 133 |
| OverLoCK-B | 2025 | CNN | 127 | 297 |
| ArgusNet-B (Ours) | – | Hybrid | 113 | 197 |
| Method | IoU Per Class (%) | mIoU | ||||||
|---|---|---|---|---|---|---|---|---|
| Background | Building | Road | Water | Barren | Forest | Agricultural | ||
| DeepLabV3+ R50 | 41.41 | 55.88 | 54.25 | 79.62 | 22.56 | 42.69 | 62.47 | 51.7 |
| SegFormer B2 | 43.56 | 56.13 | 56.11 | 79.52 | 22.38 | 46.54 | 62.18 | 52.35 |
| UNetFormer R50 | 41.95 | 55.95 | 53.65 | 77.97 | 16.8 | 44.68 | 61.85 | 50.4 |
| UNetMamba | 42.55 | 54.8 | 54.91 | 78.58 | 24.14 | 45.35 | 58.11 | 51.2 |
| RS3Mamba | 43.69 | 55.17 | 54.88 | 77.63 | 16.72 | 47.23 | 62.76 | 51.15 |
| ArgusNet-T (Ours) | 44.4 | 56.2 | 56.91 | 79.73 | 21.35 | 47.45 | 62.88 | 52.7 |
| FCN R50 | 44.83 | 57.3 | 55.86 | 79.17 | 25.08 | 45.36 | 62.48 | 52.87 |
| SegFormer B4 | 44.89 | 57.41 | 55.12 | 79.58 | 25.23 | 45.69 | 62.67 | 52.94 |
| Swin-T | 44.27 | 54.44 | 56.46 | 78.42 | 23.99 | 44.12 | 61.58 | 51.9 |
| ArgusNet-S (Ours) | 45.47 | 56.64 | 56.95 | 79.94 | 23.26 | 47.49 | 63.8 | 53.36 |
| Swin-B | 44.62 | 58. 11 | 58.18 | 79.18 | 23.5 | 46.53 | 58.42 | 52.65 |
| AerialFormer-B | 46.12 | 59.15 | 58.12 | 80.57 | 17.47 | 46.47 | 63.14 | 53.0 |
| OverLoCK-B | 46.15 | 58.62 | 55.45 | 80.12 | 24.81 | 46.68 | 63.19 | 53.57 |
| ArgusNet-B (Ours) | 45.85 ± 0.12 | 59.27 ± 0.17 | 59.3 ± 0.04 | 80.67 ± 0.09 | 25.7 ± 0.23 | 47.85 ± 0.13 | 65.61 ± 0.2 | 54.89 ± 0.04 |
| Method | IoU Per Class (%) | mIoU | mF1 | ||||
|---|---|---|---|---|---|---|---|
| Impervious Surface | Building | Low Vegetation | Tree | Car | |||
| DeepLabV3Plus R50 | 90.01 | 95.13 | 79.26 | 80.08 | 93.4 | 87.58 | 92.92 |
| SegFormer B2 | 90.03 | 95.28 | 80.1 | 80.15 | 93.24 | 87.76 | 92.81 |
| UNetFormer R50 | 89.21 | 94.82 | 78.98 | 78.96 | 93.03 | 87.0 | 92.9 |
| UNetMamba | 89.37 | 94.52 | 79.41 | 79.32 | 91.92 | 86.9 | 92.88 |
| RS3Mamba | 90.09 | 95.48 | 80.12 | 79.79 | 93.13 | 87.72 | 93.33 |
| ArgusNet-T (Ours) | 90.85 | 95.84 | 81.51 | 80.95 | 94.58 | 88.74 | 93.81 |
| FCN R50 | 89.99 | 95.0 | 79.94 | 80.9 | 94.42 | 88.05 | 93.52 |
| SegFormer B4 | 90.61 | 95.38 | 81.09 | 81.05 | 94.38 | 88.5 | 93.92 |
| Swin T | 90.25 | 95.47 | 81.14 | 80.16 | 94.11 | 88.23 | 93.61 |
| ArgusNet-S (Ours) | 90.98 | 95.89 | 81.63 | 81.11 | 94.73 | 88.87 | 94.13 |
| Swin-B | 91.02 | 96.07 | 81.59 | 81.38 | 94.4 | 88.89 | 94.0 |
| AerialFormer-B | 90.55 | 95.78 | 80.55 | 80.07 | 94.1 | 88.21 | 93.6 |
| OverLoCK-B | 90.91 | 96.13 | 81.36 | 81.34 | 94.55 | 88.86 | 93.99 |
| ArgusNet-B (Ours) | 91.56 ± 0.03 | 96.81 ± 0.02 | 81.69 ± 0.04 | 81.72 | 94.77 ± 0.05 | 89.31 ± 0.0004 | 94.55 ± 0.03 |
| Method | IoU Per Class (%) | mIoU | ||||||
|---|---|---|---|---|---|---|---|---|
| Background | Building | Road | Water | Barren | Forest | Agricultural | ||
| DeepLabV3+ R50 | 57.41 | 35.31 | 36.99 | 53.1 | 11.0 | 7.91 | 52.59 | 36.33 |
| SegFormer B2 | 49.64 | 41.74 | 38.1 | 55.24 | 6.83 | 26.52 | 52.82 | 38.69 |
| ArgusNet-T (Ours) | 55.89 ± 0.77 | 46.25 ± 0.56 | 39.14 ± 0.79 | 53.87 ± 1.24 | 7.92 ± 0.55 | 26.68 ± 0.69 | 53.8 ± 0.68 | 40.51 ± 0.2 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Chen, H.; Feng, Y.; Wang, K.; Liao, M.; Zhai, H.; Xia, T.; Zhang, Y.; Jiao, J.; Wen, C. Seeing Like Argus: Multi-Perspective Global–Local Context Learning for Remote Sensing Semantic Segmentation. Remote Sens. 2026, 18, 521. https://doi.org/10.3390/rs18030521
Chen H, Feng Y, Wang K, Liao M, Zhai H, Xia T, Zhang Y, Jiao J, Wen C. Seeing Like Argus: Multi-Perspective Global–Local Context Learning for Remote Sensing Semantic Segmentation. Remote Sensing. 2026; 18(3):521. https://doi.org/10.3390/rs18030521
Chicago/Turabian StyleChen, Hongbing, Yizhe Feng, Kun Wang, Mingrui Liao, Haoting Zhai, Tian Xia, Yubo Zhang, Jianhua Jiao, and Changji Wen. 2026. "Seeing Like Argus: Multi-Perspective Global–Local Context Learning for Remote Sensing Semantic Segmentation" Remote Sensing 18, no. 3: 521. https://doi.org/10.3390/rs18030521
APA StyleChen, H., Feng, Y., Wang, K., Liao, M., Zhai, H., Xia, T., Zhang, Y., Jiao, J., & Wen, C. (2026). Seeing Like Argus: Multi-Perspective Global–Local Context Learning for Remote Sensing Semantic Segmentation. Remote Sensing, 18(3), 521. https://doi.org/10.3390/rs18030521

