CLIP-Driven with Dynamic Feature Selection and Alignment Network for Referring Remote Sensing Image Segmentation
Highlights
- We propose CD2FSAN, a CLIP-driven single-stage framework that integrates dynamic feature selection, multi-scale aggregation and alignment, and a dynamic rotation correction decoder, explicitly addressing scale variation, orientation diversity, and cross-modal misalignment in referring remote sensing image segmentation.
- Extensive experiments demonstrate state-of-the-art performance, while ablation studies and qualitative visualizations confirm the effectiveness of the proposed modules and the robustness of CD2FSAN in small-object localization, rotated target delineation, and fine-grained vision–language alignment.
- This work pioneers a new CLIP-based paradigm for remote sensing segmentation that does not rely on SAM-based prompt engineering, achieving both high accuracy and computational efficiency as verified by efficiency analysis experiments.
- The framework effectively narrows the modality gap between natural-image–trained CLIP features and domain-specific remote sensing imagery, enabling accurate segmentation under complex spatial and geometric variations.
Abstract
1. Introduction
- We propose CD2FSAN (CLIP-Driven Dynamic Feature Selection and Alignment Network), a single-stage, CLIP-tailored framework for RRSIS that jointly optimizes cross-modal alignment, small-object localization, and geometry-aware decoding in complex aerial scenes.
- To enhance cross-modal alignment and fine-grained segmentation, we design an integrated visual–language alignment and geometry-aware decoding framework. The DFSM adaptively selects language-consistent features from CLIP’s hierarchy based on the principle of maximizing cross-modal information, thereby strengthening cross-modal correspondence under domain shifts. The MAAM constructs scale-consistent representations for small-object localization and multi-scale fusion. The DRCD employs differentiable affine transformations to align receptive fields with object orientations, enabling precise segmentation of rotated or irregular targets.
- Performance evaluations on three public RRSIS benchmarks demonstrate that CD2FSAN achieves state-of-the-art segmentation accuracy with competitive computational efficiency. Validation under complex remote sensing scenarios further confirms that the proposed framework maintains robust cross-modal alignment and fine-grained segmentation performance even in scenes with large scale variations, object rotations, and background interference, highlighting its practical value for real-world vision–language remote sensing applications.
2. Related Work
2.1. Referring Image Segmentation
2.2. Referring Remote Sensing Image Segmentation
3. Materials and Methods
3.1. Overview
3.2. Image and Text Feature Encoding
3.3. Dynamic Feature Selection Mechanism
3.4. Multi-Scale Aggregation and Alignment Module
3.5. Dynamic Rotation Correction Decoder
4. Results
4.1. Dataset and Implementation Details
- RefSegRS. The dataset comprises 4420 image–text–label triplets drawn from 285 scenes. The training, validation, and test splits contain 2172, 431, and 1817 triplets, respectively, corresponding to 151, 31, and 103 scenes. Fourteen categories (for example, road, vehicle, car, van, and building) and five attributes are annotated. Images are 512 × 512 pixels at a ground sampling distance (GSD) of 0.13 m.
- RRSIS-D. This benchmark contains 17,402 triplets of images, segmentation masks, and referring expressions. The training, validation, and test sets include 12,181, 1740, and 3481 triplets, respectively. It covers 20 semantic categories (for example, airplane, golf field, expressway service area, baseball field, and stadium) and seven attributes. Images are 800 × 800 pixels, with GSD ranging from 0.5 m to 30 m.
- RISBench. The dataset includes 52,472 image–language–label triplets. The training, validation, and test partitions contain 26,300, 10,013, and 16,159 triplets, respectively. It features 26 categories with eight attributes. All images are resized to 512 × 512 pixels, with GSD ranging from 0.1 m to 30 m.
- Overall Intersection over Union (oIoU): This metric is calculated as the ratio of the cumulative intersection area to the cumulative union area across all test samples, with an emphasis on larger objects.
- Mean Intersection over Union (mIoU): mIoU is computed by averaging the IoU values between predicted masks and ground truth annotations for each test sample, treating both small and large objects equally.
- Precision@X: Precision@X measures the percentage of test samples for which the IoU between the predicted result and the ground truth exceeds a threshold . This metric evaluates the model’s accuracy at specific IoU thresholds, reflecting its performance in object localization.
4.2. Comparisons with Other Methods
4.3. Ablation Study
4.4. Visualization and Qualitative Analysis
4.5. Efficiency and Complexity Analysis
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Yuan, Z.; Mou, L.; Hua, Y.; Zhu, X.X. RRSIS: Referring Remote Sensing Image Segmentation. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–12. [Google Scholar] [CrossRef]
- Liu, S.; Ma, Y.; Zhang, X.; Wang, H.; Ji, J.; Sun, X.; Ji, R. Rotated Multi-Scale Interaction Network for Referring Remote Sensing Image Segmentation. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 26648–26658. [Google Scholar] [CrossRef]
- Cheng, G.; Han, J. A survey on object detection in optical remote sensing images. ISPRS J. Photogramm. Remote Sens. 2016, 117, 11–28. [Google Scholar] [CrossRef]
- Cheng, J.; Deng, C.; Su, Y.; An, Z.; Wang, Q. Methods and datasets on semantic segmentation for Unmanned Aerial Vehicle remote sensing images: A review. ISPRS J. Photogramm. Remote Sens. 2024, 211, 1–34. [Google Scholar] [CrossRef]
- Yuan, X.; Shi, J.; Gu, L. A review of deep learning methods for semantic segmentation of remote sensing imagery. Expert Syst. Appl. 2021, 169, 114417. [Google Scholar] [CrossRef]
- Duan, L.; Lafarge, F. Towards Large-Scale City Reconstruction from Satellites. In Proceedings of the Computer Vision–ECCV 2016, Amsterdam, The Netherlands, 11–14 October 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Springer: Cham, Switzerland, 2016; pp. 89–104. [Google Scholar]
- Abid, S.K.; Chan, S.W.; Sulaiman, N.; Bhatti, U.; Nazir, U. Present and Future of Artificial Intelligence in Disaster Management. In Proceedings of the 2023 International Conference on Engineering Management of Communication and Technology (EMCTECH), Vienna, Austria, 16–18 October 2023; pp. 1–7. [Google Scholar] [CrossRef]
- Zhao, W.; Lyu, R.; Zhang, J.; Pang, J.; Zhang, J. A fast hybrid approach for continuous land cover change monitoring and semantic segmentation using satellite time series. Int. J. Appl. Earth Obs. Geoinf. 2024, 134, 104222. [Google Scholar] [CrossRef]
- Weiss, M.; Jacob, F.; Duveiller, G. Remote sensing for agricultural applications: A meta-review. Remote Sens. Environ. 2020, 236, 111402. [Google Scholar] [CrossRef]
- Ji, R.; Tan, K.; Wang, X.; Tang, S.; Sun, J.; Niu, C.; Pan, C. PatchOut: A novel patch-free approach based on a transformer-CNN hybrid framework for fine-grained land-cover classification on large-scale airborne hyperspectral images. Int. J. Appl. Earth Obs. Geoinf. 2025, 138, 104457. [Google Scholar] [CrossRef]
- Ding, H.; Liu, C.; Wang, S.; Jiang, X. Vision-Language Transformer and Query Generation for Referring Segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 16321–16330. [Google Scholar]
- Yu, L.; Lin, Z.; Shen, X.; Yang, J.; Lu, X.; Bansal, M.; Berg, T.L. MAttNet: Modular Attention Network for Referring Expression Comprehension. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
- Yang, Z.; Wang, J.; Tang, Y.; Chen, K.; Zhao, H.; Torr, P.H. LAVT: Language-Aware Vision Transformer for Referring Image Segmentation. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–20 June 2022; pp. 18134–18144. [Google Scholar] [CrossRef]
- Jing, Y.; Kong, T.; Wang, W.; Wang, L.; Li, L.; Tan, T. Locate then Segment: A Strong Pipeline for Referring Image Segmentation. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual, 19–25 June 2021; pp. 9853–9862. [Google Scholar] [CrossRef]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning, Virtual, 18–24 July 2021; Meila, M., Zhang, T., Eds.; PMLR: Birmingham, UK, 2021; Volume 139, pp. 8748–8763. [Google Scholar]
- Wang, Z.; Lu, Y.; Li, Q.; Tao, X.; Guo, Y.; Gong, M.; Liu, T. CRIS: CLIP-Driven Referring Image Segmentation. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 11676–11685. [Google Scholar] [CrossRef]
- Hu, R.; Rohrbach, M.; Darrell, T. Segmentation from Natural Language Expressions. In Proceedings of the Computer Vision–ECCV 2016, Amsterdam, The Netherlands, 11–14 October 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Springer: Cham, Switzerland, 2016; pp. 108–124. [Google Scholar]
- Li, R.; Li, K.; Kuo, Y.C.; Shu, M.; Qi, X.; Shen, X.; Jia, J. Referring Image Segmentation via Recurrent Refinement Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 5745–5753. [Google Scholar] [CrossRef]
- Shi, H.; Li, H.; Meng, F.; Wu, Q. Key-Word-Aware Network for Referring Expression Image Segmentation. In Proceedings of the Computer Vision–ECCV 2018, Munich, Germany, 8–14 September 2018; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; pp. 38–54. [Google Scholar]
- Ye, L.; Rochan, M.; Liu, Z.; Wang, Y. Cross-Modal Self-Attention Network for Referring Image Segmentation. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 10494–10503. [Google Scholar] [CrossRef]
- Hu, Z.; Feng, G.; Sun, J.; Zhang, L.; Lu, H. Bi-Directional Relationship Inferring Network for Referring Image Segmentation. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 4423–4432. [Google Scholar] [CrossRef]
- Kamath, A.; Singh, M.; LeCun, Y.; Synnaeve, G.; Misra, I.; Carion, N. MDETR - Modulated Detection for End-to-End Multi-Modal Understanding. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 1760–1770. [Google Scholar] [CrossRef]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 9992–10002. [Google Scholar] [CrossRef]
- Kim, N.; Kim, D.; Kwak, S.; Lan, C.; Zeng, W. ReSTR: Convolution-free Referring Image Segmentation Using Transformers. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–20 June 2022; pp. 18124–18133. [Google Scholar] [CrossRef]
- Liu, J.; Ding, H.; Cai, Z.; Zhang, Y.; Kumar Satzoda, R.; Mahadevan, V.; Manmatha, R. PolyFormer: Referring Image Segmentation as Sequential Polygon Generation. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 18653–18663. [Google Scholar] [CrossRef]
- Zhu, C.; Zhou, Y.; Shen, Y.; Luo, G.; Pan, X.; Lin, M.; Chen, C.; Cao, L.; Sun, X.; Ji, R. SeqTR: A Simple Yet Universal Network for Visual Grounding. In Proceedings of the Computer Vision–ECCV 2022, Tel Aviv, Israel, 23–27 October 2022; Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T., Eds.; pp. 598–615. [Google Scholar]
- Liu, C.; Ding, H.; Jiang, X. GRES: Generalized Referring Expression Segmentation. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–23 June 2023; pp. 23592–23601. [Google Scholar] [CrossRef]
- Quan, W.; Deng, P.; Wang, K.; Yan, D.M. CGFormer: ViT-Based Network for Identifying Computer-Generated Images with Token Labeling. IEEE Trans. Inf. Forensics Secur. 2024, 19, 235–250. [Google Scholar] [CrossRef]
- Wu, J.; Li, X.; Xu, S.; Yuan, H.; Ding, H.; Yang, Y.; Li, X.; Zhang, J.; Tong, Y.; Jiang, X.; et al. Towards Open Vocabulary Learning: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 5092–5113. [Google Scholar] [CrossRef] [PubMed]
- Dong, Z.; Sun, Y.; Liu, T.; Zuo, W.; Gu, Y. Cross-Modal Bidirectional Interaction Model for Referring Remote Sensing Image Segmentation. arXiv 2025, arXiv:2410.08613. [Google Scholar] [CrossRef]
- Lei, S.; Xiao, X.; Zhang, T.; Li, H.C.; Shi, Z.; Zhu, Q. Exploring Fine-Grained Image-Text Alignment for Referring Remote Sensing Image Segmentation. IEEE Trans. Geosci. Remote Sens. 2025, 63, 1–11. [Google Scholar] [CrossRef]
- Shi, L.; Zhang, J. Multimodal-Aware Fusion Network for Referring Remote Sensing Image Segmentation. IEEE Geosci. Remote Sens. Lett. 2025, 22, 1–5. [Google Scholar] [CrossRef]
- Chen, K.; Zhang, J.; Liu, C.; Zou, Z.; Shi, Z. RSRefSeg: Referring Remote Sensing Image Segmentation with Foundation Models. arXiv 2025, arXiv:2501.06809. [Google Scholar] [CrossRef]
- Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. In Proceedings of the Advances in Neural Information Processing Systems, Virtual, 6–14 December 2021; Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W., Eds.; Curran Associates, Inc.: New York, NY, USA, 2021; Volume 34, pp. 12077–12090. [Google Scholar]
- Li, Y.; Li, Z.Y.; Zeng, Q.; Hou, Q.; Cheng, M.M. Cascade-CLIP: Cascaded vision-language embeddings alignment for zero-shot semantic segmentation. In Proceedings of the 41st International Conference on Machine Learning, Vienna, Austria, 21–27 July 2024. [Google Scholar]
- Ruan, Z.; Pu, N.; Chen, J.; Gao, S.; Guo, Y.; Kong, Q.; Xie, Y.; Wei, Y. Few-shot event-based action recognition. Neural Netw. 2025, 191, 107750. [Google Scholar] [CrossRef]
- Cho, Y.; Yu, H.; Kang, S.J. Cross-Aware Early Fusion with Stage-Divided Vision and Language Transformer Encoders for Referring Image Segmentation. IEEE Trans. Multimed. 2024, 26, 5823–5833. [Google Scholar] [CrossRef]









| Metric | Model | Publication | Visual Encoder | Text Encoder | RefSegRS | RRSIS-D | RISBench | |||
|---|---|---|---|---|---|---|---|---|---|---|
| Val | Test | Val | Test | Val | Test | |||||
| oIoU | RRN [18] | CVPR-18 | ResNet-101 | LSTM | 69.24 | 65.06 | 66.53 | 66.43 | 47.28 | 49.67 |
| BRINet [21] | CVPR-20 | ResNet-101 | LSTM | 61.59 | 58.22 | 70.73 | 69.68 | 46.27 | 48.73 | |
| ETRIS [20] | ICCV-23 | ResNet-101 | CLIP | 72.89 | 65.96 | 72.75 | 71.06 | 64.09 | 67.61 | |
| CRIS [16] | CVPR-22 | CLIP | CLIP | 72.14 | 65.87 | 70.98 | 70.46 | 66.26 | 69.11 | |
| CrossVLT [37] | TMM-23 | Swin-B | BERT | 76.12 | 69.73 | 76.25 | 75.48 | 69.77 | 74.33 | |
| LAVT [13] | CVPR-22 | Swin-B | BERT | 78.50 | 71.86 | 76.27 | 76.16 | 69.39 | 74.15 | |
| LGCE [1] | TGRS-24 | Swin-B | BERT | 83.56 | 76.81 | 76.68 | 76.34 | 68.81 | 73.87 | |
| RMSIN [2] | CVPR-24 | Swin-B | BERT | 74.40 | 68.31 | 78.27 | 77.79 | 69.51 | 74.09 | |
| FIANet [31] | TGRS-25 | Swin-B | BERT | 85.51 | 78.28 | 76.77 | 76.05 | - | - | |
| CroBIM [30] | Arxiv-25 | Swin-B | BERT | 78.85 | 72.30 | 76.24 | 76.37 | 69.08 | 73.61 | |
| CD2FSAN | - | CLIP | CLIP | 87.04 | 79.88 | 79.04 | 78.43 | 70.32 | 74.84 | |
| mIoU | RRN [18] | CVPR-18 | ResNet-101 | LSTM | 50.81 | 41.88 | 46.06 | 45.64 | 42.65 | 43.18 |
| BRINet [21] | CVPR-20 | ResNet-101 | LSTM | 38.73 | 31.51 | 51.41 | 49.45 | 41.54 | 42.91 | |
| ETRIS [20] | ICCV-23 | ResNet-101 | CLIP | 54.03 | 43.11 | 55.21 | 54.21 | 51.13 | 53.06 | |
| CRIS [16] | CVPR-22 | CLIP | CLIP | 53.74 | 43.26 | 50.75 | 49.69 | 53.64 | 55.18 | |
| CrossVLT [37] | TMM-23 | Swin-B | BERT | 55.27 | 42.81 | 59.87 | 58.48 | 61.54 | 62.84 | |
| LAVT [13] | CVPR-22 | Swin-B | BERT | 61.53 | 47.40 | 57.72 | 56.82 | 60.45 | 61.93 | |
| LGCE [1] | TGRS-24 | Swin-B | BERT | 72.51 | 59.96 | 60.16 | 59.37 | 60.44 | 62.13 | |
| RMSIN [2] | CVPR-24 | Swin-B | BERT | 54.24 | 42.63 | 65.10 | 64.20 | 61.78 | 63.07 | |
| FIANet [31] | TGRS-25 | Swin-B | BERT | 80.61 | 68.63 | 62.99 | 63.64 | - | - | |
| CroBIM [30] | Arxiv-25 | Swin-B | BERT | 65.79 | 52.69 | 63.99 | 64.24 | 67.52 | 67.32 | |
| CD2FSAN | - | CLIP | CLIP | 76.95 | 66.96 | 66.47 | 65.37 | 68.36 | 69.74 | |
| Method | oIoU | mIoU | Pr@0.5 | Pr@0.6 | Pr@0.7 | Pr@0.8 | Pr@0.9 |
|---|---|---|---|---|---|---|---|
| LAVT [13] | 76.27 | 57.72 | 65.23 | 58.79 | 50.29 | 40.11 | 23.05 |
| CRIS [16] | 70.98 | 50.75 | 56.44 | 47.87 | 39.77 | 29.31 | 11.84 |
| LGCE [1] | 76.68 | 60.16 | 68.10 | 60.61 | 51.45 | 42.34 | 23.85 |
| FIANet [31] | 76.77 | 62.99 | 74.20 | 66.15 | 54.08 | 41.27 | 22.30 |
| RMSIN [2] | 78.27 | 65.10 | 68.39 | 61.72 | 52.24 | 41.44 | 23.16 |
| CroBIM [30] | 76.24 | 63.99 | 74.20 | 66.15 | 54.08 | 41.38 | 22.30 |
| CD2FSAN | 79.04 | 66.47 | 78.28 | 70.11 | 56.78 | 41.38 | 20.57 |
| Category | LAVT [13] | RMSIN [2] | LGCE [1] | FIANet [31] | CD2FSAN |
|---|---|---|---|---|---|
| Airport | 66.44 | 68.08 | 68.11 | 68.66 | 68.61 |
| Golf field | 56.53 | 56.11 | 56.43 | 57.07 | 64.22 |
| Expressway service area | 76.08 | 76.68 | 71.19 | 77.35 | 72.31 |
| Baseball field | 68.56 | 66.93 | 70.93 | 70.44 | 88.43 |
| Stadium | 81.77 | 83.09 | 84.90 | 84.87 | 84.43 |
| Ground track field | 81.84 | 81.91 | 82.54 | 82.00 | 83.06 |
| Storage tank | 71.33 | 73.65 | 73.33 | 76.99 | 74.89 |
| Basketball court | 70.71 | 72.26 | 74.37 | 74.86 | 88.43 |
| Chimney | 65.54 | 68.42 | 68.44 | 68.41 | 79.85 |
| Tennis court | 74.98 | 76.68 | 75.63 | 78.48 | 72.88 |
| Overpass | 66.17 | 70.14 | 67.67 | 70.01 | 65.63 |
| Train station | 57.02 | 62.67 | 58.19 | 61.30 | 68.32 |
| Ship | 63.47 | 64.64 | 63.48 | 65.96 | 68.32 |
| Expressway toll station | 63.01 | 65.71 | 61.63 | 64.82 | 72.32 |
| Dam | 61.61 | 68.70 | 64.54 | 71.31 | 66.11 |
| Harbor | 60.05 | 60.40 | 60.47 | 62.03 | 57.77 |
| Bridge | 30.48 | 36.74 | 34.24 | 37.94 | 43.53 |
| Vehicle | 42.60 | 47.63 | 43.12 | 49.66 | 49.72 |
| Windmill | 35.32 | 41.99 | 40.76 | 46.72 | 63.76 |
| Average | 62.82 | 65.39 | 64.21 | 66.78 | 69.98 |
| Method | Validation | Test | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| P@0.5 | P@0.6 | P@0.7 | P@0.8 | P@0.9 | mIoU | P@0.5 | P@0.6 | P@0.7 | P@0.8 | P@0.9 | mIoU | ||
| (a) | Baseline | 45.98 | 35.92 | 26.26 | 15.86 | 4.48 | 42.59 | 39.73 | 29.93 | 22.01 | 12.41 | 4.14 | 38.58 |
| (b) | (a) + RFS (4,7,12) | 44.19 | 34.02 | 24.71 | 14.65 | 3.67 | 41.75 | 38.67 | 28.51 | 18.10 | 10.34 | 2.12 | 37.39 |
| (c) | (a) + RFS (6,9,12) | 50.97 | 40.45 | 31.43 | 19.31 | 6.14 | 46.21 | 45.51 | 36.09 | 26.61 | 15.86 | 4.31 | 43.66 |
| (d) | (a) + DFSM | 57.82 | 51.32 | 42.53 | 32.30 | 15.40 | 50.62 | 52.24 | 42.53 | 31.15 | 12.53 | 5.74 | 48.22 |
| (e) | (d) + Transformer | 71.84 | 62.29 | 51.21 | 37.59 | 17.30 | 60.31 | 71.65 | 61.96 | 50.47 | 35.94 | 18.04 | 59.18 |
| (f) | (d) + MAAM | 76.32 | 66.60 | 54.82 | 40.80 | 19.71 | 64.87 | 72.39 | 62.37 | 50.83 | 36.11 | 17.81 | 62.55 |
| (g) | (f) + DRCD (full) | 78.28 | 70.11 | 56.78 | 41.38 | 20.57 | 66.47 | 73.14 | 63.46 | 51.19 | 36.37 | 19.42 | 65.37 |
| Rotation Bases (n) | oIoU (%) | mIoU (%) | Pr@0.5 | Pr@0.7 | Pr@0.9 |
|---|---|---|---|---|---|
| 78.61 | 66.13 | 78.07 | 56.10 | 20.19 | |
| 77.99 | 65.19 | 76.69 | 55.91 | 19.89 | |
| 79.04 | 66.47 | 78.28 | 56.78 | 20.57 | |
| 78.27 | 65.46 | 78.24 | 56.44 | 20.36 |
| Model | Params (M) | FLOPs (G) | FPS |
|---|---|---|---|
| Baseline | 183.84 | 88.54 | 18.23 |
| +DFSM | 186.28 | 97.27 | 16.82 |
| +MAAM | 192.09 | 120.44 | 15.32 |
| +DRCD (Full Model) | 197.54 | 148.20 | 14.96 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Lu, Q.; Xie, Y.; Zhang, J.; Guo, Y.; Wei, Y.; Jiang, J.; Luan, X. CLIP-Driven with Dynamic Feature Selection and Alignment Network for Referring Remote Sensing Image Segmentation. Remote Sens. 2025, 17, 3675. https://doi.org/10.3390/rs17223675
Lu Q, Xie Y, Zhang J, Guo Y, Wei Y, Jiang J, Luan X. CLIP-Driven with Dynamic Feature Selection and Alignment Network for Referring Remote Sensing Image Segmentation. Remote Sensing. 2025; 17(22):3675. https://doi.org/10.3390/rs17223675
Chicago/Turabian StyleLu, Qianqi, Yuxiang Xie, Jing Zhang, Yanming Guo, Yingmei Wei, Jie Jiang, and Xidao Luan. 2025. "CLIP-Driven with Dynamic Feature Selection and Alignment Network for Referring Remote Sensing Image Segmentation" Remote Sensing 17, no. 22: 3675. https://doi.org/10.3390/rs17223675
APA StyleLu, Q., Xie, Y., Zhang, J., Guo, Y., Wei, Y., Jiang, J., & Luan, X. (2025). CLIP-Driven with Dynamic Feature Selection and Alignment Network for Referring Remote Sensing Image Segmentation. Remote Sensing, 17(22), 3675. https://doi.org/10.3390/rs17223675

