Hierarchical Prompt Engineering for Remote Sensing Scene Understanding with Large Vision–Language Models
Highlights
- A hierarchical prompting framework that makes coarse-to-fine decisions significantly improves remote sensing scene recognition, especially on fine-grained categories.
- With lightweight parameter-efficient adaptation (LoRA/QLoRA), even a 7B vision–language model can match or surpass full fine-tuning while using much less compute and labeled data.
- The method generalizes well to new datasets and sensing domains with limited or no retraining, showing strong zero-shot and few-shot transferability for real-world remote sensing tasks.
- We provide five structured AID dataset variants and a reproducible evaluation protocol, offering a practical benchmark for robust and transparent assessment of remote sensing VLMs.
Abstract
1. Introduction
1.1. Background and Significance
1.2. Related Work
1.3. Proposed Approach and Contributions
2. Materials and Methods
2.1. AID Preparation
2.2. Prompt Design
- (i)
- a predefined decision flow that first predicts a coarse scene type and then refines it to a fine-grained category;
- (ii)
- an ontology-based label space (Table 1) that restricts each decision step to only the relevant subset of classes;
- (iii)
- a constrained output schema that forces the model to answer in machine-readable fields rather than produce an unconstrained caption.
2.3. External Evaluation Datasets and Label Mapping
3. Results
3.1. Experimental Setup
3.1.1. Hyperparameters
3.1.2. Evaluation Metrics
3.2. Main Results
3.2.1. Ablation Study
- (a)
- Prompt Engineering Analysis
- (b)
- Model Architecture Comparison
- (c)
- Effect of Model Size
3.2.2. Data Version Impact
3.2.3. Training Strategy Comparison
3.2.4. Comparison with Classical CNNs
3.2.5. Data Efficiency Under Limited Labels
3.2.6. Cross-Domain Generalization
4. Discussion
4.1. Effectiveness of Hierarchical Prompting
4.2. Trade-Offs Between Model Size and Accuracy
4.3. Limitations and Potential Failure Cases
4.4. Future Work
5. Conclusions
Supplementary Materials
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Appendix A. Prompt Templates
Appendix A.1. V0 Prompt: Direct Category Selection
Appendix A.2. V1 Prompt: Coarse-To-Fine Step-By-Step Classification
Appendix A.3. V2 Prompt: Streamlined Coarse-to-Fine Reasoning
Appendix B. Dataset Variant Design and Prompt–Variant Alignment
Appendix B.1. Goals
Appendix B.2. Variant Construction and Prompt Rationale
| Variant | Design Goal (Robustness Axis) | Construction Rule (Leakage Control) | Prompt Alignment (Why This Prompt) | Artifacts |
|---|---|---|---|---|
| Baseline comparability | Standard random split; no augmentation | Generic flat prompt, used as the neutral reference | Split index (train/val/test) | |
| Orientation/flip invariance stress | Global rotations (0/90/180/270) and H/V flips applied naïvely | Add orientation-invariant wording and spatial cues | Augmentation list | |
| Leakage-safe augmentation | Split-before-augment: apply all augmentations only to training; validation/test untouched | Keep wording to isolate protocol effect (naïve vs. leakage-safe) | Train-only aug manifests | |
| Semantic disambiguation/label noise | Merge near-synonymous/ambiguous classes; remove borderline samples; run near-duplicate audit across splits | Emphasize structural/semantic descriptors in a coarse → fine template | Cleaned indices + audit logs | |
| Clean + invariance (combined) | Apply ’s leakage-safe augmentation after cleaning | Streamlined hierarchical prompt (coarse → fine) for efficiency and clarity | All manifests and scripts |
- (i)
- Class consolidation.
- (ii)
- Ambiguous-sample removal.
Appendix B.3. Leakage Prevention and Duplicate Audit
Appendix C. Label Mapping for External Datasets
Appendix C.1. Mapping Policy and Notation
Appendix C.2. UC Merced (UCM) → Our Hierarchy/AID Fine Classes
- (1)
- Agriculture—generic farmland (AID distinguishes circular vs. rectangular farmland).
- (2)
- Buildings—generic built-up category (AID uses more specific Urban subclasses).
| UCM Class | Coarse Class | UCM Class | Coarse Class | UCM Class | Coarse Class |
|---|---|---|---|---|---|
| Agriculture | Sports and Farmland | Forest | Natural | Overpass | Transportation and Infrastructure |
| Airplane | Transportation and Infrastructure | Freeway | Transportation and Infrastructure | Parking Lot | Transportation and Infrastructure |
| Baseball Diamond | Sports and Farmland | Golf Course | Sports and Farmland | River | Natural |
| Beach | Natural | Harbor | Transportation and Infrastructure | Runway | Transportation and Infrastructure |
| Buildings | Urban/Man-made | Intersection | Transportation and Infrastructure | Sparse Residential | Urban/Man-made |
| Chaparral | Natural | Medium Residential | Urban/Man-made | Storage Tanks | Urban/Man-made |
| Dense Residential | Urban/Man-made | Mobile Home Park | Urban/Man-made | Tennis Court | Sports and Farmland |
Appendix C.3. WHU-RS19 → Our Hierarchy/AID Fine Classes
- (1)
- Farmland—generic farmland (AID splits by geometry).
- (2)
- Park—Urban green/parkland (no exact AID fine class).
- (3)
- Residential area—density not specified (AID separates dense/medium/sparse).
- (4)
- (Ambiguity handled at fine level but irrelevant here) Football field—could appear with or without athletics track; we evaluate at the coarse level.
| WHU Class | Coarse Class | WHU Class | Coarse Class | WHU Class | Coarse Class |
|---|---|---|---|---|---|
| Airport | Transportation and Infrastructure | Football field | Sports and Farmland | Parking lot | Transportation and Infrastructure |
| Beach | Natural | Forest | Natural | Pond | Natural |
| Bridge | Transportation and Infrastructure | Industrial area | Urban/Man-made | Port | Transportation and Infrastructure |
| Commercial area | Urban/Man-made | Meadow | Natural | Railway station | Transportation and Infrastructure |
| Desert | Natural | Mountain | Natural | Residential area | Urban/Man-made |
| Farmland | Sports and Farmland | Park | Urban/Man-made | River | Natural |
| - | - | Viaduct | Transportation and Infrastructure | - | - |
Appendix C.4. RSSCN7 → Our Hierarchy/AID Fine Classes
- (1)
- Field—generic farmland (AID splits circular/rectangular).
- (2)
- Residential—density not specified (AID distinguishes dense/medium/sparse).
| RSSCN7 Class | Coarse Class | RSSCN7 Class | Coarse Class | RSSCN7 Class | Coarse Class |
|---|---|---|---|---|---|
| Grass | Natural | Industrial | Urban/Man-made | Residential | Urban/Man-made |
| River | Natural | Field | Sports and Farmland | Parking | Transportation and Infrastructure |
| - | - | Forest | Natural | - | - |
Appendix D. Fine-Grained Classification Limitations
Appendix E. Hierarchical Multi-Label Extension
- (i)
- Start from the root; apply a calibrated threshold for each node.
- (ii)
- If , expand to its children; otherwise prune the subtree.
- (iii)
- Return all activated leaves and (optionally) their ancestors. A simple repair step sets to satisfy ancestry.
References
- Cheng, G.; Han, J.; Lu, X. Remote sensing image scene classification: Benchmark and state of the art. Proc. IEEE 2017, 105, 1865–1883. [Google Scholar] [CrossRef]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sutskever, I. Learning Transferable Visual Models from Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning (ICML), Online, 18–24 July 2021. [Google Scholar]
- Alayrac, J.-B.; Donahue, J.; Luc, P.; Miech, A.; Barr, I.; Li, Y.; Mensch, A.; Millican, A.; Reynolds, M.; Ring, R.; et al. Flamingo: A Visual Language Model for Few-Shot Learning. arXiv 2022, arXiv:2204.14198. [Google Scholar] [CrossRef]
- Castelluccio, M.; Poggi, G.; Sansone, C.; Verdoliva, L. Land Use Classification in Remote Sensing Images by Convolutional Neural Networks. arXiv 2015, arXiv:1508.00092. [Google Scholar] [CrossRef]
- Yang, Y.; Newsam, S. Bag-of-visual-words and spatial extensions for land-use classification. In Proceedings of the ACM SIGSPATIAL International Workshop on Advances in Geographic Information Systems, San Jose, CA, USA, 2–5 November 2010. [Google Scholar]
- Cheng, G.; Xie, X.; Han, J.; Guo, L.; Xia, G.-S. Remote Sensing Image Scene Classification Meets Deep Learning: Challenges, Methods, Benchmarks, and Opportunities. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 3735–3756. [Google Scholar] [CrossRef]
- Helber, P.; Bischke, B.; Dengel, A.; Borth, D. EuroSAT: A Novel Dataset and Deep Learning Benchmark for Land Use and Land Cover Classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2019, 12, 2217–2226. [Google Scholar] [CrossRef]
- Basu, S.; Ganguly, S.; Mukhopadhyay, S.; DiBiano, R.; Karki, M.; Nemani, R. DeepSat: A Learning Framework for Satellite Imagery. In Proceedings of the 23rd ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (SIGSPATIAL), Seattle, WA, USA, 3–6 November 2015. [Google Scholar]
- Luo, Z.; Chen, Y.; Zhang, Z.; Zhao, T. RS5M: A Large-Scale Vision–Language Dataset for Remote Sensing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024. [Google Scholar]
- Hu, Y.; Yuan, J.; Wen, C.; Lu, X.; Liu, Y.; Li, X. RSGPT: A Remote Sensing Vision Language Model and Benchmark. ISPRS J. Photogramm. Remote Sens. 2025, 224, 272–286. [Google Scholar] [CrossRef]
- Kuckreja, K.; Danish, M.S.; Naseer, M.; Das, A.; Khan, S.; Khan, F.S. GeoChat: Grounded Large Vision-Language Model for Remote Sensing. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 27831–27840. [Google Scholar] [CrossRef]
- Li, X.; Ding, J.; Elhoseiny, M. VRSBench: A Versatile Vision–Language Benchmark Dataset for Remote Sensing Image Understanding. Adv. Neural Inf. Process. Syst. (NeurIPS) Datasets Benchmarks Track 2024, 37, 3229–3242. [Google Scholar]
- Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A Simple Framework for Contrastive Learning of Visual Representations. arXiv 2020, arXiv:2002.05709. [Google Scholar] [CrossRef]
- Liu, F.; Chen, D.; Guan, Z.; Zhou, X.; Zhu, J.; Ye, Q.; Fu, L.; Zhou, J. RemoteCLIP: A Vision Language Foundation Model for Remote Sensing. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5622216. [Google Scholar] [CrossRef]
- Brade, S.; Wang, B.; Sousa, M.; Oore, S.; Grossman, T. Promptify: Text-to-Image Generation through Interactive Prompt Exploration with Large Language Models. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (UIST 2023), San Francisco, CA, USA, 29 October–1 November 2023; Association for Computing Machinery: New York, NY, USA, 2023. Article 96, 14 pp. [Google Scholar] [CrossRef]
- Gao, J.; Han, X.; Wei, J.; Le, Q.V.; Zhou, D.; Roberts, A.; Devlin, J.; Dean, J.; Huang, Y.; Mishra, G.; et al. Scaling Instruction-Finetuned Language Models. arXiv 2022, arXiv:2210.11416. [Google Scholar] [CrossRef]
- Bazi, Y.; Bashmal, L.; Al Rahhal, M.M.; Ricci, R.; Melgani, F. RS-LLaVA: A Large Vision–Language Model for Joint Captioning and VQA in Remote Sensing. Remote Sens. 2024, 16, 1477. [Google Scholar] [CrossRef]
- Zhang, W.; Cai, M.; Zhang, T.; Zhuang, Y.; Mao, X. EarthGPT: A Universal MLLM for Multi-Sensor Remote Sensing. arXiv 2024, arXiv:2401.16822v3. [Google Scholar]
- Ye, F.; Huang, L.; Liang, S.; Chi, K. Decomposed Two-Stage Prompt Learning for Few-Shot Named Entity Recognition. Information 2023, 14, 262. [Google Scholar] [CrossRef]
- Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Ichter, B.; Xia, F.; Chi, E.; Le, Q.; Zhou, D. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv 2022, arXiv:2201.11903. [Google Scholar]
- Yao, S.; Yu, Y.; Zhao, J.; Shafran, I.; Griffiths, T.L.; Cao, Y.; Narasimhan, K. Tree-of-Thoughts: Deliberate Problem Solving with Large Language Models. arXiv 2023, arXiv:2305.10601v2. [Google Scholar]
- Li, X.; Wen, C.; Hu, Y.; Zhou, N. RS-CLIP: Zero-shot Remote Sensing Scene Classification via Contrastive Vision–Language Supervision. Int. J. Appl. Earth Obs. Geoinf. 2023, 124, 103497. [Google Scholar] [CrossRef]
- El Khoury, K.; Zanella, M.; Gérin, B.; Godelaine, T.; Macq, B.; Mahmoudi, S.; De Vleeschouwer, C.; Ayed, I.B. Enhancing Remote Sensing Vision–Language Models for Zero-Shot Scene Classification. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 6–11 April 2025. [Google Scholar]
- Lu, X.; Wang, B.; Zheng, X.; Li, X. Exploring Models and Data for Remote Sensing Image Caption Generation. IEEE Trans. Geosci. Remote Sens. 2017, 56, 2183–2195. [Google Scholar] [CrossRef]
- Hu, E.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. LoRA: Low-Rank Adaptation of Large Language Models. arXiv 2021, arXiv:2106.09685. [Google Scholar]
- Dettmers, T.; Pagnoni, A.; Holtzman, A.; Zettlemoyer, L. QLoRA: Efficient Finetuning of Quantized LLMs. Adv. Neural Inf. Process. Syst. (NeurIPS) 2023, 36, 10088–10115. [Google Scholar]
- Wang, N.; Yang, A.; Cui, Z.; Ding, Y.; Xue, Y.; Su, Y. Capsule Attention Network for Hyperspectral Image Classification. Remote Sens. 2024, 16, 4001. [Google Scholar] [CrossRef]
- Chen, X.; Wang, Z.; Wang, K.; Jia, H.; Han, Z.; Tang, Y. Multi-Dimensional Low-Rank with Weighted Schatten p-Norm Minimization for Hyperspectral Anomaly Detection. Remote Sens. 2024, 16, 74. [Google Scholar] [CrossRef]
- Wang, N.; Cui, Z.; Lan, Y.; Zhang, C.; Xue, Y.; Su, Y.; Li, A. Large-Scale Hyperspectral Image-Projected Clustering via Doubly Stochastic Graph Learning. Remote Sens. 2025, 17, 1526. [Google Scholar] [CrossRef]
- Xia, G.-S.; Hu, J.; Hu, F.; Shi, B.; Bai, X.; Zhong, Y.; Zhang, L.; Lu, X. AID: A Benchmark Dataset for Performance Evaluation of Aerial Scene Classification. IEEE Trans. Geosci. Remote Sens. 2017, 55, 3965–3981. [Google Scholar] [CrossRef]
- Zauner, C. Implementation and Benchmarking of Perceptual Image Hash Functions. Master’s Thesis, Hagenberg University of Technology, Hagenberg, Austria, 2010. [Google Scholar]
- Barz, B.; Denzler, J. Do We Train on Test Data? Purging CIFAR of Near-Duplicates. J. Imaging 2020, 6, 41. [Google Scholar] [CrossRef]
- Balestra, M.; Paolanti, M.; Pierdicca, R. WHU-RS19 ABZSL: An Attribute-Based Dataset for Remote Sensing Image Understanding. Remote Sens. 2025, 17, 2384. [Google Scholar] [CrossRef]
- Zou, Q.; Ni, L.; Zhang, T.; Wang, Q. Deep Learning Based Feature Selection for Remote Sensing Scene Classification. IEEE Geosci. Remote Sens. Lett. 2015, 12, 2321–2325. [Google Scholar] [CrossRef]
- Rajbhandari, S.; Rasley, J.; Ruwase, O.; He, Y. ZeRO: Memory Optimizations Toward Training Trillion-Parameter Models. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC ’20), Atlanta, GA, USA, 16–19 November 2020. [Google Scholar]
- Dettmers, T.; Lewis, M.; Shleifer, S.; Zettlemoyer, L. LLM.int8(): 8-Bit Matrix Multiplication for Transformers at Scale. arXiv 2022, arXiv:2208.07339. [Google Scholar] [CrossRef]
- Xu, Y.; Xie, L.; Gu, X.; Chen, X.; Chang, H.; Zhang, H.; Chen, Z.; Zhang, X.; Tian, Q. QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models. In Proceedings of the International Conference on Learning Representations (ICLR 2024), Vienna, Austria, 7–11 May 2024. [Google Scholar]
- Zhao, W.; Xu, H.; Hu, J.; Wang, X.; Mao, Y.; Zhang, D. SWIFT: A Scalable Lightweight Infrastructure for Fine-Tuning. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Philadelphia, PA, USA, 25 February–4 March 2025. [Google Scholar]
- Wang, P.; Bai, S.; Tan, S.; Wang, S.; Fan, Z.; Bai, J.; Chen, K.; Liu, X.; Wang, J.; Ge, W.; et al. Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution. arXiv 2024, arXiv:2409.12191. [Google Scholar]
- Bai, S.; Chen, K.; Liu, X.; Wang, J.; Ge, W.; Song, S.; Dang, K.; Wang, P.; Wang, S.; Tang, J.; et al. Qwen2.5-VL Technical Report. arXiv 2025, arXiv:2502.13923. [Google Scholar] [CrossRef]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June–1 July 2016. [Google Scholar]
- Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
- Tan, M.; Le, Q. EfficientNet: Rethinking Model Scaling for CNNs. In Proceedings of the International Conference on Machine Learning (ICML), Long Beach, CA, USA, 9–15 June 2019. [Google Scholar]











| Coarse Class | All Fine-Grained Classes |
|---|---|
| Natural | mountain, lake, forest, beach, cloud, desert, island, river, meadow, snowberg, sea_ice, chaparral, wetland |
| Urban/ Man-made | dense_residential, medium_residential, sparse_residential, mobile_home_park, industrial_area, commercial_area, church, palace, storage_tank, terrace, thermal_power_station |
| Transportation and Infrastructure | airport, airplane, freeway, bridge, railway, railway_station, harbor, intersection, overpass, roundabout, runway, parking_lot, ship |
| Sports and Farmland | baseball_diamond, basketball_court, tennis_court, golf_course, ground_track_field, stadium, circular_farmland, rectangular_farmland |
| Version | Key Operations | Details |
|---|---|---|
| Original dataset | A total of 45 classes, 100 images per class; no augmentation data. | |
| Data augmentation | Each image was transformed with 6 operations: rotate (0°, 90°, 180°, 270°), horizontal flip, and vertical flip. | |
| Augmentation after split [32] | Training and testing sets are split first; augmentation is applied only on the training set. | |
| Class merging data cleaning | merge 6 groups of visually overlapping categories, reduce classes from 45 → 39, and remove 17 ambiguous samples. | |
| Post-cleaning augmentation | Based on , split into training and testing sets; apply augmentation on the training set. |
| Version | Main Strategy | Key Features | Output Format |
|---|---|---|---|
| Direct classification | Select one category from a list without explanation | Category name only | |
| Coarse-to-fine hierarchical reasoning | Step 1: Coarse classification into 4 major groups; Step 2: Fine-grained classification within selected group; Reasoning required | |<Coarse class>|… |<Fine class>|… |<Reasoning>|… | |
| Simplified hierarchical reasoning | Same two-step classification as but with streamlined instructions to reduce cognitive load | |<Coarse class>|… |<Fine class>|… |<Reasoning>|… |
| Prompt Version | |||
|---|---|---|---|
| Qwen2VL-72B | 29.33% | 52.00% | 51.78% |
| Qwen2.5VL-7B | 90.67% | 96.44% | 96.11% |
| Model Architecture | Qwen2VL | Qwen2.5VL | Improvement |
|---|---|---|---|
| Coarse class | 51.78% | 96.56% | +44.78% |
| Natural | 44.62% | 93.85% | +49.23% |
| Urban/Man-made | 13.64% | 78.64% | +65.00% |
| Transportation and Infrastructure | 18.08% | 91.15% | +73.07% |
| Sports and Farmland | 2.50% | 95.00% | +92.50% |
| Model Size | 7B | 72B | Improvement |
|---|---|---|---|
| Coarse class | 96.11% | 96.56% | +0.45% |
| Natural | 92.31% | 93.85% | +1.54% |
| Urban/Man-made | 81.36% | 78.64% | −2.72% |
| Transportation and Infrastructure | 90.77% | 91.15% | +0.38% |
| Sports and Farmland | 89.38% | 95.00% | +5.62% |
| Training Strategy | LoRA | Full | Improvement |
|---|---|---|---|
| Coarse class | 96.11% | 96.00% | −0.11% |
| Natural | 92.31% | 93.08% | +0.77% |
| Urban/Man-made | 81.36% | 81.82% | +0.46% |
| Transportation and Infrastructure | 90.77% | 90.38% | −0.39% |
| Sports and Farmland | 89.38% | 88.12% | −1.26% |
| Model | Trainable Params (M) | Peak Mem (GB) | Top-1 Accuracy (%) | Macro-F1 (%) |
|---|---|---|---|---|
| ResNet-50 (ImageNet) | 23.60 | 3.139 | 90.67 | 90.46 |
| MobileNetV2 (ImageNet) | 2.282 | 2.626 | 89.78 | 89.81 |
| EfficientNet-B0 (ImageNet) | 4.065 | 2.968 | 91.11 | 91.09 |
| Qwen2.5VL-7B + LoRA + HPE | 8.39 | 14.00 | 91.33 | 91.32 |
| Labeled (%) | ResNet-50 | MobileNetV2 | EfficientNet-B0 | Qwen2.5VL-7B + LoRA |
|---|---|---|---|---|
| 1% | 27.89/26.25 | 53.44/53.46 | 29.00/28.81 | 81.44/82.63 |
| 5% | 49.33/47.63 | 67.00/65.94 | 53.67/53.02 | 87.22/87.90 |
| 10% | 69.44/69.44 | 81.11/80.98 | 75.00/74.51 | 85.22/84.92 |
| 25% | 84.33/84.22 | 90.00/89.94 | 86.22/86.13 | 87.11/87.04 |
| Dataset | Few-Shot LoRA | Zero-Shot | ||
|---|---|---|---|---|
| Coarse Top-1 | Coarse F1 | Coarse Top-1 | Coarse F1 | |
| AID- | 97.78 | 97.67 | - | - |
| UCM | 95.70 | 95.55 | 87.20 | 87.04 |
| WHU | 96.81 | 95.92 | 93.90 | 91.98 |
| RSSCN7 | 83.56 | 81.23 | 78.27 | 71.83 |
| Precision | Recall | F1-Score | Support | |
|---|---|---|---|---|
| Natural | 86.7/92.7 | 82.8/83.0 | 84.7/87.6 | 1152 |
| Urban/Man-made | 80.4/95.6 | 88.7/82.7 | 84.3/88.7 | 768 |
| Transportation and Infrastructure | 66.7/60.4 | 34.1/76.6 | 42.4/67.5 | 384 |
| Sports and Farmland | 56.0/71.3 | 88.0/94.0 | 75.9/81.1 | 384 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Chen, T.; Ai, J. Hierarchical Prompt Engineering for Remote Sensing Scene Understanding with Large Vision–Language Models. Remote Sens. 2025, 17, 3727. https://doi.org/10.3390/rs17223727
Chen T, Ai J. Hierarchical Prompt Engineering for Remote Sensing Scene Understanding with Large Vision–Language Models. Remote Sensing. 2025; 17(22):3727. https://doi.org/10.3390/rs17223727
Chicago/Turabian StyleChen, Tianyang, and Jianliang Ai. 2025. "Hierarchical Prompt Engineering for Remote Sensing Scene Understanding with Large Vision–Language Models" Remote Sensing 17, no. 22: 3727. https://doi.org/10.3390/rs17223727
APA StyleChen, T., & Ai, J. (2025). Hierarchical Prompt Engineering for Remote Sensing Scene Understanding with Large Vision–Language Models. Remote Sensing, 17(22), 3727. https://doi.org/10.3390/rs17223727

