Ladder-Side-Tuning of Visual Foundation Model for City-Scale Individual Tree Detection from High-Resolution Remote Sensing Images
Highlights
- We propose Tree-SAM, a ladder-side-tuned SAM framework with three task-specific modules (CCFB, HIAN, and CAAH) to enable robust city-scale individual tree crown instance detection under heterogeneous urban, mixed, and forest scenes.
- Tree-SAM consistently achieves the best accuracy across datasets and scenarios, reaching F1/AP@50 of 0.762/0.478 (forest), 0.732/0.454 (mixed), and 0.830/0.526 (urban) on GZ-Tree Crown, and demonstrating strong cross-region robustness in zero-shot transfer to BAMFORESTS and SZ-Dataset.
- The high-precision automated workflow enables large-scale, individual-level tree monitoring, providing critical data support for urban forest management, carbon stock estimation, and ecological assessment.
- This study establishes an efficient adaptation paradigm for vision foundation models in remote sensing, proving that parameter-efficient fine-tuning can effectively bridge the domain gap for specialized downstream tasks.
Abstract
1. Introduction
- (1)
- Convolutional neural network (CNN)-based individual tree detection methods [17,18,19]. Early convolution-based frameworks often combined deep learning with traditional algorithms (e.g., watershed segmentation) [20] or utilized standard object detectors (e.g., Faster R-CNN, YOLO) [21] to delineate irregular tree crowns [10]. Despite their effectiveness in tree-covered regions, CNN-based segmentation frameworks are inherently limited in modeling individual instances [22], particularly in dense urban forests where tree crowns are mixed with a complex background with high heterogeneity [23]. Such limitations typically necessitate elaborate post-processing to achieve object-level separation. Mask R-CNN proposed multi-scale feature representations and instance-aware mechanisms, significantly improving its detection performance for individual trees [24]. It leads to specialized models such as Detectree2 for tropical forests [25] and DeepForest for diverse geographical datasets [26,27]. However, despite these improvements, Mask R-CNN still relies on convolutional backbones, which are limited in capturing long-range dependencies and scene-level semantics. This leads to reduced robustness in complex urban contexts where background interference is prominent and object boundaries are ambiguous [22,28].
- (2)
- Transformer and graph-based individual tree detection methods. Transformer models long-range dependencies and global semantics through self-attention mechanisms, enabling dynamic attention across the entire image [29,30,31], demonstrating strong structural awareness in complex urban forest scenes [32,33]. Various Transformer architectures, such as Swin Transformer [34], DeiT [35], and SegFormer [36], have been successfully adapted for large-area tree mapping and classification. Furthermore, Graph Convolutional Networks (GCNs) have shown strong capabilities in modeling complex spatial relationships and processing multimodal data, such as integrating UAV-based multispectral images and LiDAR point clouds for urban tree species classification [37]. Owing to the lack of spatial locality and translation equivariance, Transformers are less effective in preserving spatial continuity [38]. During early encoding stages, they tend to lose fine-grained textures and boundary cues, which reduces their capacity to represent small-scale tree canopies. This limitation becomes especially pronounced when segmenting small or indistinct tree crowns, where the model produces diminished responses and weakened instance-level segmentation performance. Both CNN and Transformer-based methods typically require large amounts of labeled data and exhibit limited generalization across diverse urban scenes [39].
- (3)
- The emergence of the Segment Anything Model (SAM) [40], a large-scale pretrained vision model, has introduced new opportunities for urban individual tree detection [41,42,43]. With its powerful global feature modeling and strong cross-task generalization capabilities, SAM has shown notable potential in complex background perception and global feature extraction, and has been extensively explored across various downstream visual tasks [44,45,46,47,48] and remote sensing applications [49]. Within the remote sensing domain, SAM exhibits remarkable capability in image segmentation and large-scale mapping, consistently attaining state-of-the-art performance across diverse downstream applications [50,51,52,53]. Due to domain shift in training data and the absence of task-specific supervision, the SAM out-of-the-box method performed poorly, exhibiting notable performance degradation when directly applied to urban individual tree detection. To address this, recent studies have investigated prompt-based solutions, such as utilizing bounding boxes from specialized detectors or generating tree-center heatmaps for crown segmentation [43]. However, the prompts-based methods require pre-localizing tree centers or bounding boxes to invoke SAM, i.e., a detect-then-segment cascade, where prompt misplacement or omission often yields over-extended masks (covering extra background or neighboring objects) or under-segmentation masks, particularly in scenarios involving complex crown boundaries, multi-scale canopy structures, and small object recognition [54]. Moreover, it exhibits pronounced performance variability across biomes scenarios, for example, between plantation and natural forests, and among boreal, temperate, and tropical settings [55].
2. Materials and Methods
2.1. Data Sources
2.1.1. Dataset 1: GZ-Tree Crown
2.1.2. BAMFORESTS
2.1.3. Dataset 3: SZ-Dataset
2.2. Method
2.2.1. Network Structure
2.2.2. Cross-Correlation Feature Backbone
2.2.3. Hierarchical Instance Aggregation Neck
2.2.4. Context-Aware Adaptation Head
2.3. Metrics
2.3.1. Intersection over Union (IoU) and Matching Criterion
2.3.2. Precision, Recall, and Detection IoU
2.3.3. F1-Score and mAP@0.50
3. Results
3.1. Implementation Details
3.2. Ablation Study
3.3. Comparison with SOTA Methods
3.3.1. Model Performance in GZ-Tree Crown
3.3.2. Model Performance in BAMFORESTS
- (1)
- Zero-Shot Cross-Domain Generalization on the BAMFORESTS Dataset
- (2)
- Domain Adaptation Performance on the BAMFORESTS Dataset
3.3.3. Model Performance in SZ-Dataset
4. Discussion
5. Conclusions
Supplementary Materials
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Zhang, X.; Huang, H.; Tu, K.; Li, R.; Zhang, X.; Wang, P.; Li, Y.; Yang, Q.; Acerman, A.C.; Guo, N. Effects of Plant Community Structural Characteristics on Carbon Sequestration in Urban Green Spaces. Sci. Rep. 2024, 14, 7382. [Google Scholar] [CrossRef]
- Ettinger, A.K.; Bratman, G.N.; Carey, M.; Hebert, R.; Hill, O.; Kett, H.; Levin, P.; Murphy-Williams, M.; Wyse, L. Street Trees Provide an Opportunity to Mitigate Urban Heat and Reduce Risk of High Heat Exposure. Sci. Rep. 2024, 14, 3266. [Google Scholar] [CrossRef] [PubMed]
- Feng, R.; Wang, F.; Liu, S.; Qi, W.; Zhengchen, R.; Wang, D. Synergistic Effects of Urban Forest on Urban Heat Island-Air Pollution-Carbon Stock in Mega-Urban Agglomeration. Urban For. Urban Green. 2025, 103, 128590. [Google Scholar] [CrossRef]
- Corro, L.M.; Bagstad, K.J.; Heris, M.P.; Ibsen, P.C.; Schleeweis, K.G.; Diffendorfer, J.E.; Troy, A.; Megown, K.; O’Neil-Dunne, J.P.M. An Enhanced National-Scale Urban Tree Canopy Cover Dataset for the United States. Sci. Data 2025, 12, 490. [Google Scholar] [CrossRef] [PubMed]
- Nowak, D.; Crane, D.; Stevens, J.; Hoehn, R.; Walton, J.; Bond, J. A Ground-Based Method of Assessing Urban Forest Structure and Ecosystem Services. Arboric. Urban For. 2008, 34, 347–358. [Google Scholar] [CrossRef]
- Shojanoori, R.; Shafri, H.Z.M. Review on the Use of Remote Sensing for Urban Forest Monitoring. Arboric. Urban For. 2016, 42, 400–417. [Google Scholar] [CrossRef]
- Erker, T.; Wang, L.; Lorentz, L.; Stoltman, A.; Townsend, P.A. A Statewide Urban Tree Canopy Mapping Method. Remote Sens. Environ. 2019, 229, 148–158. [Google Scholar] [CrossRef]
- He, D.; Shi, Q.; Liu, X.; Zhong, Y.; Zhang, L. Generating 2m Fine-Scale Urban Tree Cover Product over 34 Metropolises in China Based on Deep Context-Aware Sub-Pixel Mapping Network. Int. J. Appl. Earth Obs. Geoinf. 2022, 106, 102667. [Google Scholar] [CrossRef]
- Zhen, Z.; Quackenbush, L.J.; Zhang, L. Trends in Automatic Individual Tree Crown Detection and Delineation—Evolution of LiDAR Data. Remote Sens. 2016, 8, 333. [Google Scholar] [CrossRef]
- Freudenberg, M.; Magdon, P.; Nölke, N. Individual Tree Crown Delineation in High-Resolution Remote Sensing Images Based on U-Net. Neural Comput. Applic. 2022, 34, 22197–22207. [Google Scholar] [CrossRef]
- Liu, K.; Li, T.; Peng, D. Aerial Image Object Detection Based on RGB-Infrared Multibranch Progressive Fusion. IEEE Trans. Geosci. Remote Sens. 2025, 63, 1–14. [Google Scholar] [CrossRef]
- Wang, H.; Li, J.; Van de Voorde, T.; Zhou, C.; De Maeyer, P.; Ma, Y.; Shen, Z. Individual Populus euphratica Tree Detection in Sparse Desert Forests Based on Constrained 2-D Bin Packing. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–19. [Google Scholar] [CrossRef]
- Gu, J.; Congalton, R.G. Individual Tree Crown Delineation from UAS Imagery Based on Region Growing by Over-Segments with a Competitive Mechanism. IEEE Trans. Geosci. Remote Sens. 2022, 60, 4402411. [Google Scholar] [CrossRef]
- Gomes, M.F.; Maillard, P.; Deng, H. Individual Tree Crown Detection in Sub-Meter Satellite Imagery Using Marked Point Processes and a Geometrical-Optical Model. Remote Sens. Environ. 2018, 211, 184–195. [Google Scholar] [CrossRef]
- Brandt, M.; Tucker, C.J.; Kariryaa, A.; Rasmussen, K.; Abel, C.; Small, J.; Chave, J.; Rasmussen, L.V.; Hiernaux, P.; Diouf, A.A.; et al. An Unexpectedly Large Count of Trees in the West African Sahara and Sahel. Nature 2020, 587, 78–82. [Google Scholar] [CrossRef]
- Sun, Y.; Li, Z.; He, H.; Guo, L.; Zhang, X.; Xin, Q. Counting Trees in a Subtropical Mega City Using the Instance Segmentation Method. Int. J. Appl. Earth Obs. Geoinf. 2022, 106, 102662. [Google Scholar] [CrossRef]
- Zhu, X.X.; Tuia, D.; Mou, L.; Xia, G.-S.; Zhang, L.; Xu, F.; Fraundorfer, F. Deep Learning in Remote Sensing: A Comprehensive Review and List of Resources. IEEE Geosci. Remote Sens. Mag. 2017, 5, 8–36. [Google Scholar] [CrossRef]
- Ma, L.; Liu, Y.; Zhang, X.; Ye, Y.; Yin, G.; Johnson, B.A. Deep Learning in Remote Sensing Applications: A Meta-Analysis and Review. ISPRS J. Photogramm. Remote Sens. 2019, 152, 166–177. [Google Scholar] [CrossRef]
- Beloiu, M.; Heinzmann, L.; Rehush, N.; Gessler, A.; Griess, V.C. Individual Tree-Crown Detection and Species Identification in Heterogeneous Forests Using Aerial RGB Imagery and Deep Learning. Remote Sens. 2023, 15, 1463. [Google Scholar] [CrossRef]
- Lassalle, G.; Ferreira, M.P.; La Rosa, L.E.C.; de Souza Filho, C.R. Deep Learning-Based Individual Tree Crown Delineation in Mangrove Forests Using Very-High-Resolution Satellite Imagery. ISPRS J. Photogramm. Remote Sens. 2022, 189, 220–235. [Google Scholar] [CrossRef]
- dos Santos, A.A.; Marcato Junior, J.; Araújo, M.S.; Di Martini, D.R.; Tetila, E.C.; Siqueira, H.L.; Aoki, C.; Eltner, A.; Matsubara, E.T.; Pistori, H. Assessment of CNN-Based Methods for Individual Tree Detection on Images Captured by RGB Cameras Attached to UAVs. Sensors 2019, 19, 3595. [Google Scholar] [CrossRef]
- Zhao, H.; Morgenroth, J.; Pearse, G.; Schindler, J. A Systematic Review of Individual Tree Crown Detection and Delineation with Convolutional Neural Networks (CNN). Curr. For. Rep. 2023, 9, 149–170. [Google Scholar] [CrossRef]
- Zheng, J.; Yuan, S.; Li, W.; Fu, H.; Yu, L.; Huang, J. A Review of Individual Tree Crown Detection and Delineation from Optical Remote Sensing Images: Current Progress and Future. IEEE Geosci. Remote Sens. Mag. 2025, 13, 209–236. [Google Scholar] [CrossRef]
- He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. arXiv 2018, arXiv:1703.06870. [Google Scholar] [PubMed]
- Ball, J.G.C.; Hickman, S.H.M.; Jackson, T.D.; Koay, X.J.; Hirst, J.; Jay, W.; Archer, M.; Aubry-Kientz, M.; Vincent, G.; Coomes, D.A. Accurate Delineation of Individual Tree Crowns in Tropical Forests from Aerial RGB Imagery Using Mask R-CNN. Remote Sens. Ecol. Conserv. 2023, 9, 641–655. [Google Scholar] [CrossRef]
- Weinstein, B.G.; Marconi, S.; Bohlman, S.; Zare, A.; White, E. Individual Tree-Crown Detection in RGB Imagery Using Semi-Supervised Deep Learning Neural Networks. Remote Sens. 2019, 11, 1309. [Google Scholar] [CrossRef]
- Weinstein, B.G.; Marconi, S.; Aubry-Kientz, M.; Vincent, G.; Senyondo, H.; White, E.P. DeepForest: A Python Package for RGB Deep Learning Tree Crown Delineation. Methods Ecol. Evol. 2020, 11, 1743–1751. [Google Scholar] [CrossRef]
- Gan, Y.; Wang, Q.; Iio, A. Tree Crown Detection and Delineation in a Temperate Deciduous Forest from UAV RGB Imagery Using Deep Learning Approaches: Effects of Spatial Resolution and Species Characteristics. Remote Sens. 2023, 15, 778. [Google Scholar] [CrossRef]
- Wang, R.; Ma, L.; He, G.; Johnson, B.A.; Yan, Z.; Chang, M.; Liang, Y. Transformers for Remote Sensing: A Systematic Review and Analysis. Sensors 2024, 24, 3495. [Google Scholar] [CrossRef]
- Aleissaee, A.A.; Kumar, A.; Anwer, R.M.; Khan, S.; Cholakkal, H.; Xia, G.-S.; Khan, F.S. Transformers in Remote Sensing: A Survey. Remote Sens. 2023, 15, 1860. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need 2023. arXiv 2017, arXiv:1706.03762. [Google Scholar]
- Gao, T.; Gao, Z.; Ji, H.; Ao, W.; Song, W. Query Adaptive Transformer and Multiprototype Rectification for Few-Shot Remote Sensing Image Segmentation. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5651413. [Google Scholar] [CrossRef]
- Li, X.; Cheng, Y.; Fang, Y.; Liang, H.; Xu, S. 2DSegFormer: 2-D Transformer Model for Semantic Segmentation on Aerial Images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 4709413. [Google Scholar] [CrossRef]
- Jiang, T.; Freudenberg, M.; Kleinn, C.; Lüddecke, T.; Ecker, A.; Nölke, N. Detection Transformer-Based Approach for Mapping Trees Outside Forests on High Resolution Satellite Imagery. Ecol. Inform. 2025, 87, 103114. [Google Scholar] [CrossRef]
- Vinod, P.; Behera, M.; Jaya Prakash, A.; Hebbar, R.; Srivastav, S. A Novel Multitask Transformer Deep Learning Architecture for Joint Classification and Segmentation of Horticulture Plantations Using Very High-Resolution Satellite Imagery. Comput. Electron. Agric. 2024, 227, 109540. [Google Scholar] [CrossRef]
- Joshi, D.; Witharana, C. Vision Transformer-Based Unhealthy Tree Crown Detection in Mixed Northeastern US Forests and Evaluation of Annotation Uncertainty. Remote Sens. 2025, 17, 1066. [Google Scholar] [CrossRef]
- Li, X.; Wang, L.; Guan, H.; Chen, K.; Zang, Y.; Yu, Y. Urban Tree Species Classification Using UAV-Based Multispectral Images and LiDAR Point Clouds. J. Geovisualization Spat. Anal. 2023, 8, 5. [Google Scholar] [CrossRef]
- Li, X.; Ding, H.; Yuan, H.; Zhang, W.; Pang, J.; Cheng, G.; Chen, K.; Liu, Z.; Loy, C.C. Transformer-Based Visual Segmentation: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 10138–10163. [Google Scholar] [CrossRef]
- Dersch, S.; Schöttl, A.; Krzystek, P.; Heurich, M. Towards Complete Tree Crown Delineation by Instance Segmentation with Mask R–CNN and DETR Using UAV-Based Multispectral Imagery and Lidar Data. ISPRS Open J. Photogramm. Remote Sens. 2023, 8, 100037. [Google Scholar] [CrossRef]
- Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.-Y. Segment Anything. arXiv 2023, arXiv:2304.02643. [Google Scholar]
- Liu, F.; Chen, D.; Guan, Z.; Zhou, X.; Zhu, J.; Ye, Q.; Fu, L.; Zhou, J. RemoteCLIP: A Vision Language Foundation Model for Remote Sensing. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5622216. [Google Scholar] [CrossRef]
- Wang, H.; Köser, K.; Ren, P. Large Foundation Model Empowered Discriminative Underwater Image Enhancement. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5609317. [Google Scholar] [CrossRef]
- Lungo Vaschetti, J.; Arnaudo, E.; Rossi, C. TreePseCo: Scaling Individual Tree Crown Segmentation Using Large Vision Models. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2025, 48, 275–282. [Google Scholar] [CrossRef]
- Ma, J.; He, Y.; Li, F.; Han, L.; You, C.; Wang, B. Segment Anything in Medical Images. Nat. Commun. 2024, 15, 654. [Google Scholar] [CrossRef]
- Zhang, C.; Liu, L.; Cui, Y.; Huang, G.; Lin, W.; Yang, Y.; Hu, Y. A Comprehensive Survey on Segment Anything Model for Vision and Beyond. arXiv 2023, arXiv:2305.08196. [Google Scholar] [CrossRef]
- Ke, L.; Ye, M.; Danelljan, M.; Liu, Y.; Tai, Y.-W.; Tang, C.-K.; Yu, F. Segment Anything in High Quality. arXiv 2023, arXiv:2306.01567. [Google Scholar] [CrossRef]
- Osco, L.P.; Wu, Q.; de Lemos, E.L.; Gonçalves, W.N.; Ramos, A.P.M.; Li, J.; Marcato, J. The Segment Anything Model (SAM) for Remote Sensing Applications: From Zero to One Shot. Int. J. Appl. Earth Obs. Geoinf. 2023, 124, 103540. [Google Scholar] [CrossRef]
- Ma, X.; Wu, Q.; Zhao, X.; Zhang, X.; Pun, M.-O.; Huang, B. SAM-Assisted Remote Sensing Imagery Semantic Segmentation with Object and Boundary Constraints. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5636916. [Google Scholar] [CrossRef]
- Sun, X.; Liu, J.; Shen, H.; Zhu, X.; Hu, P. On Efficient Variants of Segment Anything Model: A Survey. Int. J. Comput. Vis. 2025, 133, 7406–7436. [Google Scholar] [CrossRef]
- Chen, K.; Liu, C.; Chen, H.; Zhang, H.; Li, W.; Zou, Z.; Shi, Z. RSPrompter: Learning to Prompt for Remote Sensing Instance Segmentation Based on Visual Foundation Model. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4701117. [Google Scholar] [CrossRef]
- Sun, J.; Yan, S.; Yao, X.; Gao, B.; Yang, J. A Segment Anything Model Based Weakly Supervised Learning Method for Crop Mapping Using Sentinel-2 Time Series Images. Int. J. Appl. Earth Obs. Geoinf. 2024, 133, 104085. [Google Scholar] [CrossRef]
- Wang, D.; Zhang, J.; Du, B.; Xu, M.; Liu, L.; Tao, D.; Zhang, L. SAMRS: Scaling-up remote sensing segmentation dataset with segment anything model. In Proceedings of the 37th International Conference on Neural Information Processing Systems (NIPS ’23), New Orleans, LA, USA, 10–16 December 2023. [Google Scholar]
- Liu, N.; Xu, X.; Su, Y.; Zhang, H.; Li, H.-C. PointSAM: Pointly-Supervised Segment Anything Model for Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2025, 63, 1–15. [Google Scholar] [CrossRef]
- Zhou, T.; Xia, W.; Zhang, F.; Chang, B.; Wang, W.; Yuan, Y.; Konukoglu, E.; Cremers, D. Image Segmentation in Foundation Model Era: A Survey. arXiv 2024, arXiv:2408.12957. [Google Scholar] [CrossRef]
- Teng, M.; Ouaknine, A.; Laliberté, E.; Bengio, Y.; Rolnick, D.; Larochelle, H. Assessing SAM for Tree Crown Instance Segmentation from Drone Imagery. arXiv 2025, arXiv:2503.20199. [Google Scholar] [CrossRef]
- Chai, S.; Jain, R.K.; Teng, S.; Liu, J.; Li, Y.; Tateyama, T.; Chen, Y. Ladder Fine-Tuning Approach for SAM Integrating Complementary Network. Procedia Comput. Sci. 2024, 246, 4951–4958. [Google Scholar] [CrossRef]
- Sung, Y.-L.; Cho, J.; Bansal, M. LST: Ladder Side-Tuning for Parameter and Memory Efficient Transfer Learning. Adv. Neural Inf. Process. Syst. 2022, 35, 12991–13005. [Google Scholar]
- Troles, J.; Schmid, U.; Fan, W.; Tian, J. BAMFORESTS: Bamberg Benchmark Forest Dataset of Individual Tree Crowns in Very-High-Resolution UAV Images. Remote Sens. 2024, 16, 1935. [Google Scholar] [CrossRef]
- Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. arXiv 2016, arXiv:1506.01497. [Google Scholar] [CrossRef]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021. [Google Scholar]
- Xie, S.; Girshick, R.; Dollár, P.; Tu, Z.; He, K. Aggregated Residual Transformations for Deep Neural Networks. arXiv 2017, arXiv:1611.05431. [Google Scholar] [CrossRef]
- Cai, Z.; Vasconcelos, N. Cascade R-CNN: High Quality Object Detection and Instance Segmentation. arXiv 2019, arXiv:1906.09756. [Google Scholar] [CrossRef]
- Lyu, C.; Zhang, W.; Huang, H.; Zhou, Y.; Wang, Y.; Liu, Y.; Zhang, S.; Chen, K. RTMDet: An Empirical Study of Designing Real-Time Object Detectors. arXiv 2022, arXiv:2212.07784. [Google Scholar] [CrossRef]
- Cheng, B.; Misra, I.; Schwing, A.G.; Kirillov, A.; Girdhar, R. Masked-Attention Mask Transformer for Universal Image Segmentation. arXiv 2022, arXiv:2112.01527. [Google Scholar] [CrossRef]
- Lou, M.; Zhang, S.; Zhou, H.-Y.; Yang, S.; Wu, C.; Yu, Y. TransXNet: Learning Both Global and Local Dynamics with a Dual Dynamic Token Mixer for Visual Recognition. IEEE Trans. Neural Netw. Learn. Syst. 2025, 36, 11534–11547. [Google Scholar] [CrossRef]
- Woo, S.; Debnath, S.; Hu, R.; Chen, X.; Liu, Z.; Kweon, I.S.; Xie, S. ConvNeXt V2: Co-Designing and Scaling ConvNets with Masked Autoencoders. arXiv 2023, arXiv:2301.00808. [Google Scholar]
- Bolya, D.; Zhou, C.; Xiao, F.; Lee, Y.J. YOLACT: Real-Time Instance Segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27–28 October 2019. [Google Scholar]
- Wang, X.; Zhang, R.; Kong, T.; Li, L.; Shen, C. SOLOv2: Dynamic and Fast Instance Segmentation. arXiv 2020, arXiv:2003.10152. [Google Scholar] [CrossRef]









| Model | P | R | DIoU | F1 | AP@50 | |
|---|---|---|---|---|---|---|
| Baseline | SAM | 0.111 | 0.714 | 0.106 | 0.192 | 0.121 |
| Swin | 0.402 | 0.558 | 0.305 | 0.467 | 0.212 | |
| ResNext | 0.606 | 0.562 | 0.412 | 0.583 | 0.295 | |
| Baseline+CCFB+HIAN | SAM+Swin | 0.517 | 0.824 | 0.466 | 0.636 | 0.341 |
| SAM+ResNext | 0.739 | 0.839 | 0.647 | 0.786 | 0.526 | |
| Basline+CCFB+HIAN +CAAH | Tree-SAM (Swin) | 0.723 | 0.824 | 0.627 | 0.771 | 0.474 |
| Tree-SAM (ResNext) | 0.821 | 0.839 | 0.709 | 0.830 | 0.526 |
| Model | P | R | DIoU | F1 | AP@50 | |
|---|---|---|---|---|---|---|
| Baseline | SAM | 0.319 | 0.141 | 0.108 | 0.196 | 0.093 |
| Swin | 0.586 | 0.426 | 0.327 | 0.493 | 0.344 | |
| ResNext | 0.647 | 0.456 | 0.365 | 0.535 | 0.363 | |
| Baseline+CCFB+HIAN | SAM+Swin | 0.525 | 0.773 | 0.455 | 0.625 | 0.335 |
| SAM+ResNext | 0.669 | 0.758 | 0.552 | 0.711 | 0.433 | |
| Basline+CCFB+HIAN +CAAH | Tree-SAM (Swin) | 0.702 | 0.773 | 0.582 | 0.736 | 0.428 |
| Tree-SAM (ResNext) | 0.767 | 0.758 | 0.616 | 0.762 | 0.478 |
| Scenario | Model | P | R | DIoU | F1 | AP@50 |
|---|---|---|---|---|---|---|
| Forest | C-Mask R-CNN | 0.532 | 0.669 | 0.421 | 0.593 | 0.305 |
| Yolact | 0.686 | 0.716 | 0.539 | 0.701 | 0.409 | |
| SoloV2 | 0.808 | 0.408 | 0.372 | 0.542 | 0.315 | |
| RTMDet | 0.813 | 0.521 | 0.465 | 0.635 | 0.414 | |
| Mask2Fromer | 0.794 | 0.432 | 0.389 | 0.560 | 0.341 | |
| ConvNeXtV2 | 0.239 | 0.107 | 0.079 | 0.147 | 0.073 | |
| TransXNet | 0.315 | 0.432 | 0.223 | 0.364 | 0.150 | |
| SAM-DET | 0.754 | 0.413 | 0.364 | 0.534 | 0.343 | |
| Tree-SAM | 0.767 | 0.758 | 0.616 | 0.762 | 0.478 | |
| Mixed | C-Mask R-CNN | 0.542 | 0.645 | 0.418 | 0.589 | 0.331 |
| Yolact | 0.649 | 0.513 | 0.402 | 0.573 | 0.334 | |
| SoloV2 | 0.695 | 0.715 | 0.544 | 0.705 | 0.441 | |
| RTMDet | 0.727 | 0.448 | 0.383 | 0.554 | 0.347 | |
| Mask2Fromer | 0.769 | 0.407 | 0.363 | 0.532 | 0.319 | |
| ConvNeXtV2 | 0.380 | 0.174 | 0.136 | 0.239 | 0.109 | |
| TransXNet | 0.402 | 0.558 | 0.305 | 0.467 | 0.213 | |
| SAMDET | 0.711 | 0.514 | 0.425 | 0.597 | 0.318 | |
| Tree-SAM | 0.737 | 0.726 | 0.577 | 0.732 | 0.454 | |
| Urban | C-Mask R-CNN | 0.698 | 0.725 | 0.551 | 0.711 | 0.435 |
| Yolact | 0.764 | 0.618 | 0.519 | 0.683 | 0.465 | |
| SoloV2 | 0.603 | 0.752 | 0.503 | 0.669 | 0.394 | |
| RTMDet | 0.732 | 0.657 | 0.529 | 0.692 | 0.515 | |
| Mask2Fromer | 0.828 | 0.669 | 0.588 | 0.740 | 0.482 | |
| ConvNeXtV2 | 0.424 | 0.248 | 0.186 | 0.313 | 0.138 | |
| TransXNet | 0.357 | 0.613 | 0.291 | 0.451 | 0.197 | |
| SAM-DET | 0.817 | 0.717 | 0.618 | 0.764 | 0.517 | |
| Tree-SAM | 0.821 | 0.839 | 0.710 | 0.830 | 0.526 |
| Model | Scene | P | R | DIoU | F1 | AP@50 |
|---|---|---|---|---|---|---|
| C-Mask R-CNN | Forest | 0.146 | 0.197 | 0.091 | 0.168 | 0.061 |
| Yolact | Forest | 0.171 | 0.113 | 0.073 | 0.136 | 0.058 |
| SoloV2 | Forest | 0.194 | 0.263 | 0.126 | 0.223 | 0.108 |
| RTMDet | Forest | 0.210 | 0.153 | 0.097 | 0.177 | 0.093 |
| Mask2Fromer | Forest | 0.204 | 0.107 | 0.075 | 0.140 | 0.072 |
| ConvNeXtV2 | Forest | 0.131 | 0.263 | 0.096 | 0.175 | 0.030 |
| TransXNet | Forest | 0.131 | 0.263 | 0.096 | 0.175 | 0.041 |
| SAM-DET | Forest | 0.290 | 0.217 | 0.141 | 0.248 | 0.141 |
| Tree-SAM | Forest | 0.316 | 0.361 | 0.202 | 0.337 | 0.190 |
| Model | mAP | mAP@50 | mAP@75 | mAP_m | mAP_l |
|---|---|---|---|---|---|
| Mask R-CNN | 0.421 | 0.715 | 0.449 | 0.167 | 0.496 |
| Mask2Former | 0.386 | 0.701 | 0.395 | 0.153 | 0.448 |
| RTMDet | 0.429 | 0.728 | 0.463 | 0.175 | 0.497 |
| Tree-SAM | 0.466 | 0.754 | 0.514 | 0.200 | 0.537 |
| Scenario | Model | P | R | DIoU | F1 | AP@50 |
|---|---|---|---|---|---|---|
| Forest | C-Mask R-CNN | 0.365 | 0.492 | 0.265 | 0.419 | 0.216 |
| Yolact | 0.428 | 0.282 | 0.205 | 0.340 | 0.199 | |
| SoloV2 | 0.485 | 0.658 | 0.387 | 0.558 | 0.324 | |
| RTMDet | 0.526 | 0.382 | 0.284 | 0.443 | 0.289 | |
| Mask2Fromer | 0.510 | 0.267 | 0.212 | 0.350 | 0.213 | |
| ConvNeXtV2 | 0.169 | 0.120 | 0.076 | 0.141 | 0.070 | |
| TransXNet | 0.234 | 0.471 | 0.185 | 0.313 | 0.129 | |
| SAM-DET | 0.517 | 0.387 | 0.284 | 0.443 | 0.325 | |
| Tree-SAM | 0.564 | 0.644 | 0.430 | 0.601 | 0.377 | |
| Mixed | C-Mask R-CNN | 0.339 | 0.267 | 0.175 | 0.299 | 0.168 |
| Yolact | 0.427 | 0.492 | 0.296 | 0.457 | 0.266 | |
| SoloV2 | 0.255 | 0.658 | 0.225 | 0.367 | 0.230 | |
| RTMDet | 0.455 | 0.035 | 0.034 | 0.065 | 0.041 | |
| Mask2Fromer | 0.338 | 0.463 | 0.243 | 0.391 | 0.234 | |
| ConvNeXtV2 | 0.272 | 0.375 | 0.187 | 0.315 | 0.144 | |
| TransXNet | 0.110 | 0.154 | 0.068 | 0.128 | 0.058 | |
| SAMDET | 0.431 | 0.492 | 0.298 | 0.460 | 0.298 | |
| Tree-SAM | 0.490 | 0.761 | 0.425 | 0.597 | 0.370 | |
| Urban | C-Mask R-CNN | 0.419 | 0.684 | 0.351 | 0.519 | 0.318 |
| Yolact | 0.327 | 0.620 | 0.272 | 0.428 | 0.292 | |
| SoloV2 | 0.388 | 0.684 | 0.329 | 0.495 | 0.291 | |
| RTMDet | 0.302 | 0.684 | 0.265 | 0.419 | 0.312 | |
| Mask2Fromer | 0.454 | 0.604 | 0.350 | 0.518 | 0.337 | |
| ConvNeXtV2 | 0.145 | 0.349 | 0.114 | 0.205 | 0.090 | |
| TransXNet | 0.095 | 0.745 | 0.092 | 0.169 | 0.073 | |
| SAM-DET | 0.417 | 0.717 | 0.358 | 0.527 | 0.341 | |
| Tree-SAM | 0.480 | 0.833 | 0.438 | 0.609 | 0.386 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Huang, C.; Ding, Y.; Xiao, K.; Liu, R.; Sun, Y. Ladder-Side-Tuning of Visual Foundation Model for City-Scale Individual Tree Detection from High-Resolution Remote Sensing Images. Remote Sens. 2026, 18, 819. https://doi.org/10.3390/rs18050819
Huang C, Ding Y, Xiao K, Liu R, Sun Y. Ladder-Side-Tuning of Visual Foundation Model for City-Scale Individual Tree Detection from High-Resolution Remote Sensing Images. Remote Sensing. 2026; 18(5):819. https://doi.org/10.3390/rs18050819
Chicago/Turabian StyleHuang, Chen, Ying Ding, Kun Xiao, Rong Liu, and Ying Sun. 2026. "Ladder-Side-Tuning of Visual Foundation Model for City-Scale Individual Tree Detection from High-Resolution Remote Sensing Images" Remote Sensing 18, no. 5: 819. https://doi.org/10.3390/rs18050819
APA StyleHuang, C., Ding, Y., Xiao, K., Liu, R., & Sun, Y. (2026). Ladder-Side-Tuning of Visual Foundation Model for City-Scale Individual Tree Detection from High-Resolution Remote Sensing Images. Remote Sensing, 18(5), 819. https://doi.org/10.3390/rs18050819

