Ladder-Side-Tuning of Visual Foundation Model for City-Scale Individual Tree Detection from High-Resolution Remote Sensing Images

Huang, Chen; Ding, Ying; Xiao, Kun; Liu, Rong; Sun, Ying

doi:10.3390/rs18050819

Open AccessArticle

Ladder-Side-Tuning of Visual Foundation Model for City-Scale Individual Tree Detection from High-Resolution Remote Sensing Images

by

Chen Huang

,

Ying Ding

,

Kun Xiao

,

Rong Liu

and

Ying Sun

^*

School of Geography and Planning, Sun Yat-sen University & Southern Marine Science and Engineering Guangdong Laboratory (Zhuhai), Guangzhou 510275, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(5), 819; https://doi.org/10.3390/rs18050819

Submission received: 16 January 2026 / Revised: 26 February 2026 / Accepted: 4 March 2026 / Published: 6 March 2026

(This article belongs to the Special Issue The Recent Progression of Machine Learning in Remote Sensing: Theory and Modelling (Second Edition))

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

We propose Tree-SAM, a ladder-side-tuned SAM framework with three task-specific modules (CCFB, HIAN, and CAAH) to enable robust city-scale individual tree crown instance detection under heterogeneous urban, mixed, and forest scenes.
Tree-SAM consistently achieves the best accuracy across datasets and scenarios, reaching F1/AP@50 of 0.762/0.478 (forest), 0.732/0.454 (mixed), and 0.830/0.526 (urban) on GZ-Tree Crown, and demonstrating strong cross-region robustness in zero-shot transfer to BAMFORESTS and SZ-Dataset.

What are the implications of the main findings?

The high-precision automated workflow enables large-scale, individual-level tree monitoring, providing critical data support for urban forest management, carbon stock estimation, and ecological assessment.
This study establishes an efficient adaptation paradigm for vision foundation models in remote sensing, proving that parameter-efficient fine-tuning can effectively bridge the domain gap for specialized downstream tasks.

Abstract

Accurate detection of individual trees is essential for urban forest management and ecological assessment, yet remains challenging due to the heterogeneous backgrounds, variable sizes of tree crowns, and significant variations across urban scenarios. To address these issues, we propose Tree-SAM, a city-scale individual tree detection architecture built upon the visual foundation model Segment Anything Model (SAM) and equipped with three task-specific modules, i.e., Cross-Correlation Feature Backbone (CCFB), Hierarchical Instance Aggregation Neck (HIAN), and Context-Aware Adaptation Head (CAAH). These modules synergistically fuse general semantics with fine-grained structural cues, enable multi-scale feature aggregation, and adaptively refine predictions based on specific scene contexts. On the GZ-Tree Crown dataset, Tree-SAM achieves F1-scores of 0.762, 0.732, and 0.830, with corresponding AP@50 values of 0.478, 0.454, and 0.526 in the forest, mixed, and urban scenarios, respectively, consistently ranking first across all scenes and demonstrating strong adaptability to diverse intra-city landscapes. Additional evaluations on the BAMFORESTS dataset and the SZ-Dataset further confirm its robustness across varied geographic contexts. Tree-SAM provides a reliable, automated framework for large-scale urban tree mapping, providing reliable data support for urban forest management, carbon stock estimation, and ecological assessment.

Keywords:

high-resolution imagery; individual tree detection; Segment Anything Model (SAM); Ladder-Side Tuning (LST); city-wide tree mapping

1. Introduction

As a vital component of urban ecological infrastructure, urban forests provide a wide array of ecosystem services, including carbon storage and sequestration, climate regulation, and the mitigation of urban heat island effects [1,2,3,4]. However, conventional approaches to individual tree delineation in urban areas are often constrained by high labor intensity, substantial financial costs, and susceptibility to human-induced errors [5,6]. With the continued advancement of satellite-based remote sensing technologies, it has become possible to acquire large-scale, spatially consistent observations of tree canopy structures [7,8]. In particular, high-resolution optical imagery offers significant advantages in capturing fine-scale spatial details, enabling robust and scalable identification of individual tree crowns. This provides a reliable data foundation for large-scale tree mapping and spatial pattern analysis at the city level.

An increasing number of studies have explored the use of high-resolution RGB imagery for individual tree detection [9,10,11]. Various tailored techniques, such as constrained 2D bin packing [12], competitive region growing [13], and marked point processes [14], have demonstrated strong quantitative performance in diverse forest scenes. In parallel, several large-scale studies have proven the feasibility of regional or continental mapping, successfully delineating billions of trees in dryland ecosystems [15] and millions of trees across metropolitan areas [16]. Despite these advances, individual tree detection in region-scale remains a significant challenge as a region often includes diverse scenarios such as urban, mixed, and forest. This difficulty arises from the high structural variability and semantic ambiguity introduced by visually cluttered backgrounds containing man-made or other natural elements. These factors often lead to blurred object boundaries, inconsistent spatial scales, and degraded model generalization performance across diverse urban scenes and datasets.

In response to the aforementioned challenges, diverse approaches have been proposed to improve individual tree detection in diverse scenarios. These methods can be broadly categorized into three main classes:

(1): Convolutional neural network (CNN)-based individual tree detection methods [17,18,19]. Early convolution-based frameworks often combined deep learning with traditional algorithms (e.g., watershed segmentation) [20] or utilized standard object detectors (e.g., Faster R-CNN, YOLO) [21] to delineate irregular tree crowns [10]. Despite their effectiveness in tree-covered regions, CNN-based segmentation frameworks are inherently limited in modeling individual instances [22], particularly in dense urban forests where tree crowns are mixed with a complex background with high heterogeneity [23]. Such limitations typically necessitate elaborate post-processing to achieve object-level separation. Mask R-CNN proposed multi-scale feature representations and instance-aware mechanisms, significantly improving its detection performance for individual trees [24]. It leads to specialized models such as Detectree2 for tropical forests [25] and DeepForest for diverse geographical datasets [26,27]. However, despite these improvements, Mask R-CNN still relies on convolutional backbones, which are limited in capturing long-range dependencies and scene-level semantics. This leads to reduced robustness in complex urban contexts where background interference is prominent and object boundaries are ambiguous [22,28].
(2): Transformer and graph-based individual tree detection methods. Transformer models long-range dependencies and global semantics through self-attention mechanisms, enabling dynamic attention across the entire image [29,30,31], demonstrating strong structural awareness in complex urban forest scenes [32,33]. Various Transformer architectures, such as Swin Transformer [34], DeiT [35], and SegFormer [36], have been successfully adapted for large-area tree mapping and classification. Furthermore, Graph Convolutional Networks (GCNs) have shown strong capabilities in modeling complex spatial relationships and processing multimodal data, such as integrating UAV-based multispectral images and LiDAR point clouds for urban tree species classification [37]. Owing to the lack of spatial locality and translation equivariance, Transformers are less effective in preserving spatial continuity [38]. During early encoding stages, they tend to lose fine-grained textures and boundary cues, which reduces their capacity to represent small-scale tree canopies. This limitation becomes especially pronounced when segmenting small or indistinct tree crowns, where the model produces diminished responses and weakened instance-level segmentation performance. Both CNN and Transformer-based methods typically require large amounts of labeled data and exhibit limited generalization across diverse urban scenes [39].
(3): The emergence of the Segment Anything Model (SAM) [40], a large-scale pretrained vision model, has introduced new opportunities for urban individual tree detection [41,42,43]. With its powerful global feature modeling and strong cross-task generalization capabilities, SAM has shown notable potential in complex background perception and global feature extraction, and has been extensively explored across various downstream visual tasks [44,45,46,47,48] and remote sensing applications [49]. Within the remote sensing domain, SAM exhibits remarkable capability in image segmentation and large-scale mapping, consistently attaining state-of-the-art performance across diverse downstream applications [50,51,52,53]. Due to domain shift in training data and the absence of task-specific supervision, the SAM out-of-the-box method performed poorly, exhibiting notable performance degradation when directly applied to urban individual tree detection. To address this, recent studies have investigated prompt-based solutions, such as utilizing bounding boxes from specialized detectors or generating tree-center heatmaps for crown segmentation [43]. However, the prompts-based methods require pre-localizing tree centers or bounding boxes to invoke SAM, i.e., a detect-then-segment cascade, where prompt misplacement or omission often yields over-extended masks (covering extra background or neighboring objects) or under-segmentation masks, particularly in scenarios involving complex crown boundaries, multi-scale canopy structures, and small object recognition [54]. Moreover, it exhibits pronounced performance variability across biomes scenarios, for example, between plantation and natural forests, and among boreal, temperate, and tropical settings [55].

As recent evidence indicates that Ladder-Side Tuning (LST) of a foundation model is a reliable way to enhance SAM for specific tasks [56,57], we adopt LST to efficiently adapt SAM. A framework named Tree-SAM is proposed for city-scale individual tree crown detection under diverse scenarios. To address the above limitations, Tree-SAM (i) proposed a Cross-Correlation Feature Backbone (CCFB) to fuse SAM’s global tokens with low-level boundary details, curbing over-extended masks; (ii) a Hierarchical Instance Aggregation Neck (HIAN) that integrates a standard FPN–RPN pipeline to reorganize CCFB’s multi-stage features into a scale-aware pyramid and proposal stream, improving instance separation for closely spaced heterogeneous crowns and reducing under-segmentation; and (iii) a Context-Aware Adaptation Head (CAAH) to dynamically condition the decoder for cross-scene robustness. Collectively, these components address boundary overshoot, dense-crown instance fusion, and scene variability, enabling reliable, city-wide individual tree mapping in heterogeneous environments.

2. Materials and Methods

2.1. Data Sources

We employ three datasets for model development and evaluation: GZ-Tree Crown, BAMFORESTS [58], and SZ-Dataset. GZ-Tree Crown is used for primary model training and testing. To assess the model’s generalization across varied urban scenarios, BAMFORESTS and SZ-Dataset are utilized, representing distinct geographic regions and structural complexities. BAMFORESTS facilitates the analysis of domain shift in structured urban forests, while the SZ-Dataset features diverse and complex metropolitan landscapes. Due to its available training annotations, BAMFORESTS also supports domain adaptation analysis.

2.1.1. Dataset 1: GZ-Tree Crown

High-resolution optical remote sensing imagery of Guangzhou City was acquired between September 2017 and January 2019 using a digital RGB camera mounted on a Yun-5 turboprop aircraft, under clear-sky conditions. The images have a ground sampling distance (GSD) of approximately 0.1 m. This dataset covers three representative scenarios: forest, mixed, and urban. To ensure ecological and morphological diversity, tree samples were collected from residential neighborhoods, urban parks, farmlands, and forested areas, encompassing diverse conditions for city-scale individual tree detection tasks. Image patches were randomly extracted from these heterogeneous scenes to form a representative sample of Guangzhou’s complex urban environment. Because most trees in Guangzhou remain evergreen, seasonal variability was excluded as a factor during sample selection. The final training dataset comprises 477 image patches, each with a spatial dimension of 1024 × 1024 pixels, while the test set includes an additional 51 patches. Each tree crown instance was manually annotated for instance segmentation through visual interpretation using ArcGIS 10.4.1 and LabelMe 3.16.7.

2.1.2. BAMFORESTS

BAMFORESTS (Bamberg Benchmark Forest Dataset of Individual Tree Crowns) is a publicly available high-resolution RGB dataset designed for instance segmentation of individual tree crowns. Collected in the Bamberg region of southern Germany, it contains four distinct acquisition sites covering various temperate vegetation types and forest structures. The UAV-acquired imagery maintains consistent daylight conditions with a GSD of 5 cm to 7 cm. All tree crowns were manually annotated with instance-level polygon masks and provided in COCO-compatible JSON format. The dataset includes a total of 3949 image tiles and 4089 labeled tree instances, representing a diverse mix of deciduous and coniferous species.

2.1.3. Dataset 3: SZ-Dataset

This dataset consists of high-resolution RGB remote sensing imagery collected from Shenzhen, China, using Google Earth. The imagery has a GSD of approximately 0.15 m and covers a variety of complex urban scenarios, including residential blocks, road networks, urban parks, forest patches, and mixed-use landscapes. To ensure annotation consistency with the other datasets, tree crowns were manually labeled with instance-level segmentation masks using visual interpretation. This dataset is utilized exclusively for testing.

2.2. Method

2.2.1. Network Structure

Tree-SAM is developed based on a pre-trained vision foundation model and enhanced with adaptive attention mechanisms. As illustrated in Figure 1, the architecture comprises three main components. (1) CCFB: This module encodes both general-level semantic information and fine-grained structural details by integrating a frozen ViT-based SAM encoder with a novel Cross-Ladder-Side Attention (CLSA) mechanism. The CLSA facilitates gated, stage-wise fusion between SAM-derived features and lightweight convolutional branches, thereby enhancing crown boundary delineation to curb over-extended masks and reduce background confusion. (2) HIAN: This component integrates a Feature Pyramid Network (FPN) [59] and a Region Proposal Network (RPN) [60] to support multi-scale feature aggregation and instance-level prediction, particularly for closely spaced heterogeneous crowns. It maintains spatial consistency across hierarchical feature representations and enhances the delineation of tree crowns with varying sizes and structural complexities through a unified multi-level aggregation strategy. (3) CAAH: This module integrates a RoI-based segmentation head with a scenario adapter head that dynamically calibrates segmentation thresholds and crown overlap constraints according to varying scene characteristics. Such adaptive adjustment facilitates improved generalization across heterogeneous remote sensing domains regionally.

2.2.2. Cross-Correlation Feature Backbone

CCFB is designed by fusing SAM’s global token features with a CNN-based local stream providing texture cues that facilitate boundary delineation, thereby improving adherence to true crown contours and limiting spillover into background or neighboring objects. CCFB follows the LST principle [57] for parameter-efficient adaptation of a large pre-trained model. In this configuration, the frozen SAM encoder acts as the primary pre-trained backbone, while the lightweight convolutional branch functions as the parallel “ladder-side” network, with the CLSA modules performing the essential lateral feature fusion at each encoder stage.

(1) SAM-based General Feature Encoder: In complex remote sensing scenarios, individual trees are frequently affected by spectrally similar backgrounds such as shrubs, grass, or shadows. Under these conditions, the General Feature Encoder Module aims to distinguish trees from complex backgrounds. This module is built upon the recently proposed vision foundation model—Segment Anything Model (SAM). The Vision Transformer (ViT) encoder in SAM possesses powerful multi-scale semantic representation capabilities and can generate structure-sensitive high-level features. This effectively compensates for the limited ability of lightweight CNNs or shallow Transformers to capture abstract semantics, particularly in scenarios where trees exhibit high spectral similarity with the background or under complex illumination conditions. In our architecture, we adopt the frozen image encoder component of SAM as the general feature extraction branch, utilizing its strong semantic generalization capabilities trained on over 1 billion masks and 11 million images. The ViT architecture is pretrained using a Masked Autoencoder (MAE) strategy, which enables the model to reconstruct input content from partial observations and learn robust visual priors.

In SAM, each input image

I \in R^{H \times W \times 3}

is first partitioned into non-overlapping square patches of size P × P, flattened, and linearly projected to form a sequence of patch embeddings:

X_{0} = P r o j (P a t c h i f y (I)) + E_{p o s}, X_{0} \in R^{N \times D}

(1)

where

N = \frac{H \times W}{P^{2}}

is the number of patches; D is the embedding dimension;

E_{pos}

is the positional embedding to retain spatial information.

The embedded sequence

X_{0}

is then processed through L stacked Transformer blocks, each comprising a multi-head self-attention (MSA) layer and followed by a feedforward network (FFN). The transformation at the l-th layer is defined as follows:

X_{l} = FFN (MSA (X_{l - 1})) + X_{l - 1}, l = 1, \dots, L

(2)

The final output

X_{L}

is reshaped and interpolated into a set of multiscale spatial feature maps:

{X_{S A M}^{1}, X_{S A M}^{2}, X_{S A M}^{3}, X_{S A M}^{4}}

(3)

Each

X_{S A M}

corresponds to a hierarchical stage of the encoder, representing progressively higher-level and more abstract semantic features as the spatial resolution decreases. Importantly, the SAM encoder remains frozen during training to preserve its generalized representation power and to reduce training cost and overfitting risks when applied to domain-specific remote sensing imagery.

(2) Cross-Ladder-Side Attention Module (CLSA): The overall structure of the CLSA module is illustrated in Figure 2. By integrating global and local features in a stage-wise, scale-adaptive manner, the CLSA module curbs over-extended masks in individual tree crown delineation. The CLSA module operates in parallel with the SAM encoder and is applied at each encoder stage. It receives two spatially aligned feature maps: the global feature map

X_{s a m}

extracted from the frozen SAM encoder, and the local feature map

X_{a d d}

computed from a residual convolutional block applied to the original input image. The global SAM-derived features provide context-aware semantic priors, enabling the model to distinguish trees from backgrounds. The local convolutional features retain fine-grained spatial details, which are essential for accurately delineating tree crowns.

The SAM encoder (ViT) provides frozen, global semantic features, while the local convolutional branch captures fine-grained structural details. However, directly fusing these heterogeneous features may introduce semantic conflicts or noise due to their differing abstraction levels. This challenge is especially pronounced in individual tree crown delineation, where accurate segmentation requires both high-level context and precise boundary cues. In dense or visually complex scenes, improper fusion can lead to missed or merged tree instances, highlighting the need for a carefully designed integration mechanism. The gated fusion mechanism (α) allows the network to dynamically emphasize global or local information depending on the scene complexity, ensuring flexible adaptation to varying tree crown shapes and sizes.

X_{s a m} = S A M_{e n c o d e r (x) ϵ R^{H \times W \times C}}

(4)

X_{a d d} = A d d i t i o a n l_{e n c o d e r (x) ϵ R^{H \times W \times C}}

(5)

X = α \cdot X_{s a m} + (1 - α) \cdot X_{a d d}

(6)

This fused tensor

X

is first passed through a 1 × 1 convolution to unify the channel dimension. It is then processed in parallel by four depth-wise separable convolution layers with kernel sizes of 1 × 1, 3 × 3, 5 × 5, and 7 × 7, enabling the module to capture spatial features at multiple receptive field sizes. The use of multi-receptive-field convolutions further enhances the ability to model trees of different scales and structural complexities within a unified feature space.

X_{m u l t i} = \sum_{k ϵ 1,3, 5,7} {D W C o n v}_{k \times k} (X)

(7)

The CLSA module fuses global semantics and local spatial cues through gated cross-branch fusion and scale-adaptive convolutions. Multi-scale convolutional outputs are aggregated and refined to form stage-wise embeddings, which are passed to the decoder. This design compensates for spatial imprecision in token-based representations while maintaining computational efficiency and alignment with the encoder hierarchy, ultimately enhancing tree crown instance segmentation by preserving both semantic abstraction and spatial detail.

2.2.3. Hierarchical Instance Aggregation Neck

The Hierarchical Instance Aggregation (HIAN) module serves as a task-oriented integration neck between the foundation-model encoding and instance-level tree crown segmentation. It receives the CCFB hybrid feature stream (SAM semantics enriched with Ladder-Side–tuned spatial details) and uses a standard FPN to convert the multi-stage features into a multi-resolution pyramid. It plays a pivotal role in enabling multi-scale tree crown delineation, especially for closely spaced heterogeneous crowns.

(1) Multi-Scale Feature Aggregation Module: In real-world remote sensing imagery, trees exhibit substantial variability in size, shape, and canopy structure. Regionally, trees in urban areas may appear as big, isolated crowns, while trees in forests can form overlapping clusters. To handle this diversity, HIAN leverages the FPN to aggregate features across multiple semantic levels, which fuses the CLSA-enhanced outputs from four encoder stages into a unified multiscale representation:

F_{i} = 𝒢 (X^{(i)}), i = 1,2, 3,4

(8)

where

X^{(i)}

denotes the output of the

i

-th encoder stage, and

𝒢

(·) represents the top-down fusion and lateral connection operations in FPN. The resulting feature hierarchy {

F_{i}

} maintains fine-grained spatial information across different resolutions. This multi-level representation ensures that small trees are preserved and enhanced in higher-resolution layers, large or complex crowns are semantically reinforced in deeper layers, and contextual coherence is maintained across scales, avoiding information fragmentation.

(2) Instance-Level Localization Module: The embedded RPN generates candidate object regions by learning to localize potential tree crowns in a scale-invariant manner. This is particularly crucial in contiguous canopy environments or heterogeneous landscapes, where trees are closely packed, requiring fine-grained spatial discrimination or under varying illumination and sensor conditions. On each feature level, RPN is applied to generate candidate object regions (tree crowns). The RPN operates by sliding a small network over the feature map and predicting. For each anchor box

a_{j}

, two outputs: an objectness score

s_{j}

∈ [0, 1], indicating the likelihood that the anchor covers a valid object; a bounding box regression vector Δ

b_{j}

∈

R^{4}

, encoding the coordinate offsets to refine the anchor box. The objectness score and bounding box regression help identify potential tree crowns regardless of size or shape, crucial in contiguous or cluttered canopy scenes. These are computed as follows:

s_{j} = σ (w_{s}^{(⊤)} ϕ (a_{j})), Δ b_{j} = W_{r} ϕ (a_{j})

(9)

where

ϕ (a_{j})

is the feature vector extracted at anchor

a_{j}

;

w_{s}

is a trainable vector for objectness classification;

W_{r}

is a trainable matrix for bounding box regression;

σ

(x) is the sigmoid activation function. After scoring all anchors, non-maximum suppression (NMS) is applied to filter overlapping boxes and retain the top-k proposals {

p_{k}

}, which are forwarded to the RoI head for further classification and segmentation.

2.2.4. Context-Aware Adaptation Head

Context-Aware Adaptation emphasizes that the module not only performs standard prediction tasks such as classification and segmentation but also has the capability to adapt its behavior according to different scene contexts. As the significant heterogeneous scenarios regionally, we derive the scene type from the 2019 ESRI 10 m Land Cover dataset (https://livingatlas.arcgis.com/landcover/, accessed on 20 September 2025). For each georeferenced 1024 × 1024 patch, we align it to the LULC coordinate system, clip the LULC data by this patch footprint, and compute category proportions. The scene type is then assigned using pre-defined rules: Urban if the proportion of “Built Area” pixels > 50%; Forest if the proportion of “Trees” pixels > 50%; otherwise, Mixed. The resulting label is used to select the corresponding CAAH refinement configuration. Regarding the thresholds used in CAAH, we provide a default setting selected based on the best performance on a held-out validation subset of the training data. These thresholds are configurable and can be lightly recalibrated using a small validation subset when transferring to a new region or imagery source, if desired.

(1) ROI Head: Each proposal

p_{k}

is mapped to the multiscale feature maps using Roi Align to obtain a fixed-size feature representation, which ensures spatial alignment between the proposal and feature grid, preserving fine spatial details.

ψ (p_{k}) = RoIAlign ({F_{i}}, p_{k})

(10)

The aligned feature tensor

ψ (p_{k})

is passed through three parallel prediction heads: classification, bounding box refinement, and instance mask generation. Each predicted tuple

{c_{k}, \hat{b_{k}}, M_{k}}

encodes the semantic label, geometric extent, and binary crown mask of the k-th tree instance. This structure enables precise instance-level localization and segmentation beyond the spatial constraints of anchor boxes, which is particularly important for delineating irregularly shaped, elliptical, or partially occluded tree crowns in complex remote sensing imagery.

(2) Scenarios adapter Head: To enhance post-prediction consistency across diverse urban and vegetated environments, we introduce a Scenarios Adapter Head that applies scene-specific refinement strategies during inference. This component incorporates two complementary mechanisms—confidence score thresholding and canopy overlay adjustment—which are tailored to different landscape contexts, namely urban, forest, and mixed-area scenarios.

In urban areas, a higher confidence threshold is enforced to suppress false positives caused by buildings and roads, which often resemble tree crowns in appearance. Moreover, fragmented crown predictions are merged to reflect the typically well-pruned and spatially isolated structure of urban trees, where individual crowns tend to be compact and clearly separated due to regular maintenance. In contrast, forest areas adopt a lower threshold to retain smaller or partially occluded tree crown predictions. In these contiguous vegetated environments, large overlapping crowns are deliberately split into finer segments, improving the delineation of individual tree crowns in multilayered canopies with high crown density (Figure 3). In mixed or transitional areas, which often exhibit high spatial and structural heterogeneity, no explicit threshold tuning or overlay adjustment is applied. This design choice avoids overfitting scene-specific heuristics and preserves model generalization and computational efficiency across varied landscapes.

2.3. Metrics

To quantitatively evaluate the performance of the proposed model in urban individual tree crown detection, we adopt a set of widely used metrics, including Precision, Recall, F1-score, Detection IoU (DIoU), and mean Average Precision at IoU = 0.5 (AP@50). These metrics comprehensively assess the detection quality from the perspectives of localization accuracy, completeness, and overall robustness under different confidence thresholds.

2.3.1. Intersection over Union (IoU) and Matching Criterion

Intersection over Union (IoU) is a standard metric used to measure the spatial overlap between a predicted object and the ground truth. It is computed as follows:

I o U = \frac{A \cap B}{A \cup B}

(11)

where A is the predicted tree crown area, and B is the ground truth crown. Following the criterion used in Detectree2, we adopt an IoU threshold of 0.5 to determine whether a prediction is considered a True Positive (TP). Specifically, as illustrated in Figure 4, a predicted crown is considered a TP if its IoU with any ground truth crown is ≥0.5. Predictions with IoU < 0.5 are counted as False Positives (FP), and unmatched ground truth instances are regarded as False Negatives (FN).

2.3.2. Precision, Recall, and Detection IoU

Based on the TP, FP, and FN counts, we compute the following metrics:

P r e c i s i o n = \frac{T P}{T P + F P}

(12)

R e c a l l = \frac{T P}{T P + F N}

(13)

D e t e c t i o n I o U = \frac{T P}{T P + F P + F N}

(14)

Precision measures the accuracy of the predicted crowns, indicating the proportion of correctly predicted trees among all predictions. Recall evaluates the completeness of the detection, reflecting how many ground truth instances are successfully retrieved. We adopt Detection IoU (DIoU) to measure the overall hit rate of the detection results.

2.3.3. F1-Score and mAP@0.50

In addition to the above standard metrics, F1-score balances precision and recall, which is especially useful in the case of imbalanced TP/FP/FN distributions. To evaluate the detection performance, we utilize the Average Precision (AP). We fix the IoU threshold at 0.5 and compute the interpolated AP (AP@50) by integrating over varying confidence thresholds as follows:

F 1 - s c o r e = \frac{2 \times Precision \times Recall}{Precision + Recall}

(15)

A P @ 50 = \int_{0}^{1} Precision (r) d r (a t I o U = 0.5)

(16)

Furthermore, to assess the model’s localization robustness under stricter matching criteria, the Mean Average Precision (mAP) is computed by averaging the AP scores evaluated at 10 uniformly distributed IoU thresholds (ranging from 0.50 to 0.95 with a step size of 0.05):

m A P = \frac{1}{10} \sum_{IoU = 0.50}^{0.95} A P (IoU), Δ IoU = 0.05

(17)

3. Results

3.1. Implementation Details

Our method was implemented using the PyTorch 3.7 framework, based on MMDetection 3.3.0. All input images were resized to a fixed spatial resolution of 1024 × 1024 pixels. Following standard practices, we applied random rotation and horizontal flipping for data augmentation to improve generalization. The ViT-B variant of the SAM was employed as a frozen General Feature Encoder, which provides high-level semantic priors learned from large-scale natural image datasets. To reduce the risk of overfitting and limit computational overhead, the SAM encoder was not fine-tuned during training. Instead, optimization was performed only on the other three modules, including the CLSA fusion, HIAN instance segmentation, and CAAH Adaptation. The network was trained using a learning rate of 0.001, with a linear warm-up strategy applied over the first 250 iterations to stabilize early-stage optimization. We employed the Stochastic Gradient Descent (SGD) optimizer with momentum set to 0.9 and weight decay of 1 × 10⁻⁴. All experiments were conducted on a workstation equipped with a single NVIDIA RTX 3080 GPU. All models are trained using a unified 36-epoch schedule, with an early stopping strategy based on mAP to prevent overfitting.

3.2. Ablation Study

To evaluate the contribution of each component in the Tree-SAM framework, we perform ablation experiments by sequentially removing each proposed module from the full model. These experiments are conducted under two representative scenarios: Urban and Forest, while the detailed ablation results evaluating the foundational CCFB and HIAN modules under the Mixed scenario are provided in the Supplementary Information (Table S3). The SAM-B [40] model from Meta is adopted as the General Feature Encoder, using uniformly distributed grid-based point prompts to perform global segmentation. Swin-Tiny [61] and ResNeXt-101 [62] serve as side-branch backbones within the CCFB module, participating in cross-level fusion with SAM features through the CLSA design. Table 1 and Table 2 report the ablation results based on five evaluation metrics: Precision (P), Recall (R), Detection IoU (DIoU), F1-score (F1), and Average Precision (AP@50), where AP is evaluated at an IoU threshold of 0.5.

In both Urban and Forest scenarios, adding the CCFB and HIAN modules yields significant improvements over the baselines. Incorporating the CAAH module further enhances precision and recall while markedly improving AP@50, demonstrating the benefit of scenario-adaptive calibration across different urban contexts. Under the Urban scenario, the F1-score of SAM+ResNeXt increases from 0.583 to 0.786 (+20.3%), with AP@50 rising from 0.295 to 0.526; adding CAAH further boosts F1 to 0.830. A similar trend is observed with the Swin backbone, where Tree-SAM (Swin) achieves an F1-score of 0.771 after adding CAAH, outperforming the Swin baseline by +30.4%, with AP@50 also improved to 0.474. In the Forest scenario, SAM+ResNeXt improves its F1-score from 0.535 to 0.711 (+17.6%), with AP@50 increasing from 0.363 to 0.433; adding CAAH further raises F1 to 0.762 and AP@50 to 0.478. Tree-SAM (Swin) also benefits significantly in the Forest setting, achieving an F1-score of 0.736 (+24.3%) with AP@50 rising from 0.344 to 0.428.

Across both urban areas and forest scenes, the proposed modules consistently yield performance gains on different backbones (Swin-Tiny and ResNeXt-101). The concurrent improvements in F1 and AP@50 demonstrate the complementary nature of these modules, as well as their strong cross-architecture adaptability and multi-scenario generalization capability. Therefore, all subsequent experiments adopt SAM as the General Feature Encoder and employ ResNeXt as the trainable guiding branch within the CLSA module to construct the Tree-SAM framework.

3.3. Comparison with SOTA Methods

To assess the individual tree instance detection performance of the proposed Tree-SAM framework under different scenarios, we conduct experiments on the GZ-Tree Crown dataset, BAMFORESTS, and SZ-Dataset.

3.3.1. Model Performance in GZ-Tree Crown

We compare Tree-SAM with a set of SOTA instance segmentation models, including the classical two-stage convolutional network C-Mask R-CNN [63], the one-stage convolutional detection framework RTMDet [64], the Transformer-based Mask2Former [65], the foundation model-based SAM-DET [50], as well as the hybrid architecture TransXNet [66] and the modern CNN ConvNeXtV2 [67]. To ensure comprehensive benchmarking, one-stage convolutional instance segmentation methods YOLACT [68] and SOLOv2 [69] are also included as reference models. As summarized in Table 3, the proposed Tree-SAM achieves the highest F1-score and AP@50 in all three scenarios. Specifically, Tree-SAM attains F1/AP@50 values of 0.762/0.478 in Forest, 0.732/0.454 in Mixed, and 0.830/0.526 in Urban scenes, outperforming all compared models. These results demonstrate that the Tree-SAM framework maintains robust individual tree instance detection performance across diverse scenarios.

Figure 5 shows the visual performance of these models in different scenarios. In the urban scenario, individual tree detection faces severe background confusion due to contiguous buildings and heterogeneous ground objects. Tree-SAM achieves the best performance among all models. As illustrated in Figure 6, the red boxes indicate misclassifications caused by background confusion, while the blue boxes highlight missed detections of small-crown tree instances. CNN-based methods (e.g., C-Mask R-CNN) tend to misclassify non-vegetation areas as trees, producing a large number of false positives. This results in a precision of only 0.698 and an F1-score of 0.711, as the high false positive rate substantially lowers the overall accuracy. Transformer-based approaches (e.g., Mask2Former), while achieving a higher precision of 0.828, experience reduced recall (0.669) when tree crown boundaries are obscured by shadows or occlusions, which limits their ability to detect all instances and constrains the final F1-score (0.740). Tree-SAM attains a precision of 0.821 and a recall of 0.839, with F1/AP@50 reaching 0.830/0.526, achieving the best balance between precision and recall. This result shows that the model leverages cross-layer features from the CCFB fused through the CLSA module to enhance crown–background separation and boundary preservation, thereby improving prediction stability in complex urban scenes. The red boxes illustrate suppressed background interference, while the yellow boxes highlight better crown retention and reduced over-extended masks.

In the forest scenario, individual tree detection is challenged by scale variability, the abundance of small crowns, and contiguous canopies that often lead to missed detections. As illustrated in Figure 6, the blue boxes indicate missed detections of small-crown trees, while the green boxes highlight under-segmentation issues for crowns of varying scales. Overall, the compared models exhibit generally low recall in this scenario, with Transformer-based methods such as Mask2Former reaching only 0.432, and CNN-based approaches like C-Mask R-CNN achieving higher recall (0.669) but limited F1-score (0.593) due to imbalanced precision–recall performance under multi-scale conditions. Tree-SAM achieves a recall of 0.758 in the forest scenario, with F1/AP@50 values of 0.762/0.478, significantly outperforming all models. Recall improves by more than 32% over Mask2Former and nearly +9% over C-Mask R-CNN. The HIAN module further integrates fine-grained and high-level features across scales, enabling the model to detect both small and large crowns and maintain high recall and overall segmentation accuracy for closely spaced heterogeneous crowns, reducing under-segmentation.

In contrast to forest and urban scenarios, where trees tend to follow more regular or planned spatial arrangements, the mixed scenario exhibits highly irregular and fragmented patterns of tree distribution. Tree instances in such areas often appear within fragmented green patches, roadside margins, or low-density vegetation interspersed with built structures, making crown boundaries less distinguishable. As shown in Table 3, Tree-SAM achieves the highest F1-score (0.732) and AP@50 (0.454), outperforming both CNN-based baselines, such as C-Mask R-CNN (F1 = 0.589), and Transformer-based models like Mask2Former (F1 = 0.560). Compared to SoloV2 (F1 = 0.705), Tree-SAM still gains +2.7% in overall detection accuracy.

To further validate the overall performance and cross-scenario stability, we randomly sampled 10 test images from each of the forest, mixed, and urban scenarios and computed per-image F1-score and AP@50 values. Performance distributions of multiple representative methods are visualized in Figure 7. Tree-SAM achieves the highest median values for both metrics with the narrowest interquartile range, indicating not only superior average performance but also reduced variability and greater stability across different scenarios. CAAH modulates prediction behavior according to scene-specific characteristics, enabling the model to adaptively maintain high-quality segmentation across heterogeneous scenarios. The observed reduction in performance variance across forest, mixed, and urban scenarios confirms the intended role of CAAH in enhancing cross-scenario stability.

3.3.2. Model Performance in BAMFORESTS

We also test our proposed models on BAMFORESTS. BAMFORESTS provides complete annotations and is entirely collected from forest scenarios, characterized by densely packed tree crowns and low background heterogeneity. These properties make it an ideal benchmark to assess model performance under a specific scenario type. Domain generalization (DG) and domain adaptation (DA) are employed for comparison.

DG: No BAMFORESTS labels are used during training. Models are trained on GZ-Tree Crown and directly tested on BAMFORESTS to evaluate their zero-shot generalization capability on an unlabelled target domain.

DA: GZ-Tree Crown is used as the source domain to pretrain Mask R-CNN, RTMDet, Mask2Former, and Tree-SAM. The pretrained models are then fine-tuned on the annotated BAMFORESTS data to assess their adaptability under limited target-domain supervision.

(1): Zero-Shot Cross-Domain Generalization on the BAMFORESTS Dataset

We assess the zero-shot generalization performance by training all models on GZ-Tree Crown and directly testing them on BAMFORESTS without any target-domain supervision. Table 4 summarizes their performance in terms of P, R, AC, F1, and AP@50. Compared to the source-domain performance on GZ-Tree Crown, all models show a substantial performance drop when transferred to BAMFORESTS, reflecting the significant domain gap between the two datasets. The degradation is largely attributed to differences in sensor characteristics between the source and target domains, which hinder direct feature transfer. Overall, models leveraging large-scale visual priors, such as SAM-DET and Tree-SAM, retain relatively higher F1-scores (0.248 and 0.337, respectively), indicating that foundation-model-based approaches are more resilient under cross-domain zero-shot settings.

(2): Domain Adaptation Performance on the BAMFORESTS Dataset

To evaluate the effect of limited target-domain supervision, we perform DA experiments on the BAMFORESTS dataset using a small portion of labeled target-domain images. All models are first trained on GZ-Tree Crown and subsequently fine-tuned on 10% of the BAMFORESTS training samples with a fixed schedule of 36 epochs. Input images are cropped from the original high-resolution imagery into 1024 × 1024 patches to maintain consistent spatial context. The checkpoint achieving the best mAP on the validation set is selected for evaluation. We adopt segmentation metrics to assess DA performance: mAP (mean Average Precision over IoU thresholds 0.50:0.95), mAP@50 (AP at IoU = 0.50), mAP@75 (AP at IoU = 0.75), and scale-specific AP for medium (mAP_m) and large (mAP_l) tree crowns. Table 5 summarizes the DA results under this setting. Compared to the zero-shot evaluation, all models demonstrate substantial improvements across all metrics, highlighting the effectiveness of incorporating just 10% labeled target-domain data for adapting to BAMFOREST.

Table 5 presents the results of domain adaptation experiments conducted on the BAMFORESTS dataset. Tree-SAM achieved the highest overall segmentation performance, with a mAP of 0.466, while also maintaining leading results at a high IoU threshold with a mAP@75 of 0.514. This reflects the framework’s strong capability in tree-crown boundary delineation and fine-scale crown segmentation. In terms of multi-scale crown analysis, Tree-SAM reached mAP_m and mAP_l values of 0.200 and 0.537, respectively, surpassing other models in multi-scale feature adaptation and effectively capturing the spatial characteristics of tree crowns across different size levels in the forest scene. This performance trend aligns with Tree-SAM’s design of integrating cross-scale features and general contextual information during feature extraction. Notably, this experiment was conducted with limited samples in a specific forest scene, demonstrating that Tree-SAM can maintain stable segmentation results and strong cross-domain adaptability under a single-scenario condition. These findings indicate that the framework can learn general feature representations from diverse urban scenarios while enabling fast adaptation and maintaining accuracy in a specific scene.

The DG and DA experiments on the BAMFORESTS dataset jointly demonstrate the cross-domain robustness of Tree-SAM. Under the DG setting, where no target-domain labels are available, the large domain shift makes it difficult for conventional deep learning models to generalize to the new scene, while Tree-SAM, leveraging large-scale visual priors, maintains comparatively higher segmentation performance. In the DA setting, with limited target-domain annotations, Tree-SAM further improves both overall segmentation accuracy and multi-scale crown adaptability, indicating efficient adaptation to specific environments. These results suggest that Tree-SAM combines zero-shot generalization with rapid domain adaptation, supporting its applicability to diverse urban scenarios.

3.3.3. Model Performance in SZ-Dataset

To evaluate the model’s generalization performance in complex scenarios, we assessed Tree-SAM on the SZ-Dataset. This dataset also encompasses diverse urban, forest, and mixed scenarios, providing a rigorous benchmark to examine the model’s adaptability to heterogeneous urban environments. The experiments were conducted under a zero-shot setting, where no target-domain labels were used, and all models were directly tested using source-domain weights to evaluate Tree-SAM’s cross-domain generalization capability. We adopted the same evaluation metrics as in the GZ-Dataset and validated the model across the three scenarios.

Table 6 reports the zero-shot evaluation results of Tree-SAM and other models on the SZ-Dataset, demonstrating consistent individual tree detection performance in complex urban–forest scenarios. Compared with representative models, Tree-SAM achieves higher individual tree detection accuracy and F1-score. Scenario-wise, in the urban scenario, Tree-SAM achieves a Precision of 0.480 and an F1-score of 0.609, higher than C-Mask R-CNN (0.519) and Mask2Former (0.518). With comparable Precision, the gain in F1-score indicates better precision–recall balance in densely built-up areas with scattered crowns. In the Forest scene, Tree-SAM attains a Recall of 0.644 and an AP@50 of 0.377, exceeding SOLOv2 (0.658/0.324) and Mask2Former (0.267/0.213). The improved AP@50 alongside competitive recall demonstrates the model’s ability to maintain instance-level accuracy over large contiguous canopy structures. In the Mixed scene, Tree-SAM achieves Precision/Recall/F1 values of 0.490/0.761/0.597, outperforming YOLACT (0.457) and SAM-DET (0.460) with higher F1 and Recall, suggesting adaptability to heterogeneous urban–forest transition zones. Overall, Tree-SAM maintains consistent single-tree instance segmentation performance under a zero-shot setting across diverse urban–forest scenarios. Figure 8 presents qualitative examples from the SZ-Dataset, where Tree-SAM delineates individual tree crowns consistently across varied spatial configurations while preserving segmentation quality.

Building on the SZ-Dataset experiments, we established a fully automated, city-wide instance-level tree-mapping workflow for Shenzhen, enabling large-scale spatial detections and statistics of individual trees. As illustrated in Figure 9, the resulting dataset covers a total area of 1952.47 km² with approximately 11,963,806 trees. Despite substantial variations in tree counts across districts, Tree-SAM maintains stable detection performance across diverse urban scenarios. In large-scale regions such as Bao’an and Longgang, the model detected 2,273,131 and 2,499,873 trees, respectively, while in dense urban cores such as Luohu and Futian, it identified 728,883 and 459,553 trees, accurately delineating individual instances in different scenes. This citywide experiment validates the adaptability and detection stability of the Tree-SAM framework in large-scale heterogeneous urban scenarios.

4. Discussion

Urban-scale mapping of individual tree crowns is increasingly demanded by planners and ecologists, yet remains technically challenging under realistic city conditions. In this study, Tree-SAM achieves the overall best performance across multiple datasets and heterogeneous urban scenes, consistently outperforming both CNN-based and Transformer-based baselines in terms of accuracy, robustness, and cross-domain generalization.

First, compared with single-backbone methods, Tree-SAM benefits from an explicit fusion of CNN and SAM features. In the ablation experiments on the GZ-Tree Crown dataset, integrating CCFB and HIAN on top of a ResNeXt backbone already yields substantial gains over the plain CNN baseline, and adding CAAH further improves performance. In the urban scenario, the F1-score increases from 0.583 (ResNeXt baseline) to 0.830 for Tree-SAM (ResNeXt) (+0.247), while AP@50 rises from 0.295 to 0.526 (+0.231). In the forest scenario, F1 improves from 0.535 to 0.762 (+0.227) and AP@50 from 0.363 to 0.478 (+0.115). Similar trends are observed when using Swin as the side branch. These consistent improvements over both pure CNN and SAM-based variants confirm that the cross-correlation backbone with hierarchical aggregation is an effective way to exploit the complementary strengths of SAM’s global tokens and CNN-based local textures.

Second, Tree-SAM shows clear advantages when moving from single, homogeneous scenes to large and heterogeneous urban areas. On the GZ-Tree Crown dataset, Tree-SAM attains the best performance simultaneously in forest, mixed, and urban scenarios, indicating that the multi-level fusion and Context-Aware Adaptation are effective across diverse scene types. Per-image boxplots further show that Tree-SAM not only improves median F1 and AP@50 but also narrows the performance variance across test images, indicating more stable behavior under scene changes. When deployed for city-scale mapping in Shenzhen, the framework delineates approximately 11.96 million tree instances over 1952.47 km² while maintaining consistent detection quality across central business districts, high-density residential zones, industrial parks, and peri-urban areas. This scene-level and city-level consistency provides empirical evidence that the proposed multi-level fusion and Context-Aware Adaptation modules are not overfitted to a particular scenario, but form a robust and reliable architecture for large-scale urban tree-crown delineation.

Third, Tree-SAM exhibits favorable cross-dataset and cross-region generalization, which is a key advantage of building on a vision foundation model. Under the zero-shot domain generalization setting from GZ-Tree Crown to BAMFORESTS, Tree-SAM achieves an F1-score of 0.337 and AP@50 of 0.190, outperforming the best conventional baseline (SAM-DET, F1 = 0.248, AP@50 = 0.141). With only 10% labeled target data for domain adaptation, Tree-SAM further improves to a mAP of 0.466 and mAP@75 of 0.514 on BAMFORESTS, outperforming Mask R-CNN, RTMDet, and Mask2Former by noticeable margins and achieving the best medium- and large-crown AP. On the SZ-Dataset, evaluated in a fully zero-shot manner, Tree-SAM again attains the highest F1 and AP@50 across forest, mixed, and urban scenes. These results jointly suggest that Ladder-Side Tuning and context-aware adapters adapt SAM to the urban tree-crown task without severely overfitting to a single city or dataset, effectively leveraging the pretrained knowledge encoded in the foundation model to balance in-domain accuracy and cross-domain robustness.

Furthermore, regarding the adaptation paradigm of foundation models, we also evaluated a prompt-based cascade baseline (i.e., detector + SAM) to comprehensively benchmark our framework. As detailed in the Supplementary Information (SI, Section S1), combining hard spatial pre-prompts (e.g., bounding boxes) with our soft feature-fusion LST architecture did not yield synergistic improvements. Instead, we observed performance degradation, particularly in dense forest scenarios. This is likely because the rigid spatial constraints imposed by bounding boxes may conflict with the adaptive feature alignment process. These findings indicate that the pure LST architecture adopted by Tree-SAM provides a more robust alternative, which bypasses pre-detector prompts to rely on continuous structural feature extraction.

Our study still has two main limitations. First, the current evaluation mainly focuses on high-resolution RGB imagery collected under subtropical, predominantly leaf-on conditions (evergreen or mixed forests). To partially broaden the geographic coverage, we provide preliminary cross-city, full-coverage experiments in the Supplementary Material (Figure S1). Nevertheless, Tree-SAM has not yet been systematically validated under strong phenological and climatic shifts (e.g., leaf-off deciduous periods or snow-covered canopies), nor under multimodal settings that could provide complementary cues (e.g., LiDAR for vertical structure or multispectral imagery for improved robustness to spectral/phenological variations). Future work will therefore extend the benchmark to multi-season, multi-climate scenarios and investigate RGB, LiDAR, and multispectral fusion to further improve generality for large-scale urban tree mapping. Second, due to its reliance on a SAM-B encoder, Tree-SAM still incurs a relatively high computational cost for city-scale inference. To quantify this overhead, we report a computational cost comparison (total parameters, trainable parameters, FLOPs, and FPS) against representative baselines in the Supplementary Material (Table S2). To improve scalability without sacrificing accuracy, future work will explore lighter foundation encoders (e.g., MobileSAM-style variants), model distillation, and deployment-oriented optimizations such as mixed-precision inference and pruning.

5. Conclusions

In this study, we proposed Tree-SAM, a city-scale individual tree detection architecture built upon LST-tuned visual foundation model SAM and tailored for high-resolution remote sensing imagery. CCFB incorporates the proposed CLSA module to effectively fuse SAM-derived general semantics with fine-grained structural cues, thereby improving over-extended masks in delineating small or contiguous tree crowns. HIAN strengthens multi-scale feature interactions to better capture diverse crown sizes, while the CAAH adjusts refinement strategies for urban, forest, and mixed scenarios, reducing false positives from background structures and improving robustness under heterogeneous conditions. Collectively, these modules enable Tree-SAM to achieve consistently superior performance across multiple urban and cross-regional datasets, effectively addressing key challenges in individual tree detection, such as small or contiguous crowns, and background interference. Further improvements are needed to sustain detection robustness in densely forested scenarios, particularly when extending the application to larger and more diverse geographic areas, which will be the focus of future exploration.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/rs18050819/s1.

Author Contributions

C.H.: methodology, formal analysis, investigation, validation, writing—original draft preparation, writing—review and editing; Y.D.: investigation, validation, and visualization; K.X.: data acquisition, investigation; R.L.: investigation, data acquisition. Y.S.: conceptualization, supervision, writing—review and editing, funding acquisition. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded in part by the National Natural Science Foundation of China (Grant No. 42571405 and Grant No. 42171308) and in part by the Natural Science Foundation of Guangdong Province (Grant no. 2024A1515012081). We thank the anonymous reviewers for their insightful comments.

Data Availability Statement

The GZ-Crown dataset was obtained from the Million Trees project (available at: https://milliontrees.idtrees.org/, accessed on 20 September 2025). The SZ-Dataset was derived from imagery downloaded from Google Earth (https://earth.google.com/, accessed on 20 September 2025); in this study, we display a subset of samples for visualizing the prediction results. The BAMFORESTS dataset was obtained from the publicly available BAMFORESTS dataset provided by DLR (https://www.dlr.de/en/eoc/about-us/remote-sensing-technology-institute/photogrammetry-and-image-analysis/public-datasets/bamforests, accessed on 20 September 2025).

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

Zhang, X.; Huang, H.; Tu, K.; Li, R.; Zhang, X.; Wang, P.; Li, Y.; Yang, Q.; Acerman, A.C.; Guo, N. Effects of Plant Community Structural Characteristics on Carbon Sequestration in Urban Green Spaces. Sci. Rep. 2024, 14, 7382. [Google Scholar] [CrossRef]
Ettinger, A.K.; Bratman, G.N.; Carey, M.; Hebert, R.; Hill, O.; Kett, H.; Levin, P.; Murphy-Williams, M.; Wyse, L. Street Trees Provide an Opportunity to Mitigate Urban Heat and Reduce Risk of High Heat Exposure. Sci. Rep. 2024, 14, 3266. [Google Scholar] [CrossRef] [PubMed]
Feng, R.; Wang, F.; Liu, S.; Qi, W.; Zhengchen, R.; Wang, D. Synergistic Effects of Urban Forest on Urban Heat Island-Air Pollution-Carbon Stock in Mega-Urban Agglomeration. Urban For. Urban Green. 2025, 103, 128590. [Google Scholar] [CrossRef]
Corro, L.M.; Bagstad, K.J.; Heris, M.P.; Ibsen, P.C.; Schleeweis, K.G.; Diffendorfer, J.E.; Troy, A.; Megown, K.; O’Neil-Dunne, J.P.M. An Enhanced National-Scale Urban Tree Canopy Cover Dataset for the United States. Sci. Data 2025, 12, 490. [Google Scholar] [CrossRef] [PubMed]
Nowak, D.; Crane, D.; Stevens, J.; Hoehn, R.; Walton, J.; Bond, J. A Ground-Based Method of Assessing Urban Forest Structure and Ecosystem Services. Arboric. Urban For. 2008, 34, 347–358. [Google Scholar] [CrossRef]
Shojanoori, R.; Shafri, H.Z.M. Review on the Use of Remote Sensing for Urban Forest Monitoring. Arboric. Urban For. 2016, 42, 400–417. [Google Scholar] [CrossRef]
Erker, T.; Wang, L.; Lorentz, L.; Stoltman, A.; Townsend, P.A. A Statewide Urban Tree Canopy Mapping Method. Remote Sens. Environ. 2019, 229, 148–158. [Google Scholar] [CrossRef]
He, D.; Shi, Q.; Liu, X.; Zhong, Y.; Zhang, L. Generating 2m Fine-Scale Urban Tree Cover Product over 34 Metropolises in China Based on Deep Context-Aware Sub-Pixel Mapping Network. Int. J. Appl. Earth Obs. Geoinf. 2022, 106, 102667. [Google Scholar] [CrossRef]
Zhen, Z.; Quackenbush, L.J.; Zhang, L. Trends in Automatic Individual Tree Crown Detection and Delineation—Evolution of LiDAR Data. Remote Sens. 2016, 8, 333. [Google Scholar] [CrossRef]
Freudenberg, M.; Magdon, P.; Nölke, N. Individual Tree Crown Delineation in High-Resolution Remote Sensing Images Based on U-Net. Neural Comput. Applic. 2022, 34, 22197–22207. [Google Scholar] [CrossRef]
Liu, K.; Li, T.; Peng, D. Aerial Image Object Detection Based on RGB-Infrared Multibranch Progressive Fusion. IEEE Trans. Geosci. Remote Sens. 2025, 63, 1–14. [Google Scholar] [CrossRef]
Wang, H.; Li, J.; Van de Voorde, T.; Zhou, C.; De Maeyer, P.; Ma, Y.; Shen, Z. Individual Populus euphratica Tree Detection in Sparse Desert Forests Based on Constrained 2-D Bin Packing. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–19. [Google Scholar] [CrossRef]
Gu, J.; Congalton, R.G. Individual Tree Crown Delineation from UAS Imagery Based on Region Growing by Over-Segments with a Competitive Mechanism. IEEE Trans. Geosci. Remote Sens. 2022, 60, 4402411. [Google Scholar] [CrossRef]
Gomes, M.F.; Maillard, P.; Deng, H. Individual Tree Crown Detection in Sub-Meter Satellite Imagery Using Marked Point Processes and a Geometrical-Optical Model. Remote Sens. Environ. 2018, 211, 184–195. [Google Scholar] [CrossRef]
Brandt, M.; Tucker, C.J.; Kariryaa, A.; Rasmussen, K.; Abel, C.; Small, J.; Chave, J.; Rasmussen, L.V.; Hiernaux, P.; Diouf, A.A.; et al. An Unexpectedly Large Count of Trees in the West African Sahara and Sahel. Nature 2020, 587, 78–82. [Google Scholar] [CrossRef]
Sun, Y.; Li, Z.; He, H.; Guo, L.; Zhang, X.; Xin, Q. Counting Trees in a Subtropical Mega City Using the Instance Segmentation Method. Int. J. Appl. Earth Obs. Geoinf. 2022, 106, 102662. [Google Scholar] [CrossRef]
Zhu, X.X.; Tuia, D.; Mou, L.; Xia, G.-S.; Zhang, L.; Xu, F.; Fraundorfer, F. Deep Learning in Remote Sensing: A Comprehensive Review and List of Resources. IEEE Geosci. Remote Sens. Mag. 2017, 5, 8–36. [Google Scholar] [CrossRef]
Ma, L.; Liu, Y.; Zhang, X.; Ye, Y.; Yin, G.; Johnson, B.A. Deep Learning in Remote Sensing Applications: A Meta-Analysis and Review. ISPRS J. Photogramm. Remote Sens. 2019, 152, 166–177. [Google Scholar] [CrossRef]
Beloiu, M.; Heinzmann, L.; Rehush, N.; Gessler, A.; Griess, V.C. Individual Tree-Crown Detection and Species Identification in Heterogeneous Forests Using Aerial RGB Imagery and Deep Learning. Remote Sens. 2023, 15, 1463. [Google Scholar] [CrossRef]
Lassalle, G.; Ferreira, M.P.; La Rosa, L.E.C.; de Souza Filho, C.R. Deep Learning-Based Individual Tree Crown Delineation in Mangrove Forests Using Very-High-Resolution Satellite Imagery. ISPRS J. Photogramm. Remote Sens. 2022, 189, 220–235. [Google Scholar] [CrossRef]
dos Santos, A.A.; Marcato Junior, J.; Araújo, M.S.; Di Martini, D.R.; Tetila, E.C.; Siqueira, H.L.; Aoki, C.; Eltner, A.; Matsubara, E.T.; Pistori, H. Assessment of CNN-Based Methods for Individual Tree Detection on Images Captured by RGB Cameras Attached to UAVs. Sensors 2019, 19, 3595. [Google Scholar] [CrossRef]
Zhao, H.; Morgenroth, J.; Pearse, G.; Schindler, J. A Systematic Review of Individual Tree Crown Detection and Delineation with Convolutional Neural Networks (CNN). Curr. For. Rep. 2023, 9, 149–170. [Google Scholar] [CrossRef]
Zheng, J.; Yuan, S.; Li, W.; Fu, H.; Yu, L.; Huang, J. A Review of Individual Tree Crown Detection and Delineation from Optical Remote Sensing Images: Current Progress and Future. IEEE Geosci. Remote Sens. Mag. 2025, 13, 209–236. [Google Scholar] [CrossRef]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. arXiv 2018, arXiv:1703.06870. [Google Scholar] [PubMed]
Ball, J.G.C.; Hickman, S.H.M.; Jackson, T.D.; Koay, X.J.; Hirst, J.; Jay, W.; Archer, M.; Aubry-Kientz, M.; Vincent, G.; Coomes, D.A. Accurate Delineation of Individual Tree Crowns in Tropical Forests from Aerial RGB Imagery Using Mask R-CNN. Remote Sens. Ecol. Conserv. 2023, 9, 641–655. [Google Scholar] [CrossRef]
Weinstein, B.G.; Marconi, S.; Bohlman, S.; Zare, A.; White, E. Individual Tree-Crown Detection in RGB Imagery Using Semi-Supervised Deep Learning Neural Networks. Remote Sens. 2019, 11, 1309. [Google Scholar] [CrossRef]
Weinstein, B.G.; Marconi, S.; Aubry-Kientz, M.; Vincent, G.; Senyondo, H.; White, E.P. DeepForest: A Python Package for RGB Deep Learning Tree Crown Delineation. Methods Ecol. Evol. 2020, 11, 1743–1751. [Google Scholar] [CrossRef]
Gan, Y.; Wang, Q.; Iio, A. Tree Crown Detection and Delineation in a Temperate Deciduous Forest from UAV RGB Imagery Using Deep Learning Approaches: Effects of Spatial Resolution and Species Characteristics. Remote Sens. 2023, 15, 778. [Google Scholar] [CrossRef]
Wang, R.; Ma, L.; He, G.; Johnson, B.A.; Yan, Z.; Chang, M.; Liang, Y. Transformers for Remote Sensing: A Systematic Review and Analysis. Sensors 2024, 24, 3495. [Google Scholar] [CrossRef]
Aleissaee, A.A.; Kumar, A.; Anwer, R.M.; Khan, S.; Cholakkal, H.; Xia, G.-S.; Khan, F.S. Transformers in Remote Sensing: A Survey. Remote Sens. 2023, 15, 1860. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need 2023. arXiv 2017, arXiv:1706.03762. [Google Scholar]
Gao, T.; Gao, Z.; Ji, H.; Ao, W.; Song, W. Query Adaptive Transformer and Multiprototype Rectification for Few-Shot Remote Sensing Image Segmentation. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5651413. [Google Scholar] [CrossRef]
Li, X.; Cheng, Y.; Fang, Y.; Liang, H.; Xu, S. 2DSegFormer: 2-D Transformer Model for Semantic Segmentation on Aerial Images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 4709413. [Google Scholar] [CrossRef]
Jiang, T.; Freudenberg, M.; Kleinn, C.; Lüddecke, T.; Ecker, A.; Nölke, N. Detection Transformer-Based Approach for Mapping Trees Outside Forests on High Resolution Satellite Imagery. Ecol. Inform. 2025, 87, 103114. [Google Scholar] [CrossRef]
Vinod, P.; Behera, M.; Jaya Prakash, A.; Hebbar, R.; Srivastav, S. A Novel Multitask Transformer Deep Learning Architecture for Joint Classification and Segmentation of Horticulture Plantations Using Very High-Resolution Satellite Imagery. Comput. Electron. Agric. 2024, 227, 109540. [Google Scholar] [CrossRef]
Joshi, D.; Witharana, C. Vision Transformer-Based Unhealthy Tree Crown Detection in Mixed Northeastern US Forests and Evaluation of Annotation Uncertainty. Remote Sens. 2025, 17, 1066. [Google Scholar] [CrossRef]
Li, X.; Wang, L.; Guan, H.; Chen, K.; Zang, Y.; Yu, Y. Urban Tree Species Classification Using UAV-Based Multispectral Images and LiDAR Point Clouds. J. Geovisualization Spat. Anal. 2023, 8, 5. [Google Scholar] [CrossRef]
Li, X.; Ding, H.; Yuan, H.; Zhang, W.; Pang, J.; Cheng, G.; Chen, K.; Liu, Z.; Loy, C.C. Transformer-Based Visual Segmentation: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 10138–10163. [Google Scholar] [CrossRef]
Dersch, S.; Schöttl, A.; Krzystek, P.; Heurich, M. Towards Complete Tree Crown Delineation by Instance Segmentation with Mask R–CNN and DETR Using UAV-Based Multispectral Imagery and Lidar Data. ISPRS Open J. Photogramm. Remote Sens. 2023, 8, 100037. [Google Scholar] [CrossRef]
Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.-Y. Segment Anything. arXiv 2023, arXiv:2304.02643. [Google Scholar]
Liu, F.; Chen, D.; Guan, Z.; Zhou, X.; Zhu, J.; Ye, Q.; Fu, L.; Zhou, J. RemoteCLIP: A Vision Language Foundation Model for Remote Sensing. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5622216. [Google Scholar] [CrossRef]
Wang, H.; Köser, K.; Ren, P. Large Foundation Model Empowered Discriminative Underwater Image Enhancement. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5609317. [Google Scholar] [CrossRef]
Lungo Vaschetti, J.; Arnaudo, E.; Rossi, C. TreePseCo: Scaling Individual Tree Crown Segmentation Using Large Vision Models. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2025, 48, 275–282. [Google Scholar] [CrossRef]
Ma, J.; He, Y.; Li, F.; Han, L.; You, C.; Wang, B. Segment Anything in Medical Images. Nat. Commun. 2024, 15, 654. [Google Scholar] [CrossRef]
Zhang, C.; Liu, L.; Cui, Y.; Huang, G.; Lin, W.; Yang, Y.; Hu, Y. A Comprehensive Survey on Segment Anything Model for Vision and Beyond. arXiv 2023, arXiv:2305.08196. [Google Scholar] [CrossRef]
Ke, L.; Ye, M.; Danelljan, M.; Liu, Y.; Tai, Y.-W.; Tang, C.-K.; Yu, F. Segment Anything in High Quality. arXiv 2023, arXiv:2306.01567. [Google Scholar] [CrossRef]
Osco, L.P.; Wu, Q.; de Lemos, E.L.; Gonçalves, W.N.; Ramos, A.P.M.; Li, J.; Marcato, J. The Segment Anything Model (SAM) for Remote Sensing Applications: From Zero to One Shot. Int. J. Appl. Earth Obs. Geoinf. 2023, 124, 103540. [Google Scholar] [CrossRef]
Ma, X.; Wu, Q.; Zhao, X.; Zhang, X.; Pun, M.-O.; Huang, B. SAM-Assisted Remote Sensing Imagery Semantic Segmentation with Object and Boundary Constraints. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5636916. [Google Scholar] [CrossRef]
Sun, X.; Liu, J.; Shen, H.; Zhu, X.; Hu, P. On Efficient Variants of Segment Anything Model: A Survey. Int. J. Comput. Vis. 2025, 133, 7406–7436. [Google Scholar] [CrossRef]
Chen, K.; Liu, C.; Chen, H.; Zhang, H.; Li, W.; Zou, Z.; Shi, Z. RSPrompter: Learning to Prompt for Remote Sensing Instance Segmentation Based on Visual Foundation Model. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4701117. [Google Scholar] [CrossRef]
Sun, J.; Yan, S.; Yao, X.; Gao, B.; Yang, J. A Segment Anything Model Based Weakly Supervised Learning Method for Crop Mapping Using Sentinel-2 Time Series Images. Int. J. Appl. Earth Obs. Geoinf. 2024, 133, 104085. [Google Scholar] [CrossRef]
Wang, D.; Zhang, J.; Du, B.; Xu, M.; Liu, L.; Tao, D.; Zhang, L. SAMRS: Scaling-up remote sensing segmentation dataset with segment anything model. In Proceedings of the 37th International Conference on Neural Information Processing Systems (NIPS ’23), New Orleans, LA, USA, 10–16 December 2023. [Google Scholar]
Liu, N.; Xu, X.; Su, Y.; Zhang, H.; Li, H.-C. PointSAM: Pointly-Supervised Segment Anything Model for Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2025, 63, 1–15. [Google Scholar] [CrossRef]
Zhou, T.; Xia, W.; Zhang, F.; Chang, B.; Wang, W.; Yuan, Y.; Konukoglu, E.; Cremers, D. Image Segmentation in Foundation Model Era: A Survey. arXiv 2024, arXiv:2408.12957. [Google Scholar] [CrossRef]
Teng, M.; Ouaknine, A.; Laliberté, E.; Bengio, Y.; Rolnick, D.; Larochelle, H. Assessing SAM for Tree Crown Instance Segmentation from Drone Imagery. arXiv 2025, arXiv:2503.20199. [Google Scholar] [CrossRef]
Chai, S.; Jain, R.K.; Teng, S.; Liu, J.; Li, Y.; Tateyama, T.; Chen, Y. Ladder Fine-Tuning Approach for SAM Integrating Complementary Network. Procedia Comput. Sci. 2024, 246, 4951–4958. [Google Scholar] [CrossRef]
Sung, Y.-L.; Cho, J.; Bansal, M. LST: Ladder Side-Tuning for Parameter and Memory Efficient Transfer Learning. Adv. Neural Inf. Process. Syst. 2022, 35, 12991–13005. [Google Scholar]
Troles, J.; Schmid, U.; Fan, W.; Tian, J. BAMFORESTS: Bamberg Benchmark Forest Dataset of Individual Tree Crowns in Very-High-Resolution UAV Images. Remote Sens. 2024, 16, 1935. [Google Scholar] [CrossRef]
Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. arXiv 2016, arXiv:1506.01497. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021. [Google Scholar]
Xie, S.; Girshick, R.; Dollár, P.; Tu, Z.; He, K. Aggregated Residual Transformations for Deep Neural Networks. arXiv 2017, arXiv:1611.05431. [Google Scholar] [CrossRef]
Cai, Z.; Vasconcelos, N. Cascade R-CNN: High Quality Object Detection and Instance Segmentation. arXiv 2019, arXiv:1906.09756. [Google Scholar] [CrossRef]
Lyu, C.; Zhang, W.; Huang, H.; Zhou, Y.; Wang, Y.; Liu, Y.; Zhang, S.; Chen, K. RTMDet: An Empirical Study of Designing Real-Time Object Detectors. arXiv 2022, arXiv:2212.07784. [Google Scholar] [CrossRef]
Cheng, B.; Misra, I.; Schwing, A.G.; Kirillov, A.; Girdhar, R. Masked-Attention Mask Transformer for Universal Image Segmentation. arXiv 2022, arXiv:2112.01527. [Google Scholar] [CrossRef]
Lou, M.; Zhang, S.; Zhou, H.-Y.; Yang, S.; Wu, C.; Yu, Y. TransXNet: Learning Both Global and Local Dynamics with a Dual Dynamic Token Mixer for Visual Recognition. IEEE Trans. Neural Netw. Learn. Syst. 2025, 36, 11534–11547. [Google Scholar] [CrossRef]
Woo, S.; Debnath, S.; Hu, R.; Chen, X.; Liu, Z.; Kweon, I.S.; Xie, S. ConvNeXt V2: Co-Designing and Scaling ConvNets with Masked Autoencoders. arXiv 2023, arXiv:2301.00808. [Google Scholar]
Bolya, D.; Zhou, C.; Xiao, F.; Lee, Y.J. YOLACT: Real-Time Instance Segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27–28 October 2019. [Google Scholar]
Wang, X.; Zhang, R.; Kong, T.; Li, L.; Shen, C. SOLOv2: Dynamic and Fast Instance Segmentation. arXiv 2020, arXiv:2003.10152. [Google Scholar] [CrossRef]

Figure 1. The overall architecture of Tree-SAM. (a) CCFB: Cross-Correlation Feature Backbone; (b) HIAN: Hierarchical Instance Aggregation Neck; (c) CAAH: Context-Aware Adaptation Head; (d) global feature extraction with SAM encoder; (e) local feature extraction via convolutional branch; (f) RPN with multi-level feature fusion; (g) RoI head for candidate tree instance selection; (h) scenario adapter head for context-specific proposal optimization.

Figure 2. Cross-Ladder-Side Attention (CLSA) module. SAM encoder features and locally extracted convolutional features are fused via a learnable gating mechanism, followed by multi-scale refinement using depth-wise convolutions. Different colors in the multi-scale refinement stage represent feature maps generated by depth-wise convolutions with varying kernel sizes (1 × 1, 3 × 3, 5 × 5, and 7 × 7). The output is passed to the Hierarchical Instance Aggregation Module for further processing.

Figure 3. Illustration of the proposed Scenarios Adapter Head, which applies scene-specific refinement during inference. Green regions represent predictions with confidence scores above the threshold, while gray regions denote predictions below it.

Figure 4. Illustration of true positive, false positive, and false negative cases in individual tree crown detection. (a), (b), and (c) show predicted crowns (gray) and ground truth crowns (green) with IoU < 0.5, IoU = 0.5, and IoU > 0.5.

Figure 5. Visual results of Tree-SAM and compared methods for individual tree instance detection in forest (F), mixed (M), and urban (U) scenes (GZ-Tree Crown dataset).

Figure 6. Comparison of urban tree detection results across models (Blue: missed detections of small-crown trees, Red: background-induced false positives, Green: under-segmentation of multi-scale crowns, Yellow: over-extended masks).

Figure 7. Boxplots of performance distribution. (a) F1-score and (b) AP@50 for representative models across all scenarios (GZ-Tree Crown dataset).

Figure 8. Qualitative visualization of individual tree instance detection in urban (U), forest (F), and mixed (M) scenarios of the SZ-Dataset.

Figure 9. Spatial distribution and detection results of Tree-SAM on the SZ-Dataset. (a) Area and tree counted by administrative region. (b) Trees counted by scenario type (urban, forest, and mixed) in each administrative region. (c) Spatial distribution of tree density across regions, where darker shades of green indicate higher tree density. (d) Proportions of urban, forest, and mixed trees in each region. (e–g) Representative individual tree detection results in urban, forest, and mixed scenes. Administrative districts are abbreviated as follows: LH, Luohu; FT, Futian; NS, Nanshan; BA, Bao’an; LG, Longgang; YT, Yantian; LGH, Longhua; PS, Pingshan; GM, Guangming.

Table 1. Ablation study on module combinations under an urban scenario (GZ-Tree Crown dataset).

	Model	P	R	DIoU	F1	AP@50
Baseline	SAM	0.111	0.714	0.106	0.192	0.121
	Swin	0.402	0.558	0.305	0.467	0.212
	ResNext	0.606	0.562	0.412	0.583	0.295
Baseline+CCFB+HIAN	SAM+Swin	0.517	0.824	0.466	0.636	0.341
Baseline+CCFB+HIAN	SAM+ResNext	0.739	0.839	0.647	0.786	0.526
Basline+CCFB+HIAN +CAAH	Tree-SAM (Swin)	0.723	0.824	0.627	0.771	0.474
Basline+CCFB+HIAN +CAAH	Tree-SAM (ResNext)	0.821	0.839	0.709	0.830	0.526