A Two-Stage Segment-Then-Classify Strategy for Accurate Ginkgo Tree Identification from UAV Imagery

Chen, Mengyuan; Kong, Wenwen; Sun, Yongqi; Jiao, Jie; Zhao, Yunpeng; Liu, Fei

doi:10.3390/drones9110773

Open AccessArticle

A Two-Stage Segment-Then-Classify Strategy for Accurate Ginkgo Tree Identification from UAV Imagery

by

Mengyuan Chen

^1,2,

Wenwen Kong

³,

Yongqi Sun

⁴

,

Jie Jiao

¹,

Yunpeng Zhao

⁵

and

Fei Liu

^1,*

¹

State Key Laboratory for Vegetation Structure, Function and Construction (VegLab), College of Biosystems Engineering and Food Science, Zhejiang University, Hangzhou 310058, China

²

Zhejiang Provincial Animal Husbandry Technology Extension and Breeding Livestock and Poultry Monitoring Station, Hangzhou 310020, China

³

College of Mathematics and Computer Science, Zhejiang A&F University, Hangzhou 311300, China

⁴

Institute of Crop Science, College of Agriculture & Biotechnology, Zhejiang University, Hangzhou 310058, China

⁵

State Key Laboratory for Vegetation Structure, Function and Construction (VegLab), MOE Key Laboratory of Biosystem Homeostasis and Protection, College of Life Sciences, Zhejiang University, Hangzhou 310058, China

^*

Author to whom correspondence should be addressed.

Drones 2025, 9(11), 773; https://doi.org/10.3390/drones9110773

Submission received: 19 September 2025 / Revised: 31 October 2025 / Accepted: 6 November 2025 / Published: 7 November 2025

(This article belongs to the Special Issue UAS in Smart Agriculture: 2nd Edition)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

A novel two-stage Segment-then-Classify (STC) framework achieved accurate Ginkgo identification from UAV imagery.
STC outperformed YOLOv8, achieving an F1 score of 92.96% in the yellow-leaf stage and an F1 score improvement of 31.3 points in the green-leaf stage.

What are the implications of the main findings?

The STC strategy effectively addresses challenges of overlapping canopies and phenological variation in complex forests.
It provides a reliable tool for biodiversity monitoring and sustainable forest management.

Abstract

Ginkgo biloba L. plays an important role in biodiversity conservation. Accurate identification of Ginkgo in forest environments remains challenging due to its visual similarity to other broad-leaved species during the green-leaf period and to species with yellow foliage during autumn. In this study, we propose a novel two-stage segment-then-classify (STC) strategy to improve the accuracy of Ginkgo identification from unmanned aerial vehicle (UAV) imagery. First, the Segment Anything Model (SAM) was fine-tuned for canopy segmentation across the green-leaf stage and the yellow-leaf stage. A post-processing pipeline was developed to optimize mask quality, ensuring independent and complete tree crown segmentation. Subsequently, a ResNet-101-based classification model was trained to distinguish Ginkgo from other tree species. The experimental results showed that the STC strategy achieved significant improvements compared to the YOLOv8 model. In the yellow-leaf stage, it reached an F1-score of 92.96%, improving by 24.50 percentage points over YOLOv8. In the more challenging green-leaf stage, the F1-score improved by 31.27 percentage points, surpassing YOLOv8’s best performance in the yellow-leaf stage. These findings demonstrate that the STC framework provides a reliable solution for high-precision identification of Ginkgo in forest ecosystems, offering valuable support for biodiversity monitoring and forest management.

Keywords:

Ginkgo identification; UAV imagery; forest environment; Segment Anything Model; ResNet101

1. Introduction

Accurate identification of species underpins sustainable forest management, biodiversity conservation, and ecological research [1]. As a living fossil with significant ecological, medicinal, and cultural value, Ginkgo biloba L. holds a unique position in forest ecosystems [2]. Timely and precise monitoring of Ginkgo populations is crucial for genetic resource protection, urban greening planning, and understanding its response to environmental changes.

Traditional forestry surveys, relying on manual fieldwork, are often labor-intensive, time-consuming, and limited in spatial coverage, making them impractical for large-scale, repetitive monitoring [3]. The advancement of remote sensing, particularly with Unmanned Aerial Vehicles (UAVs), has provided a powerful alternative. UAVs offer unprecedented flexibility and the ability to acquire imagery at very high spatial and temporal resolutions [4], capturing fine-scale details of tree crowns such as texture, shape, and structure that are essential for species identification [5,6,7]. This technology has been widely adopted for various forestry tasks, including forest health monitoring [8], mapping vegetation in complex terrains [9], and classifying trees [10]. However, identifying Ginkgo trees in natural, mixed-forest environments using remote sensing remains highly challenging. The green-leaf period presents the greatest difficulty, as Ginkgo trees share highly similar visual characteristics with many other broadleaf species, leading to overlapping canopies and interspecies confusion. Conversely, during the yellow-leaf period, while visually distinctive, its bright autumn coloration can resemble that of other deciduous species undergoing senescence, increasing the risk of false identification.

In concert with the development of UAV technology, deep learning, especially Convolutional Neural Networks (CNNs), has emerged as the state of the art for image analysis in forestry [11]. For instance, U-Net and its variants have demonstrated robust performance in semantic segmentation for forest cover and change detection [12,13]. Instance segmentation models like Mask R-CNN have been employed for delineating individual tree crowns and identifying species in complex mixed forests, with studies showing that integrating multi-source data such as multispectral imagery and Canopy Height Models (CHMs) can significantly enhance accuracy [14,15]. Concurrently, object detection models like the You Only Look Once (YOLO) series have been utilized for their efficiency in real-time tree counting and localization tasks [16,17].

Despite these advancements, significant challenges persist. The performance of single-stage detection models often declines in complex forests where dense and irregular canopies make simultaneous localization and classification difficult. The bounding boxes generated frequently contain substantial background noise from adjacent trees or the understory, which compromises the classifier’s accuracy. Another challenge lies in the phenological dynamics of temperate and subtropical forests. As demonstrated by Cloutier et al., the accuracy of tree species classification can vary significantly with the timing of image acquisition, with peak autumn coloration not always yielding the best results due to increased intra-species variability in senescence and leaf fall [18]. This highlights the need for methods that are robust to phenological changes or can effectively leverage them.

To address the limitations of existing methods for Ginkgo identification, this study decoupled the complex task into two stages, segmentation and classification. This Segmentation-Then-Classification (STC) strategy first focuses on precisely delineating individual tree crowns to isolate them from the complex background, and then performs classification on these clean inputs. This approach is supported by recent advances in computer vision and remote sensing. For instance, the emerging Segment Anything Model (SAM) provides a powerful foundational framework for zero-shot object segmentation, demonstrating remarkable generalization across diverse domains by leveraging large-scale training [19]. Although SAM excels in generic object segmentation, its application to fine-grained botanical tasks like tree species classification often requires subsequent specialized processing, which aligns with and validates our proposed STC paradigm. In forestry remote sensing, a study developed a watershed–spectral–textural-controlled normalized cut algorithm for individual tree segmentation, showing that improved crown delineation significantly boosts species classification accuracy [20]. Similarly, spectral and textural features from UAV multispectral imagery with robust segmentation lead to more reliable tree species and health assessment [21]. By integrating the generalizable segmentation capability of foundational models like SAM with a dedicated classification stage, our STC strategy effectively addresses challenges from complex backgrounds and interspecies similarities, thereby enhancing the robustness and precision of Ginkgo identification. For the classification stage, we adopt ResNet-101, a deep convolutional neural network built upon the ResNet architecture [22]. ResNet introduces residual connections that alleviate the vanishing gradient problem in very deep networks, enabling more efficient feature propagation and optimization. The extended 101-layer version provides enhanced representational capacity, allowing the model to extract fine-grained visual cues [23].The main aim of this work is therefore to propose and evaluate a novel two-stage framework for accurate Ginkgo tree identification from UAV imagery. We leverage a fine-tuned SAM for precise canopy segmentation, followed by a ResNet-101-based model for robust classification. We test this approach on a challenging dataset collected from a mixed-forest environment across two key phenological periods and benchmark its performance against the YOLOv8 model. This framework provides a reliable and high-precision solution for Ginkgo identification, offering valuable technological support for biodiversity monitoring and refined forest resource management.

2. Materials and Methods

2.1. Study Area

Tianmu Mountain National Nature Reserve (119°24′11″~119°28′21″ E, 30°18′30″~30°24′55″ N) is located in Lin’an District, Hangzhou, Zhejiang Province, China. This reserve is a subtropical mixed forest of deciduous and evergreen broad-leaved trees, representing one of China’s most significant hotspots for subtropical flora [24], and contains many naturally distributed Ginkgo populations.

The precise geographical coordinates and ground-surveyed data of the Ginkgo trees in our study plots were obtained from the extensive ecological records provided by GinkgoDB (https://ginkgo.zju.edu.cn/) [25], an ecological genome database for Ginkgo biloba [26]. The three monitoring plots were strategically placed along an altitudinal gradient: Plot ① foothill, Plot ② mid-slope, and Plot ③ summit (Figure 1). Plot ① (30°19′21.7″ N, 119°26′32.6″ E) covers 88,387 m² and ranges from 343 to 405 m elevation, and contains 44 Ginkgo trees. Plot ② (30°20′06.07″ N, 119°26′08.6″ E) spans 26,049 m² between 631 and 735 m elevation, featuring a V-shaped topography, and includes 45 Ginkgo trees. Plot ③ (30°20′34.9″ N, 119°25′56.2″ E) covers 117,695 m² and ranges from 1000 to 1158 m elevation, and contains 48 Ginkgo trees.

Although the current study focuses on the Tianmu Mountain region, future work will expand data collection to additional sites across different ecological zones. Incorporating Ginkgo populations from more diverse geographic and environmental conditions will enable further validation of the model’s generalization ability and support the development of a truly robust and widely applicable identification framework.

2.2. Data Acquisition

The primary dataset for this study comprises visible-light canopy imagery collected from Plots ①–③ in 2023. Two distinct UAV systems were employed for image acquisition: the DJI Mavic 3E, equipped with a standard RGB sensor, and the DJI Mavic 3 Multispectral (DJI Innovation Technology Co., Ltd., Shenzhen, China). Both UAVs are equipped with identical high-resolution RGB cameras featuring a 4/3 CMOS sensor with 20 million effective pixels, producing images at a resolution of 5280 × 3956 pixels. The lens specifications include a field of view (FOV) of 84°, an equivalent focal length of 24 mm, a variable aperture ranging from f/2.8 to f/11, and a focus range from 1 m to infinity. The camera’s ISO range spans from 100 to 6400.

Crucially, although the Mavic 3 Multispectral possesses multi-band capabilities, for the specific objectives of this research focusing on RGB-based identification, only the RGB sensor data from both UAV platforms were utilized. The rationale for deploying two different systems was to efficiently capture a more diverse set of RGB images, potentially leveraging varied flight angles or perspectives offered by operating multiple platforms, thereby enriching the RGB dataset for subsequent analysis rather than utilizing the multispectral data itself.

To ensure high image quality and minimize distortion from environmental factors, all flight missions were executed between 10:00 a.m. and 3:00 p.m. under clear, sunny conditions with ample sunlight and low wind speeds. A ground-like flight mode was employed, maintaining a consistent flight height of 50 m above ground level and a flight speed of 3 m/s to guarantee data clarity and mitigate motion blur. An automated flight pattern was used, so that the image capture mode was set to photograph at equal distance intervals along planned, parallel flight lines. High overlap rates, set at 80% for both forward and side overlap, were maintained to ensure robust image stitching and 3D model generation.

2.3. Data Pre-Processing

The collected raw aerial RGB images were subsequently processed using DJI Terra software (version 4.3.0, DJI, Shenzhen, China). After each flight mission, the corresponding raw image set was directly imported into DJI Terra, where the software automatically recognized the geographic coordinates and camera parameters embedded in the image metadata [27].

The processing workflow included automated image alignment using a structure-from-motion algorithm, and the generation of a dense point cloud followed by a high-resolution orthomosaic for each study plot. Approximately 250 images were used for Plot 1, 175 images for Plot 2, and 400 images for Plot 3 as input for the reconstruction process. During alignment, DJI Terra automatically excluded those with insufficient keypoint matches to ensure alignment accuracy; as automatic exposure settings were utilized, issues with underexposure were minimal and did not necessitate specific filtering for that attribute. As the UAV platform was equipped with a high-precision real-time kinematic (RTK) positioning system, no ground control points (GCPs) were required for georeferencing.

The final 3D models and orthomosaics were visually inspected to confirm spatial consistency and geometric accuracy before being used for subsequent analyses. The resulting orthomosaics possess very high spatial resolution, specifically 12,053 × 22,874 pixels for Plot 1, 10,679 × 13,072 pixels for Plot 2, and 25,513 × 24,009 pixels for Plot 3.

2.4. Dataset Construction

Figure 2 presents excerpts from the orthoimages of plots, captured across different months. It is evident that during the leaf-off period, after the trees have shed their leaves, the feature differences among tree trunks are minimal, making species identification of Ginkgo challenging using these images. Therefore, to ensure identification accuracy, post-defoliation imagery was excluded from the study’s dataset.

Concurrently, significant variations in tree phenology were observed among the plots due to differences in elevation. Ginkgo trees, over their annual growth cycle, exhibit distinct seasonal changes in leaf color, namely a green-leaf period and a yellow-leaf period. During the green-leaf period, the leaves are vibrantly green and actively photosynthesize, a stage that is critical for the tree’s growth and energy accumulation. As the season transitions into autumn, chlorophyll within the Ginkgo leaves degrades while the relative concentration of other pigments, such as carotenoids, increases. This process causes the leaves to turn yellow, and this vibrant golden-yellow hue is a prominent feature that distinguishes Ginkgo from other tree species.

As shown in the figure, the Ginkgo leaves in Plot ③ had already turned distinctly yellow by November 3, whereas those in Plots ① and ② did not display this bright coloration until the images acquired on November 23. Consequently, the imagery from Plot ③ on November 3 and from Plots ① and ② on November 23 was classified as yellow-leaf period data, with all other imagery being classified as green-leaf period data.

To ensure the reliability of this phenological classification and the accurate identification of Ginkgo trees, GPS data of individual Ginkgo trees were collected by a handheld GPS device during field surveys, as reported in GinkgoDB (https://ginkgo.zju.edu.cn/), and overlaid with the generated orthomosaics. This direct spatial matching allowed precise localization of each Ginkgo individual within the imagery. Subsequently, leaf-color changes within these verified tree crowns were visually examined to confirm their phenological stage. This combined approach ensured that the classification accurately represented the true phenological state of the target Ginkgo trees and minimized potential misidentification caused by co-occurring species.

Based on the ground-surveyed geographic coordinates of Ginkgo trees within each plot, 2000 × 2000-pixel sub-images were extracted from the orthomosaics, with each sub-image centered on an individual tree. These sub-images constituted the primary data for the subsequent research. Since all Ginkgo trees had completely shed their leaves by December, the UAV imagery acquired in December was excluded from subsequent analyses. Among them, there were 531 images in the green-leaf stage and 137 images in the yellow-leaf stage.

To address the potential issue of data scarcity and to enhance the model’s generalization capabilities, data augmentation was introduced during the training process [28], including smoothing, sharpening, degree adjustment, histogram equalization, adaptive histogram equalization, and rotations at three angles of 90°, 180°, and 270°, which were chosen based on their proven effectiveness in remote sensing classification [29]. The resulting collection of sub-images was partitioned into training, validation, and test sets using a 5:2:2 ratio. This split was performed ensuring that no Ginkgo tree present in the validation or test sets also appeared in the training set. This partitioning strategy was applied consistently across both the green-leaf and yellow-leaf datasets. The final image distribution for the green-leaf period was 2655 for training, 1062 for validation, and 1062 for test. For the yellow-leaf period, the distribution was 685 for training, 274 for validation, and 274 for testing. For model training, positive samples were defined as sub-images containing Ginkgo crowns verified by ground-surveyed coordinates, while negative samples represented non-Ginkgo tree crowns.

2.5. Object Detection via YOLOv8

A direct approach to identifying individual Ginkgo trees is to frame the problem as an object detection task. The YOLO (You Only Look Once) series has become a prominent tool for tree species identification. The YOLOv8 model family provides five variants with scaling complexity: nano (n), small (s), medium (m), large (l), and extra-large (x) [30]. To balance detection accuracy with computational efficiency, we selected a model based on preliminary experiments [31]. These tests revealed that larger models offered only marginal performance gains on our dataset. Therefore, the lightweight YOLOv8n model was adopted for all subsequent experiments. Key architectural enhancements in YOLOv8 include a decoupled head (Figure 3), which separates the classification and bounding box regression tasks into independent branches. This allows for specialized optimization of each sub-task and improves the model’s ability to accurately detect objects across a wide range of scales. The efficiency of the YOLOv8n architecture makes it particularly well-suited for applications requiring rapid inference [32].

All images were manually annotated with bounding boxes using the LabelImg software (version 1.8.6, LabelImg; GitHub; available at https://github.com/tzutalin/labelImg, accessed on 22 April 2024). We conducted two separate training experiments: (1) a model trained exclusively on green-leaf data, and (2) a model trained exclusively on yellow-leaf data. All models were based on the lightweight YOLOv8n architecture, initialized with weights pre-trained on the COCO dataset [33]. Training was performed using a Stochastic Gradient Descent (SGD) optimizer with a momentum of 0.937 and a batch size of 64. A cosine annealing scheduler was used to manage the learning rate. Models were trained for a maximum of 500 epochs. To prevent overfitting, an early stopping strategy with a patience of 50 epochs was employed, terminating training when no improvement in validation performance was observed within that period. The model weights that yielded the best performance on the validation set were saved and used for the final evaluation.

2.6. Two-Stage Segment-Then-Classify Framework

Meanwhile, we propose and evaluate a novel two-stage segment-then-classify (STC) strategy. The STC strategy first leverages the Segment Anything Model (SAM) to perform precise segmentation of tree canopies [19]. Subsequently, a ResNet-101–based classification model is applied to the segmented outputs. The overall workflow of the proposed STC strategy is illustrated in Figure 4.

Specifically, the STC strategy consists of the following stages. Stage 1: Instance Segmentation with SAM. The Segment Anything Model is a powerful foundation model for image segmentation, developed as the centerpiece of the “Segment Anything” project. As shown in Figure 5, the SAM architecture comprises three main components: an image encoder, a prompt encoder, and a mask decoder. The image encoder, a high-capacity Vision Transformer (ViT) pre-trained with Masked Autoencoders (MAEs), provides the core representational power. Its training on an expansive dataset of 11 million images and over 1 billion masks grants it remarkable accuracy and generalization capabilities [34]. A key feature of SAM is its strong zero-shot performance, enabling it to segment novel objects and images without task-specific fine-tuning. Furthermore, SAM can be efficiently fine-tuned on downstream datasets to achieve better performance on specialized segmentation tasks. In this study, SAM was fine-tuned to generate high-quality segmentation masks for potential tree objects, effectively isolating them from the background and surrounding vegetation.

As the initial step of the STC strategy, the quality of image segmentation has a direct impact on the final Ginkgo identification accuracy. To this end, fine-tuning SAM is crucial for developing a robust model tailored to our specific task. We performed systematic fine-tuning experiments on the green-leaf dataset, focusing on five key parameters that control segmentation behavior:

Points_per_side, which determines the number of sampling points along each side of the image and therefore controls the granularity of mask generation. A higher value increases boundary precision but also raises computational cost. Pred_iou_thresh and stability_score_thresh, two mask quality filters that, respectively, govern confidence and robustness. The first ensures that only high-confidence masks are retained, while the second discards unstable predictions, reducing over-segmentation and false boundaries. Crop_n_layers, which defines the number of pyramid levels used for multi-scale analysis, allowing the model to detect both small and large crowns effectively. Finally, crop_n_points_downscale_factor, which adjusts the density of sampling points during cropping to balance efficiency and local detail preservation.

Each parameter was tuned individually using a one-variable-at-a-time approach, varying it across a predefined range while keeping the others fixed at default values. The tested levels included 16, 32, 64, and 128 for points_per_side; 0.80, 0.85, 0.90, and 0.95 for both pred_iou_thresh and stability_score_thresh; and 1, 2, 3, and 4 for crop_n_layers and crop_n_points_downscale_factor.

Optimal values were determined based on visual inspection of crown completeness and independence. The final configuration achieved the best trade-off between accuracy and computational efficiency, producing masks that were continuous and non-overlapping. A post-processing step further ensured that each extracted tree crown was independent and structurally complete before being passed to the ResNet-101 classifier. Without this refinement, adjacent crowns or fragmented canopy segments could have been incorrectly merged or split, introducing noise and ambiguity into the classification stage.

Stage 2: Classification with ResNet-101. The segmented tree images produced in Stage 1 were then passed to a classifier for final identification. For this task, the ResNet-101 model served as the backbone. ResNet-101 is a deep convolutional neural network renowned for addressing the degradation problem in very deep networks [22]. We employed a transfer learning approach by initializing the model with ResNet-101 weights pre-trained on the ImageNet dataset. The original fully connected layer was replaced with a new layer adapted to the classes of our dataset, and the entire network was fine-tuned to learn task-specific features.

The ResNet-101 classifier was trained using the SGD optimizer with a momentum of 0.9 and a batch size of 64. A step decay schedule was adopted: the initial learning rate was set to 0.01 and decayed by a factor of 10 every 30 epochs. Training lasted up to 100 epochs. After each epoch, performance was evaluated on the validation set, and the checkpoint achieving the highest validation accuracy was saved for subsequent testing and evaluation.

2.7. Post-Processing and Optimization of SAM Masks

SAM tends to output multiple masks for a single object to ensure at least one is valid. This often results in overlapping or hierarchical masks that require post-processing to generate a final set of independent and complete canopy segmentations. In our task, two primary types of overlap were observed: A larger, coarse mask of a canopy that fully encompasses one or more smaller, more detailed masks of the same canopy. These masks typically share a common boundary (Figure 6a–c). Small spurious masks such as a shadowed area, a sunlit leaf cluster, or a patch with different texture or color may also appear, but these false masks are usually caused by local illumination differences or within-crown texture variation and do not represent distinct tree crowns (Figure 6d).

To address these issues, a rule-based post-processing pipeline was developed. Step 1: Iterative Overlap Resolution. An iterative algorithm was implemented to resolve all mask overlaps. First, the algorithm iterates through all generated masks to identify any pair with overlapping pixels. For each overlapping pair, the area of the smaller mask is subtracted from the area of the larger mask using a Boolean operation, creating a new “residual” mask. Next, a topological analysis is performed on this residual mask to check for the presence of internal holes: If the residual mask is contiguous and contains no holes, this corresponds to the first overlap scenario, a coarse mask containing a finer one. In this case, the original smaller mask and the new residual mask are both retained, effectively partitioning the original coarse mask (Figure 7a). In the other case, if the residual mask contains an internal hole, this corresponds to the second scenario, a spurious mask within a larger canopy. Here, the smaller mask is considered an artifact and is discarded, while only the original, larger mask is kept (Figure 7b). This process is repeated iteratively until no overlapping masks remain in the set.

Step 2 is filtering elongated artifacts. Beyond overlaps, the segmentation and post-processing steps can introduce two types of elongated, non-canopy artifacts, narrow gaps between adjacent canopies that are incorrectly identified as distinct objects (Figure 8a). Thin, sliver-like residuals that are byproducts of the overlap resolution process (Figure 8b). To eliminate these artifacts, a filtering step based on the aspect ratio of each mask’s bounding box was implemented. Masks with an aspect ratio exceeding a threshold of 2.0 were classified as erroneous artifacts and removed.

To provide a direct, quantitative comparison with the YOLOv8 baseline, the outputs of our STC strategy were converted into a standard object detection format. This was achieved by first using the SURF (Speeded Up Robust Features) image matching algorithm to determine the original coordinates of each positively classified Ginkgo segment. These coordinates were then used to define a final bounding box. The confidence score for each box was inherited directly from the classifier’s output, resulting in a prediction file directly comparable to that of YOLOv8.

2.8. Workstation and Model Evaluation

All data processing, model construction, and analysis in this study were performed on a workstation running the Windows 10 operating system. The hardware was configured with an Intel^® Core^TM i9-12900K CPU, an NVIDIA^® GeForce^® RTX^TM 3090 Ti GPU, and 128 GB of RAM.

Four metrics were selected to evaluate the identification performance for Ginkgo: Precision (P), Recall (R), F1-score (F1), and mean Average Precision at an Intersection over Union (IoU) threshold of 0.50 (mAP50). As defined in Table 1, prediction results are categorized into four types: True Positive (TP), False Positive (FP), False Negative (FN), and True Negative (TN). Precision and Recall are calculated based on the counts of these outcomes, with the formulas provided below:

P = \frac{T P}{T P + F P}

(1)

R = \frac{T P}{T P + F N}

(2)

The F1-score is the harmonic mean of Precision and Recall, used to provide a comprehensive measure of their combined performance. It is calculated as follows:

F 1 = 2 \times \frac{P \times R}{P + R}

(3)

The mean Average Precision (mAP) offers an objective evaluation of an object detection model’s effectiveness by considering Precision and Recall across different object classes and confidence thresholds, where APi is the Average Precision for class i and N is the total number of classes. This study reports mAP at a 0.50 Intersection over Union threshold (mAP50). For all the above metrics, a value closer to 100% indicates better model performance. The formulas are as follows:

A P = \int_{0}^{1} p (r) d r

(4)

m A P = \frac{1}{N} \sum_{i = 1}^{N} {A P}_{i}

(5)

3. Results

3.1. Ginkgo Identification Results Using YOLOv8

The performance of the YOLOv8 models is summarized in Table 2. The yellow-leaf dataset achieved better results than the green-leaf dataset, with an F1-score of 68.46% and mAP50 of 70.91% on the test set. However, both datasets exhibited relatively high false alarm and missed detection rates. Consequently, we conclude that the standard object detection approach with YOLOv8, even under optimal conditions, does not achieve the high precision required for reliable Ginkgo identification in this context.

Figure 9 shows representative examples of detection errors, including cases where crown boundaries were indistinct or where bounding boxes included background objects.

3.2. Ginkgo Identification Results Using the Two-Stage Strategy

3.2.1. Impact of SAM Hyperparameters on Segmentation Results

Based on a qualitative visual inspection of the segmentation outputs, the optimal value for each hyperparameter was selected by balancing segmentation quality, computational cost, and task-specific requirements (Table 3).

Representative visual examples of the segmentation results from this tuning process are illustrated in Figure 10. For “points_per_side”, a visual analysis (Figure 10a) revealed that as this value increased, large tree canopies were prone to over-segmentation, being incorrectly fractured into multiple smaller masks. This, combined with a concurrent increase in inference time, led to the selection of 16 as the optimal value to maintain canopy integrity and efficiency. As the “pred_iou_thresh” threshold was increased, the number of segmented objects decreased. Initially, this effectively filtered low-quality predictions. However, excessively high values resulted in under-segmentation, where adjacent canopies were merged or, in some cases, missed entirely (Figure 10b). A value of 0.85 was found to be the most suitable choice, balancing quality control with detection sensitivity. A similar under-segmentation phenomenon was observed as the “stability_score_thresh” parameter’s value increased (Figure 10c). To maximize the segmentation of all potential canopy objects, the lowest tested value of 0.80 was selected. While increasing the “crop_n_layers” parameter improved the model’s ability to detect very small objects, for this task, these small objects were predominantly undesirable segmentations of gaps between leaves within a single canopy (Figure 10d). To prevent this erroneous fragmentation, the parameter was set to 1. For “crop_n_points_downscale_factor”, increasing this factor led to a noticeable improvement in processing speed, but it came at the cost of significantly degraded segmentation quality (Figure 10e). To prioritize the accuracy of the segmentation masks, a value of 1 was chosen.

3.2.2. Reliability Assessment of Segmentation Results

To evaluate the generalizability of the fine-tuned hyperparameters, the optimal configuration obtained from the green-leaf dataset was directly applied to the yellow-leaf dataset. The segmentation results, illustrated in Figure 11, demonstrate that the parameters are robust and perform well across different phenological periods.

The overall segmentation quality met our objectives, and distinct tree canopies were effectively identified and delineated from each other. Although there were cases where some heterogeneous tree crowns were divided into independent small tree crowns, this was acceptable for subsequent classification tasks. For the subsequent classification stage, having a single tree represented by multiple, correctly segmented parts is an acceptable artifact.

3.2.3. Effect of Mask Optimization on Segmentation Results

After applying the post-processing and optimization pipeline described in Section 2.7, the SAM-generated canopy masks became independent, complete, and free of elongated artifacts. As shown in Figure 12, the optimized results exhibited clearer crown boundaries and reduced noise from overlapping or spurious masks, providing high-quality inputs for the subsequent classification stage.

3.2.4. Classification Results Based on ResNet-101

After the SAM segmentation and post-processing stage, a total of 92,925 masks were generated for the green-leaf period and 22,605 masks for the yellow-leaf period. Through visual inspection, these were labeled to create the final datasets. For the green-leaf period, the training set contained 9664 Ginkgo (positive) masks, the validation set contained 4036, and the test set contained 3817. For the yellow-leaf period, the training set contained 2608 positive masks, the validation set contained 1062, and the test set contained 1019.

Given that the number of negative samples was substantially larger than the positive samples, we downsampled the negative samples to create a balanced 1:1 positive-to-negative ratio in the training set. By maintaining an equal number of positive and negative samples, the model could better capture the discriminative visual features of Ginkgo crowns rather than overfitting to background patterns. Furthermore, to improve the model’s robustness and generalization, this downsampling was not random; we preferentially retained “hard negative” samples that were visually similar to the Ginkgo class. Representative examples of the positive and negative samples used for training are shown in Figure 13.

Following the training protocol detailed previously, we developed two separate classification models, one for each phenological period. The performance of these models was evaluated based on classification accuracy, with the results summarized in Table 4. The results indicate a strong dependency on the phenological stage. The model trained on the yellow-leaf dataset demonstrated an excellent performance, achieving a classification accuracy of 96.83% on the test set. In contrast, the model trained on the green-leaf dataset yielded a substantially lower accuracy of 88.66% on its respective test set. This performance gap underscores the superior discriminative power of the features present in the yellow-leaf imagery.

3.3. Overall Results of Ginkgo Identification

The final evaluation results, presented in Table 5, demonstrate the clear superiority of the proposed STC strategy. On the yellow-leaf test set, our method achieved an F1-score of 92.96%, representing a substantial improvement of 24.50 percentage points over the best-performing YOLOv8 model. This result confirms that the STC strategy effectively mitigates the limitations of conventional object detection models by decoupling segmentation and classification, thereby achieving more accurate and reliable Ginkgo identification.

Importantly, the proposed STC strategy also demonstrated strong robustness under the more challenging green-leaf conditions. As Table 5 shows, STC achieved an F1-score of 70.22%, representing an increase of 31.27 percentage points compared to YOLOv8 (38.95%). Notably, this performance in the difficult green-leaf phase (70.22%) even exceeded YOLOv8’s best result obtained in the ideal yellow-leaf phase (68.46%), clearly demonstrating the superior generalization and stability of the STC framework across phenological stages. Furthermore, the STC approach effectively mitigated the major challenge of distinguishing adjacent and overlapping tree canopies, which represents a fundamental limitation of conventional YOLOv8 detection (Figure 14). The qualitative comparison shown in Figure 14 highlights how STC produces more precise and complete crown delineations, particularly in complex mixed-forest environments.

4. Discussion

Our study proposed and validated an innovative two-stage STC strategy, which has proven to be a highly effective method for the accurate identification of Ginkgo in complex, mixed-forest environments using high-resolution UAV imagery. The experimental results clearly demonstrate that by decoupling the tasks of canopy segmentation and species classification, the STC strategy significantly outperforms the end-to-end YOLOv8 object detection model, particularly in challenging phenological stages.

4.1. Limitations of the YOLOv8 Detection Approach

The relatively low sensitivity of YOLOv8 can be explained by the challenges inherent in forest canopy detection. As shown in Figure 9, the complex and overlapping structures of tree crowns often lead to inaccurate boundary localization. An analysis of the prediction results suggests two potential reasons for its poor performance. First, the tree canopies have complex and irregular shapes, making it difficult for the model to accurately determine their boundaries and separate them from one another (Figure 9a). Second, the annotated bounding boxes often contain excessive background information. This makes it difficult for the model to distinguish between the target and background, causing it to learn irrelevant background features that negatively impact the detection results (Figure 9b).

4.2. The Superiority of the Two-Stage STC Framework

The core strength of our methodology lies in its two-stage design, which effectively mitigates the inherent limitations of single-stage object detection models in forestry applications. Traditional detectors such as YOLOv8 must simultaneously perform localization and classification, an approach that struggles when tree canopies are irregularly shaped, densely distributed, or overlapping. As shown in our qualitative analysis (Figure 14), YOLOv8 frequently generates bounding boxes that include substantial background clutter such as neighboring canopies, understory vegetation, or bare ground. This inclusion of irrelevant background degrades the quality of the extracted features and ultimately reduces detection performance.

The STC strategy overcomes this fundamental issue. In the first stage, the fine-tuned SAM provides precise, pixel-level instance segmentation of individual tree crowns. This step effectively isolates the object of interest, delivering a clean representation of the canopy with minimal background noise to the next stage. In the second stage, the ResNet-101 classifier can focus solely on the intrinsic features of the segmented canopy. Freed from the burden of localization and background suppression, the classifier learns more discriminative feature representations, leading to a dramatic increase in accuracy. This strategy of decomposing complex tasks into specialized sub-tasks has also been shown to improve accuracy in remote sensing classification.

4.3. The Critical Role of Phenology in Ginkgo Identification

Our findings strongly reaffirm the importance of phenology in tree species identification. The performance gap between the yellow-leaf and green-leaf stages was significant for both the STC strategy and YOLOv8. The distinct golden-yellow foliage of Ginkgo in autumn provides a powerful spectral signature that simplifies the classification task, leading to an outstanding F1-score of 92.96% with STC.

The strongest evidence of STC’s effectiveness, however, is its performance in the green-leaf stage. During this period, Ginkgo trees are spectrally similar to many other broadleaf species, a common challenge in forest remote sensing [8]. Despite this, the STC strategy achieved an F1-score of 70.22%, representing a 31.27 percentage point improvement over the augmented YOLOv8 model and surpassing YOLOv8’s best performance even in the yellow-leaf stage (68.46%). This finding suggests that accurate segmentation is not merely a preprocessing step but a critical enabler that allows the ResNet-101 classifier to focus on the intrinsic visual patterns of the canopy. By removing background interference and ensuring independent, complete crown inputs, the classifier is able to learn subtle yet decisive textural and structural cues, such as leaf density, crown morphology, and branching texture, otherwise they may be obscured in noisy detection boxes. Future work may extend this analysis by quantifying these learned structural features or visualizing attention maps within the classifier to confirm the importance of fine-grained canopy texture during the green-leaf phase.

Interestingly, the study by Cloutier et al. offers a more nuanced perspective on autumn phenology [18]. They found that in a temperate mixed forest, which included species such as Acer saccharum, Fagus grandifolia, and Abies balsamea, classification accuracy was highest at the onset of fall coloration (F1-score of 0.72) and lowest at peak coloration (F1-score of 0.61). They attributed this decline to increased intra-species variability in the timing of senescence and leaf drop, which introduced more visual noise.

While our study found the yellow-leaf period to be optimal, the contrast with Cloutier et al. [18] highlights that the consistency and uniformity of phenological change, not just the distinctness of color, are critical for achieving the highest accuracy. In our study area, Ginkgo trees exhibited a highly synchronized transition to the yellow-leaf phase, resulting in a consistent and homogeneous canopy appearance that facilitated stable feature learning. Conversely, when phenological transitions occur asynchronously within or among species, increased intra-class variability can obscure texture and structure, ultimately degrading model accuracy. This insight suggests that achieving high accuracy in phenology-dependent classification tasks requires not only selecting visually distinct periods but also considering the temporal coherence of phenological change. Furthermore, the work of Huang et al. [35]., which focused on species such as Quercus, Acer, Ulmus, and Fraxinus during the leaf-off winter period, demonstrates the potential for species identification by leveraging structural branch features, showing that different phenological windows provide unique and complementary information.

4.4. Limitations and Future Research Directions

Despite the success of our proposed method, certain limitations exist and open up promising avenues for future research.

4.4.1. Data Fusion

Our study relied solely on RGB imagery. While cost-effective, RGB data lack the rich spectral information provided by multispectral or hyperspectral sensors. The work of Htun et al. offers a clear direction for improvement: integrating multispectral bands with a Canopy Height Model derived from UAV data significantly improved the performance of a Mask R-CNN model for broadleaf species classification in mixed forests [14]. Adopting a similar multi-source data fusion approach, as also supported by Gyawali et al. [36] and Wan et al. [37], by incorporating NIR and structural data such as height from LiDAR, could substantially enhance the ability of our model to differentiate Ginkgo during the challenging green-leaf period.

4.4.2. Model Generalization

This study was conducted in a specific geographical region in eastern China. The model’s generalization ability across different forest ecosystems, with varying Ginkgo genotypes and environmental conditions, warrants further investigation. Expanding the dataset to include more diverse locations would be essential for developing a truly robust and universally applicable Ginkgo identification tool, as demonstrated by Pyo et al., who tested their U-Net model across different regions of South Korea [38]. Also, this framework can be retrained or fine-tuned with limited additional data to detect other broadleaf or coniferous species in different ecological settings. This adaptability makes STC a promising foundation for multi-species forest monitoring systems.

4.4.3. Time-Series Analysis

In addition, while we examined two distinct phenological snapshots, a dense time-series analysis could provide even greater accuracy. Building on the work of Cloutier et al. [18], who utilized seven UAV acquisitions in a single growing season, and Zhou et al. [39], who used a 22-year time series, a similar approach could capture the unique temporal signature of Ginkgo, further improving classification robustness.

4.4.4. Operational Considerations

While the proposed STC framework achieved substantially higher accuracy than YOLOv8, it also introduces additional computational and operational costs. STC demands a more complex workflow, including mask generation, post-processing, and classification, which may limit its direct deployment for real-time or onboard UAV applications. Future work could explore model compression, GPU acceleration, or lightweight segmentation architectures to reduce inference time and improve operational efficiency in field applications.

5. Conclusions

This study successfully developed and validated a two-stage STC strategy for high-precision Ginkgo tree identification from UAV imagery. By leveraging a fine-tuned Segment Anything Model for precise canopy extraction and a ResNet-101 for dedicated classification, the STC strategy overcomes the limitations of standard object detection models in complex forest environments. The results reveal that high segmentation accuracy enables the classifier to learn subtle yet decisive textural and structural features. Although STC involves a more complex inference pipeline than YOLOv8, its substantial accuracy gain justifies its use in high precision applications. Also, the approach demonstrated outstanding performance in the visually ambiguous green-leaf period, proving its robustness and potential for practical applications in forest resource inventories, biodiversity monitoring, and conservation management. Future work focusing on multi-source data fusion and architectural refinements will further enhance the accuracy and applicability of this promising STC strategy.

Author Contributions

Conceptualization, F.L. and W.K.; methodology, M.C.; software, M.C.; validation, M.C., Y.S. and J.J.; formal analysis, Y.S.; investigation, M.C., Y.S. and J.J.; resources, Y.Z.; data curation, M.C.; writing—original draft preparation, M.C.; writing—review and editing, F.L.; visualization, M.C.; supervision, Y.Z.; project administration, F.L. and Y.Z.; funding acquisition, F.L. and Y.Z. All authors have read and agreed to the published version of the manuscript.

Funding

Science and Technology Department of Zhejiang Province (2023C03138).

Data Availability Statement

The data presented in this study is available on request from the corresponding author. The data is not publicly available due to privacy.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Erfanifard, Y.; Kraszewski, B.; Lisiewicz, M.; Mielcarek, M.; Czerepko, J.; Kuberski, Ł.; Stereńczak, K. A new lens on biodiversity assessment: The reliability of high-resolution remote sensing in investigating tree species diversity in old-growth forests. For. Ecol. Manag. 2025, 595, 122987. [Google Scholar] [CrossRef]
Guan, R.; Zhao, Y.; Zhang, H.; Fan, G.; Liu, X.; Zhou, W.; Shi, C.; Wang, J.; Liu, W.; Liang, X.; et al. Draft genome of the living fossil Ginkgo biloba. GigaScience 2016, 5, 49. [Google Scholar] [CrossRef]
Luoma, V.; Saarinen, N.; Wulder, M.A.; White, J.C.; Vastaranta, M.; Holopainen, M.; Hyyppä, J. Assessing Precision in Conventional Field Measurements of Individual Tree Attributes. Forests 2017, 8, 38. [Google Scholar] [CrossRef]
Torresan, C.; Berton, A.; Carotenuto, F.; Di Gennaro, S.F.; Gioli, B.; Matese, A.; Miglietta, F.; Vagnoli, C.; Zaldei, A.; Wallace, L. Forestry applications of UAVs in Europe: A review. Int. J. Remote Sens. 2017, 38, 2427–2447. [Google Scholar] [CrossRef]
Sankey, T.; Donager, J.; McVay, J.; Sankey, J.B. UAV lidar and hyperspectral fusion for forest monitoring in the southwestern USA. Remote Sens. Environ. 2017, 195, 30–43. [Google Scholar] [CrossRef]
Kandare, K.; Ørka, H.O.; Dalponte, M.; Næsset, E.; Gobakken, T. Individual tree crown approach for predicting site index in boreal forests using airborne laser scanning and hyperspectral data. Int. J. Appl. Earth Obs. Geoinf. 2017, 60, 72–82. [Google Scholar] [CrossRef]
Diez, Y.; Kentsch, S.; Fukuda, M.; Caceres, M.L.L.; Moritake, K.; Cabezas, M. Deep Learning in Forestry Using UAV-Acquired RGB Data: A Practical Review. Remote Sens. 2021, 13, 2837. [Google Scholar] [CrossRef]
Dash, J.P.; Watt, M.S.; Pearse, G.D.; Heaphy, M.; Dungey, H.S. Assessing very high resolution UAV imagery for monitoring forest health during a simulated disease outbreak. ISPRS J. Photogramm. Remote Sens. 2017, 131, 1–14. [Google Scholar] [CrossRef]
Li, L.; Mu, X.; Chianucci, F.; Qi, J.; Jiang, J.; Zhou, J.; Chen, L.; Huang, H.; Yan, G.; Liu, S. Ultrahigh-resolution boreal forest canopy mapping: Combining UAV imagery and photogrammetric point clouds in a deep-learning-based approach. Int. J. Appl. Earth Obs. Geoinf. 2022, 107, 102686. [Google Scholar] [CrossRef]
Schiefer, F.; Kattenborn, T.; Frick, A.; Frey, J.; Schall, P.; Koch, B.; Schmidtlein, S. Mapping forest tree species in high resolution UAV-based RGB-imagery by means of convolutional neural networks. ISPRS J. Photogramm. Remote Sens. 2020, 170, 205–215. [Google Scholar] [CrossRef]
Hamedianfar, A.; Mohamedou, C.; Kangas, A.; Vauhkonen, J. Deep learning for forest inventory and planning: A critical review on the remote sensing approaches so far and prospects for further applications. For. Int. J. For. Res. 2022, 95, 451–465. [Google Scholar] [CrossRef]
Bragagnolo, L.; da Silva, R.V.; Grzybowski, J.M.V. Amazon forest cover change mapping based on semantic segmentation by U-Nets. Ecol. Inform. 2021, 62, 101279. [Google Scholar] [CrossRef]
Xu, S.; Yang, B.; Wang, R.; Yang, D.; Li, J.; Wei, J. Single Tree Semantic Segmentation from UAV Images Based on Improved U-Net Network. Drones 2025, 9, 237. [Google Scholar] [CrossRef]
Htun, N.M.; Owari, T.; Suzuki, S.N.; Fukushi, K.; Ishizaki, Y.; Fushimi, M.; Unno, Y.; Konda, R.; Kita, S. Spatial Localization of Broadleaf Species in Mixed Forests in Northern Japan Using UAV Multi-Spectral Imagery and Mask R-CNN Model. Remote Sens. 2025, 17, 2111. [Google Scholar] [CrossRef]
Fu, H.; Zhao, H.; Jiang, J.; Zhang, Y.; Liu, G.; Xiao, W.; Du, S.; Guo, W.; Liu, X. Automatic detection tree crown and height using Mask R-CNN based on unmanned aerial vehicles images for biomass mapping. For. Ecol. Manag. 2024, 555, 121712. [Google Scholar] [CrossRef]
Akdoğan, C.; Özer, T.; Oğuz, Y. PP-YOLO: Deep learning based detection model to detect apple and cherry trees in orchard based on Histogram and Wavelet preprocessing techniques. Comput. Electron. Agric. 2025, 232, 110052. [Google Scholar] [CrossRef]
Topgül, Ş.N.; Sertel, E.; Aksoy, S.; Ünsalan, C.; Fransson, J.E.S. VHRTrees: A new benchmark dataset for tree detection in satellite imagery and performance evaluation with YOLO-based models. Front. For. Glob. Change 2025, 7, 1495544. [Google Scholar] [CrossRef]
Cloutier, M.; Germain, M.; Laliberté, E. Influence of temperate forest autumn leaf phenology on segmentation of tree species from UAV imagery using deep learning. Remote Sens. Environ. 2024, 311, 114283. [Google Scholar] [CrossRef]
Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.-Y. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 4015–4026. [Google Scholar]
Qin, H.; Zhou, W.; Yao, Y.; Wang, W. Individual tree segmentation and tree species classification in subtropical broadleaf forests using UAV-based LiDAR, hyperspectral, and ultrahigh-resolution RGB data. Remote Sens. Environ. 2022, 280, 113143. [Google Scholar] [CrossRef]
Abdollahnejad, A.; Panagiotidis, D. Tree Species Classification and Health Status Assessment for a Mixed Broadleaf-Conifer Forest with UAS Multispectral Imaging. Remote Sens. 2020, 12, 3722. [Google Scholar] [CrossRef]
He, F.; Liu, T.; Tao, D. Why ResNet Works? Residuals Generalize. IEEE Trans. Neural Netw. Learn. Syst. 2020, 31, 5349–5362. [Google Scholar] [CrossRef]
Wu, Z.; Shen, C.; van den Hengel, A. Wider or Deeper: Revisiting the ResNet Model for Visual Recognition. Pattern Recognit. 2019, 90, 119–133. [Google Scholar] [CrossRef]
Lin, H.-Y.; Li, W.-H.; Lin, C.-F.; Wu, H.-R.; Zhao, Y.-P. International Biological Flora: Ginkgo biloba. J. Ecol. 2022, 110, 951–982. [Google Scholar] [CrossRef]
Elshamy, M.M.; Abdein, M.A.; Alhaithloul, H.A.S.; Alghanem, S.M.S.; Wahba, M.M. Advances in Molecular Breeding for Ginkgo biloba. In Biodiversity and Genetic Improvement of Medicinal and Aromatic Plants I; Al-Khayri, J.M., Jain, S.M., Penna, S., Eds.; Springer Nature: Cham, Switzerland, 2025; pp. 235–277. [Google Scholar]
Gu, K.-J.; Lin, C.-F.; Wu, J.-J.; Zhao, Y.-P. GinkgoDB: An ecological genome database for the living fossil, Ginkgo biloba. Database 2022, 2022, baac046. [Google Scholar] [CrossRef]
Jarahizadeh, S.; Salehi, B. A Comparative Analysis of UAV Photogrammetric Software Performance for Forest 3D Modeling: A Case Study Using AgiSoft Photoscan, PIX4DMapper, and DJI Terra. Sensors 2024, 24, 286. [Google Scholar] [CrossRef]
Krig, S. Image Pre-Processing. In Computer Vision Metrics: Textbook Edition; Krig, S., Ed.; Springer International Publishing: Cham, Switzerland, 2016; pp. 35–74. [Google Scholar]
Huang, Y.; Wen, X.; Gao, Y.; Zhang, Y.; Lin, G. Tree Species Classification in UAV Remote Sensing Images Based on Super-Resolution Reconstruction and Deep Learning. Remote Sens. 2023, 15, 2942. [Google Scholar] [CrossRef]
Hussain, M. YOLOv1 to v8: Unveiling Each Variant–A Comprehensive Review of YOLO. IEEE Access 2024, 12, 42816–42833. [Google Scholar] [CrossRef]
Varghese, R.; Sambath, M. Sambath YOLOv8: A Novel Object Detection Algorithm with Enhanced Performance and Robustness. In Proceedings of the 2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS), Chennai, India, 18–19 April 2024; pp. 1–6. [Google Scholar] [CrossRef]
Yue, M.; Zhang, L.; Huang, J.; Zhang, H. Lightweight and Efficient Tiny-Object Detection Based on Improved YOLOv8n for UAV Aerial Images. Drones 2024, 8, 276. [Google Scholar] [CrossRef]
Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Proceedings of the Computer Vision–ECCV 2014, Cham, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar]
He, K.; Chen, X.; Xie, S.; Li, Y.; Dollár, P.; Girshick, R. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 16000–16009. [Google Scholar]
Huang, Y.; Yang, B.; Carpenter, J.; Jung, J.; Fei, S. Temperate forest tree species classification with winter UAV images. Remote Sens. Appl. Soc. Environ. 2025, 37, 101422. [Google Scholar] [CrossRef]
Gyawali, A.; Aalto, M.; Ranta, T. Tree Species Detection and Enhancing Semantic Segmentation Using Machine Learning Models with Integrated Multispectral Channels from PlanetScope and Digital Aerial Photogrammetry in Young Boreal Forest. Remote Sens. 2025, 17, 1811. [Google Scholar] [CrossRef]
Wan, H.; Tang, Y.; Jing, L.; Li, H.; Qiu, F.; Wu, W. Tree Species Classification of Forest Stands Using Multisource Remote Sensing Data. Remote Sens. 2021, 13, 144. [Google Scholar] [CrossRef]
Pyo, J.; Han, K.-j.; Cho, Y.; Kim, D.; Jin, D. Generalization of U-Net Semantic Segmentation for Forest Change Detection in South Korea Using Airborne Imagery. Forests 2022, 13, 2170. [Google Scholar] [CrossRef]
Zhou, M.; Han, X.; Wang, J.; Ji, X.; Zhou, Y.; Liu, M. Identification and Mapping of Eucalyptus Plantations in Remote Sensing Data Using CCDC Algorithm and Random Forest. Forests 2024, 15, 1866. [Google Scholar] [CrossRef]

Figure 1. Study area location and orthomosaic images of the three plots.

Figure 2. Visualized orthoimages of areas ①–③ acquired in different months. The images illustrate phenological variations in Ginkgo biloba and surrounding vegetation from the green-leaf to the leaf-off periods.

Figure 3. YOLOv8 model network architecture.

Figure 4. Workflow of the proposed STC strategy for Ginkgo identification.

Figure 5. Architecture of the Segment Anything Model (SAM).

Figure 6. Multi-mask segmentation example. (a–c) large crowns containing small crowns; (d) crown heterogeneity is partially identified by new crowns.

Figure 7. Mask optimization process. (a) optimization of large and small canopy overlaps; (b) optimization of the heterogeneity component.

Figure 8. Elongated Non-Canopy Artifacts in Segmentation Masks: (a) Inter-crown gap misidentified as an object; (b) Sliver-like residual from overlap resolution.

Figure 9. Reasons for poor detection results based on YOLOv8 method. (a) The crown boundary is indistinguishable; (b) Too many features in the marking box.

Figure 10. Segmentation results of green-leaf stage images under different parameter levels. The leftmost column shows the original UAV images for visual reference, and the colored circles mark regions illustrating segmentation quality differences. (a) points_per_side; (b) pred_iou_thresh; (c) stability_score_thresh; (d) crop_n_layers; (e) crop_n_points_downscale_factor. Red circles indicate good segmentation, pink circles denote over-segmentation, and yellow circles represent under-segmentation.

Figure 11. Image segmentation results of yellow leaf stage under SAM optimal parameters.

Figure 12. The final segmentation results based on mask optimization. (a) Original images; (b) segmentation results.

Figure 13. Ginkgo classification of positive and negative sample examples. (a) The green-leaf stage; (b) the yellow-leaf stage.

Figure 14. Qualitative comparison of Ginkgo identification results using YOLOv8 and the proposed STC strategy. Panels (a,b) represent the green-leaf and yellow-leaf periods, respectively.

Table 1. Definition of classification outcomes.

Actual Class	Predicted Class
Actual Class	Positive	Negative
Positive	True Positive (TP)	False Positive (FP)
Negative	False Negative (FN)	True Negative (TN)

Table 2. Sample information and modeling results based on YOLOv8.

Data Type	Dataset	No. of Instances	P (%)	R (%)	F1 (%)	mAP50 (%)
Green-leaf period	Training set	10,620	99.98	100.00	99.99	99.56
	Validation set	4248	69.94	28.76	40.76	42.10
	Test set	4104	63.85	28.02	38.95	42.66
Yellow-leaf period	Training set	2745	98.92	99.80	99.36	99.43
	Validation set	1098	84.72	72.75	78.28	82.68
	Test set	1062	71.90	65.34	68.46	70.91

Table 3. Optimal parameter settings for SAM fine-tuning on the green-leaf dataset.

Parameter	points_per_side	pred_iou_thresh	stability_score_thresh	crop_n_layers	crop_n_points_ downscale_factor
Value	16	0.85	0.80	1	1

Table 4. Results of Ginkgo classification.

Data Type	Accuracy (%)
Data Type	Training Set	Validation Set	Test Set
Green-leaf period	94.43	89.57	88.66
Yellow-leaf period	99.57	98.40	96.83

Table 5. Comparison of Ginkgo identification results.

Data Type	Methods	P (%)	R (%)	F1 (%)	mAP50 (%)
Green-leaf period	YOLOv8	63.85	28.02	38.95	42.66
Green-leaf period	STC	73.47	67.24	70.22	69.83
Yellow-leaf period	YOLOv8	71.90	65.34	68.46	70.91
Yellow-leaf period	STC	92.28	93.64	92.96	92.90

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, M.; Kong, W.; Sun, Y.; Jiao, J.; Zhao, Y.; Liu, F. A Two-Stage Segment-Then-Classify Strategy for Accurate Ginkgo Tree Identification from UAV Imagery. Drones 2025, 9, 773. https://doi.org/10.3390/drones9110773

AMA Style

Chen M, Kong W, Sun Y, Jiao J, Zhao Y, Liu F. A Two-Stage Segment-Then-Classify Strategy for Accurate Ginkgo Tree Identification from UAV Imagery. Drones. 2025; 9(11):773. https://doi.org/10.3390/drones9110773

Chicago/Turabian Style

Chen, Mengyuan, Wenwen Kong, Yongqi Sun, Jie Jiao, Yunpeng Zhao, and Fei Liu. 2025. "A Two-Stage Segment-Then-Classify Strategy for Accurate Ginkgo Tree Identification from UAV Imagery" Drones 9, no. 11: 773. https://doi.org/10.3390/drones9110773

APA Style

Chen, M., Kong, W., Sun, Y., Jiao, J., Zhao, Y., & Liu, F. (2025). A Two-Stage Segment-Then-Classify Strategy for Accurate Ginkgo Tree Identification from UAV Imagery. Drones, 9(11), 773. https://doi.org/10.3390/drones9110773

Article Menu

A Two-Stage Segment-Then-Classify Strategy for Accurate Ginkgo Tree Identification from UAV Imagery

Highlights

Abstract

1. Introduction

2. Materials and Methods

2.1. Study Area

2.2. Data Acquisition

2.3. Data Pre-Processing

2.4. Dataset Construction

2.5. Object Detection via YOLOv8

2.6. Two-Stage Segment-Then-Classify Framework

2.7. Post-Processing and Optimization of SAM Masks

2.8. Workstation and Model Evaluation

3. Results

3.1. Ginkgo Identification Results Using YOLOv8

3.2. Ginkgo Identification Results Using the Two-Stage Strategy

3.2.1. Impact of SAM Hyperparameters on Segmentation Results

3.2.2. Reliability Assessment of Segmentation Results

3.2.3. Effect of Mask Optimization on Segmentation Results

3.2.4. Classification Results Based on ResNet-101

3.3. Overall Results of Ginkgo Identification

4. Discussion

4.1. Limitations of the YOLOv8 Detection Approach

4.2. The Superiority of the Two-Stage STC Framework

4.3. The Critical Role of Phenology in Ginkgo Identification

4.4. Limitations and Future Research Directions

4.4.1. Data Fusion

4.4.2. Model Generalization

4.4.3. Time-Series Analysis

4.4.4. Operational Considerations

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI