1. Introduction
Accurate identification of species underpins sustainable forest management, biodiversity conservation, and ecological research [
1]. As a living fossil with significant ecological, medicinal, and cultural value,
Ginkgo biloba L. holds a unique position in forest ecosystems [
2]. Timely and precise monitoring of Ginkgo populations is crucial for genetic resource protection, urban greening planning, and understanding its response to environmental changes.
Traditional forestry surveys, relying on manual fieldwork, are often labor-intensive, time-consuming, and limited in spatial coverage, making them impractical for large-scale, repetitive monitoring [
3]. The advancement of remote sensing, particularly with Unmanned Aerial Vehicles (UAVs), has provided a powerful alternative. UAVs offer unprecedented flexibility and the ability to acquire imagery at very high spatial and temporal resolutions [
4], capturing fine-scale details of tree crowns such as texture, shape, and structure that are essential for species identification [
5,
6,
7]. This technology has been widely adopted for various forestry tasks, including forest health monitoring [
8], mapping vegetation in complex terrains [
9], and classifying trees [
10]. However, identifying Ginkgo trees in natural, mixed-forest environments using remote sensing remains highly challenging. The green-leaf period presents the greatest difficulty, as Ginkgo trees share highly similar visual characteristics with many other broadleaf species, leading to overlapping canopies and interspecies confusion. Conversely, during the yellow-leaf period, while visually distinctive, its bright autumn coloration can resemble that of other deciduous species undergoing senescence, increasing the risk of false identification.
In concert with the development of UAV technology, deep learning, especially Convolutional Neural Networks (CNNs), has emerged as the state of the art for image analysis in forestry [
11]. For instance, U-Net and its variants have demonstrated robust performance in semantic segmentation for forest cover and change detection [
12,
13]. Instance segmentation models like Mask R-CNN have been employed for delineating individual tree crowns and identifying species in complex mixed forests, with studies showing that integrating multi-source data such as multispectral imagery and Canopy Height Models (CHMs) can significantly enhance accuracy [
14,
15]. Concurrently, object detection models like the You Only Look Once (YOLO) series have been utilized for their efficiency in real-time tree counting and localization tasks [
16,
17].
Despite these advancements, significant challenges persist. The performance of single-stage detection models often declines in complex forests where dense and irregular canopies make simultaneous localization and classification difficult. The bounding boxes generated frequently contain substantial background noise from adjacent trees or the understory, which compromises the classifier’s accuracy. Another challenge lies in the phenological dynamics of temperate and subtropical forests. As demonstrated by Cloutier et al., the accuracy of tree species classification can vary significantly with the timing of image acquisition, with peak autumn coloration not always yielding the best results due to increased intra-species variability in senescence and leaf fall [
18]. This highlights the need for methods that are robust to phenological changes or can effectively leverage them.
To address the limitations of existing methods for Ginkgo identification, this study decoupled the complex task into two stages, segmentation and classification. This Segmentation-Then-Classification (STC) strategy first focuses on precisely delineating individual tree crowns to isolate them from the complex background, and then performs classification on these clean inputs. This approach is supported by recent advances in computer vision and remote sensing. For instance, the emerging Segment Anything Model (SAM) provides a powerful foundational framework for zero-shot object segmentation, demonstrating remarkable generalization across diverse domains by leveraging large-scale training [
19]. Although SAM excels in generic object segmentation, its application to fine-grained botanical tasks like tree species classification often requires subsequent specialized processing, which aligns with and validates our proposed STC paradigm. In forestry remote sensing, a study developed a watershed–spectral–textural-controlled normalized cut algorithm for individual tree segmentation, showing that improved crown delineation significantly boosts species classification accuracy [
20]. Similarly, spectral and textural features from UAV multispectral imagery with robust segmentation lead to more reliable tree species and health assessment [
21]. By integrating the generalizable segmentation capability of foundational models like SAM with a dedicated classification stage, our STC strategy effectively addresses challenges from complex backgrounds and interspecies similarities, thereby enhancing the robustness and precision of Ginkgo identification. For the classification stage, we adopt ResNet-101, a deep convolutional neural network built upon the ResNet architecture [
22]. ResNet introduces residual connections that alleviate the vanishing gradient problem in very deep networks, enabling more efficient feature propagation and optimization. The extended 101-layer version provides enhanced representational capacity, allowing the model to extract fine-grained visual cues [
23].The main aim of this work is therefore to propose and evaluate a novel two-stage framework for accurate Ginkgo tree identification from UAV imagery. We leverage a fine-tuned SAM for precise canopy segmentation, followed by a ResNet-101-based model for robust classification. We test this approach on a challenging dataset collected from a mixed-forest environment across two key phenological periods and benchmark its performance against the YOLOv8 model. This framework provides a reliable and high-precision solution for Ginkgo identification, offering valuable technological support for biodiversity monitoring and refined forest resource management.
2. Materials and Methods
2.1. Study Area
Tianmu Mountain National Nature Reserve (119°24′11″~119°28′21″ E, 30°18′30″~30°24′55″ N) is located in Lin’an District, Hangzhou, Zhejiang Province, China. This reserve is a subtropical mixed forest of deciduous and evergreen broad-leaved trees, representing one of China’s most significant hotspots for subtropical flora [
24], and contains many naturally distributed Ginkgo populations.
The precise geographical coordinates and ground-surveyed data of the Ginkgo trees in our study plots were obtained from the extensive ecological records provided by GinkgoDB (
https://ginkgo.zju.edu.cn/) [
25], an ecological genome database for
Ginkgo biloba [
26]. The three monitoring plots were strategically placed along an altitudinal gradient: Plot ① foothill, Plot ② mid-slope, and Plot ③ summit (
Figure 1). Plot ① (30°19′21.7″ N, 119°26′32.6″ E) covers 88,387 m
2 and ranges from 343 to 405 m elevation, and contains 44 Ginkgo trees. Plot ② (30°20′06.07″ N, 119°26′08.6″ E) spans 26,049 m
2 between 631 and 735 m elevation, featuring a V-shaped topography, and includes 45 Ginkgo trees. Plot ③ (30°20′34.9″ N, 119°25′56.2″ E) covers 117,695 m
2 and ranges from 1000 to 1158 m elevation, and contains 48 Ginkgo trees.
Although the current study focuses on the Tianmu Mountain region, future work will expand data collection to additional sites across different ecological zones. Incorporating Ginkgo populations from more diverse geographic and environmental conditions will enable further validation of the model’s generalization ability and support the development of a truly robust and widely applicable identification framework.
2.2. Data Acquisition
The primary dataset for this study comprises visible-light canopy imagery collected from Plots ①–③ in 2023. Two distinct UAV systems were employed for image acquisition: the DJI Mavic 3E, equipped with a standard RGB sensor, and the DJI Mavic 3 Multispectral (DJI Innovation Technology Co., Ltd., Shenzhen, China). Both UAVs are equipped with identical high-resolution RGB cameras featuring a 4/3 CMOS sensor with 20 million effective pixels, producing images at a resolution of 5280 × 3956 pixels. The lens specifications include a field of view (FOV) of 84°, an equivalent focal length of 24 mm, a variable aperture ranging from f/2.8 to f/11, and a focus range from 1 m to infinity. The camera’s ISO range spans from 100 to 6400.
Crucially, although the Mavic 3 Multispectral possesses multi-band capabilities, for the specific objectives of this research focusing on RGB-based identification, only the RGB sensor data from both UAV platforms were utilized. The rationale for deploying two different systems was to efficiently capture a more diverse set of RGB images, potentially leveraging varied flight angles or perspectives offered by operating multiple platforms, thereby enriching the RGB dataset for subsequent analysis rather than utilizing the multispectral data itself.
To ensure high image quality and minimize distortion from environmental factors, all flight missions were executed between 10:00 a.m. and 3:00 p.m. under clear, sunny conditions with ample sunlight and low wind speeds. A ground-like flight mode was employed, maintaining a consistent flight height of 50 m above ground level and a flight speed of 3 m/s to guarantee data clarity and mitigate motion blur. An automated flight pattern was used, so that the image capture mode was set to photograph at equal distance intervals along planned, parallel flight lines. High overlap rates, set at 80% for both forward and side overlap, were maintained to ensure robust image stitching and 3D model generation.
2.3. Data Pre-Processing
The collected raw aerial RGB images were subsequently processed using DJI Terra software (version 4.3.0, DJI, Shenzhen, China). After each flight mission, the corresponding raw image set was directly imported into DJI Terra, where the software automatically recognized the geographic coordinates and camera parameters embedded in the image metadata [
27].
The processing workflow included automated image alignment using a structure-from-motion algorithm, and the generation of a dense point cloud followed by a high-resolution orthomosaic for each study plot. Approximately 250 images were used for Plot 1, 175 images for Plot 2, and 400 images for Plot 3 as input for the reconstruction process. During alignment, DJI Terra automatically excluded those with insufficient keypoint matches to ensure alignment accuracy; as automatic exposure settings were utilized, issues with underexposure were minimal and did not necessitate specific filtering for that attribute. As the UAV platform was equipped with a high-precision real-time kinematic (RTK) positioning system, no ground control points (GCPs) were required for georeferencing.
The final 3D models and orthomosaics were visually inspected to confirm spatial consistency and geometric accuracy before being used for subsequent analyses. The resulting orthomosaics possess very high spatial resolution, specifically 12,053 × 22,874 pixels for Plot 1, 10,679 × 13,072 pixels for Plot 2, and 25,513 × 24,009 pixels for Plot 3.
2.4. Dataset Construction
Figure 2 presents excerpts from the orthoimages of plots, captured across different months. It is evident that during the leaf-off period, after the trees have shed their leaves, the feature differences among tree trunks are minimal, making species identification of Ginkgo challenging using these images. Therefore, to ensure identification accuracy, post-defoliation imagery was excluded from the study’s dataset.
Concurrently, significant variations in tree phenology were observed among the plots due to differences in elevation. Ginkgo trees, over their annual growth cycle, exhibit distinct seasonal changes in leaf color, namely a green-leaf period and a yellow-leaf period. During the green-leaf period, the leaves are vibrantly green and actively photosynthesize, a stage that is critical for the tree’s growth and energy accumulation. As the season transitions into autumn, chlorophyll within the Ginkgo leaves degrades while the relative concentration of other pigments, such as carotenoids, increases. This process causes the leaves to turn yellow, and this vibrant golden-yellow hue is a prominent feature that distinguishes Ginkgo from other tree species.
As shown in the figure, the Ginkgo leaves in Plot ③ had already turned distinctly yellow by November 3, whereas those in Plots ① and ② did not display this bright coloration until the images acquired on November 23. Consequently, the imagery from Plot ③ on November 3 and from Plots ① and ② on November 23 was classified as yellow-leaf period data, with all other imagery being classified as green-leaf period data.
To ensure the reliability of this phenological classification and the accurate identification of Ginkgo trees, GPS data of individual Ginkgo trees were collected by a handheld GPS device during field surveys, as reported in GinkgoDB (
https://ginkgo.zju.edu.cn/), and overlaid with the generated orthomosaics. This direct spatial matching allowed precise localization of each Ginkgo individual within the imagery. Subsequently, leaf-color changes within these verified tree crowns were visually examined to confirm their phenological stage. This combined approach ensured that the classification accurately represented the true phenological state of the target Ginkgo trees and minimized potential misidentification caused by co-occurring species.
Based on the ground-surveyed geographic coordinates of Ginkgo trees within each plot, 2000 × 2000-pixel sub-images were extracted from the orthomosaics, with each sub-image centered on an individual tree. These sub-images constituted the primary data for the subsequent research. Since all Ginkgo trees had completely shed their leaves by December, the UAV imagery acquired in December was excluded from subsequent analyses. Among them, there were 531 images in the green-leaf stage and 137 images in the yellow-leaf stage.
To address the potential issue of data scarcity and to enhance the model’s generalization capabilities, data augmentation was introduced during the training process [
28], including smoothing, sharpening, degree adjustment, histogram equalization, adaptive histogram equalization, and rotations at three angles of 90°, 180°, and 270°, which were chosen based on their proven effectiveness in remote sensing classification [
29]. The resulting collection of sub-images was partitioned into training, validation, and test sets using a 5:2:2 ratio. This split was performed ensuring that no Ginkgo tree present in the validation or test sets also appeared in the training set. This partitioning strategy was applied consistently across both the green-leaf and yellow-leaf datasets. The final image distribution for the green-leaf period was 2655 for training, 1062 for validation, and 1062 for test. For the yellow-leaf period, the distribution was 685 for training, 274 for validation, and 274 for testing. For model training, positive samples were defined as sub-images containing Ginkgo crowns verified by ground-surveyed coordinates, while negative samples represented non-Ginkgo tree crowns.
2.5. Object Detection via YOLOv8
A direct approach to identifying individual Ginkgo trees is to frame the problem as an object detection task. The YOLO (You Only Look Once) series has become a prominent tool for tree species identification. The YOLOv8 model family provides five variants with scaling complexity: nano (n), small (s), medium (m), large (l), and extra-large (x) [
30]. To balance detection accuracy with computational efficiency, we selected a model based on preliminary experiments [
31]. These tests revealed that larger models offered only marginal performance gains on our dataset. Therefore, the lightweight YOLOv8n model was adopted for all subsequent experiments. Key architectural enhancements in YOLOv8 include a decoupled head (
Figure 3), which separates the classification and bounding box regression tasks into independent branches. This allows for specialized optimization of each sub-task and improves the model’s ability to accurately detect objects across a wide range of scales. The efficiency of the YOLOv8n architecture makes it particularly well-suited for applications requiring rapid inference [
32].
All images were manually annotated with bounding boxes using the LabelImg software (version 1.8.6, LabelImg; GitHub; available at
https://github.com/tzutalin/labelImg, accessed on 22 April 2024). We conducted two separate training experiments: (1) a model trained exclusively on green-leaf data, and (2) a model trained exclusively on yellow-leaf data. All models were based on the lightweight YOLOv8n architecture, initialized with weights pre-trained on the COCO dataset [
33]. Training was performed using a Stochastic Gradient Descent (SGD) optimizer with a momentum of 0.937 and a batch size of 64. A cosine annealing scheduler was used to manage the learning rate. Models were trained for a maximum of 500 epochs. To prevent overfitting, an early stopping strategy with a patience of 50 epochs was employed, terminating training when no improvement in validation performance was observed within that period. The model weights that yielded the best performance on the validation set were saved and used for the final evaluation.
2.6. Two-Stage Segment-Then-Classify Framework
Meanwhile, we propose and evaluate a novel two-stage segment-then-classify (STC) strategy. The STC strategy first leverages the Segment Anything Model (SAM) to perform precise segmentation of tree canopies [
19]. Subsequently, a ResNet-101–based classification model is applied to the segmented outputs. The overall workflow of the proposed STC strategy is illustrated in
Figure 4.
Specifically, the STC strategy consists of the following stages. Stage 1: Instance Segmentation with SAM. The Segment Anything Model is a powerful foundation model for image segmentation, developed as the centerpiece of the “Segment Anything” project. As shown in
Figure 5, the SAM architecture comprises three main components: an image encoder, a prompt encoder, and a mask decoder. The image encoder, a high-capacity Vision Transformer (ViT) pre-trained with Masked Autoencoders (MAEs), provides the core representational power. Its training on an expansive dataset of 11 million images and over 1 billion masks grants it remarkable accuracy and generalization capabilities [
34]. A key feature of SAM is its strong zero-shot performance, enabling it to segment novel objects and images without task-specific fine-tuning. Furthermore, SAM can be efficiently fine-tuned on downstream datasets to achieve better performance on specialized segmentation tasks. In this study, SAM was fine-tuned to generate high-quality segmentation masks for potential tree objects, effectively isolating them from the background and surrounding vegetation.
As the initial step of the STC strategy, the quality of image segmentation has a direct impact on the final Ginkgo identification accuracy. To this end, fine-tuning SAM is crucial for developing a robust model tailored to our specific task. We performed systematic fine-tuning experiments on the green-leaf dataset, focusing on five key parameters that control segmentation behavior:
Points_per_side, which determines the number of sampling points along each side of the image and therefore controls the granularity of mask generation. A higher value increases boundary precision but also raises computational cost. Pred_iou_thresh and stability_score_thresh, two mask quality filters that, respectively, govern confidence and robustness. The first ensures that only high-confidence masks are retained, while the second discards unstable predictions, reducing over-segmentation and false boundaries. Crop_n_layers, which defines the number of pyramid levels used for multi-scale analysis, allowing the model to detect both small and large crowns effectively. Finally, crop_n_points_downscale_factor, which adjusts the density of sampling points during cropping to balance efficiency and local detail preservation.
Each parameter was tuned individually using a one-variable-at-a-time approach, varying it across a predefined range while keeping the others fixed at default values. The tested levels included 16, 32, 64, and 128 for points_per_side; 0.80, 0.85, 0.90, and 0.95 for both pred_iou_thresh and stability_score_thresh; and 1, 2, 3, and 4 for crop_n_layers and crop_n_points_downscale_factor.
Optimal values were determined based on visual inspection of crown completeness and independence. The final configuration achieved the best trade-off between accuracy and computational efficiency, producing masks that were continuous and non-overlapping. A post-processing step further ensured that each extracted tree crown was independent and structurally complete before being passed to the ResNet-101 classifier. Without this refinement, adjacent crowns or fragmented canopy segments could have been incorrectly merged or split, introducing noise and ambiguity into the classification stage.
Stage 2: Classification with ResNet-101. The segmented tree images produced in Stage 1 were then passed to a classifier for final identification. For this task, the ResNet-101 model served as the backbone. ResNet-101 is a deep convolutional neural network renowned for addressing the degradation problem in very deep networks [
22]. We employed a transfer learning approach by initializing the model with ResNet-101 weights pre-trained on the ImageNet dataset. The original fully connected layer was replaced with a new layer adapted to the classes of our dataset, and the entire network was fine-tuned to learn task-specific features.
The ResNet-101 classifier was trained using the SGD optimizer with a momentum of 0.9 and a batch size of 64. A step decay schedule was adopted: the initial learning rate was set to 0.01 and decayed by a factor of 10 every 30 epochs. Training lasted up to 100 epochs. After each epoch, performance was evaluated on the validation set, and the checkpoint achieving the highest validation accuracy was saved for subsequent testing and evaluation.
2.7. Post-Processing and Optimization of SAM Masks
SAM tends to output multiple masks for a single object to ensure at least one is valid. This often results in overlapping or hierarchical masks that require post-processing to generate a final set of independent and complete canopy segmentations. In our task, two primary types of overlap were observed: A larger, coarse mask of a canopy that fully encompasses one or more smaller, more detailed masks of the same canopy. These masks typically share a common boundary (
Figure 6a–c). Small spurious masks such as a shadowed area, a sunlit leaf cluster, or a patch with different texture or color may also appear, but these false masks are usually caused by local illumination differences or within-crown texture variation and do not represent distinct tree crowns (
Figure 6d).
To address these issues, a rule-based post-processing pipeline was developed. Step 1: Iterative Overlap Resolution. An iterative algorithm was implemented to resolve all mask overlaps. First, the algorithm iterates through all generated masks to identify any pair with overlapping pixels. For each overlapping pair, the area of the smaller mask is subtracted from the area of the larger mask using a Boolean operation, creating a new “residual” mask. Next, a topological analysis is performed on this residual mask to check for the presence of internal holes: If the residual mask is contiguous and contains no holes, this corresponds to the first overlap scenario, a coarse mask containing a finer one. In this case, the original smaller mask and the new residual mask are both retained, effectively partitioning the original coarse mask (
Figure 7a). In the other case, if the residual mask contains an internal hole, this corresponds to the second scenario, a spurious mask within a larger canopy. Here, the smaller mask is considered an artifact and is discarded, while only the original, larger mask is kept (
Figure 7b). This process is repeated iteratively until no overlapping masks remain in the set.
Step 2 is filtering elongated artifacts. Beyond overlaps, the segmentation and post-processing steps can introduce two types of elongated, non-canopy artifacts, narrow gaps between adjacent canopies that are incorrectly identified as distinct objects (
Figure 8a). Thin, sliver-like residuals that are byproducts of the overlap resolution process (
Figure 8b). To eliminate these artifacts, a filtering step based on the aspect ratio of each mask’s bounding box was implemented. Masks with an aspect ratio exceeding a threshold of 2.0 were classified as erroneous artifacts and removed.
To provide a direct, quantitative comparison with the YOLOv8 baseline, the outputs of our STC strategy were converted into a standard object detection format. This was achieved by first using the SURF (Speeded Up Robust Features) image matching algorithm to determine the original coordinates of each positively classified Ginkgo segment. These coordinates were then used to define a final bounding box. The confidence score for each box was inherited directly from the classifier’s output, resulting in a prediction file directly comparable to that of YOLOv8.
2.8. Workstation and Model Evaluation
All data processing, model construction, and analysis in this study were performed on a workstation running the Windows 10 operating system. The hardware was configured with an Intel® CoreTM i9-12900K CPU, an NVIDIA® GeForce® RTXTM 3090 Ti GPU, and 128 GB of RAM.
Four metrics were selected to evaluate the identification performance for Ginkgo: Precision (P), Recall (R), F1-score (F1), and mean Average Precision at an Intersection over Union (IoU) threshold of 0.50 (mAP50). As defined in
Table 1, prediction results are categorized into four types: True Positive (TP), False Positive (FP), False Negative (FN), and True Negative (TN). Precision and Recall are calculated based on the counts of these outcomes, with the formulas provided below:
The F1-score is the harmonic mean of Precision and Recall, used to provide a comprehensive measure of their combined performance. It is calculated as follows:
The mean Average Precision (mAP) offers an objective evaluation of an object detection model’s effectiveness by considering Precision and Recall across different object classes and confidence thresholds, where APi is the Average Precision for class i and N is the total number of classes. This study reports mAP at a 0.50 Intersection over Union threshold (mAP50). For all the above metrics, a value closer to 100% indicates better model performance. The formulas are as follows:
4. Discussion
Our study proposed and validated an innovative two-stage STC strategy, which has proven to be a highly effective method for the accurate identification of Ginkgo in complex, mixed-forest environments using high-resolution UAV imagery. The experimental results clearly demonstrate that by decoupling the tasks of canopy segmentation and species classification, the STC strategy significantly outperforms the end-to-end YOLOv8 object detection model, particularly in challenging phenological stages.
4.1. Limitations of the YOLOv8 Detection Approach
The relatively low sensitivity of YOLOv8 can be explained by the challenges inherent in forest canopy detection. As shown in
Figure 9, the complex and overlapping structures of tree crowns often lead to inaccurate boundary localization. An analysis of the prediction results suggests two potential reasons for its poor performance. First, the tree canopies have complex and irregular shapes, making it difficult for the model to accurately determine their boundaries and separate them from one another (
Figure 9a). Second, the annotated bounding boxes often contain excessive background information. This makes it difficult for the model to distinguish between the target and background, causing it to learn irrelevant background features that negatively impact the detection results (
Figure 9b).
4.2. The Superiority of the Two-Stage STC Framework
The core strength of our methodology lies in its two-stage design, which effectively mitigates the inherent limitations of single-stage object detection models in forestry applications. Traditional detectors such as YOLOv8 must simultaneously perform localization and classification, an approach that struggles when tree canopies are irregularly shaped, densely distributed, or overlapping. As shown in our qualitative analysis (
Figure 14), YOLOv8 frequently generates bounding boxes that include substantial background clutter such as neighboring canopies, understory vegetation, or bare ground. This inclusion of irrelevant background degrades the quality of the extracted features and ultimately reduces detection performance.
The STC strategy overcomes this fundamental issue. In the first stage, the fine-tuned SAM provides precise, pixel-level instance segmentation of individual tree crowns. This step effectively isolates the object of interest, delivering a clean representation of the canopy with minimal background noise to the next stage. In the second stage, the ResNet-101 classifier can focus solely on the intrinsic features of the segmented canopy. Freed from the burden of localization and background suppression, the classifier learns more discriminative feature representations, leading to a dramatic increase in accuracy. This strategy of decomposing complex tasks into specialized sub-tasks has also been shown to improve accuracy in remote sensing classification.
4.3. The Critical Role of Phenology in Ginkgo Identification
Our findings strongly reaffirm the importance of phenology in tree species identification. The performance gap between the yellow-leaf and green-leaf stages was significant for both the STC strategy and YOLOv8. The distinct golden-yellow foliage of Ginkgo in autumn provides a powerful spectral signature that simplifies the classification task, leading to an outstanding F1-score of 92.96% with STC.
The strongest evidence of STC’s effectiveness, however, is its performance in the green-leaf stage. During this period, Ginkgo trees are spectrally similar to many other broadleaf species, a common challenge in forest remote sensing [
8]. Despite this, the STC strategy achieved an F1-score of 70.22%, representing a 31.27 percentage point improvement over the augmented YOLOv8 model and surpassing YOLOv8’s best performance even in the yellow-leaf stage (68.46%). This finding suggests that accurate segmentation is not merely a preprocessing step but a critical enabler that allows the ResNet-101 classifier to focus on the intrinsic visual patterns of the canopy. By removing background interference and ensuring independent, complete crown inputs, the classifier is able to learn subtle yet decisive textural and structural cues, such as leaf density, crown morphology, and branching texture, otherwise they may be obscured in noisy detection boxes. Future work may extend this analysis by quantifying these learned structural features or visualizing attention maps within the classifier to confirm the importance of fine-grained canopy texture during the green-leaf phase.
Interestingly, the study by Cloutier et al. offers a more nuanced perspective on autumn phenology [
18]. They found that in a temperate mixed forest, which included species such as
Acer saccharum,
Fagus grandifolia, and
Abies balsamea, classification accuracy was highest at the onset of fall coloration (F1-score of 0.72) and lowest at peak coloration (F1-score of 0.61). They attributed this decline to increased intra-species variability in the timing of senescence and leaf drop, which introduced more visual noise.
While our study found the yellow-leaf period to be optimal, the contrast with Cloutier et al. [
18] highlights that the consistency and uniformity of phenological change, not just the distinctness of color, are critical for achieving the highest accuracy. In our study area, Ginkgo trees exhibited a highly synchronized transition to the yellow-leaf phase, resulting in a consistent and homogeneous canopy appearance that facilitated stable feature learning. Conversely, when phenological transitions occur asynchronously within or among species, increased intra-class variability can obscure texture and structure, ultimately degrading model accuracy. This insight suggests that achieving high accuracy in phenology-dependent classification tasks requires not only selecting visually distinct periods but also considering the temporal coherence of phenological change. Furthermore, the work of Huang et al. [
35]., which focused on species such as Quercus, Acer, Ulmus, and Fraxinus during the leaf-off winter period, demonstrates the potential for species identification by leveraging structural branch features, showing that different phenological windows provide unique and complementary information.
4.4. Limitations and Future Research Directions
Despite the success of our proposed method, certain limitations exist and open up promising avenues for future research.
4.4.1. Data Fusion
Our study relied solely on RGB imagery. While cost-effective, RGB data lack the rich spectral information provided by multispectral or hyperspectral sensors. The work of Htun et al. offers a clear direction for improvement: integrating multispectral bands with a Canopy Height Model derived from UAV data significantly improved the performance of a Mask R-CNN model for broadleaf species classification in mixed forests [
14]. Adopting a similar multi-source data fusion approach, as also supported by Gyawali et al. [
36] and Wan et al. [
37], by incorporating NIR and structural data such as height from LiDAR, could substantially enhance the ability of our model to differentiate Ginkgo during the challenging green-leaf period.
4.4.2. Model Generalization
This study was conducted in a specific geographical region in eastern China. The model’s generalization ability across different forest ecosystems, with varying Ginkgo genotypes and environmental conditions, warrants further investigation. Expanding the dataset to include more diverse locations would be essential for developing a truly robust and universally applicable Ginkgo identification tool, as demonstrated by Pyo et al., who tested their U-Net model across different regions of South Korea [
38]. Also, this framework can be retrained or fine-tuned with limited additional data to detect other broadleaf or coniferous species in different ecological settings. This adaptability makes STC a promising foundation for multi-species forest monitoring systems.
4.4.3. Time-Series Analysis
In addition, while we examined two distinct phenological snapshots, a dense time-series analysis could provide even greater accuracy. Building on the work of Cloutier et al. [
18], who utilized seven UAV acquisitions in a single growing season, and Zhou et al. [
39], who used a 22-year time series, a similar approach could capture the unique temporal signature of Ginkgo, further improving classification robustness.
4.4.4. Operational Considerations
While the proposed STC framework achieved substantially higher accuracy than YOLOv8, it also introduces additional computational and operational costs. STC demands a more complex workflow, including mask generation, post-processing, and classification, which may limit its direct deployment for real-time or onboard UAV applications. Future work could explore model compression, GPU acceleration, or lightweight segmentation architectures to reduce inference time and improve operational efficiency in field applications.