1. Introduction
The frequency and severity of weather events such as drought and hailstorms are increasing under the effects of climate change [
1]. These abiotic stresses increase the susceptibility of trees to attack from biotic agents such as insect pests and fungal pathogens [
2,
3,
4]. Globally, drought-related tree mortality in softwood plantations is becoming a significant management issue [
5]. In Australia, two biotic agents that magnify the impacts of drought in
Pinus plantations are the five-spined bark beetle (
Ips grandicollis (Eichhoff)) and diplodia shoot blight (
Diplodia sapinea (Fries) Fuckel). Australian pine plantations consist of numerous management units (MUs), each containing same age class compartments, with the MUs ranging in age from just planted to scheduled for harvesting and can present crown damage symptoms that can vary spatially over time, including when infested with
D. sapinea.
Diplodia sapinea is an opportunistic pathogen infecting many conifer species and has a global distribution [
6,
7]. Trees of all growth stages are susceptible to this disease with symptoms presented as dead or dying shoots, branches or crown tops. Severe infection can kill the tree. The infected needles may, for a brief period, appear yellow (chlorotic) before becoming red or brown (necrotic) in colour. Over time, the necrotic needles then fall, leaving defoliated shoots and branches. These crown-scale symptoms are also common to several other damaging agents in Australia, including
Ips grandicollis, and so require diagnosis by a forest health expert.
Conventional forest health monitoring often involves aerial and ground-based manual surveys. Forest health aerial sketch-mapping surveys remain standard practice in North America [
8] and for pine plantations in Australia [
9] and New Zealand [
10] to identify the extent and severity of damaging agents and processes [
11,
12]. In addition to mapping the extent and severity of damaged tree crowns, an experienced observer can often diagnosis the damaging agent from the aircraft. However, this methodology is subjective, can be imprecise and reliant on the experience of the aerial observer. A rapidly expanding solution to improving on this approach is the application of semantic segmentation and classification techniques applied to high-spatial-resolution, remotely sensed imagery acquired from UAVs, aircraft and high-resolution satellite systems [
13,
14,
15,
16,
17,
18,
19].
Recent advances in digital image processing and deep learning (DL) are now being applied to aerial imagery by commercial forest service providers to automate tree inventories and health surveys [
20,
21,
22]. However, their workflows are commercial in confidence and so not freely available to educational institutions, government agencies, or the forest industry more broadly.
AI-based image analysis was initially dominated by traditional machine learning (ML) methods such as random forest (RF) and support vector machine (SVM) but over the past decade has principally been replaced by DL architectures [
13,
15]. DL techniques such as object detection, pixel-based semantic segmentation and instance segmentation techniques can be applied for tree health surveys [
15,
23,
24]. Object detection techniques identify the presence and location of tree crowns and draw bounding boxes around them, whereas pixel-based semantic segmentation classify every pixel in an image into a predefined category, producing a pixel-wise mask for each classified crown. These approaches present different trade-offs in computational cost, spatial detail, and suitability for specific monitoring tasks, with object detection typically being more efficient for precise tree counting and individual level assessment, while semantic segmentation offering finer spatial resolution of within-crown symptom patterns and coverage of affected areas [
25]. Instance segmentation, another DL approach that combines object detection with pixel-level masks, also can be used for tree surveys [
24,
26] and represents a promising direction for future investigation. While previous studies have demonstrated the potential of deep learning for the detection of unhealthy tree crowns, they have typically focused on either object detection approaches or semantic segmentation methods. In our study, we compared both object detection and semantic segmentation approaches on the same dataset for diplodia shoot blight detection. This comparison is critical for determining which approach provides the optimal balance of accuracy, computational efficiency, and practical applicability for operational forest health monitoring. Our study addresses this gap by evaluating state-of-the-art models from both paradigms.
The most common DL architectures applied for automated tree detection have been based on convolutional neural networks (CNNs). CNNs require training using a large number of labelled samples that for forest health applications are usually created by manual annotation of tree crowns. Importantly the training samples must cover sufficient variation to avoid model underfitting or overfitting. A common technique to reduce inaccuracies from this source is to use pre-trained models [
13,
23]. In addition, DL object detection models can contain one or two stages and often leverage CNNs. Two-stage detectors comprise a region proposal module and an object detection/classification module [
13]. Initially, two-stage CNNs such as Faster R-CNN and Mask R-CNN, were shown to outperform traditional ML approaches applied for tree crown detection and classification [
13,
27]. One-stage detectors integrate the tasks of object classification and localisation of the bounding box or mask into a global problem and produces detections in one-stage [
13,
28]. The application of YOLO (You Only Look Once) frameworks has been successfully demonstrated for object detection and instance segmentation of tree health in high-spatial-resolution imagery [
13,
29,
30]. The YOLO architecture family evolves rapidly, with frequent releases of improved versions. YOLOv12, the most recent stable version at the time of our study, was selected as the representative instance segmentation model [
31].
Importantly, CNNs are re-trainable, being able to incorporate multiple, unique dataset characteristics and hence increasing their robustness for plantation compartments of variable age classes; however, their performance can be challenged in complex scenes having variable image conditions and complex background features [
14]. Far fewer studies have evaluated the performance of vision transformers for the segmentation and classification of tree crowns [
14] as well as a hybrid approach which combines the strengths of CNN-based and transformer-based architectures [
32,
33]. Vision transformers undertake computer vision tasks by dividing an image into patches and processing them with a self-attention mechanism to model local and global relationships across image patches. Vision transformers typically require larger amounts of training data while CNNs can perform well with relatively smaller datasets.
A large portion of the DL studies for forest health have been classifying damaged tree crowns as a single category [
14,
19,
32,
34,
35,
36]. Some studies have incorporated multiple damage classes, such as distinguishing between infested and dead trees [
37,
38,
39], but these typically represent distinct and visually obvious damage stages. For certain damaging agents including diplodia shoot blight, the affected tree crowns can present a progression of coloured crown symptoms, which also indicates the duration or severity of the infestation. In contrast to binary or distinct class schemes, our study focused on more transitional symptom classes that represent gradual disease progression. The progression of tree crown symptoms from green to yellow to red also occurs with other important damaging agents of pine trees, for example, numerous bark beetle species [
40] and sirex infestation [
28]. In addition, in these Australian plantations, unhealthy trees affected by the aphid
Essigella californica, cyclaneusma needle cast or magnesium deficiency present crown symptoms that are mostly yellow in colour. Identifying and quantify the proportion of partially affected crowns can also indicate the severity or duration of a damaging event. Therefore, having separate classes of yellow and red-brown colours or partially damaged crowns can assist in targeted on-ground diagnosis and for forest managers to make informed decisions regarding treatment priorities. For example, stands with a high proportion of dead trees may warrant salvage to recover timber before trees deteriorate, while stands with predominantly dead-tops may warrant thinning to alleviate water stress and halt or slow the progression of the disease.
However, distinguishing these symptom classes presents significant challenges for both manual annotation and automated prediction. The colour gradient from green to yellow to red is continuous rather than discrete, making the definition of class boundaries inherently subjective and dependent on annotator expertise and judgement. This is further complicated by natural variation in needle colour due to stand age, seasonal phenology, site conditions, and silvicultural status. Furthermore, the varying lighting conditions in aerial imagery captured in different times of year and times of day increased the difficulty of creating consistent annotations for each class across different study sites.
To address these challenges and evaluate the feasibility of automated multi-class symptom detection, in this study, we compared the accuracy and computational efficiency of several representative DL approaches across different architectural paradigms for detecting and classifying discoloured
Pinus radiata crowns affected by diplodia shoot blight across five plantations in New South Wales (NSW), South Australia (SA) and Victoria (VIC). We evaluated two complementary approaches: (1) object detection using YOLOv12 [
31] combined with the Segment Anything Model (SAM) [
41] for identifying and classifying individual tree crowns into three severity classes (yellow, red-brown, and dead tops), and (2) pixel-level semantic segmentation using three architectures—U-Net (CNN baseline with ResNet-34 encoder) [
42], SegFormer (vision transformer) [
43], and EVitNet (CNN-transformer hybrid) [
32]—for mapping yellow and red-brown discoloured pixels. These two approaches provide different levels of information for plantation management decisions. While object detection provides location and approximate crown size, it has limited capacity to quantify actual affected crown area, particularly for areas with a large portion of partially discoloured crowns. Semantic segmentation captures subtle pixel-level changes at the crown level that are difficult to detect with object detection alone, especially for early-stage infections where less than 50% of the crown is affected. Furthermore, integrating both methods enables accurate measurement of the proportion of affected area within each crown, providing the detailed spatial information needed for targeted management decisions such as selective thinning or salvage cutting.
Our overall aim was to identify the advantages and disadvantages of these DL models and to provide recommendations for applying DL to high-spatial-resolution imagery for assessing crowns affected by diplodia shoot blight and other damaging agents that present similar crown damage symptoms in Pinus spp. plantations.
2. Materials and Methods
The methodology consisted of seven main stages (
Figure 1): data acquisition and preprocessing, manual annotation, data splitting, model training (object detection and semantic segmentation approaches), performance evaluation on the test set, comparative analysis, and integrated assessment.
2.1. Study Sites and Image Acquisition
The study was conducted across five
P. radiata plantation sites with varying topographies located in three Australian states: New South Wales (NSW), South Australia (SA), and Victoria (VIC). The NSW sites included Kangaroo Vale and Carabost, while the SA and VIC sites—Langkoop, Myora, and Dartmoor—are situated within the Green Triangle (GT) region (
Figure 2). The GT is a major plantation forestry and wood products region spanning the border area between the southeast of SA and southwest of VIC. These timber plantations are subject to multiple abiotic and biotic disturbances. At the time of image acquisition, these sites were all affected by drought and
D. sapinea to varying extents. Depending on the duration and severity of the drought and stand conditions, tree crowns present colour symptoms ranging from pale green or yellow to orange, red or brown. Dead tops, individual branches or entire crowns can become necrotic. For our deep learning model development, crown classification was based on predominant visible colour symptoms rather than tree physiological status (details in
Section 2.2.1).
High-resolution aerial imagery was acquired by Xmap [
44] between 2023 and 2024 under clear sky conditions. The Kangaroo Vale and Langkoop sites were acquired in September 2023, while the Carabost, Myora, and Dartmoor sites were acquired in July 2024. The aerial platform used for 2023 imagery was a Cessna 172 (Textron Aviation, Wichita, KS, USA) fitted with a nadir camera hatch and external camera pod for dual camera use. The dual camera setup comprised a Nikon D850 (Nikon Corporation, Tokyo, Japan) (Bayer filter) and a modified Nikon D850 (hot filter removed) with a HOYA 52 mm infrared (R72) external filter (Hoya Corporation, Tokyo, Japan) fitted to the lens. The lenses used for Kangaroo Vale were Nikkor 85 mm prime lenses (Nikon Corporation, Tokyo, Japan), while for Langkoop were Nikkor 50 mm prime lenses (Nikon Corporation, Tokyo, Japan). The camera photos were captured with 80% forward overlap and 70% side overlap. The Kangaroo Vale and Langkoop sites were flown at 7800 ft and 4600 ft, respectively. Aerial imagery for sites in 2024 was acquired using a Cessna 172 equipped with a wheel strut camera pod. The sensor was a Fujifilm GFX 100S camera (Fujifilm Corporation, Tokyo, Japan) with 50 mm Fujinon GF lens (Fujifilm Corporation, Tokyo, Japan) mounted in nadir orientation. Images were captured at 3900 ft above ground level. These acquisition specifications resulted in delivered imagery having a ground sample distance (GSD) of 0.09 m with positional accuracy of approximately 0.27 m (±3 pixels) for all study sites between 2023 and 2024. Digital surface models (DSMs) were generated from stereo photogrammetry and used to orthorectify the raw images. Final orthomosaics were then produced using dodging (local brightness adjustment) [
45] and colour balancing techniques to ensure radiometric consistency across the imagery. The 2023 imagery included four bands (red, green, blue, and near-infrared), while the 2024 imagery contained only three bands (red, green, and blue). For consistency across all sites, only the three visible bands were used in this study due to the absence of NIR data for the 2024 acquisitions. It is acknowledged that while NIR data are advantageous for detecting unhealthy vegetation, in Australia the more commonly available RGB cameras allow for more rapid and flexible operational deployment.
2.2. Data Preparation
2.2.1. Class Definitions
Based on visual interpretation of the aerial imagery, affected tree crowns were classified according to their dominant colour symptoms. Two classification schemes were developed to accommodate the different requirements of object detection and semantic segmentation approaches. For individual tree crown detection and classification (tree-level analysis), three classes were defined (
Figure 3a):
Yellow: Tree crowns showing predominantly yellow-coloured needles.
Red-brown: Tree crowns displaying predominantly orange, red, and/or brown coloured needles.
Dead tops: Tree crowns with damage where more than 50% of crown pixels but less than 90% exhibited yellow and/or red-brown discolouration, indicating dead branches or shoots, with the remaining crown retaining green needles.
For semantic segmentation (pixel-level analysis), another three-class scheme was used (
Figure 3b):
Yellow: Crown pixels showing yellow colours.
Red-brown: Crown pixels displaying orange, red, or brown colours.
Background: All other pixels including green crowns, shadows, and ground.
The different classification schemes reflect the distinct analytical capabilities of each approach. Object detection operates at the tree crown level, enabling assessment of within-crown colour variation and the identification of trees with mixed symptom patterns (e.g., dead tops class). In contrast, semantic segmentation focuses on spectral classification of individual pixels without tree-level aggregation.
2.2.2. Image Annotations and Preprocessing
Due to the different input requirements of object detection and semantic segmentation models, separate annotation datasets were produced for each approach. Annotations were produced by a forest health expert using ArcGIS Pro 3.5.2 (Esri, Redlands, CA, USA) and subsequently verified by an independent GIS technician to ensure labelling accuracy and consistency. For object detection, bounding boxes were drawn to encompass the entire crown extent of each affected tree. The bounding box annotations were converted to text files in YOLO format, containing the class label and normalised coordinates (centre x, centre y, width, height) for each detection instance.
For semantic segmentation, a modified image dataset was also created to enable pixel-level classification while managing annotation effort. First, bounding boxes were drawn around all trees containing any yellow or red-brown pixels, excluding those already classified as fully yellow or red-brown trees in the object detection dataset. The Segment Anything Model (SAM) was then applied within these bounding boxes to generate precise crown segments. These crown segments were subsequently masked out from the original imagery to create a modified dataset containing only fully symptomatic yellow and red-brown trees and background. This approach eliminated the need to annotate all individual discoloured pixels while avoiding potential confusion from crowns with mixed symptoms that could introduce ambiguous training signals. Within the remaining areas in the modified images, detailed polygons were manually delineated by the forest health expert following the boundaries of continuous yellow and red-brown crowns. These polygon annotations were rasterised into three-class mask images (yellow, red-brown, and background) matching the spatial resolution of the input imagery.
Figure 3 shows examples of the annotation strategies used for both object detection and semantic segmentation.
In total, 10,893 individual tree crowns were annotated across all five study sites for the object detection task, comprising 1751 yellow trees, 5747 red-brown trees, and 3395 dead top trees. For semantic segmentation, 7498 yellow and red-brown tree crowns were annotated with detailed polygons after masking.
To prepare the imagery for model training, a two-stage tiling approach was implemented. First, each orthomosaic was divided into large, non-overlapping tiles of 3040 × 3040 pixels (
Figure 4a). These large tiles were then randomly partitioned into training, validation, and test sets at a 7:2:1 ratio. This approach prevented data leakage by ensuring that overlapping subtiles generated in the subsequent step would not span across different dataset splits. Within each large tile, smaller subtiles of 640 × 640 pixels were generated using a sliding window approach with a stride of 480 pixels, creating 25% overlap between adjacent subtiles (
Figure 4b). This overlapping strategy ensured that tree crowns near tile boundaries were fully captured in at least one subtile while providing additional training samples to improve model robustness [
46]. Tiles and subtiles containing no annotated tree crowns were excluded from the dataset. Following this procedure, the final dataset comprised 2821 training, 746 validation, and 426 test images, all measuring 640 × 640 pixels. We used RGB as three input bands for our DL models as they were all pretrained on natural colour imagery, and the RGB combination has been demonstrated to be effective for detecting unhealthy tree crowns [
14,
39]. This RGB-only approach establishes baseline performance for future comparison with additional spectral bands or derived indices.
2.3. Individual Tree Crown Detection and Classification
A pipeline combining object detection and instance segmentation was investigated in this study to provide an efficient approach for extracting and classifying individual tree crowns.
You Only Look Once (YOLO) is one of the most popular object detection frameworks [
47]. The state-of-the-art YOLOv12 model was utilised in this study. Compared to previous YOLO models using traditional CNN-based approaches, YOLOv12 combines convolutional feature extraction with an attention-centric architecture and an improved feature aggregation module based on residual efficient layer aggregation networks (R-ELAN) [
31]. It blends CNN and transformer-style components, achieving outstanding speed and accuracy.
We employed the YOLOv12m model, pretrained on the Microsoft COCO (Common Objects in Context) dataset [
48], which was implemented using the Ultralytics framework. The training process consisted of 150 epochs with a batch size of 16 and an initial learning rate of 0.001, following a cosine learning rate schedule with a final learning rate fraction of 0.01. The AdamW optimiser was used with a momentum of 0.937 and weight decay of 0.0005. Five warmup epochs were employed to gradually adapt the pretrained model to the target dataset. A comprehensive set of augmentation strategies was applied, including HSV colour augmentation, geometric augmentations (rotation, translation, scaling, and flipping), and advanced augmentations (mosaic, mixup, and copy-paste) [
49].
To obtain precise crown boundaries beyond the rectangular bounding boxes provided by YOLO, SAM was employed for instance segmentation. SAM, developed by Meta AI, is a cutting-edge image segmentation model that can produce high-quality object masks from input prompts such as points or boxes [
50]. The bounding boxes predicted by the YOLO model were directly used as prompts for SAM with minimal preprocessing. SAM 2.1-hiera-large model [
41] was utilised in this study, leveraging its improved accuracy and efficiency for generating precise tree crown masks from the YOLO-detected bounding boxes. To remove artefacts from the SAM output, post-filtering was applied to ensure mask quality. First, any crown segment not intersecting with its corresponding bounding box was removed as such masks likely represented segmentation errors. Second, any crown segment with an area exceeding 1.5 times the area of its bounding box was removed as such masks likely represented segmentation errors extending beyond the target tree. Finally, when multiple crown segments were generated for a single bounding box, only the segment with the largest area was retained, assuming it represented the primary tree crown. This three-stage filtering process ensured reliable extraction of individual tree crown boundaries.
2.4. Semantic Segmentation of Affected Tree Crown Classes
To evaluate and compare different deep learning architectures for mapping tree crown symptoms, three representative models were investigated: a CNN-based U-Net as a baseline, a lightweight CNN-transformer hybrid model—Easy Vision Transformer Net (EVitNet), and a vision transformer-based model—SegFormer.
U-Net is a symmetric CNN architecture composed of an encoder–decoder structure with skip connections to recover spatial details [
42]. It was originally developed for biomedical image segmentation but has been widely adopted across various domains. The encoder repeatedly applies convolutions and pooling operations to capture increasingly abstract features at progressively lower resolutions, while the decoder upsamples features back to the original resolution. Skip connections pass feature maps from each encoder stage directly to its corresponding decoder stage, preserving spatial details lost through downsampling. In this study, U-Net was initially evaluated with both ResNet-34 and ResNet-50 encoders pretrained on ImageNet. ImageNet is a large-scale dataset containing over 14 million images across 1000 object categories [
51]. While not domain-specific to forestry, ImageNet pre-training provides models with general visual feature extraction capabilities (edges, textures, colour patterns) that transfer effectively to specialised tasks, improving performance and training efficiency. (a large-scale image dataset). Between the two U-Net variants evaluated, ResNet-34 achieved comparable segmentation accuracy to ResNet-50 while requiring substantially shorter training time (6.6 h vs. 16.0 h), and was therefore selected as the baseline CNN model for comparison. EVitNet is a lightweight CNN-transformer combined architecture initially designed by [
32] for detecting pine wilt disease in drone imagery. It combines a MobileViT-based encoder that alternates between CNN blocks (for local feature extraction) and lightweight vision transformer blocks (for global context modelling) with a U-Net-style decoder that uses expanded convolutions to improve upsampling accuracy without adding parameters. The model’s hybrid architecture preserves spatial detail through skip connections while capturing global features via self-attention. In this study, we adapted the original EVitNet by replacing the custom MobileViT blocks with Apple’s official ImageNet pretrained MobileViT-XXS model as the encoder backbone.
SegFormer is an efficient transformer-based semantic segmentation architecture that employs a hierarchical transformer encoder (Mix Transformer, MiT) to capture multi-scale features through self-attention mechanisms, combined with a lightweight all-MLP (Multilayer Perceptron) decoder [
43]. This design achieves strong segmentation performance while addressing the computational challenges of traditional heavy encoders and complex decoders. In this study, we employed the MiT-B0 encoder pretrained on the ImageNet dataset as the backbone.
All three segmentation models were pretrained on ImageNet and fine-tuned on the study dataset using identical training configurations to ensure fair comparison. Training was conducted for 150 epochs with 15 warmup epochs. The AdamW optimiser [
52] was employed with an initial learning rate of 1 × 10
−4 and weight decay of 0.01, using a linear warmup followed by cosine annealing learning rate schedule with a minimum learning rate of 1 × 10
−6. To address the severe class imbalance inherent in the dataset (where background pixels vastly outnumber symptomatic pixels and imbalance exists between disease severity classes), we employed generalised dice loss. This loss function automatically assigns higher weights to underrepresented classes during training and has been proven effective for imbalanced segmentation tasks [
53]. A comprehensive data augmentation strategy was implemented using the Albumentations library [
54] to improve model robustness and generalisation. Geometric augmentations included random horizontal flipping, vertical flipping, rotation, and affine transformations with scaling, and translation. Colour space augmentations comprised random brightness and contrast adjustments, hue-saturation-value shifts, and channel shuffling. To simulate real-world imaging conditions and improve robustness, we applied Gaussian noise, random fog effects, and coarse dropout. Finally, images were normalised using dataset-specific mean and standard deviation values before conversion to tensors.
2.5. Integration of Object Detection and Semantic Segmentation Outputs
To leverage the complementary strengths of both approaches, we developed an integrated assessment workflow that combines outputs from object detection (tree-level) and semantic segmentation (pixel-level). This integration enables within-crown damage quantification by overlaying pixel-level disease classifications onto individual tree crown boundaries identified through the object detection approach (YOLO + SAM). Specifically, for each detected tree crown, we calculate the proportion of pixels classified as yellow or red-brown, providing precise metrics of disease severity and spatial distribution within individual crowns. This dual-level analysis supports more nuanced management decisions compared to either approach alone, enabling forest managers to distinguish between trees with minor branch symptoms versus those with extensive crown damage, even when both fall within the same nominal class.
2.6. Implementation Details
All experiments were conducted on a Windows Subsystem for Linux 2 (WSL2) environment running on a desktop workstation equipped with dual Intel Xeon Gold 6136 CPUs (3.00 GHz, 24 cores total) and 256 GB RAM. Model training and inference were performed on an NVIDIA Quadro P6000 GPU with 24 GB memory. The deep learning framework employed was PyTorch 2.6.0 with CUDA 12.6 and Python 3.11.11. All speed measurements were conducted on an NVIDIA Quadro P6000 GPU with batch size 1, input size 3 × 640 × 640, using PyTorch 2.6.0 and CUDA 12.6. Inference times represent the mean of 100 iterations after 10 warmup iterations.
Table 1 summarises the model specifications and computational requirements of all deep learning models employed in this study. GFLOPs (giga floating-point operations) quantify the computational complexity of a single forward pass through the model, serving as a hardware-independent measure of model efficiency. FPS (frames per second) measures the inference speed, representing the number of images the model can process per second under standardised conditions. All speed measurements were conducted on the NVIDIA Quadro P6000 GPU with batch size 1, using PyTorch 2.6.0 and CUDA 12.6. Inference times represent the mean of 100 iterations after 10 warmup iterations to ensure stable measurements. SAM 2.1, used for instance segmentation in the detection pipeline, is not included in
Table 1 as it was applied in a prompt-based manner using pretrained weights without additional training.
2.7. Accuracy Assessment and Evaluation Metrics
The results predicted from the object detection model were evaluated using several standard metrics, including precision, recall, F1 score and mAP.
These metrics rely on the calculation of intersection over union (IoU), a fundamental measure that quantifies the overlap between predicted and ground truth bounding boxes. IoU is defined as
With IoU values ranging from 0 to 1, where 0 signifies no overlap and 1 denotes a perfect match, IoU serves as a crucial threshold to determine the correctness of a detection. For a given IoU threshold (α), true positives are detections where objects are correctly labelled and IoU values between the predicted and ground truth bounding boxes exceed the threshold. False positives (FP) occur when objects are incorrectly labelled or IoU values fall below the threshold. False negatives (FN) represent missed detections of objects present in the ground truth. True positives (TP) are correctly detected objects and IoU values above the threshold. Based on these concepts, precision is defined as:
Precision represents the percentage of correctly detected trees among all predicted trees. Similarly, recall is defined as:
Recall represents the percentage of correctly detected trees among all ground truth trees.
The F1 score, serving as the harmonic mean of precision and recall, provides a balanced measure of the model’s performance, considering both false positives and false negatives. It is expressed as:
Precision-recall values at different confidence thresholds are calculated to form a precision-recall curve. The average precision (AP) is computed as the area under this curve (AUC), representing the trade-off between precision and recall in object detection at a given IoU threshold. Commonly, AP is calculated at an IoU threshold of 0.5 (AP50 or mAP@0.5) as a standard benchmark. Additionally, AP50–95 (or mAP@[0.5:0.95]) represents the mean AP calculated across IoU thresholds ranging from 0.5 to 0.95 in 0.05 increments, providing a more comprehensive evaluation of localization accuracy. To obtain the Mean Average Precision (mAP), the AP values for each individual class are calculated, and the final mAP is derived by averaging these class-specific AP values over the total number of classes
where AP
i is the AP of class
i and
n is the number of target classes. mAP serves as a comprehensive metric, providing an overall evaluation of the model’s effectiveness across diverse object categories. In this study, both mAP50 and mAP50–95 were computed to evaluate detection performance at different localization precision requirements.
Semantic segmentation performance was evaluated using the same fundamental metrics (IoU, precision, recall, and F1 score) as object detection, but computed at the pixel level rather than bounding box level (
Table 2).
5. Conclusions
In this study, we evaluated DL models for identifying discoloured tree crowns affected by diplodia shoot blight in P. radiata plantations, comparing two complementary approaches: object detection using YOLO combined with SAM to detect and classify individual tree crowns, and semantic segmentation using three different architectures including CNN-based U-Net, vision transformer-based SegFormer, and CNN-transformer hybrid EVitNet. Based on crown colour symptoms, three damage classes (yellow, red-brown, dead tops) and three classes (yellow, red-brown and background) were defined for object detection and semantic segmentation, respectively. The YOLO model achieved an overall mAP50 of 0.766 and mAP50–95 of 0.447 across all three classes, with red-brown crowns demonstrating the highest detection accuracy (mAP50: 0.918, F1 score: 0.851). For semantic segmentation, both SegFormer and EVitNet models outperformed the baseline U-Net, with SegFormer showing the strongest performance (IoU of 0.662 for red-brown and 0.542 for yellow). EVitNet achieved slightly lower but comparable accuracy to SegFormer while demonstrating superior training efficiency with its lighter architecture, requiring less than half of the training time (9.4 h vs. 20.5 h). The two approaches serve complementary application roles. Object detection combined with SAM is most effective for tree-level assessment and can detect crowns with heterogeneous symptom patterns, while semantic segmentation excels at providing damage information at the pixel level, facilitating accurate area quantification. Integrating both approaches can provide both tree-level identity and precise symptom quantification within individual crowns, offering comprehensive information spanning individual tree details to spatial symptom mapping. These capabilities support calibration and validation of satellite-based monitoring systems and assist in prioritisation of ground-based diagnosis or interventions. Importantly, this study was conducted across five geographically dispersed sites spanning multiple states over two acquisition years, demonstrating the operational scalability of aerial imagery-based deep learning workflows for forest health surveillance.
Future research could explore end-to-end instance segmentation architectures such as YOLO11-Seg [
55] or Mask2Former [
56] to investigate whether they could potentially streamline the workflow and improve both accuracy and efficiency. Manual annotations are labour-intensive and time-consuming. Semi-automated annotation workflows [
57] could be explored to determine whether they can enable more rapid development of larger training datasets, which would facilitate rapid model adaptation to new regions. While our study required substantial manual annotation, the resulting trained models can serve as a foundation for future applications. Through transfer learning and few-shot learning approaches, these models could be adapted to detect similar forest diseases or deployed in new plantation sites with significantly reduced annotation requirements. The consistency and accuracy of training data annotations are key to developing reliable and robust models [
14]. Alternative machine learning approaches, such as unsupervised clustering or foundation model-assisted annotation, could be explored to establish more consistent class boundaries that reduce dependence on annotators’ subjective judgements. Vision-Language Models (VLMs) represent another promising direction, potentially reducing annotation requirements by leveraging semantic text-image alignment to identify visual features based on natural language descriptions rather than extensive labelled datasets [
58,
59]. This capability could be particularly valuable for detecting early symptoms like chlorosis, though rigorous validation against standard deep learning approaches would be essential before operational deployment. While this study focused on representative models from CNN, Transformer, and hybrid architectures, future work could expand the comparison to include additional semantic segmentation models (e.g., DeepLabv3+, PSPNet) and conduct ablation studies to identify optimal architectural components for forest disease detection tasks. Finally, the input channels for DL models can be expanded beyond RGB by including NIR (when available), or derived vegetation indices. Such spectral enhancements would be particularly valuable for improving detection of spectrally ambiguous yellow symptoms, which showed consistently lower accuracy than red-brown crowns across both detection and segmentation approaches in this study.